The Internet Archive discovers and captures web pages through many different web crawls. At any given time several distinct crawls are running, some for months, and some every day or longer. View the web archive through the Wayback Machine .
Topic: webwidecrawl
Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites. Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites. Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as "Seed Lists". Descriptions of the Seed Lists associated with each crawl may be provided as part of the metadata for...
Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
Topics: web crawl, Alexa
Content crawled via the Wayback Machine Live Proxy mostly by the Save Page Now feature on web.archive.org. Liveweb proxy is a component of Internet Archive’s wayback machine project. The liveweb proxy captures the content of a web page in real time, archives it into a ARC or WARC file and returns the ARC/WARC record back to the wayback machine to process. The recorded ARC/WARC file becomes part of the wayback machine in due course of time.
Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history. History is littered with hundreds of conflicts over the future of a community, group, location or...
Survey crawls are run about twice a year, on average, and attempt to capture the content of the front page of every web host ever seen by the Internet Archive since 1996.
Topic: survey crawls
These crawls are part of an effort to archive pages as they are created and archive the pages that they refer to. That way, as the pages that are referenced are changed or taken from the web, a link to the version that was live when the page was written will be preserved. Then the Internet Archive hopes that references to these archived pages will be put in place of a link that would be otherwise be broken, or a companion link to allow people to see what was originally intended by a page's...
Crawl of outlinks from wikipedia.org . These files are currently not publicly accessible. from Wikipedia : Wikipedia is a multilingual, web-based, free-content encyclopedia project operated by the Wikimedia Foundation and based on an openly editable model. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links to guide the...
Web crawl data from Common Crawl.
This is a collection of web page captures from links added to, or changed on, Wikipedia pages. The idea is to bring a reliability to Wikipedia outlinks so that if the pages referenced by Wikipedia articles are changed, or go away, a reader can permanently find what was originally referred to. This is part of the Internet Archive's attempt to rid the web of broken links .
Topics: Wikipedia, Wikimedia
The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.
Web wide crawl number 16 The seed list for Wide00016 was made from the join of the top 1 million domains from CISCO and the top 1 million domains from Alexa.
A daily crawl of more than 200,000 home pages of news sites, including the pages linked from those home pages. Site list provided by The GDELT Project
Topics: GDELT, News
The seeds for this crawl came from: 251 million Domains that had at least one link from a different domain in the Wayback Machine, across all time ~ 300 million Domains that we had in the Wayback, across all time 55,945,067 Domains from https://archive.org/details/wide00016 This crawl was run with a Heritrix setting of "maxHops=0" (URLs including their embeds) The WARC files associated with this crawl are not currently available to the general public.
Web wide crawl with initial seedlist and crawler configuration from January 2015.
Web wide crawl with initial seedlist and crawler configuration from April 2013.
The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.
Web wide crawl with initial seedlist and crawler configuration from June 2014.
This is a collection of pages and embedded objects from WordPress blogs and the external pages they link to. Captures of these pages are made on a continuous basis seeded from a feed of new or changed pages hosted by Wordpress.com or by Wordpress pages hosted by sites running a properly configured Jetpack wordpress plugin.
Topics: Wordpress.com, blogs, jetpack
The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.
Listen to free audio books and poetry recordings! This library of audio books and poetry features digital recordings and MP3's from the Naropa Poetics Audio Archive, LibriVox, Project Gutenberg, Maria Lectrix, and Internet Archive users.
Wayback indexes. This data is currently not publicly accessible.
Images contributed by Internet Archive users and community members. These images are available for free download. Please select a Creative Commons License during upload so that others will know what they may (or may not) do with with your images.
Topic: images
Web wide crawl with initial seedlist and crawler configuration from August 2013.
The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.
Crawls performed by Internet Archive on behalf of the National Library of Australia. This data is currently not publicly accessible.
Web wide crawl with initial seedlist and crawler configuration from January 2012 using HQ software.
Web wide crawl with initial seedlist and crawler configuration from April 2012.
The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.
Survey crawl of .com domains started January 2011.
Topic: webcrawl
Web wide crawl with initial seedlist and crawler configuration from February 2014.
Wide crawls of the Internet conducted by Internet Archive. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
Web wide crawl with initial seedlist and crawler configuration from September 2012.
Web wide crawl with initial seedlist and crawler configuration from October 2010
Web wide crawl with initial seedlist and crawler configuration from March 2011 using HQ software.
Screen captures of hosts discovered during wide crawls. This data is currently not publicly accessible.
The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi. What’s in the data set: Crawl start date: 09 March, 2011 Crawl end date: 23 December, 2011 Number of captures: 2,713,676,341 Number of unique URLs: 2,273,840,159 Number of hosts: 29,032,069 The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT)...
Crawls of International News Sites
Listen to sermons and lectures concerning religion and spirituality here.
Data crawled on behalf of Internet Memory Foundation . This data is currently not publicly accessible. from Wikipedia : The Internet Memory Foundation (formerly the European Archive Foundation) is a non profit foundation whose purpose is archiving web content, it supports projects and research which include the preservation and protection of multimedia content. Its archives form a digital library of cultural content.
Crawl of outlinks from wikipedia.org started February, 2012. These files are currently not publicly accessible.
Programs in TV News Archive for research and educational purposes. The programs allow users to search across a collection of television news programs dating back to 2009 for research and educational purposes such as fact checking. Users may view short clips, share links to customized short quotes, embed customized short quotes, or borrow a copy of the full program.
( 1 reviews )
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Crawl EG from Alexa Internet. This data is currently not publicly accessible.
Watch full-length feature films, classic shorts, world culture documentaries, World War II propaganda, movie trailers, and films created in just ten hours: These options are all featured in this diverse library! Many of these videos are available for free download.
A great resource for podcasters: the Creative Commons Podcasting Legal Guide .
Books contributed by the Internet Archive.
Topic: internet archive books
Free books for the people with disabilities that impact reading. If you have a disability that interferes with reading printed text then all of these books can be instantaneously available in your browser or via protected download. Want access? Individuals If you would like to apply for access (it is free), make sure you have an Archive.org account and then fill in this form to contact the Vermont Mutual Aid Society . If you are affiliated with any of...
Topics: print disabled, print disability
Books in this collection may be borrowed by logged in patrons. You may read the books online in your browser or, in some cases, download them into Adobe Digital Editions , a free piece of software used for managing loans. Please note that works in this collection are protected by copyright law (Title 17 U.S. Code) and copying, redistribution or sale, whether or not for profit, by the recipient is not permitted unless authorized by the rightsholder or by law. See FAQs about...
Data collected by Internet Archive on behalf of the National Library of Spain. This data is currently not publicly accessible.
Crawl performed by Internet Archive. This data is currently not publicly accessible.
A number of religious and spiritual organizations regularly upload their sermons and lectures to the Archive through the Open Source Audio collection. You may easily locate them here.
Crawl EI from Alexa Internet. This data is currently not publicly accessible.
Crawl of outlinks from wikipedia.org started May, 2011. These files are currently not publicly accessible.
Crawl EH from Alexa Internet. This data is currently not publicly accessible.
This collection features audio collections reflecting music, art and culture. Collections include the unique contemporary compositions and performances found in the Other Minds collection, the hundreds of popular songs from the early 20th Century found in the 78 RPM collection and oral history projects.
Items included in the Television News search service. Part of TV News Archive .
Shallow crawls that collect content 1 level deep including embeds. This data is currently not publicly accessible.
Crawls of the french domain space performed by Internet Archive on behalf of Bibliotheque Nationale de France. This data is currently not publicly accessible.
Captures of pages from YouTube. Currently these are discovered by searching for YouTube links on Twitter.
Topics: YouTube, Twitter, Video
An analysis of news and public affairs independent from traditional corporate media is available from this diverse video library. From Democracy Now's daily news program, to three days of TV news coverage following the 911 attacks, to Mosaic’s timely clips of Middle East newscasts, to UCSF's Tobacco Industry Videos: These collections offer an alternative way to view and interpret current news and public affairs. Many of these videos are available for free download.
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Crawl DX from Alexa Internet. This data is currently not publicly accessible.
Collections of items recorded from television, including commercials, old television shows, government proceedings, and more.
National Archives and Records Administration crawl performed by Internet Archive. This data is currently not publicly accessible.
Geocities crawl performed by Internet Archive. This data is currently not publicly accessible. from Wikipedia : Yahoo! GeoCities is a Web hosting service. GeoCities was originally founded by David Bohnett and John Rezner in late 1994 as Beverly Hills Internet (BHI), and by 1999 GeoCities was the third-most visited Web site on the World Wide Web. In its original form, site users selected a "city" in which to place their Web pages. The "cities" were metonymously named after...
Crawl of outlinks from wikipedia.org started July, 2011. These files are currently not publicly accessible.
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Crawl EB from Alexa Internet. This data is currently not publicly accessible.
Crawl DZ from Alexa Internet. This data is currently not publicly accessible.
CDX Index shards for the Wayback Machine. The Wayback Machine works by looking for historic URL's based on a query. This is done by searching an index of all the web objects (pages, images, etc) that have been archived over the years. This collection holds the index used for this purpose, which is broken up into 300 pieces so they fit into items more naturally and distribute the lookup load. Each of these 300 pieces is stored in at least 2 items, and then those are also stored on the backup...
Digitized version from Serials In Microform collection originally from NA Publishing. Record of the acquisition of the microfilm: https://archive.org/details/SerialsOnMicrofilmCollection
Newest uploads! Auto-78-twitter . Through the Great 78 Project the Internet Archive has begun to digitize 78rpm discs for preservation, research, and discovery with the help of George Blood, L.P. . 78s were mostly made from shellac, i.e., beetle resin, and were the brittle predecessors to the LP (microgroove) era. @great78project for uploads as they happen. Turntable used for 78rpm digitization of four simultaneous recordings with different needles. The...
Topics: 78rpm, digitization
Source: 78
Periodical publications including magazines, trade magazines, and journals. Please peruse the growing list of publications .
Topics: periodicals, journals, serials, magazines
Crawl EF from Alexa Internet. This data is currently not publicly accessible.
The newspapers in this collection have been scanned as part of a pilot project using microfilm and microfiche. After using a microfilm/fiche scanner to create a digital image of each page, we process the resulting images so that each reel is contained in a single item with easily navigable files. For a few examples, please see: The New York times (Oct 16 31 1915) The New York times (1919 July 1-15) The New York times (May 1-15 1915)
Crawl DL from Alexa Internet. This data is currently not publicly accessible.
Non-English language collections contributed to the Open Source Audio collection are featured here.
COM survey crawl data collected by Internet Archive in 2009-2010. This data is currently not publicly accessible.
Books scanned in Shenzhen and Beijing, China.
Topic: books
This library of arts and music videos features This or That (a burlesque game show), the Coffee House TV arts program, punk bands from Punkcast and live performances from Groove TV. Many of these movies are available for free download.
Web crawl snapshots generously donated from Accelovation . This data is currently not publicly accessible. From the site : Accelovation is pioneering the delivery of Insight Discovery™ software solutions that help companies move from innovation idea to product reality faster and with more success. Our solutions are used by leading firms in the Fortune 500 and beyond – companies from a diverse set of industries ranging from consumer packaged goods to high tech, foods to chemicals, and...
The John P. Robarts Research Library, commonly referred to as Robarts Library, is the main humanities and social sciences library of the University of Toronto Libraries and the largest individual library in the university. Opened in 1973 and named for John Robarts, the 17th Premier of Ontario, the library contains more than 4.5 million bookform items, 4.1 million microform items and 740,000 other items. The library building is one of the most significant examples of brutalist architecture in...
Inspiring discovery through free access to biodiversity knowledge. | The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community. | Please read BHL's Acknowledgment of Harmful Content . About the Biodiversity Heritage Library The Biodiversity Heritage Library (BHL) is the world's largest open access digital library for biodiversity literature and archives. BHL is...
Crawl EE from Alexa Internet. This data is currently not publicly accessible.
Survey crawl of .net domains started December 2010.
Topic: webcrawl
Crawl DJ from Alexa Internet. This data is currently not publicly accessible.
Question or comment about digitized items from the Library of Congress that are presented on this website? Please use the Library of Congress Ask a Librarian form. The Library of Congress is the world’s largest library, offering access to the creative record of the United States—and extensive materials from around the world—both on-site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office. Explore...
Crawl data from Institut national de l’audiovisuel in France. This data is currently not publicly accessible. from Wikipedia : The Institut national de l'audiovisuel (or INA, French for National Audiovisual Institute), is a repository of all French radio and television audiovisual archives. Since 2006, it has allowed free online consultation on a website called ina.fr with a search tool indexing 100,000 archives of historical programs, for a total of 20,000 hours.
Shallow crawl started 2013 that collects content 1 level deep, including embeds. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Shallow crawl started 2013 that collects content 1 level deep, including embeds. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
Crawl DI from Alexa Internet. This data is currently not publicly accessible.
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Crawl Image from Alexa Internet. This data is currently not publicly accessible.