Internet Archive: Archive-It and Contract Crawling, C. Mumma

Web Crawling Tools and Services from
the Internet Archive: Archive-It and
Contract Crawling
Courtney C. Mumma, Internet Archive
November 17, 2016 - Dutch Institute for Sound and Vision

Talk overview
● Archiving the web at IA
● Partnerships and services
○ Contract crawls
○ Archive-It
■ Research Services
○ Interoperability & Distributed Preservation
● New technology for new challenges

The Internet Archive
Non-Profit Library
Founded in 1996 by Brewster Kahle
Universal Access to All Knowledge

30,000,000,000,000,000 Bytes Archived
(30 PetaBytes)
20 Years of Archiving the Web
500,000,000,000+ URLs

1996
US Presidential Campaigns with Smithsonian
218,342,520
Web Captures

1997
First Full Crawl
525,362,846
Web Captures

1998
Donation of Crawl to the Library Of Congress
1,166,891,826
Web Captures

2000
US Presidential Campaigns with the Library of Congress
6,153,042,235
Web Captures

2001
Launch of the WayBack Machine
12,082,859,018
Web Captures

2003
International Internet Preservation Consortium
Founded
38,868,116,181
Web Captures

2006
Archive-It Started
103,943,903,726
Web Captures

2007
Ireland
184,277,909,308
Web Captures

2008
National Archive Government Crawls
209,160,715,829
Web Captures

2009
Archive-It Adds its 100th Partner
7 National Library Partners
225,658,093,516
Web Captures

2010
Broad and Survey Web-Scale Crawls
246,744,306,660
Web Captures

2015
Archive-It Adds its 400th Partner
467,195,419,06
9
Web Captures

Global Wayback
● Broad snapshot
● Deep crawl on popular
sites
● Broad crawl on known
domains
● No more 404s
● On-demand
● Donated and targeted crawls
● https://web-beta.archive.org/
with KEYWORD SEARCH
and more!

Web Archiving Partnerships
and Services

Contract Crawling
Domain-scale
• Run by Internet Archive
• Average 300 million URLs per collection
Partial List of Partners
• National Libraries of Australia and New Zealand
• U.S. National Archives and Library of Congress
• Luxembourg National Library
• Israel National Library
Partial List of Collections
• Iraq War (2003-2011)
• 2005 US Supreme Court Nominations

Archive-It
Curated, Selective Web Archiving

Archive-It
Web based - nothing to install
Fully hosted service with
unlimited support
Simple to select, manage, scope
and catalog with metadata
10 different crawl frequencies
Includes quick access and
storage
html, videos, audio, social
media, PDFs, images, news
Full text search
Restricted access options

How our partners use Archive-It
● Enhance and supplement traditional offline collections
○ archives, topical collections
● Support records retention and archival policies
● Capture event-based content
○ Spontaneous
○ Planned
● Individual organizations and Consortial collaboration

Goals of Archive-It Research Services
● Expand access models for web archives
● Enable new insights into collections
● Leverage Internet Archive infrastructure for large-scale
processing to produce datasets for research
● Facilitate computational analysis and new use cases
● Increase use, visibility, and value of Archive-It partner
collections

Web Archives Datasets
Archive-It Research Services
http://bit.ly/ait_ars

Exploring the Canadian Political Interest Group and
Political Parties Web Sphere via WAT files

Named Entities in the Human Rights Collection

Systems Interoperability and Distributed
Preservation

Lost in the maze in Labyrinth (1986, LucasFilm, screen capture)
WARCs, CDXs
and derivatives
Access
Storage
Preservation
Content Mgmt
Web Archiving Tools

APIs
(*application programming interfaces)
● Interoperability
● Flexibility and modularity
● Loose coupling of services (so we can improve pieces as
needed)
● Scalability - Bulk data upload and download

New technology to face new challenges

Ongoing efforts
• Open Wayback
• Social media / Dynamic content
– Brozzler and Umbra (Archive-It)
– Social Feed Manager (GWU)
• URL nomination tools (UNT)
• Capture tools (GWU, IA, Rhizome)
• WASAPI - Community building and API
• Memento
BROZZLER!

????s
THANK YOU
courtney@archive.org

Internet Archive: Archive-It and Contract Crawling, C. Mumma

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Internet Archive: Archive-It and Contract Crawling, C. Mumma

Similar to Internet Archive: Archive-It and Contract Crawling, C. Mumma (20)

More from Netwerk Digitaal Erfgoed

More from Netwerk Digitaal Erfgoed (20)

Recently uploaded

Recently uploaded (20)

Internet Archive: Archive-It and Contract Crawling, C. Mumma