Courtney Mumma presenteerde op de studiedag Een web van webarchieven (17-11-2016, NCDD/Beeld en Geluid/Netwerk Digitaal Erfgoed) over de web crawling tools en services van the Internet Archive.
Powering Britain: Can we decarbonise electricity without disadvantaging poore...
Internet Archive: Archive-It and Contract Crawling, C. Mumma
1. Web Crawling Tools and Services from
the Internet Archive: Archive-It and
Contract Crawling
Courtney C. Mumma, Internet Archive
November 17, 2016 - Dutch Institute for Sound and Vision
2. Talk overview
● Archiving the web at IA
● Partnerships and services
○ Contract crawls
○ Archive-It
■ Research Services
○ Interoperability & Distributed Preservation
● New technology for new challenges
18. Global Wayback
● Broad snapshot
● Deep crawl on popular
sites
● Broad crawl on known
domains
● No more 404s
● On-demand
● Donated and targeted crawls
● https://web-beta.archive.org/
with KEYWORD SEARCH
and more!
22. Contract Crawling
Domain-scale
• Run by Internet Archive
• Average 300 million URLs per collection
Partial List of Partners
• National Libraries of Australia and New Zealand
• U.S. National Archives and Library of Congress
• Luxembourg National Library
• Israel National Library
Partial List of Collections
• Iraq War (2003-2011)
• 2005 US Supreme Court Nominations
24. Archive-It
Web based - nothing to install
Fully hosted service with
unlimited support
Simple to select, manage, scope
and catalog with metadata
10 different crawl frequencies
Includes quick access and
storage
html, videos, audio, social
media, PDFs, images, news
Full text search
Restricted access options
25.
26.
27.
28. How our partners use Archive-It
● Enhance and supplement traditional offline collections
○ archives, topical collections
● Support records retention and archival policies
● Capture event-based content
○ Spontaneous
○ Planned
● Individual organizations and Consortial collaboration
30. Goals of Archive-It Research Services
● Expand access models for web archives
● Enable new insights into collections
● Leverage Internet Archive infrastructure for large-scale
processing to produce datasets for research
● Facilitate computational analysis and new use cases
● Increase use, visibility, and value of Archive-It partner
collections
35. Lost in the maze in Labyrinth (1986, LucasFilm, screen capture)
WARCs, CDXs
and derivatives
Access
Storage
Preservation
Content Mgmt
Web Archiving Tools
36.
37. APIs
(*application programming interfaces)
● Interoperability
● Flexibility and modularity
● Loose coupling of services (so we can improve pieces as
needed)
● Scalability - Bulk data upload and download
40. Ongoing efforts
• Open Wayback
• Social media / Dynamic content
– Brozzler and Umbra (Archive-It)
– Social Feed Manager (GWU)
• URL nomination tools (UNT)
• Capture tools (GWU, IA, Rhizome)
• WASAPI - Community building and API
• Memento
BROZZLER!