Archive-It: Scaling Beyond a Billion Archival Web-pages       Aaron Binns, Internet Archive      aaron@archive.org, 2011-1...
My Background§    Aaron Binns (aaron@archive.org)§    Internet Archive§    Senior Software Engineer§    Full-text sear...
Internet Archive§    Universal access to all knowledge§    http://archive.org§    Founded 1996§    501(c)(3) non-profi...
§  http://web.archive.org§  165,000,000,000+ archived web pages   –  HTML   –  Images   –  CSS   –  JavaScript   –  Mult...
5
http://archive-it.org§  Subscription web archiving service   –  Select websites to harvest, frequency, depth   –  Crawlin...
Collections & Documents§  Collection   –  Web harvest configuration       •  URLs to crawl       •  Frequency & depth   –...
Archive-It: Collection            8
Archive-It: Wayback           9
Archive-It: Replay                July 27, 2002Sept 15, 2011                10
Archive-It: Search          11
Archive-It: Search          12
Challenges and....Solutions?§  Scale§  Archival web search != web search§  Document formats  –  HTML (1996....2011)  – ...
Scale§  200+ customers§  2,272 collections   –  Largest: 33,470,659 documents   –  24 collections, 10,000,000+ docs   – ...
Scale...each day§  30-40 simultaneous crawls/harvests§  ~150GB of data: HTML, images, media§  ~1.3 million new unique d...
Architecture§  Offline indexing   –  10 dedicated indexing machines   –  ~10% of collections per machine   –  Add new doc...
Diversity     17
Diversity     18
Diversity     19
Field Collapsing / Grouping§  Applied to web documents       “Give me the best 1-2 hits from a site”§  Lucene   –  Group...
Time§  User experience & understanding   –  Archival web search != web search§  Information Architecture   –  Publicatio...
Time   22
Searching across collections§  Search all collections of a user§  Search arbitrary group of collections§  1 collection ...
Custom Solutions§  Java§  Built on Lucene§  Investigating Solr   –  Capabilities   –  Cost§  Internet Archive   –  Ope...
Custom Solutions: Indexing§    http://github.com/aaronbinns/jbs§    Archive-It & other archival web collections§    Had...
Custom Solutions: Searching§  http://github.com/aaronbinns/tnh§  Custom Java web application with Lucene§  Federated se...
CollapsingCollector§    http://github.com/aaronbinns/tnh§    Extends Lucene Collector§    Field cache: “site”§    Reta...
Web Archives!§  Archive-It   –  http://archive-it.org/§  US National Archives   –  http://webharvest.gov/§  UK Web Arch...
Upcoming SlideShare
Loading in...5
×

Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

1,476

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

Description of Archive-It, the Internet Archive's subscription, self-serve web archiving service, focused on the full-text search system. With nearly 200 partners and over 2000 collections the custom Lucene-based system handles 3+ million index updates per day across an index that totals over 1.3 billion documents. This session will give a detailed description of the architecture and implementation of the Archive-It search system; highlighting many of the challenges due to the scale as well as complex use cases.

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,476
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns"

  1. 1. Archive-It: Scaling Beyond a Billion Archival Web-pages Aaron Binns, Internet Archive aaron@archive.org, 2011-10-19
  2. 2. My Background§  Aaron Binns (aaron@archive.org)§  Internet Archive§  Senior Software Engineer§  Full-text search & cool stuff •  Full-text search •  Hadoop •  “Big Data”•  http://github.com/aaronbinns 2
  3. 3. Internet Archive§  Universal access to all knowledge§  http://archive.org§  Founded 1996§  501(c)(3) non-profit org.§  Digital Library§  San Francisco, CA, USA§  7+ PB of publicly accessible digital materials –  Web archive –  Books, music, video, etc. 3
  4. 4. §  http://web.archive.org§  165,000,000,000+ archived web pages –  HTML –  Images –  CSS –  JavaScript –  Multimedia§  1996-today 4
  5. 5. 5
  6. 6. http://archive-it.org§  Subscription web archiving service –  Select websites to harvest, frequency, depth –  Crawling/Harvesting –  Wayback –  Full-text search§  Customers –  Public, State & University Libraries –  Local governments –  Museums –  Non-Governmental Organizations (NGOs) 6
  7. 7. Collections & Documents§  Collection –  Web harvest configuration •  URLs to crawl •  Frequency & depth –  Set of documents archived •  Access via Wayback Machine •  Full-text search§  Document –  Unique version of a URL –  “Text” documents: HTML, PDF, Office, etc. 7
  8. 8. Archive-It: Collection 8
  9. 9. Archive-It: Wayback 9
  10. 10. Archive-It: Replay July 27, 2002Sept 15, 2011 10
  11. 11. Archive-It: Search 11
  12. 12. Archive-It: Search 12
  13. 13. Challenges and....Solutions?§  Scale§  Archival web search != web search§  Document formats –  HTML (1996....2011) –  PDF, Office, text, etc.§  English, Français, Español,漢字, …§  Diversity§  Time 13
  14. 14. Scale§  200+ customers§  2,272 collections –  Largest: 33,470,659 documents –  24 collections, 10,000,000+ docs –  250 collections, 1,000,000+ docs§  Total: –  1,375,473,187 unique documents 14
  15. 15. Scale...each day§  30-40 simultaneous crawls/harvests§  ~150GB of data: HTML, images, media§  ~1.3 million new unique documents –  New URLs never seen before –  New versions of URLs§  ~1.3 million updates –  Documents unchanged –  New crawl dates 15
  16. 16. Architecture§  Offline indexing –  10 dedicated indexing machines –  ~10% of collections per machine –  Add new documents –  Update existing documents with new dates –  1CPU x 2core, 4GB RAM, 3x2TB disk§  Search service –  11 machines: 1 master, 10 slaves –  ~10% of collections per slave –  1 collection → 1 Lucene index –  1CPU x 2core, 8GB RAM, 3x2TB disk 16
  17. 17. Diversity 17
  18. 18. Diversity 18
  19. 19. Diversity 19
  20. 20. Field Collapsing / Grouping§  Applied to web documents “Give me the best 1-2 hits from a site”§  Lucene –  Grouping contrib package§  Solr –  Field Collapsing§  What is the performance cost?§  Custom solution 20
  21. 21. Time§  User experience & understanding –  Archival web search != web search§  Information Architecture –  Publication date for web pages – difficult§  Temporal diversity –  Multiple hits per site –  Multiple versions per URL 21
  22. 22. Time 22
  23. 23. Searching across collections§  Search all collections of a user§  Search arbitrary group of collections§  1 collection → 1 Lucene index –  Search 100 collections.... –  Search 100 indexes§  Collections distributed over 10 searchers 23
  24. 24. Custom Solutions§  Java§  Built on Lucene§  Investigating Solr –  Capabilities –  Cost§  Internet Archive –  Open Source –  Apache License –  http://github.com/aaronbinns 24
  25. 25. Custom Solutions: Indexing§  http://github.com/aaronbinns/jbs§  Archive-It & other archival web collections§  Hadoop-based, or stand-alone§  Java code with Lucene –  Hard-coded “schema” for web documents –  Title, body, keywords, date, mime-type, etc. –  Link analysis & curation to augment scoring 25
  26. 26. Custom Solutions: Searching§  http://github.com/aaronbinns/tnh§  Custom Java web application with Lucene§  Federated search –  1 master, 10 slaves –  OpenSearch§  Multiple collections & arbitrary grouping§  CollapsingCollector 26
  27. 27. CollapsingCollector§  http://github.com/aaronbinns/tnh§  Extends Lucene Collector§  Field cache: “site”§  Retains top N hits per “site” –  Control N via URL parameter 27
  28. 28. Web Archives!§  Archive-It –  http://archive-it.org/§  US National Archives –  http://webharvest.gov/§  UK Web Archive –  http://www.webarchive.org.uk/ –  Solr-based§  Web Archive of Catalonia / PADICAT –  Biblioteca de Catalunya –  http://www.padicat.cat/ 28
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×