SlideShare a Scribd company logo
Archive-It: Scaling Beyond a
 Billion Archival Web-pages
       Aaron Binns, Internet Archive
      aaron@archive.org, 2011-10-19
My Background
§    Aaron Binns (aaron@archive.org)
§    Internet Archive
§    Senior Software Engineer
§    Full-text search & cool stuff
   •  Full-text search
   •  Hadoop
   •  “Big Data”
•  http://github.com/aaronbinns




                           2
Internet Archive
§    Universal access to all knowledge
§    http://archive.org
§    Founded 1996
§    501(c)(3) non-profit org.
§    Digital Library
§    San Francisco, CA, USA
§    7+ PB of publicly accessible digital materials
      –  Web archive
      –  Books, music, video, etc.


                               3
§  http://web.archive.org
§  165,000,000,000+ archived web pages
   –  HTML
   –  Images
   –  CSS
   –  JavaScript
   –  Multimedia
§  1996-today


                        4
5
http://archive-it.org
§  Subscription web archiving service
   –  Select websites to harvest, frequency, depth
   –  Crawling/Harvesting
   –  Wayback
   –  Full-text search
§  Customers
   –  Public, State & University Libraries
   –  Local governments
   –  Museums
   –  Non-Governmental Organizations (NGOs)


                             6
Collections & Documents
§  Collection
   –  Web harvest configuration
       •  URLs to crawl
       •  Frequency & depth
   –  Set of documents archived
       •  Access via Wayback Machine
       •  Full-text search
§  Document
   –  Unique version of a URL
   –  “Text” documents: HTML, PDF, Office, etc.


                             7
Archive-It: Collection




            8
Archive-It: Wayback




           9
Archive-It: Replay
                July 27, 2002




Sept 15, 2011


                10
Archive-It: Search




          11
Archive-It: Search




          12
Challenges and....Solutions?

§  Scale
§  Archival web search != web search
§  Document formats
  –  HTML (1996....2011)
  –  PDF, Office, text, etc.
§  English, Français, Español,漢字, …
§  Diversity
§  Time


                               13
Scale

§  200+ customers
§  2,272 collections
   –  Largest: 33,470,659 documents
   –  24 collections, 10,000,000+ docs
   –  250 collections, 1,000,000+ docs
§  Total:
     –  1,375,473,187 unique documents



                            14
Scale...each day

§  30-40 simultaneous crawls/harvests
§  ~150GB of data: HTML, images, media
§  ~1.3 million new unique documents
   –  New URLs never seen before
   –  New versions of URLs
§  ~1.3 million updates
   –  Documents unchanged
   –  New crawl dates



                            15
Architecture
§  Offline indexing
   –  10 dedicated indexing machines
   –  ~10% of collections per machine
   –  Add new documents
   –  Update existing documents with new dates
   –  1CPU x 2core, 4GB RAM, 3x2TB disk
§  Search service
   –  11 machines: 1 master, 10 slaves
   –  ~10% of collections per slave
   –  1 collection → 1 Lucene index
   –  1CPU x 2core, 8GB RAM, 3x2TB disk

                            16
Diversity




     17
Diversity




     18
Diversity




     19
Field Collapsing / Grouping

§  Applied to web documents
       “Give me the best 1-2 hits from a site”
§  Lucene
   –  Grouping contrib package
§  Solr
   –  Field Collapsing
§  What is the performance cost?
§  Custom solution


                            20
Time

§  User experience & understanding
   –  Archival web search != web search
§  Information Architecture
   –  Publication date for web pages – difficult
§  Temporal diversity
   –  Multiple hits per site
   –  Multiple versions per URL




                              21
Time




   22
Searching across collections

§  Search all collections of a user
§  Search arbitrary group of collections
§  1 collection → 1 Lucene index
     –  Search 100 collections....
     –  Search 100 indexes
§  Collections distributed over 10 searchers




                          23
Custom Solutions

§  Java
§  Built on Lucene
§  Investigating Solr
   –  Capabilities
   –  Cost
§  Internet Archive
   –  Open Source
   –  Apache License
   –  http://github.com/aaronbinns


                             24
Custom Solutions: Indexing


§    http://github.com/aaronbinns/jbs
§    Archive-It & other archival web collections
§    Hadoop-based, or stand-alone
§    Java code with Lucene
      –  Hard-coded “schema” for web documents
      –  Title, body, keywords, date, mime-type, etc.
      –  Link analysis & curation to augment scoring



                                 25
Custom Solutions: Searching

§  http://github.com/aaronbinns/tnh
§  Custom Java web application with Lucene
§  Federated search
   –  1 master, 10 slaves
   –  OpenSearch
§  Multiple collections & arbitrary grouping
§  CollapsingCollector




                            26
CollapsingCollector


§    http://github.com/aaronbinns/tnh
§    Extends Lucene Collector
§    Field cache: “site”
§    Retains top N hits per “site”
      –  Control N via URL parameter




                               27
Web Archives!
§  Archive-It
   –  http://archive-it.org/
§  US National Archives
   –  http://webharvest.gov/
§  UK Web Archive
   –  http://www.webarchive.org.uk/
    –  Solr-based
§  Web Archive of Catalonia / PADICAT
   –  Biblioteca de Catalunya
   –  http://www.padicat.cat/

                                28

More Related Content

Similar to Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

The Archivists' Toolkit presented at MARAC, November 13, 2010
The Archivists' Toolkit presented at MARAC, November 13, 2010The Archivists' Toolkit presented at MARAC, November 13, 2010
The Archivists' Toolkit presented at MARAC, November 13, 2010
Holly Mengel
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
Roxanne Missingham
 
IIPC GA 2014 Solr
IIPC GA 2014 SolrIIPC GA 2014 Solr
IIPC GA 2014 Solr
Andy Jackson
 
Aglin
AglinAglin
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
dwig
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congress
nullhandle
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
National Library of Australia
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
Ahmed AlSum
 
Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
eby
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Micah Altman
 
High and Lows of Library Linked Data
High and Lows of Library Linked DataHigh and Lows of Library Linked Data
High and Lows of Library Linked Data
Adrian Stevenson
 
Online Exhibits in Plone
Online Exhibits in PloneOnline Exhibits in Plone
Online Exhibits in Plone
Jazkarta, Inc.
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3
Essam Obaid
 
Connecting the Dots: Constellations in the Linked Data Universe
Connecting the Dots: Constellations in the Linked Data UniverseConnecting the Dots: Constellations in the Linked Data Universe
Connecting the Dots: Constellations in the Linked Data Universe
National Information Standards Organization (NISO)
 
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Marcus Smith
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Lucidworks (Archived)
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
lucenerevolution
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
Murat Çakal
 

Similar to Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns (20)

The Archivists' Toolkit presented at MARAC, November 13, 2010
The Archivists' Toolkit presented at MARAC, November 13, 2010The Archivists' Toolkit presented at MARAC, November 13, 2010
The Archivists' Toolkit presented at MARAC, November 13, 2010
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
IIPC GA 2014 Solr
IIPC GA 2014 SolrIIPC GA 2014 Solr
IIPC GA 2014 Solr
 
Aglin
AglinAglin
Aglin
 
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congress
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
High and Lows of Library Linked Data
High and Lows of Library Linked DataHigh and Lows of Library Linked Data
High and Lows of Library Linked Data
 
Online Exhibits in Plone
Online Exhibits in PloneOnline Exhibits in Plone
Online Exhibits in Plone
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3
 
Connecting the Dots: Constellations in the Linked Data Universe
Connecting the Dots: Constellations in the Linked Data UniverseConnecting the Dots: Constellations in the Linked Data Universe
Connecting the Dots: Constellations in the Linked Data Universe
 
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
lucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
lucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
lucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
lucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
lucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
lucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Recently uploaded

Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
ScyllaDB
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
HarpalGohil4
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 

Recently uploaded (20)

Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 

Archive-It: Scaling Beyond a Billion Archival Webpages - Aaron Binns

  • 1. Archive-It: Scaling Beyond a Billion Archival Web-pages Aaron Binns, Internet Archive aaron@archive.org, 2011-10-19
  • 2. My Background §  Aaron Binns (aaron@archive.org) §  Internet Archive §  Senior Software Engineer §  Full-text search & cool stuff •  Full-text search •  Hadoop •  “Big Data” •  http://github.com/aaronbinns 2
  • 3. Internet Archive §  Universal access to all knowledge §  http://archive.org §  Founded 1996 §  501(c)(3) non-profit org. §  Digital Library §  San Francisco, CA, USA §  7+ PB of publicly accessible digital materials –  Web archive –  Books, music, video, etc. 3
  • 4. §  http://web.archive.org §  165,000,000,000+ archived web pages –  HTML –  Images –  CSS –  JavaScript –  Multimedia §  1996-today 4
  • 5. 5
  • 6. http://archive-it.org §  Subscription web archiving service –  Select websites to harvest, frequency, depth –  Crawling/Harvesting –  Wayback –  Full-text search §  Customers –  Public, State & University Libraries –  Local governments –  Museums –  Non-Governmental Organizations (NGOs) 6
  • 7. Collections & Documents §  Collection –  Web harvest configuration •  URLs to crawl •  Frequency & depth –  Set of documents archived •  Access via Wayback Machine •  Full-text search §  Document –  Unique version of a URL –  “Text” documents: HTML, PDF, Office, etc. 7
  • 10. Archive-It: Replay July 27, 2002 Sept 15, 2011 10
  • 13. Challenges and....Solutions? §  Scale §  Archival web search != web search §  Document formats –  HTML (1996....2011) –  PDF, Office, text, etc. §  English, Français, Español,漢字, … §  Diversity §  Time 13
  • 14. Scale §  200+ customers §  2,272 collections –  Largest: 33,470,659 documents –  24 collections, 10,000,000+ docs –  250 collections, 1,000,000+ docs §  Total: –  1,375,473,187 unique documents 14
  • 15. Scale...each day §  30-40 simultaneous crawls/harvests §  ~150GB of data: HTML, images, media §  ~1.3 million new unique documents –  New URLs never seen before –  New versions of URLs §  ~1.3 million updates –  Documents unchanged –  New crawl dates 15
  • 16. Architecture §  Offline indexing –  10 dedicated indexing machines –  ~10% of collections per machine –  Add new documents –  Update existing documents with new dates –  1CPU x 2core, 4GB RAM, 3x2TB disk §  Search service –  11 machines: 1 master, 10 slaves –  ~10% of collections per slave –  1 collection → 1 Lucene index –  1CPU x 2core, 8GB RAM, 3x2TB disk 16
  • 17. Diversity 17
  • 18. Diversity 18
  • 19. Diversity 19
  • 20. Field Collapsing / Grouping §  Applied to web documents “Give me the best 1-2 hits from a site” §  Lucene –  Grouping contrib package §  Solr –  Field Collapsing §  What is the performance cost? §  Custom solution 20
  • 21. Time §  User experience & understanding –  Archival web search != web search §  Information Architecture –  Publication date for web pages – difficult §  Temporal diversity –  Multiple hits per site –  Multiple versions per URL 21
  • 22. Time 22
  • 23. Searching across collections §  Search all collections of a user §  Search arbitrary group of collections §  1 collection → 1 Lucene index –  Search 100 collections.... –  Search 100 indexes §  Collections distributed over 10 searchers 23
  • 24. Custom Solutions §  Java §  Built on Lucene §  Investigating Solr –  Capabilities –  Cost §  Internet Archive –  Open Source –  Apache License –  http://github.com/aaronbinns 24
  • 25. Custom Solutions: Indexing §  http://github.com/aaronbinns/jbs §  Archive-It & other archival web collections §  Hadoop-based, or stand-alone §  Java code with Lucene –  Hard-coded “schema” for web documents –  Title, body, keywords, date, mime-type, etc. –  Link analysis & curation to augment scoring 25
  • 26. Custom Solutions: Searching §  http://github.com/aaronbinns/tnh §  Custom Java web application with Lucene §  Federated search –  1 master, 10 slaves –  OpenSearch §  Multiple collections & arbitrary grouping §  CollapsingCollector 26
  • 27. CollapsingCollector §  http://github.com/aaronbinns/tnh §  Extends Lucene Collector §  Field cache: “site” §  Retains top N hits per “site” –  Control N via URL parameter 27
  • 28. Web Archives! §  Archive-It –  http://archive-it.org/ §  US National Archives –  http://webharvest.gov/ §  UK Web Archive –  http://www.webarchive.org.uk/ –  Solr-based §  Web Archive of Catalonia / PADICAT –  Biblioteca de Catalunya –  http://www.padicat.cat/ 28