SlideShare a Scribd company logo
1 of 43
Profiling Web ArchivesProfiling Web Archives
Michael L. Nelson
Ahmed AlSum, Michele C. Weigle
Herbert Van de Sompel, David Rosenthal
IIPC General Assembly
Paris, France, May 21, 2014
1
Where's that issue
with the Afghan girl?
7
8
9
Prior IIPC Memento Aggregator ProjectPrior IIPC Memento Aggregator Project
• Ten IIPC archives, led by LANL
• Conceived at 2011 IIPC meeting
• Results reported at 2012 IIPC meeting
o http://netpreserve.org/sites/default/files/resources/Sanderson.pdf
• Two highlights:
Stop and Rethink…Stop and Rethink…
• LANL's processing was informative from a
"big data" perspective, but was neither
scalable nor sustainable
o "send us your CDX" == hard for both parties
o there are lots of URIs in the world
• Will only get worse with:
o more archives…
o …doing more archiving
Leverage Memento AggregatorsLeverage Memento Aggregators
• Memento aggregator currently broadcast
URI lookups to all known archives
• New approach:
1. build profiles based on sampling from URI lookups
(optionally supplement with CDX files when available)
2. Use archive profiles for informing Memento
aggregator "query routing" decisions
3. Share serialized profiles with other IIPC partners
http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.bnf.fr/
Profiling StudiesProfiling Studies
• TPDL 2013
o 12 archives, March 2013, public web archives used
but techniques apply generally
o sampling only, no CDX access
• IJDL 2014 (to appear)
o 15 archives (+4, -1), October 2013
o slightly larger sample URI dataset
o results similar
URI Lookup = Limited InformationURI Lookup = Limited Information
16
GET /aggr/timegate/http://www.bnf.fr/ HTTP/1.1
Host: mementoproxy.lanl.gov
Accept-Datetime: Sun, 29 May 2005 02:46:53 GMT
Accept-Language: fr; q=1.0, en; q=0.5
…
1. Original URI
2. Memento-Datetime
3. Preferred URI
2
1
3
Where to find Mementos for …Where to find Mementos for …
17
http://www.japantimes.co.jp/
Where to find Mementos for …Where to find Mementos for …
18
http://www.japantimes.co.jp/
Where to find Mementos for …Where to find Mementos for …
19
http://www.bnf.fr
Where to find Mementos for …Where to find Mementos for …
20
http://www.bnf.fr
Research QuestionResearch Question
Problem
• Profile public web archives according to the following
dimensions:
o Top-level domains
o Languages
o Growth rate
o Archival date
Motivation
• Determine who is archiving what
• Optimize query routing for a Memento Aggregator
21
Web Archives in this ExperimentWeb Archives in this Experiment
Full text URI-lookup
Internet Archive √
Library of Congress √
Icelandic Web Archive √
Library and Archives Canada √ √
British Library √ √
UK National Library √ √
Portuguese Web Archive √ √
Web Archive of Catalonia √ √
Croatian Web Archive √ √
Archive of the Czech Web √ √
National Taiwan University √ √
Archive It √ √
22
Experiment Set UpExperiment Set Up
• Sample URIs from seven different sources
• Retrieve the TimeMap for each URI from all archives
o A TimeMap lists all Mementos for a given URI
o A Memento is an archived version of a resource
• Analyze who has holdings for which URIs
23
Sampling URIs - DMOZSampling URIs - DMOZ
1. DMOZ:Random
o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs).
2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs
whichever is greater
o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net
2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au
764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319),
(cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149),
(tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov,
id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy,
zw])
3. DMOZ:Languages - 100 URIs for each language
1. 24 languages: Icelandic, Portuguese, Catalan, Afrikaans,
Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional),
Dutch, Spanish, French, Greek, Hindi, Italian, Japanese,
Korean, Norwegian, Persian, Polish , Russian, Turkish,
Ukrainian 24
• Query the fulltext search interface of select web archives
with two sets of query terms.
4. Top 1-Gram from Bing
o Most are English
4. Top 1000 query terms from Yahoo in 9 languages
o Excluding general keywords such as: Obama, Facebook.
25
Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
26
Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
27
Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
Sampling URIs – User RequestsSampling URIs – User Requests
• Sampling from user requests for archived web resources
6. Sample from IA Wayback Machine Log files
o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,
2012.
6. Sample from Memento Aggregator log files
o 100 URIs randomly sampled from LANL Memento Aggregator
between 2011 to 2013.
28
Archive Coverage per SampleArchive Coverage per Sample
29
1
0
0
%
3
5
%
Entire Sample
TLD Coverage across Archives (1)TLD Coverage across Archives (1)
30
Entire Sample
TLD Coverage across Archives (2)TLD Coverage across Archives (2)
31
Entire Sample
TLD Distribution per ArchiveTLD Distribution per Archive
32
DMOZ:TLD Sample
TLD Distribution per ArchiveTLD Distribution per Archive
33
Web Archives Full Text Sample
Language Coverage per ArchiveLanguage Coverage per Archive
34
DMOZ Sample
Archive Growth RateArchive Growth Rate
35
Entire Sample
Query Routing EvaluationQuery Routing Evaluation
36
Study ResultsStudy Results
• Introduced sampling to profile web archives using
available infrastructure, no privileged access
• Coverage:
o Internet Archive provides broad coverage
o National archives have good coverage for their domains
o Surprising coverage by certain archives
• Query Routing:
o In 84% of the cases, all existing Mementos for a TLD can be
found by using IA and two additional top archives for a TLD
o In 55% of the cases, all existing Mementos for a TLD can be
found by using the top 3 archives for a TLD, excluding IA
37
Next Steps With the IIPCNext Steps With the IIPC
38
• Finding the right granularity
o too fine:
http://www.bnf.fr/fr/evenements_et_culture/a.passe_bnf.html
o too coarse: .fr
o just right?: bnf.fr, www.bnf.fr, gallica.bnf.fr, www.bnf.fr/fr/
• Generating profiles
o what are desirable / representative sample sets: domains,
languages, regions, etc. -- what's missing?
o local CDX analysis tools (can help with cold start problem)
• Profile format
o community input (yet another metadata format)
o github (or other tools) for exchange & integration
{"Profile":{
"Name":"Taiwan Web Archive",
"URI":"http://webarchive.lib.ntu.edu.tw",
"TimeGate":
"http://mementoproxy.cs.odu.edu/tw/timegate/",
"Code":"TW",
"Age":"Tue, 15 Jul 1997 00:00:00 GMT",
"TLD":[{"tw":0.6},{"cn":0.08},{"hk":0.04},
{"eg":0.04},{"gov":0.04},{"my":0.04},
{"jp":0.04},{"kr":0.02}],
"Language":[{"zh-TW":0.5},{"zh-CN":0.25},
{"id":0.08},{"ar":0.08}],
"GrowthRate":[
{"199707":[4,4]},{"200202":[1,1]},
{"200607":[30,62]},{"200608":[20,80]},
{"200609":[5,9]},{"200612":[77,129]},
... // other values truncated
{"201308":[7,94]},{"201309":[2,94]}]
}
}
A Possible SerializationA Possible Serialization
{Light, Dim, Dark} Archives{Light, Dim, Dark} Archives
42
• Work to date has assumed light archives
because our focus has been on sampling
archives we don't control
• Applicable to a continuum of archives:
o download/fork and run "dark-sample.py"
o it accesses sample URIs from IIPC github
o issues URI lookups to local archive
o write/update your archive profile in IIPC github with machine
readable IP restrictions
o all profiles -- light/dim/dark -- now available to Memento
aggregators and other IIPC analysis tools
Profiles = Easy Discovery, SharingProfiles = Easy Discovery, Sharing
http://netpreserve.org/aggr/timemap/link/1/http://www.bnf.fr/

More Related Content

What's hot

Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)
Web@rchive Austria
 

What's hot (20)

Creating Pockets of Persistence
Creating Pockets of PersistenceCreating Pockets of Persistence
Creating Pockets of Persistence
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)Web@rchive Austria (Archiving Online Media)
Web@rchive Austria (Archiving Online Media)
 
Viaf and isni ifla 2013 08-16
Viaf and isni  ifla 2013 08-16Viaf and isni  ifla 2013 08-16
Viaf and isni ifla 2013 08-16
 
Interoperability for web based scholarship
Interoperability for web based scholarshipInteroperability for web based scholarship
Interoperability for web based scholarship
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning Issue
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
PID Signposting Pattern
PID Signposting PatternPID Signposting Pattern
PID Signposting Pattern
 
A Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordA Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly Record
 
Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.Intro to Linked Open Data in Libraries Archives & Museums.
Intro to Linked Open Data in Libraries Archives & Museums.
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall Forum
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
What is #LODLAM?! Understanding linked open data in libraries, archives [and ...
 
Linked Data Basics
Linked Data BasicsLinked Data Basics
Linked Data Basics
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collections
 

Viewers also liked

Viewers also liked (20)

Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived Pages
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over Time
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web Archives
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 

Similar to Profiling Web Archives

Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013
Ahmed AlSum
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
Roxanne Missingham
 
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Lucidworks
 
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Nuno Freire
 
Snrg2011 6.15.2.sta canney_suranofsky
Snrg2011 6.15.2.sta canney_suranofskySnrg2011 6.15.2.sta canney_suranofsky
Snrg2011 6.15.2.sta canney_suranofsky
karan saini
 

Similar to Profiling Web Archives (20)

Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013
 
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
 
IIIF at europeana, IIIF conference, Vatican, 2017
IIIF at europeana, IIIF conference, Vatican, 2017IIIF at europeana, IIIF conference, Vatican, 2017
IIIF at europeana, IIIF conference, Vatican, 2017
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and Potential
 
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
 
Arcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls AdvancedArcomem training Specifying Crawls Advanced
Arcomem training Specifying Crawls Advanced
 
Lecture semantic augmentation
Lecture semantic augmentationLecture semantic augmentation
Lecture semantic augmentation
 
Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginners
 
Finalrevc
FinalrevcFinalrevc
Finalrevc
 
Cosi Opac Tweaks
Cosi   Opac TweaksCosi   Opac Tweaks
Cosi Opac Tweaks
 
Snrg2011 6.15.2.sta canney_suranofsky
Snrg2011 6.15.2.sta canney_suranofskySnrg2011 6.15.2.sta canney_suranofsky
Snrg2011 6.15.2.sta canney_suranofsky
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
Digital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondDigital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the Pond
 
Digital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondDigital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the Pond
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 

More from Michael Nelson

More from Michael Nelson (10)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
 

Recently uploaded

Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discs
Sérgio Sacani
 

Recently uploaded (20)

Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...
Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...
Virulence Analysis of Citrus canker caused by Xanthomonas axonopodis pv. citr...
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdf
 
Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discs
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
MSCII_              FCT UNIT 5 TOXICOLOGY.pdfMSCII_              FCT UNIT 5 TOXICOLOGY.pdf
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
 
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyanPlasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
 
Lubrication System in forced feed system
Lubrication System in forced feed systemLubrication System in forced feed system
Lubrication System in forced feed system
 
The Scientific names of some important families of Industrial plants .pdf
The Scientific names of some important families of Industrial plants .pdfThe Scientific names of some important families of Industrial plants .pdf
The Scientific names of some important families of Industrial plants .pdf
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary Gland
 
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdfFORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
 
ANALEPTICS Mrs Namrata Sanjay Mane  Department of Pharmaceutical Chemistry...
ANALEPTICS  Mrs Namrata Sanjay  Mane   Department of Pharmaceutical Chemistry...ANALEPTICS  Mrs Namrata Sanjay  Mane   Department of Pharmaceutical Chemistry...
ANALEPTICS Mrs Namrata Sanjay Mane  Department of Pharmaceutical Chemistry...
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday LifeGBSN - Microbiology (Unit 7) Microbiology in Everyday Life
GBSN - Microbiology (Unit 7) Microbiology in Everyday Life
 
short range interaction for protein and factors influencing or affecting the ...
short range interaction for protein and factors influencing or affecting the ...short range interaction for protein and factors influencing or affecting the ...
short range interaction for protein and factors influencing or affecting the ...
 

Profiling Web Archives

  • 1. Profiling Web ArchivesProfiling Web Archives Michael L. Nelson Ahmed AlSum, Michele C. Weigle Herbert Van de Sompel, David Rosenthal IIPC General Assembly Paris, France, May 21, 2014 1
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. Where's that issue with the Afghan girl?
  • 7. 7
  • 8. 8
  • 9. 9
  • 10. Prior IIPC Memento Aggregator ProjectPrior IIPC Memento Aggregator Project • Ten IIPC archives, led by LANL • Conceived at 2011 IIPC meeting • Results reported at 2012 IIPC meeting o http://netpreserve.org/sites/default/files/resources/Sanderson.pdf • Two highlights:
  • 11.
  • 12.
  • 13. Stop and Rethink…Stop and Rethink… • LANL's processing was informative from a "big data" perspective, but was neither scalable nor sustainable o "send us your CDX" == hard for both parties o there are lots of URIs in the world • Will only get worse with: o more archives… o …doing more archiving
  • 14. Leverage Memento AggregatorsLeverage Memento Aggregators • Memento aggregator currently broadcast URI lookups to all known archives • New approach: 1. build profiles based on sampling from URI lookups (optionally supplement with CDX files when available) 2. Use archive profiles for informing Memento aggregator "query routing" decisions 3. Share serialized profiles with other IIPC partners http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.bnf.fr/
  • 15. Profiling StudiesProfiling Studies • TPDL 2013 o 12 archives, March 2013, public web archives used but techniques apply generally o sampling only, no CDX access • IJDL 2014 (to appear) o 15 archives (+4, -1), October 2013 o slightly larger sample URI dataset o results similar
  • 16. URI Lookup = Limited InformationURI Lookup = Limited Information 16 GET /aggr/timegate/http://www.bnf.fr/ HTTP/1.1 Host: mementoproxy.lanl.gov Accept-Datetime: Sun, 29 May 2005 02:46:53 GMT Accept-Language: fr; q=1.0, en; q=0.5 … 1. Original URI 2. Memento-Datetime 3. Preferred URI 2 1 3
  • 17. Where to find Mementos for …Where to find Mementos for … 17 http://www.japantimes.co.jp/
  • 18. Where to find Mementos for …Where to find Mementos for … 18 http://www.japantimes.co.jp/
  • 19. Where to find Mementos for …Where to find Mementos for … 19 http://www.bnf.fr
  • 20. Where to find Mementos for …Where to find Mementos for … 20 http://www.bnf.fr
  • 21. Research QuestionResearch Question Problem • Profile public web archives according to the following dimensions: o Top-level domains o Languages o Growth rate o Archival date Motivation • Determine who is archiving what • Optimize query routing for a Memento Aggregator 21
  • 22. Web Archives in this ExperimentWeb Archives in this Experiment Full text URI-lookup Internet Archive √ Library of Congress √ Icelandic Web Archive √ Library and Archives Canada √ √ British Library √ √ UK National Library √ √ Portuguese Web Archive √ √ Web Archive of Catalonia √ √ Croatian Web Archive √ √ Archive of the Czech Web √ √ National Taiwan University √ √ Archive It √ √ 22
  • 23. Experiment Set UpExperiment Set Up • Sample URIs from seven different sources • Retrieve the TimeMap for each URI from all archives o A TimeMap lists all Mementos for a given URI o A Memento is an archived version of a resource • Analyze who has holdings for which URIs 23
  • 24. Sampling URIs - DMOZSampling URIs - DMOZ 1. DMOZ:Random o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs). 2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs whichever is greater o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net 2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au 764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319), (cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149), (tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov, id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy, zw]) 3. DMOZ:Languages - 100 URIs for each language 1. 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch, Spanish, French, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Persian, Polish , Russian, Turkish, Ukrainian 24
  • 25. • Query the fulltext search interface of select web archives with two sets of query terms. 4. Top 1-Gram from Bing o Most are English 4. Top 1000 query terms from Yahoo in 9 languages o Excluding general keywords such as: Obama, Facebook. 25 Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
  • 26. 26 Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
  • 27. 27 Sampling URIs – Web Archives Full TextSampling URIs – Web Archives Full Text
  • 28. Sampling URIs – User RequestsSampling URIs – User Requests • Sampling from user requests for archived web resources 6. Sample from IA Wayback Machine Log files o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26, 2012. 6. Sample from Memento Aggregator log files o 100 URIs randomly sampled from LANL Memento Aggregator between 2011 to 2013. 28
  • 29. Archive Coverage per SampleArchive Coverage per Sample 29 1 0 0 % 3 5 % Entire Sample
  • 30. TLD Coverage across Archives (1)TLD Coverage across Archives (1) 30 Entire Sample
  • 31. TLD Coverage across Archives (2)TLD Coverage across Archives (2) 31 Entire Sample
  • 32. TLD Distribution per ArchiveTLD Distribution per Archive 32 DMOZ:TLD Sample
  • 33. TLD Distribution per ArchiveTLD Distribution per Archive 33 Web Archives Full Text Sample
  • 34. Language Coverage per ArchiveLanguage Coverage per Archive 34 DMOZ Sample
  • 35. Archive Growth RateArchive Growth Rate 35 Entire Sample
  • 36. Query Routing EvaluationQuery Routing Evaluation 36
  • 37. Study ResultsStudy Results • Introduced sampling to profile web archives using available infrastructure, no privileged access • Coverage: o Internet Archive provides broad coverage o National archives have good coverage for their domains o Surprising coverage by certain archives • Query Routing: o In 84% of the cases, all existing Mementos for a TLD can be found by using IA and two additional top archives for a TLD o In 55% of the cases, all existing Mementos for a TLD can be found by using the top 3 archives for a TLD, excluding IA 37
  • 38. Next Steps With the IIPCNext Steps With the IIPC 38 • Finding the right granularity o too fine: http://www.bnf.fr/fr/evenements_et_culture/a.passe_bnf.html o too coarse: .fr o just right?: bnf.fr, www.bnf.fr, gallica.bnf.fr, www.bnf.fr/fr/ • Generating profiles o what are desirable / representative sample sets: domains, languages, regions, etc. -- what's missing? o local CDX analysis tools (can help with cold start problem) • Profile format o community input (yet another metadata format) o github (or other tools) for exchange & integration
  • 39. {"Profile":{ "Name":"Taiwan Web Archive", "URI":"http://webarchive.lib.ntu.edu.tw", "TimeGate": "http://mementoproxy.cs.odu.edu/tw/timegate/", "Code":"TW", "Age":"Tue, 15 Jul 1997 00:00:00 GMT", "TLD":[{"tw":0.6},{"cn":0.08},{"hk":0.04}, {"eg":0.04},{"gov":0.04},{"my":0.04}, {"jp":0.04},{"kr":0.02}], "Language":[{"zh-TW":0.5},{"zh-CN":0.25}, {"id":0.08},{"ar":0.08}], "GrowthRate":[ {"199707":[4,4]},{"200202":[1,1]}, {"200607":[30,62]},{"200608":[20,80]}, {"200609":[5,9]},{"200612":[77,129]}, ... // other values truncated {"201308":[7,94]},{"201309":[2,94]}] } } A Possible SerializationA Possible Serialization
  • 40.
  • 41.
  • 42. {Light, Dim, Dark} Archives{Light, Dim, Dark} Archives 42 • Work to date has assumed light archives because our focus has been on sampling archives we don't control • Applicable to a continuum of archives: o download/fork and run "dark-sample.py" o it accesses sample URIs from IIPC github o issues URI lookups to local archive o write/update your archive profile in IIPC github with machine readable IP restrictions o all profiles -- light/dim/dark -- now available to Memento aggregators and other IIPC analysis tools
  • 43. Profiles = Easy Discovery, SharingProfiles = Easy Discovery, Sharing http://netpreserve.org/aggr/timemap/link/1/http://www.bnf.fr/