SlideShare a Scribd company logo
1 of 25
Download to read offline
Profiling Web Archive Coverage
for
Top-Level Domain &Content Language
Ahmed AlSum, Michele C. Weigle, Michael L. Nelson,
Herbert Van de Sompel
International Conference on Theory and Practice of Digital Libraries
September 22-26, 2013
Valletta, Malta 1
2
Where to find Mementos for …
3
http://www.japantimes.co.jp/
Where to find Mementos for …
4
http://www.japantimes.co.jp/
Where to find Mementos for …
5
http://www.google.com/
Where to find Mementos for …
6
http://www.google.com/
Research Question
Problem
• Profile public web archives according to the following
dimensions:
o Top-level domains
o Languages
o Growth rate
o Archival date
Motivation
• To determine who is archiving what
• To optimize the query routing for a Memento Aggregator
7
Web Archives in this Experiment
Full text URI-lookup
Internet Archive √
Library of Congress √
Icelandic Web Archive √
Library and Archives Canada √ √
British Library √ √
UK National Library √ √
Portuguese Web Archive √ √
Web Archive of Catalonia √ √
Croatian Web Archive √ √
Archive of the Czech Web √ √
National Taiwan University √ √
Archive It √ √
8
Experiment Set Up
• Sample URIs from different sources
o Details coming up
• Retrieve the TimeMap for each URI from all archives
o A TimeMap lists all Mementos for a given URI
o A Memento is an archived version of a resource
• Analyze
o Details coming up
9
Sampling URIs
Web
1. DMOZ:Random
2. DMOZ:TLD - 2% of each
TLD from DMOZ (.com,
.org, .jp, etc 52 TLD)
3. DMOZ:Languages - 100
URIs for each Languages
(24 lang.)
Web Archives Full Text
4. Top 1-Gram from Bing
5. Top 1000 queries term
by Yahoo in 9 languages
User requests
6. IA Wayback Machine Log
files
7. Memento aggregator log files
10
Sampling URIs - DMOZ
1. DMOZ:Random
o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs).
2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs
whichever is greater
o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net
2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au
764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319),
(cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149),
(tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov,
id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy,
zw])
3. DMOZ:Languages - 100 URIs for each language
o 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic,
Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch,
Spanish, French, Greek, Hindi, Italian, Japanese, Korean,
Norwegian, Persian, Polish , Russian, Turkish, Ukrainian
11
• Query the fulltext search interface of select web archives
with two sets of query terms.
4. Top 1-Gram from Bing
o Most are English
5. Top 1000 query terms from Yahoo in 9 languages
o Excluding general keywords such as: Obama, Facebook.
12
Sampling URIs – Web Archives Full Text
Chinese
English
French
German
Italian
Japanese
Korean
Portuguese
Spanish
Yahoo Bing
rchivewithFullTextsearch
AIT 26 2066 3512 3837 3321 119 2 2434
214
1
12617 3953
BL 163 2354 2350 2240 2068 225 131 1940
205
6
6430 3187
CAN 49 800 804 646 601 77 113 580 514 1351 1107
CR 54 706 697 703 701 74 19 599 600 1599 1201
CZ 363 1782 1578 1695 1519 577 114 1310
127
8
6081 3360
CAT 28 2775 2496 2448 2280 209 129 2164
242
9
8996 4241
PO 91 2460 3603 3081 3113 53 69 3267
317
7
14126 5004
TW 357 178 176 165 157 106 7 198 119 1004 354
13
Sampling URIs – Web Archives Full Text
Chinese
English
French
German
Italian
Japanese
Korean
Portuguese
Spanish
Yahoo Bing
rchivewithFullTextsearch
AIT 26 2066 3512 3837 3321 119 2 2434
214
1
12617 3953
BL 163 2354 2350 2240 2068 225 131 1940
205
6
6430 3187
CAN 49 800 804 646 601 77 113 580 514 1351 1107
CR 54 706 697 703 701 74 19 599 600 1599 1201
CZ 363 1782 1578 1695 1519 577 114 1310
127
8
6081 3360
CAT 28 2775 2496 2448 2280 209 129 2164
242
9
8996 4241
PO 91 2460 3603 3081 3113 53 69 3267
317
7
14126 5004
TW 357 178 176 165 157 106 7 198 119 1004 354
14
Sampling URIs – Web Archives Full Text
Sampling URIs – User Requests
• Sampling from user requests for archived web resources
6. Sample from IA Wayback Machine Log files
o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,
2012.
7. Sample from Memento Aggregator log files
o 100 URIs randomly sampled from LANL Memento Aggregator
between 2011 to 2013.
15
Archive Coverage per Sample
16
1
0
0
%
3
5
%
Entire Sample
TLD Coverage across Archives (1)
17
Entire Sample
TLD Coverage across Archives (2)
18
Entire Sample
TLD Distribution per Archive
19
DMOZ:TLD Sample
TLD Distribution per Archive
20
Web Archives Full Text Sample
Language Coverage per Archive
21
DMOZ Sample
Archive Growth Rate
22
Entire Sample
TLD Coverage across Archives
23
Entire Sample
Query Routing Evaluation
24
Conclusions
• Introduced automated methodology to profile web
archives using available infrastructure, no privileged
access
• Coverage:
o Internet Archive provides broad coverage
o National archives have good coverage for their domains
o Surprising coverage by certain archives
• Query Routing:
o In 84% of the cases, all existing Mementos for a TLD can be
found by using IA and two additional top archives for a TLD
o In 55% of the cases, all existing Mementos for a TLD can be
found by using the top 3 archives for a TLD, excluding IA
25

More Related Content

What's hot

Something about links
Something about linksSomething about links
Something about linksRoderic Page
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsJustin Brunelle
 
ResourceSync in 24x7
ResourceSync in 24x7ResourceSync in 24x7
ResourceSync in 24x7Simeon Warner
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Apachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogriselApachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogriselNuxeo
 
Free Web Hosting Directory List
Free Web Hosting Directory ListFree Web Hosting Directory List
Free Web Hosting Directory ListAlice Marshal
 
Richard Wallis Linked Data
Richard Wallis Linked DataRichard Wallis Linked Data
Richard Wallis Linked DataIncisive_Events
 
An Introduction to Linked Data for Librarians (2018-06-28)
An Introduction to Linked Data for Librarians (2018-06-28)An Introduction to Linked Data for Librarians (2018-06-28)
An Introduction to Linked Data for Librarians (2018-06-28)Cliff Landis
 

What's hot (8)

Something about links
Something about linksSomething about links
Something about links
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
ResourceSync in 24x7
ResourceSync in 24x7ResourceSync in 24x7
ResourceSync in 24x7
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Apachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogriselApachecon 2011 stanbol_ogrisel
Apachecon 2011 stanbol_ogrisel
 
Free Web Hosting Directory List
Free Web Hosting Directory ListFree Web Hosting Directory List
Free Web Hosting Directory List
 
Richard Wallis Linked Data
Richard Wallis Linked DataRichard Wallis Linked Data
Richard Wallis Linked Data
 
An Introduction to Linked Data for Librarians (2018-06-28)
An Introduction to Linked Data for Librarians (2018-06-28)An Introduction to Linked Data for Librarians (2018-06-28)
An Introduction to Linked Data for Librarians (2018-06-28)
 

Viewers also liked

Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet ArchiveMichael Nelson
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better Michael Nelson
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Michael Nelson
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesMichael Nelson
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Michael Nelson
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple ArchivesMichael Nelson
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesMichael Nelson
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesMichael Nelson
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?Michael Nelson
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web ArchivesMichael Nelson
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research ObjectYasmin AlNoamany, PhD
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web ArchivesMichael Nelson
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesMichael Nelson
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange ProjectMichael Nelson
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?Michael Nelson
 

Viewers also liked (20)

Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived Pages
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web Archives
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
 

Similar to Profiling Web Archive Coverage for Top-Level Domain and Content Language

TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesSawood Alam
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Ahmed AlSum
 
2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekingeProf. Wim Van Criekinge
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data ChallengeKnud Möller
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Semantic Technology for Development: Semantic Web without the Web?
Semantic Technology for Development: Semantic Web without the Web?Semantic Technology for Development: Semantic Web without the Web?
Semantic Technology for Development: Semantic Web without the Web?Victor de Boer
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014William Ulate
 
Do MORe with your data
Do MORe with your dataDo MORe with your data
Do MORe with your datalocloud
 
Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)Dag Endresen
 
Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Prof. Wim Van Criekinge
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! TheContentMine
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013Ahmed AlSum
 
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...CIARD Movement
 
Semantic Web in an SMS as presented at EKAW2016
Semantic Web in an SMS as presented at EKAW2016Semantic Web in an SMS as presented at EKAW2016
Semantic Web in an SMS as presented at EKAW2016Victor de Boer
 

Similar to Profiling Web Archive Coverage for Top-Level Domain and Content Language (20)

TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013Web Archiving Profile - WADL 2013
Web Archiving Profile - WADL 2013
 
2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data Challenge
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Bioinformatics t2-databases v2014
Bioinformatics t2-databases v2014Bioinformatics t2-databases v2014
Bioinformatics t2-databases v2014
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Semantic Technology for Development: Semantic Web without the Web?
Semantic Technology for Development: Semantic Web without the Web?Semantic Technology for Development: Semantic Web without the Web?
Semantic Technology for Development: Semantic Web without the Web?
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014
 
Do MORe with your data
Do MORe with your dataDo MORe with your data
Do MORe with your data
 
Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)
 
Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013Bioinformatics t2-databases wim-vancriekinge_v2013
Bioinformatics t2-databases wim-vancriekinge_v2013
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013ArcLink - IIPC GA 2013
ArcLink - IIPC GA 2013
 
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Accelerating your research with Microsoft Azure
Accelerating your research with Microsoft AzureAccelerating your research with Microsoft Azure
Accelerating your research with Microsoft Azure
 
Semantic Web in an SMS as presented at EKAW2016
Semantic Web in an SMS as presented at EKAW2016Semantic Web in an SMS as presented at EKAW2016
Semantic Web in an SMS as presented at EKAW2016
 

More from Michael Nelson

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 

More from Michael Nelson (9)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 

Recently uploaded

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Recently uploaded (20)

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Profiling Web Archive Coverage for Top-Level Domain and Content Language

  • 1. Profiling Web Archive Coverage for Top-Level Domain &Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, Herbert Van de Sompel International Conference on Theory and Practice of Digital Libraries September 22-26, 2013 Valletta, Malta 1
  • 2. 2
  • 3. Where to find Mementos for … 3 http://www.japantimes.co.jp/
  • 4. Where to find Mementos for … 4 http://www.japantimes.co.jp/
  • 5. Where to find Mementos for … 5 http://www.google.com/
  • 6. Where to find Mementos for … 6 http://www.google.com/
  • 7. Research Question Problem • Profile public web archives according to the following dimensions: o Top-level domains o Languages o Growth rate o Archival date Motivation • To determine who is archiving what • To optimize the query routing for a Memento Aggregator 7
  • 8. Web Archives in this Experiment Full text URI-lookup Internet Archive √ Library of Congress √ Icelandic Web Archive √ Library and Archives Canada √ √ British Library √ √ UK National Library √ √ Portuguese Web Archive √ √ Web Archive of Catalonia √ √ Croatian Web Archive √ √ Archive of the Czech Web √ √ National Taiwan University √ √ Archive It √ √ 8
  • 9. Experiment Set Up • Sample URIs from different sources o Details coming up • Retrieve the TimeMap for each URI from all archives o A TimeMap lists all Mementos for a given URI o A Memento is an archived version of a resource • Analyze o Details coming up 9
  • 10. Sampling URIs Web 1. DMOZ:Random 2. DMOZ:TLD - 2% of each TLD from DMOZ (.com, .org, .jp, etc 52 TLD) 3. DMOZ:Languages - 100 URIs for each Languages (24 lang.) Web Archives Full Text 4. Top 1-Gram from Bing 5. Top 1000 queries term by Yahoo in 9 languages User requests 6. IA Wayback Machine Log files 7. Memento aggregator log files 10
  • 11. Sampling URIs - DMOZ 1. DMOZ:Random o 10,000 URIs randomly sampled from DMOZ directory (~5M URIs). 2. DMOZ:TLD - 2% for each TLD from DMOZ or 100 URIs whichever is greater o 52 TLDs (com 23,470) (de 6,332), (org 4,025), (uk 3,309), (net 2,073), (it 1,775), (jp 1379), (ru 1244), (fr 1154), (pl 1062), (au 764), (ca 642), (at 438), (edu 390), (cz 385), (tr 334), (info 319), (cn 278), (us 266), (nz 265), (es 238), (ar 213), (no 150), (br 149), (tw 141), (za 118), (fi 113), ( 100 URIs for [ae, cat, cl, cu, eg, gov, id, in, ir, is, ke, kr, ma, mt, mx, my, na, pe, pk, pt, sa, to, uy, zw]) 3. DMOZ:Languages - 100 URIs for each language o 24 languages: Icelandic, Portuguese, Catalan, Afrikaans, Arabic, Indonesian, Chinese (Simplified), Chinese (Traditional), Dutch, Spanish, French, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Persian, Polish , Russian, Turkish, Ukrainian 11
  • 12. • Query the fulltext search interface of select web archives with two sets of query terms. 4. Top 1-Gram from Bing o Most are English 5. Top 1000 query terms from Yahoo in 9 languages o Excluding general keywords such as: Obama, Facebook. 12 Sampling URIs – Web Archives Full Text
  • 13. Chinese English French German Italian Japanese Korean Portuguese Spanish Yahoo Bing rchivewithFullTextsearch AIT 26 2066 3512 3837 3321 119 2 2434 214 1 12617 3953 BL 163 2354 2350 2240 2068 225 131 1940 205 6 6430 3187 CAN 49 800 804 646 601 77 113 580 514 1351 1107 CR 54 706 697 703 701 74 19 599 600 1599 1201 CZ 363 1782 1578 1695 1519 577 114 1310 127 8 6081 3360 CAT 28 2775 2496 2448 2280 209 129 2164 242 9 8996 4241 PO 91 2460 3603 3081 3113 53 69 3267 317 7 14126 5004 TW 357 178 176 165 157 106 7 198 119 1004 354 13 Sampling URIs – Web Archives Full Text
  • 14. Chinese English French German Italian Japanese Korean Portuguese Spanish Yahoo Bing rchivewithFullTextsearch AIT 26 2066 3512 3837 3321 119 2 2434 214 1 12617 3953 BL 163 2354 2350 2240 2068 225 131 1940 205 6 6430 3187 CAN 49 800 804 646 601 77 113 580 514 1351 1107 CR 54 706 697 703 701 74 19 599 600 1599 1201 CZ 363 1782 1578 1695 1519 577 114 1310 127 8 6081 3360 CAT 28 2775 2496 2448 2280 209 129 2164 242 9 8996 4241 PO 91 2460 3603 3081 3113 53 69 3267 317 7 14126 5004 TW 357 178 176 165 157 106 7 198 119 1004 354 14 Sampling URIs – Web Archives Full Text
  • 15. Sampling URIs – User Requests • Sampling from user requests for archived web resources 6. Sample from IA Wayback Machine Log files o 1,000 URIs randomly sampled from Feb 22, 2012 to Feb 26, 2012. 7. Sample from Memento Aggregator log files o 100 URIs randomly sampled from LANL Memento Aggregator between 2011 to 2013. 15
  • 16. Archive Coverage per Sample 16 1 0 0 % 3 5 % Entire Sample
  • 17. TLD Coverage across Archives (1) 17 Entire Sample
  • 18. TLD Coverage across Archives (2) 18 Entire Sample
  • 19. TLD Distribution per Archive 19 DMOZ:TLD Sample
  • 20. TLD Distribution per Archive 20 Web Archives Full Text Sample
  • 21. Language Coverage per Archive 21 DMOZ Sample
  • 23. TLD Coverage across Archives 23 Entire Sample
  • 25. Conclusions • Introduced automated methodology to profile web archives using available infrastructure, no privileged access • Coverage: o Internet Archive provides broad coverage o National archives have good coverage for their domains o Surprising coverage by certain archives • Query Routing: o In 84% of the cases, all existing Mementos for a TLD can be found by using IA and two additional top archives for a TLD o In 55% of the cases, all existing Mementos for a TLD can be found by using the top 3 archives for a TLD, excluding IA 25