SlideShare a Scribd company logo
Finding Pages on the Unarchived Web 
Hugo Huurdeman, Anat Ben-David, Jaap Kamps 
Thaer Samar, Arjen de Vries" 
" 
University of Amsterdam, Centrum Wiskunde & Informatica" 
" 
" 
" 
Presentation ACM/IEEE Digital Libraries conference 2014
Introduction 
• Web archives preserve the fast-changing 
Web 
• However, they cannot capture 
the entire Web due to various 
limitations" 
• Recrawl “lost” webpages 
impossible 
" 
• Would it be possible to recover 
parts of the unarchived Web?
Background & 
experimental setup
0.1 Background: Web archiving 
• Web archives, keepers of our 
future cultural heritage, are 
inherently incomplete 
• e.g. due to limitations in crawling 
[Masanès06] 
" 
• However, crawlers do register 
additional information, e.g. 
• page source, link structure, 
server metadata, timestamps, .. 
• potentially usable for analytical 
purposes (e.g. [Rauber06])
0.1 Background: Link evidence and anchor text 
• Defining property of the Web: 
graph-based structure 
• links: src, destination, anchor text 
• Widely used in Web retrieval 
• [e.g. Craswell01, Fujii08, 
Koolen10] 
" 
• Our approach 
• inspired by previous results on 
Web-centric document 
representations 
• Our use case: the Web archive
0.2 Data: Dutch Web Archive 
• National Library of the 
Netherlands (KB) " 
" 
• Selective Web archive" 
• 2007-now 
• 10+ Terabyte 
• seedlist: 8000+ websites 
• 25,000+ harvests 
" 
• Our focus: one year of data 
(2012)
0.2 Data: extraction and processing 
extracting links from all pages" 
{destination URL, anchor text, 
hashcode src, crawldate} 
matching with seedlist 
adding KB metadata 
deduplication (per year)" 
to correct for harvesting frequencies 
cleaning and processing" 
e.g. URL normalization 
MySQL DB 
(13M. rows) 
aggregation and data 
enrichment" 
e.g. filetypes, counts, ..
Research Questions 
1 Can we recover a significant fraction of unarchived pages from 
references to them in the Web archive? 
" 
2 How rich are the representations that can be created for 
unarchived URLs? 
" 
3 Are the resulting derived representations of unarchived pages 
useful in practice? Do they capture enough of the unique page 
content to make them retrievable amongst millions of other pages?
1. Expanding the Web archive 
Can we recover a significant fraction of 
unarchived pages from references 
in the Web archive?
1.1 Archived content (2012) 
Dutch Web Archive 1 2 
1. Contents in seedlist (2012) 
• 10.2M unique pages 
• 6,157 unique hosts 
• 3,413 unique domains 
• 16 TLDs 
" 
2. Contents not in seedlist (2012) 
• 0.9M unique pages 
• 37,166 unique hosts 
• 30,367 unique domains 
• 181 TLDs
1.2 Unarchived content: the aura 
• the aura of the web 
archive 
• pages not in archive 
• but existence can be 
derived from link evidence 
in the archive 
" 
• distinguishing 
• inner aura (parent domain 
on the seedlist) 
• outer aura (parent domain 
not in the seedlist) 
Dutch Web Archive 1 2
1.2 Unarchived content (2012) 
3. Inner aura 
• 5.5M unique pages 
• 9,039 unique hosts 
• 3,019 unique domains 
• 17 TLDs 
" 
4.Outer aura 
• 5.2M unique pages 
• 481,797 unique hosts 
• 369,721 unique domains 
• 100 TLDs 
Dutch Web Archive 1 2 3 4
1.3 Characterizing the Aura: tld distribution 
Inner aura 
2% 
96% 
nl 
com 
org 
net 
other 
Outer aura 
10% 
2% 
18% 
5% 
31% 
35% nl 
com 
org 
jp 
net 
other 
mainly .nl content more mixed distribution 
(incl. .com, .org & .net)
1.3 Characterizing the Aura: coverage Alexa top 100 
• Inner aura! 
• includes 7 of 100 most 
popular Dutch sites 
280 
210 
140 
70 
0 
twitter.com facebook.com linkedin.com hyves.nl google.com 
280 
210 
140 
70 
0 
nu.nl wikipedia blogspot.com kvk.nl anwb.nl 
• Outer aura! 
• includes 90 of 100 most 
popular Dutch sites (1.2M 
references)
1.4 Expanding the Web archive: summary 
• Recovered pages and hosts: 
" 
• Substantial amount 
• as many references to unarchived 
content as pages in the archive 
" 
• Complementing sites in archive 
" 
• Indirect evidence of lost 
Webpages holds the potential 
to significantly expand the Web 
archive’s coverage 
20,0 
15,0 
10,0 
5,0 
0,0 
Unarchived pages (M) 
Archived pages (M)
2. Representations! 
" 
How rich are the representations " 
that can be created for 
unarchived URLs?
2.1 Representations of unarchived content: indegree 
• Characteristics of incoming links (indegree) 
• All target representations: link from at least 1 unique page (b/o MD5) 
• 18% at least 3 unique incoming links 
• 10% has 5 links or more 
100,00%! 
90,00%! 
80,00%! 
70,00%! 
60,00%! 
50,00%! 
40,00%! 
30,00%! 
20,00%! 
10,00%! 
0,00%! 
1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 
subset coverage! 
! 
! 
indegree (unique source pages)! 
inner aura! 
outer aura!
2.2 Representations: anchor text distribution 
• Further inspecting the richness: number of unique words 
" 
• 95% has 1 unique word or more 
• christinaconcours.nl: 
concertagenda (5) 
• 30% has 3 unique words or more 
• watou2009.be: 
watou (3) collection (2) stories (2) 
• 3% has 10 words or more 
• jos.rotterdam.nl: 
society (2) service (2) youth 
(2) and (2) education (2) 
wwwjosrotterdamnl (1) 
municipality (1) governance (1) 
jos (4) rotterdam (3) 
100,00%! 
90,00%! 
80,00%! 
70,00%! 
60,00%! 
50,00%! 
40,00%! 
30,00%! 
20,00%! 
10,00%! 
0,00%! 
0! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14! 15! 
subset coverage! 
! 
! 
unique word count (anchor text)! 
inner aura! 
outer aura!
2.3 Representations: homepages & non-homepages 
• Anchors often refer to homepages [e.g. Craswell01] 
• In our dataset: homepages for 336K of 481K hosts (69.8%) 
• homepage: vakcentrum.nl (6 unique anchors) 
• 
" 
• non-homepage: nesomexico.org/dutch-students/study-in-mexico/study-grants-and- 
loans/ (2 unique anchors — combine with URL words)
2.4 Representations of unarchived content: summary 
• Richness of representations: 
" 
• Results mixed: 
• skewed distribution 
• majority of pages: relatively 
sparse descriptions 
• minority of pages: relatively rich 
descriptions 
" 
• Are the representations rich 
enough to characterize the 
page’s contents?
3. Finding Unarchived Pages! 
" 
Are the representations of " 
unarchived pages useful " 
in practice?
3.1 Finding unarchived pages: evaluation setup 
• Indexed 5.19M representations unarchived content (outer aura) 
• three indexes: 
" 
" 
" 
" 
anchT urlW 
anchT 
UrlW 
" 
aggregated anchor text only URL words both 
" 
• Stratified sample: 500 homepages & 500 non-homepages 
• Pages (if available via IA / live Web) consulted by two annotators 
• creating known-item topics (150 per category) 
• inspect target page 
• write down query for refinding (without knowledge of anchor text) 
• result: 300 queries (~5-7 words)
3.2 Evaluation: results 
• Mean Reciprocal Rank (MRR)" 
• average scores of first correct 
result of each query 
• score: 1/rank 
• Results: " 
• homepages score better for anchor text representations 
• URL words representation better for non-homepages 
• combined representation improves MRR score for both 
• average close to 0.5: average case correct result 2nd rank
3.2 Evaluation: results 
• Success Rate @10: correct target page in top 10 
" 
" 
" 
" 
" 
• Similar to MRR results: 
• homepages score better for anchor text 
• non-homepages score better for URL words representations 
" 
• On average, 59.7% of the correct homepages and non-homepages 
can be retrieved in the top 10
3.3 Evaluation: Impact of indegree 
• Impact of incoming links on richness of representations 
cer.org.uk! 
(5 anchor words) 
actionaid.org/kenya! 
(1 anchor word)
3.3 Evaluation: Impact of indegree (unique hosts) 
" 
" 
" 
" 
" 
" 
" 
" 
• Again, skewed nature: 
" 
• 251 out of 300 pages (84%) have links from 1 source 
• 49 pages (16%) have links from 2 or more sources 
" 
" 
16% 
84% 
• Higher indegree (unique hosts) results in rise in 
• mean word count 
• MRR 
• degree of homepages
3.3 Finding unarchived pages: summary 
• Usefulness in practice 
• Critical test: known-item finding 
• Generally positive results 
" 
• Unavailability of pages strengthens 
potential utility representations: 
• 20.1% of homepages 
• 45.4% of non-homepages 
not available via live Web or Internet 
Archive
4. Conclusion and Discussion
4.1 Conclusions 
• Approach to recover significant parts of the 
unarchived Web 
• by reconstructing descriptions based on link 
evidence 
" 
1. Evidence high number unarchived pages 
• potentially increasing archive coverage 
" 
2. Skewed distribution generated descriptions 
• popular pages have more terms 
• richness tapers off quickly 
" 
3. Succint representation generally rich enough 
to identify pages 
• in a known item search setting 
"
4.2 Future work & Discussion 
" 
" 
" 
• Representations could be useful in research and institutional 
context, e.g. 
• helping to assess the completeness of the archive 
• extending seedlists for selection-based archives 
• potential representation popular unarchived sites, excluded 
from archiving 
" 
• Potentially enrich web archive systems with 
contextual information 
Web Archive 
• Aggregation per year: refine and extend to longitudinal case 
• Assessing the impact of crawling strategies 
• Incorporating additional contextual information 
• e.g. text surrounding anchors 
• Optimally weigh all sources of evidence, using advanced retrieval models
Acknowledgements 
• We gratefully acknowledge the 
collaboration with the Dutch 
Web Archive of the National 
Library of the Netherlands. 
" 
• This research was supported by 
the Netherlands Organization 
for Scientific Research 
(WebART project, NWO CATCH 
# 640.005.001).
References 
• [Craswell01] N. Craswell, D. Hawking, and S. Robertson, “Effective site finding using 
link anchor information,” in SIGIR. ACM, 2001, pp. 250–257. 
• [Fuji08] A. Fujii, “Modeling anchor text and classifying queries to enhance web 
document retrieval,” in WWW, J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, 
and X. Zhang, Eds. ACM, 2008, pp. 337–346. 
• [Kamps06] J. Kamps, “Web-centric language models,” in CIKM, O. Herzog, H.-J. 
Schek, N. Fuhr, A. Chowdhury, and W. Teiken, Eds. ACM, 2005, pp. 307–308. 
• [Masanès06] J. Masanès, Web archiving. Springer, 2006 
• [Koolen10] M. Koolen and J. Kamps, “The importance of anchor text for ad hoc search 
revisited,” in SIGIR, F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and 
J. Savoy, Eds. ACM, 2010, pp. 122–129. 
• [Rauber02] A. Rauber, R. M. Bruckner, A. Aschenbrenner, O. Witvoet, and M. Kaiser, 
“Uncovering information hidden in web archives: A glimpse at web analysis building on 
data warehouses,” D-Lib Magazine, vol. 8, no. 12, 2002. 
• [Unesco03] UNESCO, “Charter on the preservation of digital heritage (article 3.4),” 
2003.
Finding Pages on the Unarchived Web 
Hugo Huurdeman, Anat Ben-David, Jaap Kamps 
Thaer Samar, Arjen de Vries" 
" 
University of Amsterdam, Centrum Wiskunde & Informatica" 
" 
" 
"

More Related Content

What's hot

Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web Archiving
OCLC
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
OCLC
 
Gonzalez-8-jun15
Gonzalez-8-jun15Gonzalez-8-jun15
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
Micah Altman
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
WiLS
 
Best Practices for Descriptive Metadata
Best Practices for Descriptive MetadataBest Practices for Descriptive Metadata
Best Practices for Descriptive Metadata
OCLC
 
Linked Data Implementations—Who, What and Why?
Linked Data Implementations—Who, What and Why?Linked Data Implementations—Who, What and Why?
Linked Data Implementations—Who, What and Why?
OCLC
 
The Future of Finding: Resource Discovery @ The University of Oxford
The Future of Finding: Resource Discovery @ The University of OxfordThe Future of Finding: Resource Discovery @ The University of Oxford
The Future of Finding: Resource Discovery @ The University of Oxford
Christine Madsen
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
guest756e05
 
Ir1
Ir1Ir1
The library in the life of the user
The library in the life of the userThe library in the life of the user
The library in the life of the user
lisld
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
Anja Jentzsch
 
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering EvidenceBIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
OCLC
 
Lauruhn-5-jun15
Lauruhn-5-jun15Lauruhn-5-jun15
Exploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataExploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadata
Shenghui Wang
 
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
Andrew Bourgeois
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
kulibrarians
 
Library Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic ControlLibrary Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic Control
University of Toronto Libraries - Information Technology Services
 
FAST Update
FAST UpdateFAST Update
FAST Update
OCLC
 
Role of libraries in research and scholarly communication
Role of libraries in research and scholarly communicationRole of libraries in research and scholarly communication
Role of libraries in research and scholarly communication
Nikesh Narayanan
 

What's hot (20)

Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web Archiving
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
 
Gonzalez-8-jun15
Gonzalez-8-jun15Gonzalez-8-jun15
Gonzalez-8-jun15
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
 
Best Practices for Descriptive Metadata
Best Practices for Descriptive MetadataBest Practices for Descriptive Metadata
Best Practices for Descriptive Metadata
 
Linked Data Implementations—Who, What and Why?
Linked Data Implementations—Who, What and Why?Linked Data Implementations—Who, What and Why?
Linked Data Implementations—Who, What and Why?
 
The Future of Finding: Resource Discovery @ The University of Oxford
The Future of Finding: Resource Discovery @ The University of OxfordThe Future of Finding: Resource Discovery @ The University of Oxford
The Future of Finding: Resource Discovery @ The University of Oxford
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
Ir1
Ir1Ir1
Ir1
 
The library in the life of the user
The library in the life of the userThe library in the life of the user
The library in the life of the user
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering EvidenceBIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
 
Lauruhn-5-jun15
Lauruhn-5-jun15Lauruhn-5-jun15
Lauruhn-5-jun15
 
Exploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataExploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadata
 
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
 
Library Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic ControlLibrary Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic Control
 
FAST Update
FAST UpdateFAST Update
FAST Update
 
Role of libraries in research and scholarly communication
Role of libraries in research and scholarly communicationRole of libraries in research and scholarly communication
Role of libraries in research and scholarly communication
 

Viewers also liked

@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
Michael Nelson
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashed
mwe400
 
JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?
LulwahMA
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Sawood Alam
 
You can you up - Chinoiserie - 60 words about China
You can you up - Chinoiserie - 60 words about ChinaYou can you up - Chinoiserie - 60 words about China
You can you up - Chinoiserie - 60 words about China
Pengyuan Zhao
 
Viacom_Jenn Lim_New York
Viacom_Jenn Lim_New YorkViacom_Jenn Lim_New York
Viacom_Jenn Lim_New York
Delivering Happiness
 
[Infografik] Talent Trends Report 2015
[Infografik] Talent Trends Report 2015[Infografik] Talent Trends Report 2015
[Infografik] Talent Trends Report 2015
LinkedIn D-A-CH
 
Freshdesk Arcade - Gamify Your Helpdesk
Freshdesk Arcade - Gamify Your HelpdeskFreshdesk Arcade - Gamify Your Helpdesk
Freshdesk Arcade - Gamify Your Helpdesk
Freshdesk Inc.
 
Art for Literacy's Sake
Art for Literacy's SakeArt for Literacy's Sake
Art for Literacy's Sake
Katie Carmichael
 
Creating Innovation in Schools
Creating Innovation in SchoolsCreating Innovation in Schools
Creating Innovation in Schools
Rafael Parente
 
500’s Demo Day Batch 16 >> Many Chat
500’s Demo Day Batch 16 >>  Many Chat500’s Demo Day Batch 16 >>  Many Chat
500’s Demo Day Batch 16 >> Many Chat
500 Startups
 
Asuhan Keperawatan Pada Kanker Tulang
Asuhan Keperawatan Pada Kanker TulangAsuhan Keperawatan Pada Kanker Tulang
Asuhan Keperawatan Pada Kanker Tulang
pjj_kemenkes
 
Micro Interactions
Micro InteractionsMicro Interactions
Micro Interactions
David Armano
 
The shaping of the earth´s relief.ppt
The shaping of the earth´s relief.pptThe shaping of the earth´s relief.ppt
The shaping of the earth´s relief.ppt
davmartse
 
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing ForumFive Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
Healthcare Network marcus evans
 
2. creation
2. creation2. creation
2. creation
Bafowethu Mavule
 
Foundations of Strategic Competitiveness
Foundations of Strategic CompetitivenessFoundations of Strategic Competitiveness
Foundations of Strategic Competitiveness
drnurhizam
 
ROUSSEAU, Henri, Featured Paintings in Detail (2)
ROUSSEAU, Henri, Featured Paintings in Detail (2)ROUSSEAU, Henri, Featured Paintings in Detail (2)
ROUSSEAU, Henri, Featured Paintings in Detail (2)
guimera
 
9 ways to make the wrong impression on your first day
9 ways to make the wrong impression on your first day9 ways to make the wrong impression on your first day
9 ways to make the wrong impression on your first day
CAREEREALISM
 

Viewers also liked (19)

@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashed
 
JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
You can you up - Chinoiserie - 60 words about China
You can you up - Chinoiserie - 60 words about ChinaYou can you up - Chinoiserie - 60 words about China
You can you up - Chinoiserie - 60 words about China
 
Viacom_Jenn Lim_New York
Viacom_Jenn Lim_New YorkViacom_Jenn Lim_New York
Viacom_Jenn Lim_New York
 
[Infografik] Talent Trends Report 2015
[Infografik] Talent Trends Report 2015[Infografik] Talent Trends Report 2015
[Infografik] Talent Trends Report 2015
 
Freshdesk Arcade - Gamify Your Helpdesk
Freshdesk Arcade - Gamify Your HelpdeskFreshdesk Arcade - Gamify Your Helpdesk
Freshdesk Arcade - Gamify Your Helpdesk
 
Art for Literacy's Sake
Art for Literacy's SakeArt for Literacy's Sake
Art for Literacy's Sake
 
Creating Innovation in Schools
Creating Innovation in SchoolsCreating Innovation in Schools
Creating Innovation in Schools
 
500’s Demo Day Batch 16 >> Many Chat
500’s Demo Day Batch 16 >>  Many Chat500’s Demo Day Batch 16 >>  Many Chat
500’s Demo Day Batch 16 >> Many Chat
 
Asuhan Keperawatan Pada Kanker Tulang
Asuhan Keperawatan Pada Kanker TulangAsuhan Keperawatan Pada Kanker Tulang
Asuhan Keperawatan Pada Kanker Tulang
 
Micro Interactions
Micro InteractionsMicro Interactions
Micro Interactions
 
The shaping of the earth´s relief.ppt
The shaping of the earth´s relief.pptThe shaping of the earth´s relief.ppt
The shaping of the earth´s relief.ppt
 
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing ForumFive Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
 
2. creation
2. creation2. creation
2. creation
 
Foundations of Strategic Competitiveness
Foundations of Strategic CompetitivenessFoundations of Strategic Competitiveness
Foundations of Strategic Competitiveness
 
ROUSSEAU, Henri, Featured Paintings in Detail (2)
ROUSSEAU, Henri, Featured Paintings in Detail (2)ROUSSEAU, Henri, Featured Paintings in Detail (2)
ROUSSEAU, Henri, Featured Paintings in Detail (2)
 
9 ways to make the wrong impression on your first day
9 ways to make the wrong impression on your first day9 ways to make the wrong impression on your first day
9 ways to make the wrong impression on your first day
 

Similar to Finding Pages on the Unarchived Web (DL 2014)

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
Roxanne Missingham
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
Harini Sirisena
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
Denis Shestakov
 
Library Mashups & APIs
Library Mashups & APIsLibrary Mashups & APIs
Library Mashups & APIs
librarywebchic
 
Google Paper
Google Paper Google Paper
Google Paper
girish1m
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
vinaygo
 
Federated to library discovery platfoms
Federated to library discovery platfomsFederated to library discovery platfoms
Federated to library discovery platfoms
Nikesh Narayanan
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
Zakaria Zubi
 
Leveraging Library Thing (2009)
Leveraging Library Thing (2009)Leveraging Library Thing (2009)
Leveraging Library Thing (2009)
Niamh Walker-Headon
 
Pandora
PandoraPandora
Websrc~1
Websrc~1Websrc~1
Websrc~1
Ram Dutt Shukla
 
The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...
Alexander Decker
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
ScrbifPt
 
Content Management and Page Structure for SharePoint
Content Management and Page Structure for SharePointContent Management and Page Structure for SharePoint
Content Management and Page Structure for SharePoint
D'arce Hess
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
Stefanos Anastasiadis
 
Online Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and MuseumsOnline Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and Museums
mherbison
 
Web Mining
Web MiningWeb Mining
Web Mining
Mudit Dholakia
 
Web mining
Web miningWeb mining
Web mining
Innovative Pencils
 
Increasing the findability of digital heritage documents by using Search Engi...
Increasing the findability of digital heritage documents by using Search Engi...Increasing the findability of digital heritage documents by using Search Engi...
Increasing the findability of digital heritage documents by using Search Engi...
Andrea Hrckova
 

Similar to Finding Pages on the Unarchived Web (DL 2014) (20)

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Library Mashups & APIs
Library Mashups & APIsLibrary Mashups & APIs
Library Mashups & APIs
 
Google Paper
Google Paper Google Paper
Google Paper
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
Federated to library discovery platfoms
Federated to library discovery platfomsFederated to library discovery platfoms
Federated to library discovery platfoms
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
 
Leveraging Library Thing (2009)
Leveraging Library Thing (2009)Leveraging Library Thing (2009)
Leveraging Library Thing (2009)
 
Pandora
PandoraPandora
Pandora
 
Websrc~1
Websrc~1Websrc~1
Websrc~1
 
The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Content Management and Page Structure for SharePoint
Content Management and Page Structure for SharePointContent Management and Page Structure for SharePoint
Content Management and Page Structure for SharePoint
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Online Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and MuseumsOnline Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and Museums
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Increasing the findability of digital heritage documents by using Search Engi...
Increasing the findability of digital heritage documents by using Search Engi...Increasing the findability of digital heritage documents by using Search Engi...
Increasing the findability of digital heritage documents by using Search Engi...
 

More from TimelessFuture

Data Visualization via Enhanced Maps in a Digital Humanities Context – a Desi...
Data Visualization via Enhanced Maps in a Digital Humanities Context – a Desi...Data Visualization via Enhanced Maps in a Digital Humanities Context – a Desi...
Data Visualization via Enhanced Maps in a Digital Humanities Context – a Desi...
TimelessFuture
 
Webmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysisWebmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysis
TimelessFuture
 
Experiential Interfaces: 

3D reconstructions as entry points for exploration...
Experiential Interfaces: 

3D reconstructions as entry points for exploration...Experiential Interfaces: 

3D reconstructions as entry points for exploration...
Experiential Interfaces: 

3D reconstructions as entry points for exploration...
TimelessFuture
 
Step inside the Image: 

Interpretative Interfaces for 
3D Historical Content
Step inside the Image: 

Interpretative Interfaces for 
3D Historical ContentStep inside the Image: 

Interpretative Interfaces for 
3D Historical Content
Step inside the Image: 

Interpretative Interfaces for 
3D Historical Content
TimelessFuture
 
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
TimelessFuture
 
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
TimelessFuture
 
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
TimelessFuture
 
Visualization Lecture - Clariah Summer School 2018
Visualization Lecture - Clariah Summer School 2018Visualization Lecture - Clariah Summer School 2018
Visualization Lecture - Clariah Summer School 2018
TimelessFuture
 
Outcomes Visual Navigation Project
Outcomes Visual Navigation ProjectOutcomes Visual Navigation Project
Outcomes Visual Navigation Project
TimelessFuture
 
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
TimelessFuture
 
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
TimelessFuture
 
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
TimelessFuture
 
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
TimelessFuture
 
Not available, or not found? Lessons from user queries in the Oria catalog at...
Not available, or not found? Lessons from user queries in the Oria catalog at...Not available, or not found? Lessons from user queries in the Oria catalog at...
Not available, or not found? Lessons from user queries in the Oria catalog at...
TimelessFuture
 
Webarchief & Wetenschap (Dutch)
Webarchief & Wetenschap (Dutch)Webarchief & Wetenschap (Dutch)
Webarchief & Wetenschap (Dutch)
TimelessFuture
 
From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...
TimelessFuture
 
Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...
TimelessFuture
 
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
TimelessFuture
 
The Value of Multistage Search Systems for Book Search
The Value of Multistage Search Systems for Book SearchThe Value of Multistage Search Systems for Book Search
The Value of Multistage Search Systems for Book Search
TimelessFuture
 
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
TimelessFuture
 

More from TimelessFuture (20)

Data Visualization via Enhanced Maps in a Digital Humanities Context – a Desi...
Data Visualization via Enhanced Maps in a Digital Humanities Context – a Desi...Data Visualization via Enhanced Maps in a Digital Humanities Context – a Desi...
Data Visualization via Enhanced Maps in a Digital Humanities Context – a Desi...
 
Webmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysisWebmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysis
 
Experiential Interfaces: 

3D reconstructions as entry points for exploration...
Experiential Interfaces: 

3D reconstructions as entry points for exploration...Experiential Interfaces: 

3D reconstructions as entry points for exploration...
Experiential Interfaces: 

3D reconstructions as entry points for exploration...
 
Step inside the Image: 

Interpretative Interfaces for 
3D Historical Content
Step inside the Image: 

Interpretative Interfaces for 
3D Historical ContentStep inside the Image: 

Interpretative Interfaces for 
3D Historical Content
Step inside the Image: 

Interpretative Interfaces for 
3D Historical Content
 
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
 
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
 
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
 
Visualization Lecture - Clariah Summer School 2018
Visualization Lecture - Clariah Summer School 2018Visualization Lecture - Clariah Summer School 2018
Visualization Lecture - Clariah Summer School 2018
 
Outcomes Visual Navigation Project
Outcomes Visual Navigation ProjectOutcomes Visual Navigation Project
Outcomes Visual Navigation Project
 
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
 
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
 
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
 
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
 
Not available, or not found? Lessons from user queries in the Oria catalog at...
Not available, or not found? Lessons from user queries in the Oria catalog at...Not available, or not found? Lessons from user queries in the Oria catalog at...
Not available, or not found? Lessons from user queries in the Oria catalog at...
 
Webarchief & Wetenschap (Dutch)
Webarchief & Wetenschap (Dutch)Webarchief & Wetenschap (Dutch)
Webarchief & Wetenschap (Dutch)
 
From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...
 
Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...
 
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
 
The Value of Multistage Search Systems for Book Search
The Value of Multistage Search Systems for Book SearchThe Value of Multistage Search Systems for Book Search
The Value of Multistage Search Systems for Book Search
 
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
 

Recently uploaded

Mysore Girls Call Mysore 0X0000000X Payment On Delevery Cash Hot Premium Genu...
Mysore Girls Call Mysore 0X0000000X Payment On Delevery Cash Hot Premium Genu...Mysore Girls Call Mysore 0X0000000X Payment On Delevery Cash Hot Premium Genu...
Mysore Girls Call Mysore 0X0000000X Payment On Delevery Cash Hot Premium Genu...
seenaoberoi
 
A study on drug utilization evaluation of bronchodilators using DDD method
A study on drug utilization evaluation of bronchodilators using DDD methodA study on drug utilization evaluation of bronchodilators using DDD method
A study on drug utilization evaluation of bronchodilators using DDD method
Dr. Chihiro
 
ANALYSIS OF LIVELIHOOD DIVERSIFICATION STRATEGIES AMONG WOMEN CROP FARMERS IN...
ANALYSIS OF LIVELIHOOD DIVERSIFICATION STRATEGIES AMONG WOMEN CROP FARMERS IN...ANALYSIS OF LIVELIHOOD DIVERSIFICATION STRATEGIES AMONG WOMEN CROP FARMERS IN...
ANALYSIS OF LIVELIHOOD DIVERSIFICATION STRATEGIES AMONG WOMEN CROP FARMERS IN...
DrAdoGarba
 
VIP Ahmedabad Girls Call Ahmedabad 0X0000000X Doorstep High-Profile Girl Serv...
VIP Ahmedabad Girls Call Ahmedabad 0X0000000X Doorstep High-Profile Girl Serv...VIP Ahmedabad Girls Call Ahmedabad 0X0000000X Doorstep High-Profile Girl Serv...
VIP Ahmedabad Girls Call Ahmedabad 0X0000000X Doorstep High-Profile Girl Serv...
satpalsheravatmumbai
 
Lucknow Girls Call Aliganj 08630512678 Provide Best And Top Girl Service And ...
Lucknow Girls Call Aliganj 08630512678 Provide Best And Top Girl Service And ...Lucknow Girls Call Aliganj 08630512678 Provide Best And Top Girl Service And ...
Lucknow Girls Call Aliganj 08630512678 Provide Best And Top Girl Service And ...
arnavkumar9870
 
Call India - AmanTel on the App Store.ppt
Call India - AmanTel on the App Store.pptCall India - AmanTel on the App Store.ppt
Call India - AmanTel on the App Store.ppt
Best International calling app on the market
 
UMiami biyezheng degree offer diploma Transcript
UMiami biyezheng degree offer diploma TranscriptUMiami biyezheng degree offer diploma Transcript
UMiami biyezheng degree offer diploma Transcript
xmevus
 
VIP Shimla Girls Call Shimla 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Shimla Girls Call Shimla 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Shimla Girls Call Shimla 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Shimla Girls Call Shimla 0X0000000X Doorstep High-Profile Girl Service Ca...
sukaniyasunnu
 
Communication Skills F.pptx for corporate employee
Communication Skills F.pptx for corporate employeeCommunication Skills F.pptx for corporate employee
Communication Skills F.pptx for corporate employee
artemacademy2
 
@ℂall Lucknow @Girls Chinhat 08630512678
@ℂall Lucknow  @Girls Chinhat 08630512678 @ℂall Lucknow  @Girls Chinhat 08630512678
@ℂall Lucknow @Girls Chinhat 08630512678
veenita788
 
VIP Nashik Girls Call Nashik 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Nashik Girls Call Nashik 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Nashik Girls Call Nashik 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Nashik Girls Call Nashik 0X0000000X Doorstep High-Profile Girl Service Ca...
saroohilakhatariroy
 
2024-07-07 Transformed 06 (shared slides).pptx
2024-07-07 Transformed 06 (shared slides).pptx2024-07-07 Transformed 06 (shared slides).pptx
2024-07-07 Transformed 06 (shared slides).pptx
Dale Wells
 
Strategies for Adoption of SDGs in organizations
Strategies for Adoption of SDGs in organizationsStrategies for Adoption of SDGs in organizations
Strategies for Adoption of SDGs in organizations
Amgad Morgan
 
Chandigarh Girls Call Chandigarh 0X0000000X Provide Best And Top Girl Service...
Chandigarh Girls Call Chandigarh 0X0000000X Provide Best And Top Girl Service...Chandigarh Girls Call Chandigarh 0X0000000X Provide Best And Top Girl Service...
Chandigarh Girls Call Chandigarh 0X0000000X Provide Best And Top Girl Service...
kishanaaani
 
stackconf 2024 | Using European Open Source to build a Sovereign Multi-Cloud ...
stackconf 2024 | Using European Open Source to build a Sovereign Multi-Cloud ...stackconf 2024 | Using European Open Source to build a Sovereign Multi-Cloud ...
stackconf 2024 | Using European Open Source to build a Sovereign Multi-Cloud ...
NETWAYS
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
rawankhanlove256
 
stackconf 2024 | Generative AI Security — A Practical Guide to Securing Your ...
stackconf 2024 | Generative AI Security — A Practical Guide to Securing Your ...stackconf 2024 | Generative AI Security — A Practical Guide to Securing Your ...
stackconf 2024 | Generative AI Security — A Practical Guide to Securing Your ...
NETWAYS
 
Hyderabad Girls Call Hyderabad 0X0000000X Unlimited Short Providing Girls Ser...
Hyderabad Girls Call Hyderabad 0X0000000X Unlimited Short Providing Girls Ser...Hyderabad Girls Call Hyderabad 0X0000000X Unlimited Short Providing Girls Ser...
Hyderabad Girls Call Hyderabad 0X0000000X Unlimited Short Providing Girls Ser...
rashmikasinghdelhiro
 
Lucknow Girls Call Fazullaganj 08630512678 Provide Best And Top Girl Service ...
Lucknow Girls Call Fazullaganj 08630512678 Provide Best And Top Girl Service ...Lucknow Girls Call Fazullaganj 08630512678 Provide Best And Top Girl Service ...
Lucknow Girls Call Fazullaganj 08630512678 Provide Best And Top Girl Service ...
bangaloreakshitakaus
 
Cornell biyezheng degree offer diploma Transcript
Cornell biyezheng degree offer diploma TranscriptCornell biyezheng degree offer diploma Transcript
Cornell biyezheng degree offer diploma Transcript
xmevus
 

Recently uploaded (20)

Mysore Girls Call Mysore 0X0000000X Payment On Delevery Cash Hot Premium Genu...
Mysore Girls Call Mysore 0X0000000X Payment On Delevery Cash Hot Premium Genu...Mysore Girls Call Mysore 0X0000000X Payment On Delevery Cash Hot Premium Genu...
Mysore Girls Call Mysore 0X0000000X Payment On Delevery Cash Hot Premium Genu...
 
A study on drug utilization evaluation of bronchodilators using DDD method
A study on drug utilization evaluation of bronchodilators using DDD methodA study on drug utilization evaluation of bronchodilators using DDD method
A study on drug utilization evaluation of bronchodilators using DDD method
 
ANALYSIS OF LIVELIHOOD DIVERSIFICATION STRATEGIES AMONG WOMEN CROP FARMERS IN...
ANALYSIS OF LIVELIHOOD DIVERSIFICATION STRATEGIES AMONG WOMEN CROP FARMERS IN...ANALYSIS OF LIVELIHOOD DIVERSIFICATION STRATEGIES AMONG WOMEN CROP FARMERS IN...
ANALYSIS OF LIVELIHOOD DIVERSIFICATION STRATEGIES AMONG WOMEN CROP FARMERS IN...
 
VIP Ahmedabad Girls Call Ahmedabad 0X0000000X Doorstep High-Profile Girl Serv...
VIP Ahmedabad Girls Call Ahmedabad 0X0000000X Doorstep High-Profile Girl Serv...VIP Ahmedabad Girls Call Ahmedabad 0X0000000X Doorstep High-Profile Girl Serv...
VIP Ahmedabad Girls Call Ahmedabad 0X0000000X Doorstep High-Profile Girl Serv...
 
Lucknow Girls Call Aliganj 08630512678 Provide Best And Top Girl Service And ...
Lucknow Girls Call Aliganj 08630512678 Provide Best And Top Girl Service And ...Lucknow Girls Call Aliganj 08630512678 Provide Best And Top Girl Service And ...
Lucknow Girls Call Aliganj 08630512678 Provide Best And Top Girl Service And ...
 
Call India - AmanTel on the App Store.ppt
Call India - AmanTel on the App Store.pptCall India - AmanTel on the App Store.ppt
Call India - AmanTel on the App Store.ppt
 
UMiami biyezheng degree offer diploma Transcript
UMiami biyezheng degree offer diploma TranscriptUMiami biyezheng degree offer diploma Transcript
UMiami biyezheng degree offer diploma Transcript
 
VIP Shimla Girls Call Shimla 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Shimla Girls Call Shimla 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Shimla Girls Call Shimla 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Shimla Girls Call Shimla 0X0000000X Doorstep High-Profile Girl Service Ca...
 
Communication Skills F.pptx for corporate employee
Communication Skills F.pptx for corporate employeeCommunication Skills F.pptx for corporate employee
Communication Skills F.pptx for corporate employee
 
@ℂall Lucknow @Girls Chinhat 08630512678
@ℂall Lucknow  @Girls Chinhat 08630512678 @ℂall Lucknow  @Girls Chinhat 08630512678
@ℂall Lucknow @Girls Chinhat 08630512678
 
VIP Nashik Girls Call Nashik 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Nashik Girls Call Nashik 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Nashik Girls Call Nashik 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Nashik Girls Call Nashik 0X0000000X Doorstep High-Profile Girl Service Ca...
 
2024-07-07 Transformed 06 (shared slides).pptx
2024-07-07 Transformed 06 (shared slides).pptx2024-07-07 Transformed 06 (shared slides).pptx
2024-07-07 Transformed 06 (shared slides).pptx
 
Strategies for Adoption of SDGs in organizations
Strategies for Adoption of SDGs in organizationsStrategies for Adoption of SDGs in organizations
Strategies for Adoption of SDGs in organizations
 
Chandigarh Girls Call Chandigarh 0X0000000X Provide Best And Top Girl Service...
Chandigarh Girls Call Chandigarh 0X0000000X Provide Best And Top Girl Service...Chandigarh Girls Call Chandigarh 0X0000000X Provide Best And Top Girl Service...
Chandigarh Girls Call Chandigarh 0X0000000X Provide Best And Top Girl Service...
 
stackconf 2024 | Using European Open Source to build a Sovereign Multi-Cloud ...
stackconf 2024 | Using European Open Source to build a Sovereign Multi-Cloud ...stackconf 2024 | Using European Open Source to build a Sovereign Multi-Cloud ...
stackconf 2024 | Using European Open Source to build a Sovereign Multi-Cloud ...
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
 
stackconf 2024 | Generative AI Security — A Practical Guide to Securing Your ...
stackconf 2024 | Generative AI Security — A Practical Guide to Securing Your ...stackconf 2024 | Generative AI Security — A Practical Guide to Securing Your ...
stackconf 2024 | Generative AI Security — A Practical Guide to Securing Your ...
 
Hyderabad Girls Call Hyderabad 0X0000000X Unlimited Short Providing Girls Ser...
Hyderabad Girls Call Hyderabad 0X0000000X Unlimited Short Providing Girls Ser...Hyderabad Girls Call Hyderabad 0X0000000X Unlimited Short Providing Girls Ser...
Hyderabad Girls Call Hyderabad 0X0000000X Unlimited Short Providing Girls Ser...
 
Lucknow Girls Call Fazullaganj 08630512678 Provide Best And Top Girl Service ...
Lucknow Girls Call Fazullaganj 08630512678 Provide Best And Top Girl Service ...Lucknow Girls Call Fazullaganj 08630512678 Provide Best And Top Girl Service ...
Lucknow Girls Call Fazullaganj 08630512678 Provide Best And Top Girl Service ...
 
Cornell biyezheng degree offer diploma Transcript
Cornell biyezheng degree offer diploma TranscriptCornell biyezheng degree offer diploma Transcript
Cornell biyezheng degree offer diploma Transcript
 

Finding Pages on the Unarchived Web (DL 2014)

  • 1. Finding Pages on the Unarchived Web Hugo Huurdeman, Anat Ben-David, Jaap Kamps Thaer Samar, Arjen de Vries" " University of Amsterdam, Centrum Wiskunde & Informatica" " " " Presentation ACM/IEEE Digital Libraries conference 2014
  • 2. Introduction • Web archives preserve the fast-changing Web • However, they cannot capture the entire Web due to various limitations" • Recrawl “lost” webpages impossible " • Would it be possible to recover parts of the unarchived Web?
  • 4. 0.1 Background: Web archiving • Web archives, keepers of our future cultural heritage, are inherently incomplete • e.g. due to limitations in crawling [Masanès06] " • However, crawlers do register additional information, e.g. • page source, link structure, server metadata, timestamps, .. • potentially usable for analytical purposes (e.g. [Rauber06])
  • 5. 0.1 Background: Link evidence and anchor text • Defining property of the Web: graph-based structure • links: src, destination, anchor text • Widely used in Web retrieval • [e.g. Craswell01, Fujii08, Koolen10] " • Our approach • inspired by previous results on Web-centric document representations • Our use case: the Web archive
  • 6. 0.2 Data: Dutch Web Archive • National Library of the Netherlands (KB) " " • Selective Web archive" • 2007-now • 10+ Terabyte • seedlist: 8000+ websites • 25,000+ harvests " • Our focus: one year of data (2012)
  • 7. 0.2 Data: extraction and processing extracting links from all pages" {destination URL, anchor text, hashcode src, crawldate} matching with seedlist adding KB metadata deduplication (per year)" to correct for harvesting frequencies cleaning and processing" e.g. URL normalization MySQL DB (13M. rows) aggregation and data enrichment" e.g. filetypes, counts, ..
  • 8. Research Questions 1 Can we recover a significant fraction of unarchived pages from references to them in the Web archive? " 2 How rich are the representations that can be created for unarchived URLs? " 3 Are the resulting derived representations of unarchived pages useful in practice? Do they capture enough of the unique page content to make them retrievable amongst millions of other pages?
  • 9. 1. Expanding the Web archive Can we recover a significant fraction of unarchived pages from references in the Web archive?
  • 10. 1.1 Archived content (2012) Dutch Web Archive 1 2 1. Contents in seedlist (2012) • 10.2M unique pages • 6,157 unique hosts • 3,413 unique domains • 16 TLDs " 2. Contents not in seedlist (2012) • 0.9M unique pages • 37,166 unique hosts • 30,367 unique domains • 181 TLDs
  • 11. 1.2 Unarchived content: the aura • the aura of the web archive • pages not in archive • but existence can be derived from link evidence in the archive " • distinguishing • inner aura (parent domain on the seedlist) • outer aura (parent domain not in the seedlist) Dutch Web Archive 1 2
  • 12. 1.2 Unarchived content (2012) 3. Inner aura • 5.5M unique pages • 9,039 unique hosts • 3,019 unique domains • 17 TLDs " 4.Outer aura • 5.2M unique pages • 481,797 unique hosts • 369,721 unique domains • 100 TLDs Dutch Web Archive 1 2 3 4
  • 13. 1.3 Characterizing the Aura: tld distribution Inner aura 2% 96% nl com org net other Outer aura 10% 2% 18% 5% 31% 35% nl com org jp net other mainly .nl content more mixed distribution (incl. .com, .org & .net)
  • 14. 1.3 Characterizing the Aura: coverage Alexa top 100 • Inner aura! • includes 7 of 100 most popular Dutch sites 280 210 140 70 0 twitter.com facebook.com linkedin.com hyves.nl google.com 280 210 140 70 0 nu.nl wikipedia blogspot.com kvk.nl anwb.nl • Outer aura! • includes 90 of 100 most popular Dutch sites (1.2M references)
  • 15. 1.4 Expanding the Web archive: summary • Recovered pages and hosts: " • Substantial amount • as many references to unarchived content as pages in the archive " • Complementing sites in archive " • Indirect evidence of lost Webpages holds the potential to significantly expand the Web archive’s coverage 20,0 15,0 10,0 5,0 0,0 Unarchived pages (M) Archived pages (M)
  • 16. 2. Representations! " How rich are the representations " that can be created for unarchived URLs?
  • 17. 2.1 Representations of unarchived content: indegree • Characteristics of incoming links (indegree) • All target representations: link from at least 1 unique page (b/o MD5) • 18% at least 3 unique incoming links • 10% has 5 links or more 100,00%! 90,00%! 80,00%! 70,00%! 60,00%! 50,00%! 40,00%! 30,00%! 20,00%! 10,00%! 0,00%! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! subset coverage! ! ! indegree (unique source pages)! inner aura! outer aura!
  • 18. 2.2 Representations: anchor text distribution • Further inspecting the richness: number of unique words " • 95% has 1 unique word or more • christinaconcours.nl: concertagenda (5) • 30% has 3 unique words or more • watou2009.be: watou (3) collection (2) stories (2) • 3% has 10 words or more • jos.rotterdam.nl: society (2) service (2) youth (2) and (2) education (2) wwwjosrotterdamnl (1) municipality (1) governance (1) jos (4) rotterdam (3) 100,00%! 90,00%! 80,00%! 70,00%! 60,00%! 50,00%! 40,00%! 30,00%! 20,00%! 10,00%! 0,00%! 0! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14! 15! subset coverage! ! ! unique word count (anchor text)! inner aura! outer aura!
  • 19. 2.3 Representations: homepages & non-homepages • Anchors often refer to homepages [e.g. Craswell01] • In our dataset: homepages for 336K of 481K hosts (69.8%) • homepage: vakcentrum.nl (6 unique anchors) • " • non-homepage: nesomexico.org/dutch-students/study-in-mexico/study-grants-and- loans/ (2 unique anchors — combine with URL words)
  • 20. 2.4 Representations of unarchived content: summary • Richness of representations: " • Results mixed: • skewed distribution • majority of pages: relatively sparse descriptions • minority of pages: relatively rich descriptions " • Are the representations rich enough to characterize the page’s contents?
  • 21. 3. Finding Unarchived Pages! " Are the representations of " unarchived pages useful " in practice?
  • 22. 3.1 Finding unarchived pages: evaluation setup • Indexed 5.19M representations unarchived content (outer aura) • three indexes: " " " " anchT urlW anchT UrlW " aggregated anchor text only URL words both " • Stratified sample: 500 homepages & 500 non-homepages • Pages (if available via IA / live Web) consulted by two annotators • creating known-item topics (150 per category) • inspect target page • write down query for refinding (without knowledge of anchor text) • result: 300 queries (~5-7 words)
  • 23. 3.2 Evaluation: results • Mean Reciprocal Rank (MRR)" • average scores of first correct result of each query • score: 1/rank • Results: " • homepages score better for anchor text representations • URL words representation better for non-homepages • combined representation improves MRR score for both • average close to 0.5: average case correct result 2nd rank
  • 24. 3.2 Evaluation: results • Success Rate @10: correct target page in top 10 " " " " " • Similar to MRR results: • homepages score better for anchor text • non-homepages score better for URL words representations " • On average, 59.7% of the correct homepages and non-homepages can be retrieved in the top 10
  • 25. 3.3 Evaluation: Impact of indegree • Impact of incoming links on richness of representations cer.org.uk! (5 anchor words) actionaid.org/kenya! (1 anchor word)
  • 26. 3.3 Evaluation: Impact of indegree (unique hosts) " " " " " " " " • Again, skewed nature: " • 251 out of 300 pages (84%) have links from 1 source • 49 pages (16%) have links from 2 or more sources " " 16% 84% • Higher indegree (unique hosts) results in rise in • mean word count • MRR • degree of homepages
  • 27. 3.3 Finding unarchived pages: summary • Usefulness in practice • Critical test: known-item finding • Generally positive results " • Unavailability of pages strengthens potential utility representations: • 20.1% of homepages • 45.4% of non-homepages not available via live Web or Internet Archive
  • 28. 4. Conclusion and Discussion
  • 29. 4.1 Conclusions • Approach to recover significant parts of the unarchived Web • by reconstructing descriptions based on link evidence " 1. Evidence high number unarchived pages • potentially increasing archive coverage " 2. Skewed distribution generated descriptions • popular pages have more terms • richness tapers off quickly " 3. Succint representation generally rich enough to identify pages • in a known item search setting "
  • 30. 4.2 Future work & Discussion " " " • Representations could be useful in research and institutional context, e.g. • helping to assess the completeness of the archive • extending seedlists for selection-based archives • potential representation popular unarchived sites, excluded from archiving " • Potentially enrich web archive systems with contextual information Web Archive • Aggregation per year: refine and extend to longitudinal case • Assessing the impact of crawling strategies • Incorporating additional contextual information • e.g. text surrounding anchors • Optimally weigh all sources of evidence, using advanced retrieval models
  • 31. Acknowledgements • We gratefully acknowledge the collaboration with the Dutch Web Archive of the National Library of the Netherlands. " • This research was supported by the Netherlands Organization for Scientific Research (WebART project, NWO CATCH # 640.005.001).
  • 32. References • [Craswell01] N. Craswell, D. Hawking, and S. Robertson, “Effective site finding using link anchor information,” in SIGIR. ACM, 2001, pp. 250–257. • [Fuji08] A. Fujii, “Modeling anchor text and classifying queries to enhance web document retrieval,” in WWW, J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, and X. Zhang, Eds. ACM, 2008, pp. 337–346. • [Kamps06] J. Kamps, “Web-centric language models,” in CIKM, O. Herzog, H.-J. Schek, N. Fuhr, A. Chowdhury, and W. Teiken, Eds. ACM, 2005, pp. 307–308. • [Masanès06] J. Masanès, Web archiving. Springer, 2006 • [Koolen10] M. Koolen and J. Kamps, “The importance of anchor text for ad hoc search revisited,” in SIGIR, F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy, Eds. ACM, 2010, pp. 122–129. • [Rauber02] A. Rauber, R. M. Bruckner, A. Aschenbrenner, O. Witvoet, and M. Kaiser, “Uncovering information hidden in web archives: A glimpse at web analysis building on data warehouses,” D-Lib Magazine, vol. 8, no. 12, 2002. • [Unesco03] UNESCO, “Charter on the preservation of digital heritage (article 3.4),” 2003.
  • 33. Finding Pages on the Unarchived Web Hugo Huurdeman, Anat Ben-David, Jaap Kamps Thaer Samar, Arjen de Vries" " University of Amsterdam, Centrum Wiskunde & Informatica" " " "