SlideShare a Scribd company logo
1 of 33
Finding Pages on the Unarchived Web 
Hugo Huurdeman, Anat Ben-David, Jaap Kamps 
Thaer Samar, Arjen de Vries" 
" 
University of Amsterdam, Centrum Wiskunde & Informatica" 
" 
" 
" 
Presentation ACM/IEEE Digital Libraries conference 2014
Introduction 
• Web archives preserve the fast-changing 
Web 
• However, they cannot capture 
the entire Web due to various 
limitations" 
• Recrawl “lost” webpages 
impossible 
" 
• Would it be possible to recover 
parts of the unarchived Web?
Background & 
experimental setup
0.1 Background: Web archiving 
• Web archives, keepers of our 
future cultural heritage, are 
inherently incomplete 
• e.g. due to limitations in crawling 
[Masanès06] 
" 
• However, crawlers do register 
additional information, e.g. 
• page source, link structure, 
server metadata, timestamps, .. 
• potentially usable for analytical 
purposes (e.g. [Rauber06])
0.1 Background: Link evidence and anchor text 
• Defining property of the Web: 
graph-based structure 
• links: src, destination, anchor text 
• Widely used in Web retrieval 
• [e.g. Craswell01, Fujii08, 
Koolen10] 
" 
• Our approach 
• inspired by previous results on 
Web-centric document 
representations 
• Our use case: the Web archive
0.2 Data: Dutch Web Archive 
• National Library of the 
Netherlands (KB) " 
" 
• Selective Web archive" 
• 2007-now 
• 10+ Terabyte 
• seedlist: 8000+ websites 
• 25,000+ harvests 
" 
• Our focus: one year of data 
(2012)
0.2 Data: extraction and processing 
extracting links from all pages" 
{destination URL, anchor text, 
hashcode src, crawldate} 
matching with seedlist 
adding KB metadata 
deduplication (per year)" 
to correct for harvesting frequencies 
cleaning and processing" 
e.g. URL normalization 
MySQL DB 
(13M. rows) 
aggregation and data 
enrichment" 
e.g. filetypes, counts, ..
Research Questions 
1 Can we recover a significant fraction of unarchived pages from 
references to them in the Web archive? 
" 
2 How rich are the representations that can be created for 
unarchived URLs? 
" 
3 Are the resulting derived representations of unarchived pages 
useful in practice? Do they capture enough of the unique page 
content to make them retrievable amongst millions of other pages?
1. Expanding the Web archive 
Can we recover a significant fraction of 
unarchived pages from references 
in the Web archive?
1.1 Archived content (2012) 
Dutch Web Archive 1 2 
1. Contents in seedlist (2012) 
• 10.2M unique pages 
• 6,157 unique hosts 
• 3,413 unique domains 
• 16 TLDs 
" 
2. Contents not in seedlist (2012) 
• 0.9M unique pages 
• 37,166 unique hosts 
• 30,367 unique domains 
• 181 TLDs
1.2 Unarchived content: the aura 
• the aura of the web 
archive 
• pages not in archive 
• but existence can be 
derived from link evidence 
in the archive 
" 
• distinguishing 
• inner aura (parent domain 
on the seedlist) 
• outer aura (parent domain 
not in the seedlist) 
Dutch Web Archive 1 2
1.2 Unarchived content (2012) 
3. Inner aura 
• 5.5M unique pages 
• 9,039 unique hosts 
• 3,019 unique domains 
• 17 TLDs 
" 
4.Outer aura 
• 5.2M unique pages 
• 481,797 unique hosts 
• 369,721 unique domains 
• 100 TLDs 
Dutch Web Archive 1 2 3 4
1.3 Characterizing the Aura: tld distribution 
Inner aura 
2% 
96% 
nl 
com 
org 
net 
other 
Outer aura 
10% 
2% 
18% 
5% 
31% 
35% nl 
com 
org 
jp 
net 
other 
mainly .nl content more mixed distribution 
(incl. .com, .org & .net)
1.3 Characterizing the Aura: coverage Alexa top 100 
• Inner aura! 
• includes 7 of 100 most 
popular Dutch sites 
280 
210 
140 
70 
0 
twitter.com facebook.com linkedin.com hyves.nl google.com 
280 
210 
140 
70 
0 
nu.nl wikipedia blogspot.com kvk.nl anwb.nl 
• Outer aura! 
• includes 90 of 100 most 
popular Dutch sites (1.2M 
references)
1.4 Expanding the Web archive: summary 
• Recovered pages and hosts: 
" 
• Substantial amount 
• as many references to unarchived 
content as pages in the archive 
" 
• Complementing sites in archive 
" 
• Indirect evidence of lost 
Webpages holds the potential 
to significantly expand the Web 
archive’s coverage 
20,0 
15,0 
10,0 
5,0 
0,0 
Unarchived pages (M) 
Archived pages (M)
2. Representations! 
" 
How rich are the representations " 
that can be created for 
unarchived URLs?
2.1 Representations of unarchived content: indegree 
• Characteristics of incoming links (indegree) 
• All target representations: link from at least 1 unique page (b/o MD5) 
• 18% at least 3 unique incoming links 
• 10% has 5 links or more 
100,00%! 
90,00%! 
80,00%! 
70,00%! 
60,00%! 
50,00%! 
40,00%! 
30,00%! 
20,00%! 
10,00%! 
0,00%! 
1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 
subset coverage! 
! 
! 
indegree (unique source pages)! 
inner aura! 
outer aura!
2.2 Representations: anchor text distribution 
• Further inspecting the richness: number of unique words 
" 
• 95% has 1 unique word or more 
• christinaconcours.nl: 
concertagenda (5) 
• 30% has 3 unique words or more 
• watou2009.be: 
watou (3) collection (2) stories (2) 
• 3% has 10 words or more 
• jos.rotterdam.nl: 
society (2) service (2) youth 
(2) and (2) education (2) 
wwwjosrotterdamnl (1) 
municipality (1) governance (1) 
jos (4) rotterdam (3) 
100,00%! 
90,00%! 
80,00%! 
70,00%! 
60,00%! 
50,00%! 
40,00%! 
30,00%! 
20,00%! 
10,00%! 
0,00%! 
0! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14! 15! 
subset coverage! 
! 
! 
unique word count (anchor text)! 
inner aura! 
outer aura!
2.3 Representations: homepages & non-homepages 
• Anchors often refer to homepages [e.g. Craswell01] 
• In our dataset: homepages for 336K of 481K hosts (69.8%) 
• homepage: vakcentrum.nl (6 unique anchors) 
• 
" 
• non-homepage: nesomexico.org/dutch-students/study-in-mexico/study-grants-and- 
loans/ (2 unique anchors — combine with URL words)
2.4 Representations of unarchived content: summary 
• Richness of representations: 
" 
• Results mixed: 
• skewed distribution 
• majority of pages: relatively 
sparse descriptions 
• minority of pages: relatively rich 
descriptions 
" 
• Are the representations rich 
enough to characterize the 
page’s contents?
3. Finding Unarchived Pages! 
" 
Are the representations of " 
unarchived pages useful " 
in practice?
3.1 Finding unarchived pages: evaluation setup 
• Indexed 5.19M representations unarchived content (outer aura) 
• three indexes: 
" 
" 
" 
" 
anchT urlW 
anchT 
UrlW 
" 
aggregated anchor text only URL words both 
" 
• Stratified sample: 500 homepages & 500 non-homepages 
• Pages (if available via IA / live Web) consulted by two annotators 
• creating known-item topics (150 per category) 
• inspect target page 
• write down query for refinding (without knowledge of anchor text) 
• result: 300 queries (~5-7 words)
3.2 Evaluation: results 
• Mean Reciprocal Rank (MRR)" 
• average scores of first correct 
result of each query 
• score: 1/rank 
• Results: " 
• homepages score better for anchor text representations 
• URL words representation better for non-homepages 
• combined representation improves MRR score for both 
• average close to 0.5: average case correct result 2nd rank
3.2 Evaluation: results 
• Success Rate @10: correct target page in top 10 
" 
" 
" 
" 
" 
• Similar to MRR results: 
• homepages score better for anchor text 
• non-homepages score better for URL words representations 
" 
• On average, 59.7% of the correct homepages and non-homepages 
can be retrieved in the top 10
3.3 Evaluation: Impact of indegree 
• Impact of incoming links on richness of representations 
cer.org.uk! 
(5 anchor words) 
actionaid.org/kenya! 
(1 anchor word)
3.3 Evaluation: Impact of indegree (unique hosts) 
" 
" 
" 
" 
" 
" 
" 
" 
• Again, skewed nature: 
" 
• 251 out of 300 pages (84%) have links from 1 source 
• 49 pages (16%) have links from 2 or more sources 
" 
" 
16% 
84% 
• Higher indegree (unique hosts) results in rise in 
• mean word count 
• MRR 
• degree of homepages
3.3 Finding unarchived pages: summary 
• Usefulness in practice 
• Critical test: known-item finding 
• Generally positive results 
" 
• Unavailability of pages strengthens 
potential utility representations: 
• 20.1% of homepages 
• 45.4% of non-homepages 
not available via live Web or Internet 
Archive
4. Conclusion and Discussion
4.1 Conclusions 
• Approach to recover significant parts of the 
unarchived Web 
• by reconstructing descriptions based on link 
evidence 
" 
1. Evidence high number unarchived pages 
• potentially increasing archive coverage 
" 
2. Skewed distribution generated descriptions 
• popular pages have more terms 
• richness tapers off quickly 
" 
3. Succint representation generally rich enough 
to identify pages 
• in a known item search setting 
"
4.2 Future work & Discussion 
" 
" 
" 
• Representations could be useful in research and institutional 
context, e.g. 
• helping to assess the completeness of the archive 
• extending seedlists for selection-based archives 
• potential representation popular unarchived sites, excluded 
from archiving 
" 
• Potentially enrich web archive systems with 
contextual information 
Web Archive 
• Aggregation per year: refine and extend to longitudinal case 
• Assessing the impact of crawling strategies 
• Incorporating additional contextual information 
• e.g. text surrounding anchors 
• Optimally weigh all sources of evidence, using advanced retrieval models
Acknowledgements 
• We gratefully acknowledge the 
collaboration with the Dutch 
Web Archive of the National 
Library of the Netherlands. 
" 
• This research was supported by 
the Netherlands Organization 
for Scientific Research 
(WebART project, NWO CATCH 
# 640.005.001).
References 
• [Craswell01] N. Craswell, D. Hawking, and S. Robertson, “Effective site finding using 
link anchor information,” in SIGIR. ACM, 2001, pp. 250–257. 
• [Fuji08] A. Fujii, “Modeling anchor text and classifying queries to enhance web 
document retrieval,” in WWW, J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, 
and X. Zhang, Eds. ACM, 2008, pp. 337–346. 
• [Kamps06] J. Kamps, “Web-centric language models,” in CIKM, O. Herzog, H.-J. 
Schek, N. Fuhr, A. Chowdhury, and W. Teiken, Eds. ACM, 2005, pp. 307–308. 
• [Masanès06] J. Masanès, Web archiving. Springer, 2006 
• [Koolen10] M. Koolen and J. Kamps, “The importance of anchor text for ad hoc search 
revisited,” in SIGIR, F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and 
J. Savoy, Eds. ACM, 2010, pp. 122–129. 
• [Rauber02] A. Rauber, R. M. Bruckner, A. Aschenbrenner, O. Witvoet, and M. Kaiser, 
“Uncovering information hidden in web archives: A glimpse at web analysis building on 
data warehouses,” D-Lib Magazine, vol. 8, no. 12, 2002. 
• [Unesco03] UNESCO, “Charter on the preservation of digital heritage (article 3.4),” 
2003.
Finding Pages on the Unarchived Web 
Hugo Huurdeman, Anat Ben-David, Jaap Kamps 
Thaer Samar, Arjen de Vries" 
" 
University of Amsterdam, Centrum Wiskunde & Informatica" 
" 
" 
"

More Related Content

What's hot

Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingOCLC
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for DiscoveryOCLC
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceMicah Altman
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemWiLS
 
Best Practices for Descriptive Metadata
Best Practices for Descriptive MetadataBest Practices for Descriptive Metadata
Best Practices for Descriptive MetadataOCLC
 
Linked Data Implementations—Who, What and Why?
Linked Data Implementations—Who, What and Why?Linked Data Implementations—Who, What and Why?
Linked Data Implementations—Who, What and Why?OCLC
 
The Future of Finding: Resource Discovery @ The University of Oxford
The Future of Finding: Resource Discovery @ The University of OxfordThe Future of Finding: Resource Discovery @ The University of Oxford
The Future of Finding: Resource Discovery @ The University of OxfordChristine Madsen
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeologyguest756e05
 
The library in the life of the user
The library in the life of the userThe library in the life of the user
The library in the life of the userlisld
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering EvidenceBIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering EvidenceOCLC
 
Exploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataExploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataShenghui Wang
 
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...Andrew Bourgeois
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅kulibrarians
 
FAST Update
FAST UpdateFAST Update
FAST UpdateOCLC
 
Role of libraries in research and scholarly communication
Role of libraries in research and scholarly communicationRole of libraries in research and scholarly communication
Role of libraries in research and scholarly communicationNikesh Narayanan
 

What's hot (20)

Best Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web ArchivingBest Practices for Descriptive Metadata for Web Archiving
Best Practices for Descriptive Metadata for Web Archiving
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
 
Gonzalez-8-jun15
Gonzalez-8-jun15Gonzalez-8-jun15
Gonzalez-8-jun15
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
 
Best Practices for Descriptive Metadata
Best Practices for Descriptive MetadataBest Practices for Descriptive Metadata
Best Practices for Descriptive Metadata
 
Linked Data Implementations—Who, What and Why?
Linked Data Implementations—Who, What and Why?Linked Data Implementations—Who, What and Why?
Linked Data Implementations—Who, What and Why?
 
The Future of Finding: Resource Discovery @ The University of Oxford
The Future of Finding: Resource Discovery @ The University of OxfordThe Future of Finding: Resource Discovery @ The University of Oxford
The Future of Finding: Resource Discovery @ The University of Oxford
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
Ir1
Ir1Ir1
Ir1
 
The library in the life of the user
The library in the life of the userThe library in the life of the user
The library in the life of the user
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering EvidenceBIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
 
Lauruhn-5-jun15
Lauruhn-5-jun15Lauruhn-5-jun15
Lauruhn-5-jun15
 
Exploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataExploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadata
 
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
20161019-dlc-making-it-happen-together-demonstrating-resilience-thru-successf...
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
 
Library Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic ControlLibrary Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic Control
 
FAST Update
FAST UpdateFAST Update
FAST Update
 
Role of libraries in research and scholarly communication
Role of libraries in research and scholarly communicationRole of libraries in research and scholarly communication
Role of libraries in research and scholarly communication
 

Viewers also liked

@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashedmwe400
 
JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?LulwahMA
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Sawood Alam
 
You can you up - Chinoiserie - 60 words about China
You can you up - Chinoiserie - 60 words about ChinaYou can you up - Chinoiserie - 60 words about China
You can you up - Chinoiserie - 60 words about ChinaPengyuan Zhao
 
[Infografik] Talent Trends Report 2015
[Infografik] Talent Trends Report 2015[Infografik] Talent Trends Report 2015
[Infografik] Talent Trends Report 2015LinkedIn D-A-CH
 
Freshdesk Arcade - Gamify Your Helpdesk
Freshdesk Arcade - Gamify Your HelpdeskFreshdesk Arcade - Gamify Your Helpdesk
Freshdesk Arcade - Gamify Your HelpdeskFreshdesk Inc.
 
Creating Innovation in Schools
Creating Innovation in SchoolsCreating Innovation in Schools
Creating Innovation in SchoolsRafael Parente
 
500’s Demo Day Batch 16 >> Many Chat
500’s Demo Day Batch 16 >>  Many Chat500’s Demo Day Batch 16 >>  Many Chat
500’s Demo Day Batch 16 >> Many Chat500 Startups
 
Asuhan Keperawatan Pada Kanker Tulang
Asuhan Keperawatan Pada Kanker TulangAsuhan Keperawatan Pada Kanker Tulang
Asuhan Keperawatan Pada Kanker Tulangpjj_kemenkes
 
Micro Interactions
Micro InteractionsMicro Interactions
Micro InteractionsDavid Armano
 
The shaping of the earth´s relief.ppt
The shaping of the earth´s relief.pptThe shaping of the earth´s relief.ppt
The shaping of the earth´s relief.pptdavmartse
 
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing ForumFive Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing ForumHealthcare Network marcus evans
 
Foundations of Strategic Competitiveness
Foundations of Strategic CompetitivenessFoundations of Strategic Competitiveness
Foundations of Strategic Competitivenessdrnurhizam
 
ROUSSEAU, Henri, Featured Paintings in Detail (2)
ROUSSEAU, Henri, Featured Paintings in Detail (2)ROUSSEAU, Henri, Featured Paintings in Detail (2)
ROUSSEAU, Henri, Featured Paintings in Detail (2)guimera
 
9 ways to make the wrong impression on your first day
9 ways to make the wrong impression on your first day9 ways to make the wrong impression on your first day
9 ways to make the wrong impression on your first dayCAREEREALISM
 

Viewers also liked (19)

@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashed
 
JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?JCDL2015: How Well are Arabic Websites Archived?
JCDL2015: How Well are Arabic Websites Archived?
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
You can you up - Chinoiserie - 60 words about China
You can you up - Chinoiserie - 60 words about ChinaYou can you up - Chinoiserie - 60 words about China
You can you up - Chinoiserie - 60 words about China
 
Viacom_Jenn Lim_New York
Viacom_Jenn Lim_New YorkViacom_Jenn Lim_New York
Viacom_Jenn Lim_New York
 
[Infografik] Talent Trends Report 2015
[Infografik] Talent Trends Report 2015[Infografik] Talent Trends Report 2015
[Infografik] Talent Trends Report 2015
 
Freshdesk Arcade - Gamify Your Helpdesk
Freshdesk Arcade - Gamify Your HelpdeskFreshdesk Arcade - Gamify Your Helpdesk
Freshdesk Arcade - Gamify Your Helpdesk
 
Art for Literacy's Sake
Art for Literacy's SakeArt for Literacy's Sake
Art for Literacy's Sake
 
Creating Innovation in Schools
Creating Innovation in SchoolsCreating Innovation in Schools
Creating Innovation in Schools
 
500’s Demo Day Batch 16 >> Many Chat
500’s Demo Day Batch 16 >>  Many Chat500’s Demo Day Batch 16 >>  Many Chat
500’s Demo Day Batch 16 >> Many Chat
 
Asuhan Keperawatan Pada Kanker Tulang
Asuhan Keperawatan Pada Kanker TulangAsuhan Keperawatan Pada Kanker Tulang
Asuhan Keperawatan Pada Kanker Tulang
 
Micro Interactions
Micro InteractionsMicro Interactions
Micro Interactions
 
The shaping of the earth´s relief.ppt
The shaping of the earth´s relief.pptThe shaping of the earth´s relief.ppt
The shaping of the earth´s relief.ppt
 
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing ForumFive Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
Five Things Senior Living Needs to Rethink - Steve Moran, Senior Housing Forum
 
2. creation
2. creation2. creation
2. creation
 
Foundations of Strategic Competitiveness
Foundations of Strategic CompetitivenessFoundations of Strategic Competitiveness
Foundations of Strategic Competitiveness
 
ROUSSEAU, Henri, Featured Paintings in Detail (2)
ROUSSEAU, Henri, Featured Paintings in Detail (2)ROUSSEAU, Henri, Featured Paintings in Detail (2)
ROUSSEAU, Henri, Featured Paintings in Detail (2)
 
9 ways to make the wrong impression on your first day
9 ways to make the wrong impression on your first day9 ways to make the wrong impression on your first day
9 ways to make the wrong impression on your first day
 

Similar to Finding Pages on the Unarchived Web (DL 2014)

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery systemDenis Shestakov
 
Library Mashups & APIs
Library Mashups & APIsLibrary Mashups & APIs
Library Mashups & APIslibrarywebchic
 
Google Paper
Google Paper Google Paper
Google Paper girish1m
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archivesvinaygo
 
Federated to library discovery platfoms
Federated to library discovery platfomsFederated to library discovery platfoms
Federated to library discovery platfomsNikesh Narayanan
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification Zakaria Zubi
 
The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...Alexander Decker
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 
Content Management and Page Structure for SharePoint
Content Management and Page Structure for SharePointContent Management and Page Structure for SharePoint
Content Management and Page Structure for SharePointD'arce Hess
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technologyStefanos Anastasiadis
 
Online Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and MuseumsOnline Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and Museumsmherbison
 
Increasing the findability of digital heritage documents by using Search Engi...
Increasing the findability of digital heritage documents by using Search Engi...Increasing the findability of digital heritage documents by using Search Engi...
Increasing the findability of digital heritage documents by using Search Engi...Andrea Hrckova
 

Similar to Finding Pages on the Unarchived Web (DL 2014) (20)

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
Library Mashups & APIs
Library Mashups & APIsLibrary Mashups & APIs
Library Mashups & APIs
 
Google Paper
Google Paper Google Paper
Google Paper
 
Analyzing Web Archives
Analyzing Web ArchivesAnalyzing Web Archives
Analyzing Web Archives
 
Federated to library discovery platfoms
Federated to library discovery platfomsFederated to library discovery platfoms
Federated to library discovery platfoms
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
 
Leveraging Library Thing (2009)
Leveraging Library Thing (2009)Leveraging Library Thing (2009)
Leveraging Library Thing (2009)
 
Pandora
PandoraPandora
Pandora
 
Websrc~1
Websrc~1Websrc~1
Websrc~1
 
The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...The commitment of arabic sites in the field of libraries and information that...
The commitment of arabic sites in the field of libraries and information that...
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Content Management and Page Structure for SharePoint
Content Management and Page Structure for SharePointContent Management and Page Structure for SharePoint
Content Management and Page Structure for SharePoint
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Online Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and MuseumsOnline Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and Museums
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Increasing the findability of digital heritage documents by using Search Engi...
Increasing the findability of digital heritage documents by using Search Engi...Increasing the findability of digital heritage documents by using Search Engi...
Increasing the findability of digital heritage documents by using Search Engi...
 

More from TimelessFuture

Webmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysisWebmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysisTimelessFuture
 
Experiential Interfaces: 

3D reconstructions as entry points for exploration...
Experiential Interfaces: 

3D reconstructions as entry points for exploration...Experiential Interfaces: 

3D reconstructions as entry points for exploration...
Experiential Interfaces: 

3D reconstructions as entry points for exploration...TimelessFuture
 
Step inside the Image: 

Interpretative Interfaces for 
3D Historical Content
Step inside the Image: 

Interpretative Interfaces for 
3D Historical ContentStep inside the Image: 

Interpretative Interfaces for 
3D Historical Content
Step inside the Image: 

Interpretative Interfaces for 
3D Historical ContentTimelessFuture
 
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...TimelessFuture
 
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...TimelessFuture
 
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...TimelessFuture
 
Visualization Lecture - Clariah Summer School 2018
Visualization Lecture - Clariah Summer School 2018Visualization Lecture - Clariah Summer School 2018
Visualization Lecture - Clariah Summer School 2018TimelessFuture
 
Outcomes Visual Navigation Project
Outcomes Visual Navigation ProjectOutcomes Visual Navigation Project
Outcomes Visual Navigation ProjectTimelessFuture
 
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...TimelessFuture
 
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...TimelessFuture
 
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...TimelessFuture
 
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
“More than Meets the Eye” - Analyzing the Success of User Queries in OriaTimelessFuture
 
Not available, or not found? Lessons from user queries in the Oria catalog at...
Not available, or not found? Lessons from user queries in the Oria catalog at...Not available, or not found? Lessons from user queries in the Oria catalog at...
Not available, or not found? Lessons from user queries in the Oria catalog at...TimelessFuture
 
Webarchief & Wetenschap (Dutch)
Webarchief & Wetenschap (Dutch)Webarchief & Wetenschap (Dutch)
Webarchief & Wetenschap (Dutch)TimelessFuture
 
From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...TimelessFuture
 
Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...TimelessFuture
 
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)TimelessFuture
 
The Value of Multistage Search Systems for Book Search
The Value of Multistage Search Systems for Book SearchThe Value of Multistage Search Systems for Book Search
The Value of Multistage Search Systems for Book SearchTimelessFuture
 
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)TimelessFuture
 
From multistage information seeking models to multistage search systems (IIiX...
From multistage information seeking models to multistage search systems (IIiX...From multistage information seeking models to multistage search systems (IIiX...
From multistage information seeking models to multistage search systems (IIiX...TimelessFuture
 

More from TimelessFuture (20)

Webmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysisWebmapping: maps for presentation, exploration & analysis
Webmapping: maps for presentation, exploration & analysis
 
Experiential Interfaces: 

3D reconstructions as entry points for exploration...
Experiential Interfaces: 

3D reconstructions as entry points for exploration...Experiential Interfaces: 

3D reconstructions as entry points for exploration...
Experiential Interfaces: 

3D reconstructions as entry points for exploration...
 
Step inside the Image: 

Interpretative Interfaces for 
3D Historical Content
Step inside the Image: 

Interpretative Interfaces for 
3D Historical ContentStep inside the Image: 

Interpretative Interfaces for 
3D Historical Content
Step inside the Image: 

Interpretative Interfaces for 
3D Historical Content
 
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
Supporting the Interpretation of Enriched Audiovisual Sources through Tempora...
 
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
The Multi-Stage Experience: the Simulated Work Task Approach to Studying Info...
 
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
Op Ontdekkingsreis door het KB Webarchief - Exploratieve Visualisatie in een ...
 
Visualization Lecture - Clariah Summer School 2018
Visualization Lecture - Clariah Summer School 2018Visualization Lecture - Clariah Summer School 2018
Visualization Lecture - Clariah Summer School 2018
 
Outcomes Visual Navigation Project
Outcomes Visual Navigation ProjectOutcomes Visual Navigation Project
Outcomes Visual Navigation Project
 
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
KNVI 2017: De collectie in een ander licht - Creatieve inzet van nieuwe techn...
 
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...Chaos&Order: Using visualization as a means to
 explore large heritage collec...
Chaos&Order: Using visualization as a means to
 explore large heritage collec...
 
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
Workshop: Inspirational Journeys - Challenges and Solutions for Visual Naviga...
 
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
“More than Meets the Eye” - Analyzing the Success of User Queries in Oria
 
Not available, or not found? Lessons from user queries in the Oria catalog at...
Not available, or not found? Lessons from user queries in the Oria catalog at...Not available, or not found? Lessons from user queries in the Oria catalog at...
Not available, or not found? Lessons from user queries in the Oria catalog at...
 
Webarchief & Wetenschap (Dutch)
Webarchief & Wetenschap (Dutch)Webarchief & Wetenschap (Dutch)
Webarchief & Wetenschap (Dutch)
 
From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...From Exploration to Construction
 - How to Support the Complex Dynamics of In...
From Exploration to Construction
 - How to Support the Complex Dynamics of In...
 
Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...Active & Passive Utility of Search Interface Features in different Informatio...
Active & Passive Utility of Search Interface Features in different Informatio...
 
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
Supporting the Process - Adapting Search Systems To Search Stages (ECIL15)
 
The Value of Multistage Search Systems for Book Search
The Value of Multistage Search Systems for Book SearchThe Value of Multistage Search Systems for Book Search
The Value of Multistage Search Systems for Book Search
 
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
WebART: hoe maak je webarchieven bruikbaar voor de wetenschap? (Dutch)
 
From multistage information seeking models to multistage search systems (IIiX...
From multistage information seeking models to multistage search systems (IIiX...From multistage information seeking models to multistage search systems (IIiX...
From multistage information seeking models to multistage search systems (IIiX...
 

Recently uploaded

Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝soniya singh
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...NETWAYS
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptssuser319dad
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Salam Al-Karadaghi
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)Basil Achie
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptxBasil Achie
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 
Motivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfMotivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfakankshagupta7348026
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 

Recently uploaded (20)

Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.ppt
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 
Motivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfMotivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdf
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 

Finding Pages on the Unarchived Web (DL 2014)

  • 1. Finding Pages on the Unarchived Web Hugo Huurdeman, Anat Ben-David, Jaap Kamps Thaer Samar, Arjen de Vries" " University of Amsterdam, Centrum Wiskunde & Informatica" " " " Presentation ACM/IEEE Digital Libraries conference 2014
  • 2. Introduction • Web archives preserve the fast-changing Web • However, they cannot capture the entire Web due to various limitations" • Recrawl “lost” webpages impossible " • Would it be possible to recover parts of the unarchived Web?
  • 4. 0.1 Background: Web archiving • Web archives, keepers of our future cultural heritage, are inherently incomplete • e.g. due to limitations in crawling [Masanès06] " • However, crawlers do register additional information, e.g. • page source, link structure, server metadata, timestamps, .. • potentially usable for analytical purposes (e.g. [Rauber06])
  • 5. 0.1 Background: Link evidence and anchor text • Defining property of the Web: graph-based structure • links: src, destination, anchor text • Widely used in Web retrieval • [e.g. Craswell01, Fujii08, Koolen10] " • Our approach • inspired by previous results on Web-centric document representations • Our use case: the Web archive
  • 6. 0.2 Data: Dutch Web Archive • National Library of the Netherlands (KB) " " • Selective Web archive" • 2007-now • 10+ Terabyte • seedlist: 8000+ websites • 25,000+ harvests " • Our focus: one year of data (2012)
  • 7. 0.2 Data: extraction and processing extracting links from all pages" {destination URL, anchor text, hashcode src, crawldate} matching with seedlist adding KB metadata deduplication (per year)" to correct for harvesting frequencies cleaning and processing" e.g. URL normalization MySQL DB (13M. rows) aggregation and data enrichment" e.g. filetypes, counts, ..
  • 8. Research Questions 1 Can we recover a significant fraction of unarchived pages from references to them in the Web archive? " 2 How rich are the representations that can be created for unarchived URLs? " 3 Are the resulting derived representations of unarchived pages useful in practice? Do they capture enough of the unique page content to make them retrievable amongst millions of other pages?
  • 9. 1. Expanding the Web archive Can we recover a significant fraction of unarchived pages from references in the Web archive?
  • 10. 1.1 Archived content (2012) Dutch Web Archive 1 2 1. Contents in seedlist (2012) • 10.2M unique pages • 6,157 unique hosts • 3,413 unique domains • 16 TLDs " 2. Contents not in seedlist (2012) • 0.9M unique pages • 37,166 unique hosts • 30,367 unique domains • 181 TLDs
  • 11. 1.2 Unarchived content: the aura • the aura of the web archive • pages not in archive • but existence can be derived from link evidence in the archive " • distinguishing • inner aura (parent domain on the seedlist) • outer aura (parent domain not in the seedlist) Dutch Web Archive 1 2
  • 12. 1.2 Unarchived content (2012) 3. Inner aura • 5.5M unique pages • 9,039 unique hosts • 3,019 unique domains • 17 TLDs " 4.Outer aura • 5.2M unique pages • 481,797 unique hosts • 369,721 unique domains • 100 TLDs Dutch Web Archive 1 2 3 4
  • 13. 1.3 Characterizing the Aura: tld distribution Inner aura 2% 96% nl com org net other Outer aura 10% 2% 18% 5% 31% 35% nl com org jp net other mainly .nl content more mixed distribution (incl. .com, .org & .net)
  • 14. 1.3 Characterizing the Aura: coverage Alexa top 100 • Inner aura! • includes 7 of 100 most popular Dutch sites 280 210 140 70 0 twitter.com facebook.com linkedin.com hyves.nl google.com 280 210 140 70 0 nu.nl wikipedia blogspot.com kvk.nl anwb.nl • Outer aura! • includes 90 of 100 most popular Dutch sites (1.2M references)
  • 15. 1.4 Expanding the Web archive: summary • Recovered pages and hosts: " • Substantial amount • as many references to unarchived content as pages in the archive " • Complementing sites in archive " • Indirect evidence of lost Webpages holds the potential to significantly expand the Web archive’s coverage 20,0 15,0 10,0 5,0 0,0 Unarchived pages (M) Archived pages (M)
  • 16. 2. Representations! " How rich are the representations " that can be created for unarchived URLs?
  • 17. 2.1 Representations of unarchived content: indegree • Characteristics of incoming links (indegree) • All target representations: link from at least 1 unique page (b/o MD5) • 18% at least 3 unique incoming links • 10% has 5 links or more 100,00%! 90,00%! 80,00%! 70,00%! 60,00%! 50,00%! 40,00%! 30,00%! 20,00%! 10,00%! 0,00%! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! subset coverage! ! ! indegree (unique source pages)! inner aura! outer aura!
  • 18. 2.2 Representations: anchor text distribution • Further inspecting the richness: number of unique words " • 95% has 1 unique word or more • christinaconcours.nl: concertagenda (5) • 30% has 3 unique words or more • watou2009.be: watou (3) collection (2) stories (2) • 3% has 10 words or more • jos.rotterdam.nl: society (2) service (2) youth (2) and (2) education (2) wwwjosrotterdamnl (1) municipality (1) governance (1) jos (4) rotterdam (3) 100,00%! 90,00%! 80,00%! 70,00%! 60,00%! 50,00%! 40,00%! 30,00%! 20,00%! 10,00%! 0,00%! 0! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14! 15! subset coverage! ! ! unique word count (anchor text)! inner aura! outer aura!
  • 19. 2.3 Representations: homepages & non-homepages • Anchors often refer to homepages [e.g. Craswell01] • In our dataset: homepages for 336K of 481K hosts (69.8%) • homepage: vakcentrum.nl (6 unique anchors) • " • non-homepage: nesomexico.org/dutch-students/study-in-mexico/study-grants-and- loans/ (2 unique anchors — combine with URL words)
  • 20. 2.4 Representations of unarchived content: summary • Richness of representations: " • Results mixed: • skewed distribution • majority of pages: relatively sparse descriptions • minority of pages: relatively rich descriptions " • Are the representations rich enough to characterize the page’s contents?
  • 21. 3. Finding Unarchived Pages! " Are the representations of " unarchived pages useful " in practice?
  • 22. 3.1 Finding unarchived pages: evaluation setup • Indexed 5.19M representations unarchived content (outer aura) • three indexes: " " " " anchT urlW anchT UrlW " aggregated anchor text only URL words both " • Stratified sample: 500 homepages & 500 non-homepages • Pages (if available via IA / live Web) consulted by two annotators • creating known-item topics (150 per category) • inspect target page • write down query for refinding (without knowledge of anchor text) • result: 300 queries (~5-7 words)
  • 23. 3.2 Evaluation: results • Mean Reciprocal Rank (MRR)" • average scores of first correct result of each query • score: 1/rank • Results: " • homepages score better for anchor text representations • URL words representation better for non-homepages • combined representation improves MRR score for both • average close to 0.5: average case correct result 2nd rank
  • 24. 3.2 Evaluation: results • Success Rate @10: correct target page in top 10 " " " " " • Similar to MRR results: • homepages score better for anchor text • non-homepages score better for URL words representations " • On average, 59.7% of the correct homepages and non-homepages can be retrieved in the top 10
  • 25. 3.3 Evaluation: Impact of indegree • Impact of incoming links on richness of representations cer.org.uk! (5 anchor words) actionaid.org/kenya! (1 anchor word)
  • 26. 3.3 Evaluation: Impact of indegree (unique hosts) " " " " " " " " • Again, skewed nature: " • 251 out of 300 pages (84%) have links from 1 source • 49 pages (16%) have links from 2 or more sources " " 16% 84% • Higher indegree (unique hosts) results in rise in • mean word count • MRR • degree of homepages
  • 27. 3.3 Finding unarchived pages: summary • Usefulness in practice • Critical test: known-item finding • Generally positive results " • Unavailability of pages strengthens potential utility representations: • 20.1% of homepages • 45.4% of non-homepages not available via live Web or Internet Archive
  • 28. 4. Conclusion and Discussion
  • 29. 4.1 Conclusions • Approach to recover significant parts of the unarchived Web • by reconstructing descriptions based on link evidence " 1. Evidence high number unarchived pages • potentially increasing archive coverage " 2. Skewed distribution generated descriptions • popular pages have more terms • richness tapers off quickly " 3. Succint representation generally rich enough to identify pages • in a known item search setting "
  • 30. 4.2 Future work & Discussion " " " • Representations could be useful in research and institutional context, e.g. • helping to assess the completeness of the archive • extending seedlists for selection-based archives • potential representation popular unarchived sites, excluded from archiving " • Potentially enrich web archive systems with contextual information Web Archive • Aggregation per year: refine and extend to longitudinal case • Assessing the impact of crawling strategies • Incorporating additional contextual information • e.g. text surrounding anchors • Optimally weigh all sources of evidence, using advanced retrieval models
  • 31. Acknowledgements • We gratefully acknowledge the collaboration with the Dutch Web Archive of the National Library of the Netherlands. " • This research was supported by the Netherlands Organization for Scientific Research (WebART project, NWO CATCH # 640.005.001).
  • 32. References • [Craswell01] N. Craswell, D. Hawking, and S. Robertson, “Effective site finding using link anchor information,” in SIGIR. ACM, 2001, pp. 250–257. • [Fuji08] A. Fujii, “Modeling anchor text and classifying queries to enhance web document retrieval,” in WWW, J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, and X. Zhang, Eds. ACM, 2008, pp. 337–346. • [Kamps06] J. Kamps, “Web-centric language models,” in CIKM, O. Herzog, H.-J. Schek, N. Fuhr, A. Chowdhury, and W. Teiken, Eds. ACM, 2005, pp. 307–308. • [Masanès06] J. Masanès, Web archiving. Springer, 2006 • [Koolen10] M. Koolen and J. Kamps, “The importance of anchor text for ad hoc search revisited,” in SIGIR, F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy, Eds. ACM, 2010, pp. 122–129. • [Rauber02] A. Rauber, R. M. Bruckner, A. Aschenbrenner, O. Witvoet, and M. Kaiser, “Uncovering information hidden in web archives: A glimpse at web analysis building on data warehouses,” D-Lib Magazine, vol. 8, no. 12, 2002. • [Unesco03] UNESCO, “Charter on the preservation of digital heritage (article 3.4),” 2003.
  • 33. Finding Pages on the Unarchived Web Hugo Huurdeman, Anat Ben-David, Jaap Kamps Thaer Samar, Arjen de Vries" " University of Amsterdam, Centrum Wiskunde & Informatica" " " "