SlideShare a Scribd company logo
Using the Web Infrastructure for Real Time Recovery of Missing Web Pages Dissertation Defense Martin Klein mklein@cs.odu.edu Old Dominion University Norfolk, VA 07/18/2011 Committee: Dr. Michael L. Nelson (Advisor) Dr. Yaohang Li Dr. Michele C. Weigle Dr. Mohammad Zubair Dr. Robert Sanderson Dr. Herbert Van de Sompel
Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 2 Motivation Background
The Problem 3
The Problem - 404 Errors ,[object Object]
URIs inaccessible in CS papers: 23%-53%[Lawrence2001]
Inaccessible web pages: 67% after 4 years [Koehler2002]
Inaccessible objects in DLs: 3%[Nelson2002]
URIs inaccessible in high IF journals: 3.8% after 3 months; 13% after 27 months [Dellavalle2003]
URIs inaccessible in D-Lib Magazine: ~30%[McCown2005]
URIs inaccessible (and not archived) in scholarly articles: ~25%[Sanderson2011]4
The Problem - 404 Errors ,[object Object]
Has anybody crawled and indexed it?
Do Google, Yahoo!, Bing have a copy of the page?
Has the page been archived by a web archive?
Information retrieval techniques needed to (re-)discover content5
The Solution? ,[object Object]
Requires knowledge about content
Problem with homographs (jaguar, present, lead, M/mobile, etc)
Problem with very frequent terms/names (Michael Nelson, Eric Miller, etc)
Web archives
Helps for apple pie recipe but not for web page of transferred faculty, e.g.6
Content Similarity JCDL 2005 http://www.jcdl2005.org/ July 2005 http://www.jcdl2005.org/ Today 7
Content Similarity Hypertext 2006 http://www.ht06.org/ August 2006 http://www.ht06.org/ Today 8
Content Similarity PSP 2003 http://www.pspcentral.org/events/annual_meeting_2003.html http://www.pspcentral.org/events/archive/annual_meeting_2003.html August 2003 Today 9
Content Similarity ECDL 1999 http://www.informatik.uni-trier.de/~ley/ db/conf/ercimdl/ercimdl99.html http://www-rocq.inria.fr/EuroDL99/ October 1999 Today 10
Content Similarity Greynet 1999 http://www.konbib.nl/infolev/greynet/2.5.htm 1999 Today ? ? 11
Research Questions (1) The Problem Based on the WI, can we use content- and link structure based methods to (re-)discover missing web pages in real time? Investigated Methods: Lexical signatures Titles Tags Link neighborhood lexical signatures 12
Research Questions (2) The Problem What are the optimal characteristics of these methods (age, length, etc) with respect to retrieval performance? Can we improve the performance by consolidating two or more methods? Can we have a real-world implementation and evaluation of the above? 13
Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 14 Motivation Background
Memento, Web Infrastructure (WI) 15
Lexical Signatures (LSs) First introduced by Phelps and Wilensky[Phelps2000] Small set of terms capturing “aboutness” of a document, “lightweight” metadata Resource Abstract 10,000 terms 200 terms 16
Lexical Signature Generation  ,[object Object]
Term frequency (TF):
“How often does this word appear in this document?”
Inverse document frequency (IDF):
“In how many documents does this word appear?”17
Lexical Signatures -- Examples 18
Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 19 A Comparison of Techniques for Estimating IDF Values to Generate LexicalSignatures for the Web(WIDM 2008) Motivation Background
Accurate IDF Values for LSs Screen scraping the Google web interface 20
The Dataset Local universe consisting of copies of URIs from the Internet Archivebetween 1996 and 2007 21
The Idea ,[object Object],Local collection of web pages “screen scraping”SE result pages ,[object Object]
 Google N-GramsNote: N-Grams provide term count (TC) and not DF values – ask me for details 22
LSs Example Based on all 3 methods URL: http://www.perfect10wines.com Year: 2007 Union: 12 unique terms 23
Comparing LSs Normalized term overlap ,[object Object]
k-term LSs normalized by k Kendall Tau ,[object Object],M-Score ,[object Object],24
Comparing LSs Top 5, 10 and 15terms LC – local universe SC – screen scraping NG – N-Grams 25
Conclusions ,[object Object]
 Compared to the Google N-Gram baseline
 Screen scraping method seems preferable
Similarity scores are slightly higher
Feasible in real time!!!Contribution: Established well performing IDF estimation technique. 26
Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 27 Revisiting Lexical Signatures to (Re-)Discover Web Pages(ECDL 2008) Motivation Background
The Idea Evaluate Evolution of LSs over Time by ,[object Object]
Conduct overlap analysis
Neither Phelps and Wilensky nor Park et al.[Park2004] did that
Park et al. just re-confirmed their findings after 6 months28
LSs Over Time - Example 10-term LSs generated for http://www.perfect10wines.com 29
LS Overlap Analysis Rooted: overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URI has been observed Sliding: overlap between two LSs of consecutive years starting with the first year and ending with the last 30
Evolution of LSs over Time Rooted Results: ,[object Object]
Highest overlap in the first 1-2 years after creation of the LS
Rarely peaks after that – once terms are gone do not return31
Evolution of LSs over Time Sliding Results: Overlap increases over time Seem to reach steady state around 2003 32
Performance of LSs Idea: ,[object Object]
Identify URI in result set
For each URI it is possible that:URI is returned as the top ranked result URI is ranked somewhere between 2 and 10 URI is ranked somewhere between 11 and 100 URI is ranked somewhere beyond rank 100 considered as not returned 33
Performance of LSs wrt Length Results: ,[object Object]
5-, 6- and 7-term LSs seem best
Top mean rank (MR) value with 5 terms
Most top ranked with 7 terms
Binary pattern: either in top 10 or undiscovered
8 terms and beyond do not show improvement34
Performance of LSs wrt Length nDCG for LSs consisting of 2-15 terms (mean over all years) 35
Performance of LSs over Time nDCG for LSs consisting of 2, 5, 7 and 10 terms 36
Conclusions ,[object Object]
 Rooted: quickly after generation
 Sliding: seem to stabilize
 LSs older than 5 years perform poorly
 5-, 6- and 7-term LSs seem to perform best
 7 – most top ranked
 5 – lowest mean rank
 2..4 as well as 8+ term LSs are insufficient Contribution: Determined age and length limits for LSs. 37
Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 38 Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure(JCDL 2010) Motivation Background
59 copies The Problem The Problem Internet Archive - Wayback Machine www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 39
The Problem The Problem www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry  40
The Problem The Problem www.aircharter-international.com Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 41
The Idea Contributions Compare performance of two automated methods to rediscover web pages Lexical signatures (LSs) Titles Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery 42
LS Retrieval Performance LS Retrieval Performance 5- and 7-Term LSs ,[object Object]
Binary retrieval pattern, URI either within top 10 or undiscovered43
Title Retrieval Performance Title Retrieval Performance Non-Quoted and Quoted Titles ,[object Object]
Google and Yahoo! return more URIs for non-quoted titles
Same binary retrieval pattern44
Combination of Methods Combination of Methods Top Results for Combination of Methods 45
Conclusions Concluding Remarks ,[object Object]
 Return 50%-70% URIs top rankedBUT ,[object Object]
 Preferred primary method
 5-term LSs secondary method
 Results in 75% top ranked URIsContributions: Provided evidence for suitability of titles and introduced web page discovery framework. 46
Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 47 Is This a Good Title?(Hypertext 2010) Motivation Background
??? The Problem The Problem http://www.drbartell.com/ Lexical Signature (TF/IDF) Plastic Surgeon Reconstructive Dr Bartell Symbol University 48
The Problem The Problem http://www.drbartell.com/ Title Thomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery 49
The Problem The Problem www.reagan.navy.mil Lexical Signature (TF/IDF) Ronald USS MCSN Torrey Naval Sea Commanding  50
The Problem The Problem www.reagan.navy.mil ??? Title Home Page Is This a Good Title? 51
The Idea Contributions Display title evolution over time Compare to content evolution “Normalize” time as fixed size windows Provide prediction model for title’s retrieval potential 52
Title and LS Retrieval Performance Title (and LS) Retrieval Performance Titles 5- and 7-Term LSs ,[object Object]
Binary retrieval pattern, URI either within top 10 or undiscovered53
Title Evolution – Example I Title Evolution - Example I www.sun.com/solutions 1998-01-27 Sun Software Products Selector Guides - Solutions Tree 1999-02-20 Sun Software Solutions 2002-02-01 Sun Microsystems Products 2002-06-01 Sun Microsystems - Business & Industry Solutions 2003-08-01 Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions 2004-02-02 Sun Microsystems – Solutions 2004-06-10 Gateway Page - Sun Solutions  2006-01-09 Sun Microsystems Solutions & Services 2007-01-03 Services & Solutions 2007-02-07 Sun Services & Solutions 2008-01-19 Sun Solutions 54
Title Evolution – Example II Title Evolution - Example II www.datacity.com/mainf.html 2002-10-16 computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free 2006-03-14 Est1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB 2000-06-19 DataCityof Manassas Park Main Page 2000-10-12 DataCityof Manassas Park sells Custom Built Computers & Removable Hard Drives 2001-08-21 DataCitya computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives 55
Title Evolution Over Time Title Evolution Over Time How much do titles change over time? ,[object Object]
Extract available titles of past 14 years
Compute normalized Levenshtein edit distance between titles of copies and baseline (today)(0 = identical;1 = completely dissimilar)56
Title Evolution Over Time Title Evolution Over Time Title edit distance frequencies ,[object Object]
Decay from 2005 on (with fewer copies available)
4 year old title:40% chance to be unchanged57

More Related Content

Similar to Dissertation Defense

(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
Michael Nelson
 
Question Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning IssuesQuestion Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning Issues
Michael Petychakis
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Michael Nelson
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
Rakebul Hasan
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
ebiquity
 
(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages
Michael Nelson
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
Armin Haller
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
Bhaskar Mitra
 
semantic web & natural language
semantic web & natural languagesemantic web & natural language
semantic web & natural language
Nurfadhlina Mohd Sharef
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Lecture 05-SchemaMatching.ppt
Lecture 05-SchemaMatching.pptLecture 05-SchemaMatching.ppt
Lecture 05-SchemaMatching.ppt
Asadkhan47384
 
RDA for Original Catalogers
RDA for Original CatalogersRDA for Original Catalogers
RDA for Original Catalogers
Shana McDanold
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
Abhay Prakash
 
Can’t Find Your 404s?
Can’t Find Your 404s?Can’t Find Your 404s?
Can’t Find Your 404s?
Michael Nelson
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
Valeria de Paiva
 
Approach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through SemanticsApproach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through Semantics
Ioannis Stavrakantonakis
 
Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Bradley Allen
 
The paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecologyThe paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecology
R. John Robertson
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
Jian Wu
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
is20090
 

Similar to Dissertation Defense (20)

(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Question Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning IssuesQuestion Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning Issues
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
semantic web & natural language
semantic web & natural languagesemantic web & natural language
semantic web & natural language
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Lecture 05-SchemaMatching.ppt
Lecture 05-SchemaMatching.pptLecture 05-SchemaMatching.ppt
Lecture 05-SchemaMatching.ppt
 
RDA for Original Catalogers
RDA for Original CatalogersRDA for Original Catalogers
RDA for Original Catalogers
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
 
Can’t Find Your 404s?
Can’t Find Your 404s?Can’t Find Your 404s?
Can’t Find Your 404s?
 
Searchland: Search quality for Beginners
Searchland: Search quality for BeginnersSearchland: Search quality for Beginners
Searchland: Search quality for Beginners
 
Approach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through SemanticsApproach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through Semantics
 
Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)Faceted Navigation (LACASIS Fall Workshop 2005)
Faceted Navigation (LACASIS Fall Workshop 2005)
 
The paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecologyThe paper trail:steps towards a reference model for the metadata ecology
The paper trail:steps towards a reference model for the metadata ecology
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 

More from Martin Klein

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Martin Klein
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
Martin Klein
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Martin Klein
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
Martin Klein
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
Martin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
Martin Klein
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
Martin Klein
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
Martin Klein
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
Martin Klein
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
Martin Klein
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
Martin Klein
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
Martin Klein
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
Martin Klein
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
Martin Klein
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
Martin Klein
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Martin Klein
 

More from Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Dissertation Defense

  • 1. Using the Web Infrastructure for Real Time Recovery of Missing Web Pages Dissertation Defense Martin Klein mklein@cs.odu.edu Old Dominion University Norfolk, VA 07/18/2011 Committee: Dr. Michael L. Nelson (Advisor) Dr. Yaohang Li Dr. Michele C. Weigle Dr. Mohammad Zubair Dr. Robert Sanderson Dr. Herbert Van de Sompel
  • 2. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 2 Motivation Background
  • 4.
  • 5. URIs inaccessible in CS papers: 23%-53%[Lawrence2001]
  • 6. Inaccessible web pages: 67% after 4 years [Koehler2002]
  • 7. Inaccessible objects in DLs: 3%[Nelson2002]
  • 8. URIs inaccessible in high IF journals: 3.8% after 3 months; 13% after 27 months [Dellavalle2003]
  • 9. URIs inaccessible in D-Lib Magazine: ~30%[McCown2005]
  • 10. URIs inaccessible (and not archived) in scholarly articles: ~25%[Sanderson2011]4
  • 11.
  • 12. Has anybody crawled and indexed it?
  • 13. Do Google, Yahoo!, Bing have a copy of the page?
  • 14. Has the page been archived by a web archive?
  • 15. Information retrieval techniques needed to (re-)discover content5
  • 16.
  • 18. Problem with homographs (jaguar, present, lead, M/mobile, etc)
  • 19. Problem with very frequent terms/names (Michael Nelson, Eric Miller, etc)
  • 21. Helps for apple pie recipe but not for web page of transferred faculty, e.g.6
  • 22. Content Similarity JCDL 2005 http://www.jcdl2005.org/ July 2005 http://www.jcdl2005.org/ Today 7
  • 23. Content Similarity Hypertext 2006 http://www.ht06.org/ August 2006 http://www.ht06.org/ Today 8
  • 24. Content Similarity PSP 2003 http://www.pspcentral.org/events/annual_meeting_2003.html http://www.pspcentral.org/events/archive/annual_meeting_2003.html August 2003 Today 9
  • 25. Content Similarity ECDL 1999 http://www.informatik.uni-trier.de/~ley/ db/conf/ercimdl/ercimdl99.html http://www-rocq.inria.fr/EuroDL99/ October 1999 Today 10
  • 26. Content Similarity Greynet 1999 http://www.konbib.nl/infolev/greynet/2.5.htm 1999 Today ? ? 11
  • 27. Research Questions (1) The Problem Based on the WI, can we use content- and link structure based methods to (re-)discover missing web pages in real time? Investigated Methods: Lexical signatures Titles Tags Link neighborhood lexical signatures 12
  • 28. Research Questions (2) The Problem What are the optimal characteristics of these methods (age, length, etc) with respect to retrieval performance? Can we improve the performance by consolidating two or more methods? Can we have a real-world implementation and evaluation of the above? 13
  • 29. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 14 Motivation Background
  • 31. Lexical Signatures (LSs) First introduced by Phelps and Wilensky[Phelps2000] Small set of terms capturing “aboutness” of a document, “lightweight” metadata Resource Abstract 10,000 terms 200 terms 16
  • 32.
  • 34. “How often does this word appear in this document?”
  • 36. “In how many documents does this word appear?”17
  • 37. Lexical Signatures -- Examples 18
  • 38. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 19 A Comparison of Techniques for Estimating IDF Values to Generate LexicalSignatures for the Web(WIDM 2008) Motivation Background
  • 39. Accurate IDF Values for LSs Screen scraping the Google web interface 20
  • 40. The Dataset Local universe consisting of copies of URIs from the Internet Archivebetween 1996 and 2007 21
  • 41.
  • 42. Google N-GramsNote: N-Grams provide term count (TC) and not DF values – ask me for details 22
  • 43. LSs Example Based on all 3 methods URL: http://www.perfect10wines.com Year: 2007 Union: 12 unique terms 23
  • 44.
  • 45.
  • 46. Comparing LSs Top 5, 10 and 15terms LC – local universe SC – screen scraping NG – N-Grams 25
  • 47.
  • 48. Compared to the Google N-Gram baseline
  • 49. Screen scraping method seems preferable
  • 50. Similarity scores are slightly higher
  • 51. Feasible in real time!!!Contribution: Established well performing IDF estimation technique. 26
  • 52. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 27 Revisiting Lexical Signatures to (Re-)Discover Web Pages(ECDL 2008) Motivation Background
  • 53.
  • 55. Neither Phelps and Wilensky nor Park et al.[Park2004] did that
  • 56. Park et al. just re-confirmed their findings after 6 months28
  • 57. LSs Over Time - Example 10-term LSs generated for http://www.perfect10wines.com 29
  • 58. LS Overlap Analysis Rooted: overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URI has been observed Sliding: overlap between two LSs of consecutive years starting with the first year and ending with the last 30
  • 59.
  • 60. Highest overlap in the first 1-2 years after creation of the LS
  • 61. Rarely peaks after that – once terms are gone do not return31
  • 62. Evolution of LSs over Time Sliding Results: Overlap increases over time Seem to reach steady state around 2003 32
  • 63.
  • 64. Identify URI in result set
  • 65. For each URI it is possible that:URI is returned as the top ranked result URI is ranked somewhere between 2 and 10 URI is ranked somewhere between 11 and 100 URI is ranked somewhere beyond rank 100 considered as not returned 33
  • 66.
  • 67. 5-, 6- and 7-term LSs seem best
  • 68. Top mean rank (MR) value with 5 terms
  • 69. Most top ranked with 7 terms
  • 70. Binary pattern: either in top 10 or undiscovered
  • 71. 8 terms and beyond do not show improvement34
  • 72. Performance of LSs wrt Length nDCG for LSs consisting of 2-15 terms (mean over all years) 35
  • 73. Performance of LSs over Time nDCG for LSs consisting of 2, 5, 7 and 10 terms 36
  • 74.
  • 75. Rooted: quickly after generation
  • 76. Sliding: seem to stabilize
  • 77. LSs older than 5 years perform poorly
  • 78. 5-, 6- and 7-term LSs seem to perform best
  • 79. 7 – most top ranked
  • 80. 5 – lowest mean rank
  • 81. 2..4 as well as 8+ term LSs are insufficient Contribution: Determined age and length limits for LSs. 37
  • 82. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 38 Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure(JCDL 2010) Motivation Background
  • 83. 59 copies The Problem The Problem Internet Archive - Wayback Machine www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 39
  • 84. The Problem The Problem www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry 40
  • 85. The Problem The Problem www.aircharter-international.com Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 41
  • 86. The Idea Contributions Compare performance of two automated methods to rediscover web pages Lexical signatures (LSs) Titles Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery 42
  • 87.
  • 88. Binary retrieval pattern, URI either within top 10 or undiscovered43
  • 89.
  • 90. Google and Yahoo! return more URIs for non-quoted titles
  • 92. Combination of Methods Combination of Methods Top Results for Combination of Methods 45
  • 93.
  • 94.
  • 96. 5-term LSs secondary method
  • 97. Results in 75% top ranked URIsContributions: Provided evidence for suitability of titles and introduced web page discovery framework. 46
  • 98. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 47 Is This a Good Title?(Hypertext 2010) Motivation Background
  • 99. ??? The Problem The Problem http://www.drbartell.com/ Lexical Signature (TF/IDF) Plastic Surgeon Reconstructive Dr Bartell Symbol University 48
  • 100. The Problem The Problem http://www.drbartell.com/ Title Thomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery 49
  • 101. The Problem The Problem www.reagan.navy.mil Lexical Signature (TF/IDF) Ronald USS MCSN Torrey Naval Sea Commanding 50
  • 102. The Problem The Problem www.reagan.navy.mil ??? Title Home Page Is This a Good Title? 51
  • 103. The Idea Contributions Display title evolution over time Compare to content evolution “Normalize” time as fixed size windows Provide prediction model for title’s retrieval potential 52
  • 104.
  • 105. Binary retrieval pattern, URI either within top 10 or undiscovered53
  • 106. Title Evolution – Example I Title Evolution - Example I www.sun.com/solutions 1998-01-27 Sun Software Products Selector Guides - Solutions Tree 1999-02-20 Sun Software Solutions 2002-02-01 Sun Microsystems Products 2002-06-01 Sun Microsystems - Business & Industry Solutions 2003-08-01 Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions 2004-02-02 Sun Microsystems – Solutions 2004-06-10 Gateway Page - Sun Solutions 2006-01-09 Sun Microsystems Solutions & Services 2007-01-03 Services & Solutions 2007-02-07 Sun Services & Solutions 2008-01-19 Sun Solutions 54
  • 107. Title Evolution – Example II Title Evolution - Example II www.datacity.com/mainf.html 2002-10-16 computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free 2006-03-14 Est1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB 2000-06-19 DataCityof Manassas Park Main Page 2000-10-12 DataCityof Manassas Park sells Custom Built Computers & Removable Hard Drives 2001-08-21 DataCitya computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives 55
  • 108.
  • 109. Extract available titles of past 14 years
  • 110. Compute normalized Levenshtein edit distance between titles of copies and baseline (today)(0 = identical;1 = completely dissimilar)56
  • 111.
  • 112. Decay from 2005 on (with fewer copies available)
  • 113. 4 year old title:40% chance to be unchanged57
  • 114.
  • 115. X: avg edit distance of corresponding titles
  • 117. Semi-transparent: total amount of points plotted58
  • 118.
  • 119. Number of nouns, articles etc.
  • 120. Amount of title terms, characters [Ntoulas2006]
  • 121. Observation of re-occurring terms in poorly performing titles - “Stop Titles”home, index, home page, welcome, untitled document The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”! 59
  • 122.
  • 123. Not all titles equally good
  • 124. If the majority of title terms are Stop Titles its quality can be predicted poorContribution: Quantified title evolution and introduced stop titles. 60
  • 125. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 61 Motivation Background Find, New, Copy, Web, Page - Tagging for the (Re-)Discovery of Web Pages(TPDL 2011)
  • 126.
  • 127. Titles BUT What if no archived/cached copy can be found? 62
  • 128. The Solution? The Problem Conferences Digitallibraries Conference Library Jcdl2005 63
  • 129.
  • 130. Test combination of methods to improve retrieval performance
  • 132.
  • 133. Same four retrieval cases introduced earlier
  • 134. nDCG w/ binary relevance scoring
  • 135. Mean Average Precision65
  • 136. The Experiment The Problem Combining methods 66
  • 137.
  • 138. ~50% of tags do not occur in page [Bischoff2008]
  • 140. ~50% of tags do not occur in current version of page
  • 141. ergo: How about previous versions?67
  • 142.
  • 143. 66.3% of our tags do not occur in page
  • 144. 4.9% of tags occur in previous version of page Ghost Tags
  • 145. represent a previous version better than the current one
  • 146. What kind of tags are these?
  • 147. Important to the document, to the Delicious user?68
  • 148. Ghost Tags The Problem Document importance: TF rank User importance: Delicious rank Normalized rank: 0 - top 1 - bottom 69
  • 149.
  • 150. Combining tags with titles and LSs gains URIs
  • 152. 1/3 of them are important to the page and userContributions: Added tags to web page discovery framework and introduced notion of Ghost Tags. 70
  • 153. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 71 Motivation Background Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures(JCDL 2011)
  • 154.
  • 155. Titles BUT What if no archived/cached copy can be found? Plan A: Tags 72
  • 156. Plan B The Problem Link neighborhood Lexical Signatures (LNLSs) is about Computer Dominion Norfolk Monarch extract 73
  • 157.
  • 158. Length
  • 161. Radius of terms on backlink page74
  • 162. The Radius on a Backlink Page The Problem Entire page Paragraph Anchor text 75
  • 163.
  • 164. IDF values from Yahoo!
  • 165. 1..7 and 10 termsQuery Yahoo! API Compute “goodness” (nDCG) 76
  • 166. The Results The Problem 1st and 2nd level level-radius-rank better 77
  • 167. The Results – Radius The Problem All Radii level-radius-rank 78
  • 168. The Results – Backlink Rank The Problem Ranks 10 100 1000 level-radius-rank 79
  • 169. The Results – In Numbers The Problem GOOD 1-anchor-1000 WINNER 1-anchor-10 80
  • 170.
  • 171. Parsed from top 10backlink pages
  • 173. Consider anchor text onlyContributions: Added LNLS to web page discovery framework. 81
  • 174. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 82 Motivation Background Synchronicity – Automatically Rediscover Missing Web Pages in Real Time (JCDL 2011)
  • 175. Synchronicity Concluding Remarks Firefox add-on Triggers on 404 error Rediscover page via: Memento Title Lexical signature Tags Link neighborhood lexical signature URI modification http://bit.ly/no-more-404 83
  • 176. Contributions Concluding Remarks Introduce reliable real-time approach to estimate IDF values Workflow for generation of well performing lexical signatures Performance evaluation of web page titles Investigation of tags for web page discovery Analysis of link neighborhood lexical signatures and their optimal parameter Introduce Synchronicity implementing the entire framework 84
  • 178. Next Stop… New Mexico Concluding Remarks 86
  • 179. List of my Relevant Publications Concluding Remarks M.Klein, M.L.Nelson, “A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web“, WIDM 2008, pp. 39-46 M.Klein, M.L.Nelson, “Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008, pp. 371-382 M.Klein, M.L.Nelson, “Correlation of Term Count and Document Frequency for Google N-Grams“, ECIR 2009, pp. 620-627 M.Klein, M.L.Nelson, “Inter-Search Engine Lexical Signature Performance“, JCDL 2009, pp. 413-414 M.Klein, M.L.Nelson, “Investigating the Change of Web Pages Titles Over Time“, InDP 2009 M.Klein, J.Shipman, M.L.Nelson, “Is This a Good Title”, Hypertext 2010, pp. 3-12 M.Klein, M.L.Nelson, “Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure”, JCDL 2010, pp. 59-68 M.Klein, J.Ware, M.L.Nelson, “Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures”, JCDL 2011 M.Klein, M.Aly, M.L.Nelson, “Synchronicity - Automatically Rediscover Missing Web Pages in Real Time”, JCDL 2011 M.Klein, M.L.Nelson, “Find, New, Copy, Web, Page – Tagging for the (Re-)Discovery of Web Pages”, TPDL 2011 to appear 87
  • 180. References Concluding Remarks Bischoff2008 K.Bischoff, C.Firan, W.Nejdl, R.Paiu, “Can All Tags Be Used for Search?” In: Proceedings of CIKM '08, pp.193-202, 2008 Dellavalle2003 R.P.Dellavalle, E.J.Hester, L.F.Heilig, A.L.Drake, J.W.Kuntzman, M.Graber, L.M.Schilling, “Information Science: Going, Going, Gone: Lost Internet References”, Science 302(5646), pp.787-788, 2003 Jones1973 K.Spärck Jones, “Index Term Weighting”, Information Storage and Retrieval, pp. 619-633, 1973 Kahle1997 B.Kahle, “Preserving the Internet”, Scientific American 276, pp.82-83, 1997 Koehler2002 W.C.Koehler, “Web Page Change and Persistence - A Four-Year Longitudinal Study”, JASIST 53(2), pp.162-171, 2002 Lawrence2001 S.Lawrence, D.M.Pennock, G.W.Flake, R.Krovetz, F.M.Coetzee, E.Glover, F.A.Nielsen, A.Kruger, C.L.Giles, “Persistence of Web References in Scientic Research”, Computer 34(2), pp.26-31, 2001 McCown2005 F.McCown, S.Chan, M.L.Nelson, J.Bollen, “The Availability and Persistence of Web References in D-Lib Magazine”, Proceedings of IWAW '05, 2005 Nelson2002 M.L.Nelson, B.D.Allen, “Object Persistence and Availability in Digital Libraries”, D-Lib Magazine 8(1), 2002 Ntoulas2006 A. Ntoulas, M.Najork, M.Manasse, D.Fetterly, “Detecting Spam Web Pages Through Content Analysis”, Proceedings of WWW ’06, pp 83-92, 2006 Park2004 S.T.Park, D.M.Pennock, C.L.Giles, R.Krovetz, “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web”, TOIS 22(4), pp.540-572, 2004 Phelps2000 T.A.Phelps, R.Wilensky, “Robust Hyperlinks Cost Just Five Words Each”, technical report, UC Berkeley, 2000 Sanderson2011 R.Sanderson, M.Phillips, H.Van de Sompel, “Analyzing the Persistence of Referenced Web Resources with Memento”, Proceedings of OR '11, 2011 88
  • 181. Using the Web Infrastructure for Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu http://www.cs.odu.edu/~mklein/
  • 183.
  • 184. Find more Stop Titles
  • 189.
  • 190.
  • 191. Shown, screen scraping works but
  • 192. missing validation of baseline (Google N-Grams)
  • 193. N-Grams seem suitable (recently created, based on web pages) but provide TC and not DF  what is their relationship?93
  • 194.
  • 195. 95 Experiment Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Rank similarity of all terms
  • 196. 96 Experiment Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Spearman’s ρ and Kendall τ
  • 197. 97 Experiment Results Top 10 terms in decreasing order of their TF/IDF values taken from http://ecir09.irit.fr U = 14 ∩ = 6 Strong indicator that TC can be used to estimate DF for web pages! Google: screen scraping DF values from the Google web interface
  • 198. 98 Experiment Results Show similarity between WaC based TC and Google N-Gram based TC TC frequencies N-Grams have a threshold of 200
  • 199. Experiment Results Frequency of TC/DF Ratio Within the WaC Integer Values Two Decimals One Decimal 99
  • 200.
  • 201. TC frequencies of WaC and Google N-Grams are very similiar
  • 202. N-Grams are suitable for accurate IDF estimation for web pages Does not mean everything correlated to TC can be used as DF substitute! 100
  • 203. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 101 Inter-Search Engine Lexical Signature Performance (JCDL 2009) Motivation Background
  • 204. Inter-Search EngineLexical Signature Performance http://en.wikipedia.org/wiki/Elephant Elephant Tusks Trunk African Loxodonta Elephant, Asian, African Species, Trunk Elephant, African, Tusks Asian, Trunk
  • 205. 103
  • 206. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 104 Motivation Background Synchronicity – Automatically Rediscover Missing Web Pages in Real Time (JCDL 2011)
  • 207.
  • 208. Events reveal underlying pattern, framework bigger than any of the synchronous systems
  • 209. Carl Gustav Jung (1875-1961)
  • 211. Deschamps – de Fontgibu plumpudding examplepicture from http://www.crystalinks.com/jung.html 105
  • 212. Synchro…What? Repo Man (1984) http://www.imdb.com/title/tt0087995/ http://www.youtube.com/watch?v=X4HQyqc-aVU 106
  • 213. Agenda LSs for Web Pages DF Estimation Techniques TC-DF Correlation Web Page Titles Synchronicity Link Neighborhood LSs Book of the Dead Web Page Tags 107 Motivation Background (Not yet published)
  • 214.
  • 215. 233 URIs returning status 404
  • 216. Mechanical Turk to determine “aboutness”
  • 217. Guess from URI string
  • 219. Apply lexical signatures and title108
  • 220. 5-term LSs Titles 109 Experiment Results Dice Similarity Coefficient of Top 100 Results D = 0 0.0 < D ≤ 0.3 0.3 < D ≤ 0.6 0.6 < D ≤ 1.0
  • 221. 5-term LSs Titles 110 Experiment Results Jaro Distance of Top 100 Results J = 0 0.0 < J ≤ 0.3 0.3 < J ≤ 0.6 0.6 < J ≤ 1.0
  • 222.
  • 228. nDCG of top 10 results111
  • 229. 5-term LSs Titles 112 Experiment Results Relevance of Top 10 Results
  • 230. 113 Experiment Results nDCG of Top 10 Results