Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure
Upcoming SlideShare
Loading in...5
×
 

Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure

on

  • 850 views

Presentation I gave at JCDL 2010 in Surfers Paradise, Gold Coast, Australia. The paper can be found at:

Presentation I gave at JCDL 2010 in Surfers Paradise, Gold Coast, Australia. The paper can be found at:
http://doi.acm.org/10.1145/1816123.1816133

Statistics

Views

Total Views
850
Views on SlideShare
782
Embed Views
68

Actions

Likes
0
Downloads
4
Comments
0

8 Embeds 68

http://ws-dl.blogspot.com 55
http://ws-dl.blogspot.ca 5
http://ws-dl.blogspot.co.at 2
http://ws-dl.blogspot.ru 2
http://ws-dl.blogspot.fr 1
http://ws-dl.blogspot.co.uk 1
http://ws-dl.blogspot.mx 1
http://web.archive.org 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />

Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure Presentation Transcript

  • Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu JCDL 2010 Gold Coast, Australia 06/22/2010 This work is supported in part by the Library of Congress
  • The Problem Professional Scholarly Publishing 2003 http://www.pspcentral.org/events/annual_meeting_2003.html 2
  • The Problem Professional Scholarly Publishing 2003 http://www.pspcentral.org/events/archive/annual_meeting_2003.html 3
  • The Problem URI Content Mapping !"# !"# same URI !"# !"# same URI $"# $"# maps to same maps to $"# $%# 1 or very similar 2 different A time B content at a A time B content at a later time later time !"# different URI the content &'&# maps to same !"# !"# can not be or very similar 3 !"# !%# content at the 4 $"# ((# found at any URI $"# $"# same or at a A time B later time A time B 4
  • The Problem Internet Archive - www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine 5
  • The Problem Internet Archive - www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine 59 copies 5
  • The Problem Internet Archive - www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter 59 copies International 5
  • The Problem www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry 6
  • The Problem www.aircharter-international.com Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 7
  • The Problem If no archived/cached copy can be found... 8
  • The Problem If no archived/cached copy can be found... Tags GET https://user:pass@api.del.icio.us/v1/posts/suggest?url=http://yahoo.com/ <?xml version="1.0" encoding="UTF-8"?> <suggest> <popular>web</popular> <popular>tools</popular> <popular>searchengines</popular> <recommended>yahoo!</recommended> <recommended>yahoo</recommended> <recommended>web</recommended> <recommended>tools</recommended> <recommended>search</recommended> <recommended>reference</recommended> <recommended>portal</recommended> <recommended>news</recommended> </suggest> 8
  • The Problem If no archived/cached copy can be found... Link Tags Neighborhood GET https://user:pass@api.del.icio.us/v1/posts/suggest?url=http://yahoo.com/ (LNLS) <?xml version="1.0" encoding="UTF-8"?> <suggest> <popular>web</popular> A <popular>tools</popular> <popular>searchengines</popular> <recommended>yahoo!</recommended> <recommended>yahoo</recommended> <recommended>web</recommended> <recommended>tools</recommended> <recommended>search</recommended> <recommended>reference</recommended> <recommended>portal</recommended> <recommended>news</recommended> </suggest> ? C B 8
  • The Problem 9
  • The Problem 9
  • The Problem http://www.drbartell.com/ Lexical Signature (TF/IDF) ??? Plastic Surgeon Reconstructive Dr Bartell Symbol University 10
  • The Problem www.reagan.navy.mil Title Home Page ??? 11
  • Contributions • Compare performance of four automated methods to rediscover web pages 1. Lexical signatures (LSs) 2. Titles 3. Tags 4. Link neighborhood lexical signatures (LNLS) • Analysis of title characteristics wrt their retrieval performance • Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery 12
  • Experiment - Data Gathering • 500 URIs randomly sampled from DMOZ • Applied filters • .com, .org, .net, .edu domains • English Language • min. of 50 terms [Park] • Results in 309 URIs to download and parse [Park] S.T. Park et al. “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web” ACM ToIS 22(4):540-572, 2004 13
  • Experiment - Data Gathering • Extract title • <Title>...</Title> • Generate 3 LSs per page • IDF values obtained from Google,Yahoo!, MSN Live • Obtain tags from delicious.com API • only 15% of URIs • Obtain link neighborhood from Yahoo! API (max. 50 URIs) • Generate LNLS • TF from “bucket” of words per neighborhood • IDF obtained from Yahoo! API 14
  • LS Retrieval Performance 5- and 7-Term LSs • 80 Google Yahoo Yahoo! returns most MSN URIs top ranked and leaves least 60 undiscovered • Binary retrieval URLs in % pattern, URI either 40 within top 10 or undiscovered 20 0 Top Top10 Top100 Undiscovered 15
  • Title Retrieval Performance Non-Quoted and Quoted Titles 80 Google • Yahoo MSN Results at least as good as for LSs 60 • Google and Yahoo! return more URIs URLs in % 40 for non-quoted titles • Same binary 20 retrieval pattern 0 Top Top10 Top100 Undiscovered 16
  • Tags Retrieval Performance Yahoo Results • Top API returns up to 14 Top10 Top100 Undiscovered top10 tags - 12 distinguish between # of tags queried 10 • Frequency Low # of URIs 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 Number of Tags 17
  • LNLS Retrieval Performance Yahoo Results 100 • 80 5- and 7-term LNLSs • < 5% top ranked 60 URLs in % 40 20 0 Top Top10 Top100 Undiscovered 5− and 7−Term Neighborhood Lexical Signatures 18
  • Combination of Methods Can we achieve better retrieval performance if we combine 2 or more methods? 19
  • Combination of Methods Can we achieve better retrieval performance if we combine 2 or more methods? Query LS 19
  • Combination of Methods Can we achieve better retrieval performance if we combine 2 or more methods? Query LS Done 19
  • Combination of Methods Can we achieve better retrieval performance if we combine 2 or more methods? Query LS Done Query Title 19
  • Combination of Methods Can we achieve better retrieval performance if we combine 2 or more methods? Query LS Done Query Title Done 19
  • Combination of Methods Can we achieve better retrieval performance if we combine 2 or more methods? Query LS Done Query Title Done Query Tags 19
  • Combination of Methods Can we achieve better retrieval performance if we combine 2 or more methods? Query LS Done Query Title Done Query Tags Done 19
  • Combination of Methods Can we achieve better retrieval performance if we combine 2 or more methods? Query LS Done Query Title Done Query Tags Done Query LNLS 19
  • Combination of Methods Top Top10 Undis LS5 50.8 12.6 32.4 LS7 57.3 9.1 31.1 Google TI 69.3 8.1 19.7 TA 2.1 10.6 75.5 Top Top10 Undis LS5 67.6 7.8 22.3 Yahoo! LS7 66.7 4.5 26.9 TI 63.8 8.1 27.5 TA 6.4 17.0 63.8 Top Top10 Undis LS5 63.1 8.1 27.2 LS7 62.8 5.8 29.8 MSN Live TI 61.5 6.8 30.7 TA 0 8.5 80.9 20
  • Combination of Methods Top Results for Combination of Methods Google Yahoo! MSN Live LS5-TI 65.0 73.8 71.5 LS7-TI 70.9 75.7 73.8 TI-LS5 73.5 75.7 73.1 TI-LS7 74.1 75.1 74.1 LS5-TI-LS7 65.4 73.8 72.5 LS7-TI-LS5 71.2 76.4 74.4 TI-LS5-LS7 73.8 75.7 74.1 TI-LS7-LS5 74.4 75.7 74.8 LS5-LS7 52.8 68.0 64.4 LS7-LS5 59.9 71.5 66.7 21
  • Combination of Methods Top Results for Combination of Methods Google Yahoo! MSN Live LS5-TI 65.0 73.8 71.5 LS7-TI 70.9 75.7 73.8 TI-LS5 73.5 75.7 73.1 TI-LS7 74.1 75.1 74.1 LS5-TI-LS7 65.4 73.8 72.5 LS7-TI-LS5 71.2 76.4 74.4 TI-LS5-LS7 73.8 75.7 74.1 TI-LS7-LS5 74.4 75.7 74.8 LS5-LS7 52.8 68.0 64.4 LS7-LS5 59.9 71.5 66.7 21
  • Combination of Methods Top Results for Combination of Methods Google Yahoo! MSN Live LS5-TI 65.0 73.8 71.5 LS7-TI 70.9 75.7 73.8 TI-LS5 73.5 75.7 73.1 TI-LS7 74.1 75.1 74.1 LS5-TI-LS7 65.4 73.8 72.5 LS7-TI-LS5 71.2 76.4 74.4 TI-LS5-LS7 73.8 75.7 74.1 TI-LS7-LS5 74.4 75.7 74.8 LS5-LS7 52.8 68.0 64.4 LS7-LS5 59.9 71.5 66.7 21
  • Title Characteristics Length in # of Terms 35 Top Top10 • Top100 Undiscovered Length varies 30 between 1 and 43 25 terms • 20 Length between 3 Frequency and 6 terms occurs 15 most frequently and performs well [Ntoulas] 10 5 0 1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 Title Length in Number of Terms [Ntoulas] A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 22
  • Title Characteristics Length in # of Characters • Length varies 14 Top between 4 and 294 12 characters Top10 Top100 Undiscovered • 10 Short titles (<10) do not perform well Frequency 8 • Length between 10 6 and 70 most common 4 • Length between 10 and 45 seem to 2 perform best 0 4 9 15 22 29 36 44 51 58 66 78 86 100 124 225 Title Length in Number of Characters 23
  • Title Characteristics Mean # of Characters, # of Stop Words 100 80 Top Top10 Top100 • Title terms with a Undiscovered mean of 5,6,7 80 60 characters seem most suitable for well 60 performing terms Frequency Frequency 40 • More than 1 or 2 40 stop words hurts performance 20 20 0 0 3 6 9 12 15 18 21 1 3 5 7 9 11 Mean Characters per Title Term Number of Stopwords 24
  • Concluding Remarks Lexical signatures, as much as titles, are very suitable as search engine queries to rediscover missing web pages. They return 50-70% URIs top ranked. Tags and link neighborhood LSs do not seem to significantly contribute to the retrieval of the web pages. Titles are much cheaper to obtain than LSs. The combination of primarily querying titles and 5-term LSs as a second option returns more than 75% URIs top ranked. Not all titles are equally good. [Klein] Titles containing between 3 and 6 terms seem to perform best. More than a couple of stop words hurt the performance. [Klein] M. Klein et al. “Is This a Good Title?” In Proceedings of Hypertext 2010 25
  • Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure Questions? Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu 26