Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures


Published on

For discovering the new URI of a missing web page, lexical signatures, which consist of a small number of words chosen to represent the “aboutness” of a page, have been previously proposed. However, prior methods relied on computing the lexical signature before the page was lost, or using cached or archived versions of the page to calculate a lexical signature. We demonstrate a system of constructing a lexical signature for a page from its link neighborhood, that is the “backlinks”, or pages that link to the missing page. After testing various methods, we show that one can construct a lexical signature for a missing web page using only ten backlink pages. Further, we show that only the first level of backlinks are useful in this effort. The text that the backlinks use to point to the missing page is used as input for the creation of a four-word lexical signature. That lexical signature is shown to successfully find the target URI in more than half of the test cases.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures

  1. 1. Rediscovering Missing Web Pages Using Link Neighborhood<br />Lexical Signatures<br />Martin Klein, Jeb Ware, Michael L. Nelson<br />{mklein,jware,mln}<br />JCDL 2011<br />Ottawa, Canada<br />07/14/2011<br />
  2. 2. 2<br />The Problem<br />
  3. 3. 3<br />The Problem<br />
  4. 4. 4<br />The Problem<br />
  5. 5. Previously on… Missing Web Pages<br />The Problem<br />There was a (IKEA) web page<br />Use Memento to get an archived copy<br /><ul><li>Extract title [Klein2010a]
  6. 6. Generate lexical signature [Klein2010b]</li></ul>  Query search engines<br />What if there is no archived copy?<br /><ul><li>Tags (scarcity problem) [Klein2011]</li></ul>Link neighborhood lexical signatures<br />5<br />
  7. 7. Link Neighborhood<br />The Problem<br />#1<br />A<br />is about<br />IKEA<br />#2<br />B<br />Bjoern<br />Oslo<br />Dorm room<br />Nobel<br />Herring<br />#3<br />C<br />extract<br />6<br />
  8. 8. 7<br />Lexical Signatures (LSs)<br />First introduced by Phelps and Wilensky[Phelps2000]<br />Small set of terms capturing “aboutness” of a document, “lightweight” metadata<br />Resource<br />Abstract<br />10,000 terms<br />200 terms<br />
  9. 9. Research Questions<br />The Problem<br />What is a good length of link neighborhood lexical signatures?<br /><ul><li>5 or 7 terms for lexical signatures [Klein2008]
  10. 10. 5..8 for tags [Klein2011]</li></ul>How many backlinks to include?<br />The more backlinklevels the better? <br />What radius on the backlink page to use?<br />8<br />
  11. 11. The Radius on a Backlink Page<br />The Problem<br />Entire page<br />Paragraph<br />Anchor text<br />9<br />
  12. 12. 10<br />The Dataset<br />309 URIs [Klein2010b]<br />28,325 first level<br />306,700 second level backlinks<br />Filter for language, file type, etc. <br /> 12% discarded<br /><ul><li>Lexical signature generation
  13. 13. IDF values from Yahoo! [Klein2010b]
  14. 14. 1..7 and 10 terms</li></ul>Query Yahoo! API<br />Compute “goodness” (nDCG)<br />
  15. 15. The Results<br />The Problem<br />1st and 2nd<br />level<br />level-radius-rank<br />better<br />11<br />
  16. 16. The Results – Radius<br />The Problem<br />All Radii<br />level-radius-rank<br />12<br />
  17. 17. The Results – Backlink Rank<br />The Problem<br />Ranks<br />10<br />100<br />1000<br />level-radius-rank<br />13<br />
  18. 18. The Results – In Numbers<br />The Problem<br />GOOD<br />1-anchor-1000<br />WINNER<br />1-anchor-10<br />14<br />
  19. 19. Synchronicity<br />Concluding Remarks<br />Firefox add-on<br />Triggers on 404 error<br />Rediscover page via:<br />Title<br />Lexical signature<br />Tags<br />Link neighborhood lexical signature<br />URI modification<br /><br />Example: conference home page <br /><ul><li>http://www.jcdl2011.org2011 JCDL Libraries</li></ul>15<br />
  20. 20. 16<br />In Conclusion…<br />
  21. 21. Conclusions and Future Work<br />Concluding Remarks<br />Optimal link neighborhood lexical signatures:<br /><ul><li>Contain 4 terms
  22. 22. Parsed from top 10backlink pages
  23. 23. Include firstbacklink level only
  24. 24. Consider anchor text only</li></ul>Define “stop anchors”<br /><ul><li>“click here”, “homepage”, etc.</li></ul>Find optimum between 10 and 100 backlinks<br />17<br />
  25. 25. References<br />Concluding Remarks<br />Jones73<br />K.Spärck Jones, “Index Term Weighting”, Information Storage and Retrieval, pp. 619-633, 1973<br />Klein2008<br />M.Klein, M.L.Nelson,“Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008, pp. 371-382<br />Klein2010a<br />M.Klein, J.Shipman, M.L.Nelson,“Is This a Good Title”, Hypertext 2010, pp. 3-12<br />Klein2010b<br />M.Klein, M.L.Nelson, “Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure”, JCDL 2010, pp. 59-68<br />Klein2011<br />M.Klein, M.L.Nelson, “Find, New, Copy, Web, Page – Tagging for the (Re-)Discovery of Web Pages”, TPDL 2011 to appear<br />Phelps2000<br />T.A.Phelps, R.Wilensky, “Robust Hyperlinks Cost Just Five Words Each”, technical report, Univesity of California at Berkeley, 2000<br />18<br />
  26. 26. Rediscovering Missing Web Pages Using Link Neighborhood<br />Lexical Signatures<br />Martin Klein, Jeb Ware, Michael L. Nelson<br />{mklein,jware,mln}<br />
  27. 27. 20<br />Generation of Lexical Signatures<br />Following TF-IDF scheme first introduced by Karen Spärck Jones [Jones73]<br />Term frequency (TF):<br /><ul><li>“How often does this word appear in this document?”</li></ul>Inverse document frequency (IDF):<br /><ul><li>“In how many documents does this word appear?”</li></li></ul><li>The Results – Backlink Level<br />The Problem<br />Anchor text<br />±<br />5 words<br />level-radius-rank<br />21<br />
  28. 28. The Results – Backlink Level<br />The Problem<br />Anchor text<br />±<br />10 words<br />level-radius-rank<br />22<br />