Dissertation Defense

2,365 views
2,230 views

Published on

Martin Klein's dissertation defense slides.

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,365
On SlideShare
0
From Embeds
0
Number of Embeds
571
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Dissertation Defense

  1. 1. Using the Web Infrastructure<br />for Real Time Recovery<br />of Missing Web Pages<br />Dissertation Defense<br />Martin Klein<br />mklein@cs.odu.edu<br />Old Dominion University<br />Norfolk, VA<br />07/18/2011<br />Committee:<br />Dr. Michael L. Nelson (Advisor)<br />Dr. Yaohang Li<br />Dr. Michele C. Weigle<br />Dr. Mohammad Zubair<br />Dr. Robert Sanderson<br />Dr. Herbert Van de Sompel<br />
  2. 2. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />2<br />Motivation<br />Background<br />
  3. 3. The Problem<br />3<br />
  4. 4. The Problem - 404 Errors<br /><ul><li>Expected lifetime of a web page is 44 days[Kahle1997]
  5. 5. URIs inaccessible in CS papers: 23%-53%[Lawrence2001]
  6. 6. Inaccessible web pages: 67% after 4 years [Koehler2002]
  7. 7. Inaccessible objects in DLs: 3%[Nelson2002]
  8. 8. URIs inaccessible in high IF journals: 3.8% after 3 months; 13% after 27 months [Dellavalle2003]
  9. 9. URIs inaccessible in D-Lib Magazine: ~30%[McCown2005]
  10. 10. URIs inaccessible (and not archived) in scholarly articles: ~25%[Sanderson2011]</li></ul>4<br />
  11. 11. The Problem - 404 Errors<br /><ul><li>Are they really gone? Or just relocated?
  12. 12. Has anybody crawled and indexed it?
  13. 13. Do Google, Yahoo!, Bing have a copy of the page?
  14. 14. Has the page been archived by a web archive?
  15. 15. Information retrieval techniques needed to (re-)discover content</li></ul>5<br />
  16. 16. The Solution?<br /><ul><li>Search engines
  17. 17. Requires knowledge about content
  18. 18. Problem with homographs (jaguar, present, lead, M/mobile, etc)
  19. 19. Problem with very frequent terms/names (Michael Nelson, Eric Miller, etc)
  20. 20. Web archives
  21. 21. Helps for apple pie recipe but not for web page of transferred faculty, e.g.</li></ul>6<br />
  22. 22. Content Similarity<br />JCDL 2005<br />http://www.jcdl2005.org/<br />July 2005<br />http://www.jcdl2005.org/<br />Today<br />7<br />
  23. 23. Content Similarity<br />Hypertext 2006<br />http://www.ht06.org/<br />August 2006<br />http://www.ht06.org/<br />Today<br />8<br />
  24. 24. Content Similarity<br />PSP 2003<br />http://www.pspcentral.org/events/annual_meeting_2003.html<br />http://www.pspcentral.org/events/archive/annual_meeting_2003.html<br />August 2003<br />Today<br />9<br />
  25. 25. Content Similarity<br />ECDL 1999<br />http://www.informatik.uni-trier.de/~ley/<br />db/conf/ercimdl/ercimdl99.html<br />http://www-rocq.inria.fr/EuroDL99/<br />October 1999<br />Today<br />10<br />
  26. 26. Content Similarity<br />Greynet 1999<br />http://www.konbib.nl/infolev/greynet/2.5.htm<br />1999<br />Today<br />?<br />?<br />11<br />
  27. 27. Research Questions (1)<br />The Problem<br />Based on the WI, can we use content- and link structure based methods to (re-)discover missing web pages in real time?<br />Investigated Methods:<br />Lexical signatures<br />Titles<br />Tags<br />Link neighborhood lexical signatures<br />12<br />
  28. 28. Research Questions (2)<br />The Problem<br />What are the optimal characteristics of these methods (age, length, etc) with respect to retrieval performance?<br />Can we improve the performance by consolidating two or more methods?<br />Can we have a real-world implementation and evaluation of the above?<br />13<br />
  29. 29. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />14<br />Motivation<br />Background<br />
  30. 30. Memento, Web Infrastructure (WI)<br />15<br />
  31. 31. Lexical Signatures (LSs)<br />First introduced by Phelps and Wilensky[Phelps2000]<br />Small set of terms capturing “aboutness” of a document, “lightweight” metadata<br />Resource<br />Abstract<br />10,000 terms<br />200 terms<br />16<br />
  32. 32. Lexical Signature Generation <br /><ul><li>Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones1973]
  33. 33. Term frequency (TF):
  34. 34. “How often does this word appear in this document?”
  35. 35. Inverse document frequency (IDF):
  36. 36. “In how many documents does this word appear?”</li></ul>17<br />
  37. 37. Lexical Signatures -- Examples<br />18<br />
  38. 38. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />19<br />A Comparison of Techniques for Estimating IDF Values to Generate LexicalSignatures for the Web(WIDM 2008)<br />Motivation<br />Background<br />
  39. 39. Accurate IDF Values for LSs<br />Screen scraping the Google web interface<br />20<br />
  40. 40. The Dataset<br />Local universe consisting of copies of URIs<br />from the Internet Archivebetween 1996 and 2007<br />21<br />
  41. 41. The Idea<br /><ul><li>Use IDF values obtained from </li></ul>Local collection of web pages<br />“screen scraping”SE result pages<br /><ul><li> Validate both methods against a baseline
  42. 42. Google N-Grams</li></ul>Note: N-Grams provide term count (TC) and not DF values – ask me for details<br />22<br />
  43. 43. LSs Example<br />Based on all 3 methods<br />URL: http://www.perfect10wines.com<br />Year: 2007<br />Union: 12 unique terms<br />23<br />
  44. 44. Comparing LSs<br />Normalized term overlap<br /><ul><li>Assume term commutativity
  45. 45. k-term LSs normalized by k </li></ul>Kendall Tau<br /><ul><li>Modified version since LSs to compare may contain different terms</li></ul>M-Score<br /><ul><li>Penalizes discordance in higher ranks</li></ul>24<br />
  46. 46. Comparing LSs<br />Top 5, 10 and 15terms<br />LC – local universe<br />SC – screen scraping<br />NG – N-Grams<br />25<br />
  47. 47. Conclusions<br /><ul><li> Both methods for the computation of IDF values provide accurate results
  48. 48. Compared to the Google N-Gram baseline
  49. 49. Screen scraping method seems preferable
  50. 50. Similarity scores are slightly higher
  51. 51. Feasible in real time!!!</li></ul>Contribution:<br />Established well performing IDF estimation technique.<br />26<br />
  52. 52. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />27<br />Revisiting Lexical Signatures to (Re-)Discover Web Pages(ECDL 2008)<br />Motivation<br />Background<br />
  53. 53. The Idea<br />Evaluate Evolution of LSs over Time by<br /><ul><li>Generate LSs of URIs (from local universe mentioned above) over time
  54. 54. Conduct overlap analysis
  55. 55. Neither Phelps and Wilensky nor Park et al.[Park2004] did that
  56. 56. Park et al. just re-confirmed their findings after 6 months</li></ul>28<br />
  57. 57. LSs Over Time - Example<br />10-term LSs generated for<br />http://www.perfect10wines.com<br />29<br />
  58. 58. LS Overlap Analysis<br />Rooted:<br />overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URI has been observed<br />Sliding:<br />overlap between two LSs of consecutive years starting with the first year and ending with the last<br />30<br />
  59. 59. Evolution of LSs over Time<br />Rooted<br />Results:<br /><ul><li>Little overlap between the early years and more recent ones
  60. 60. Highest overlap in the first 1-2 years after creation of the LS
  61. 61. Rarely peaks after that – once terms are gone do not return</li></ul>31<br />
  62. 62. Evolution of LSs over Time<br />Sliding<br />Results:<br />Overlap increases over time<br />Seem to reach steady state around 2003<br />32<br />
  63. 63. Performance of LSs<br />Idea:<br /><ul><li>Query LSs against Google search API
  64. 64. Identify URI in result set
  65. 65. For each URI it is possible that:</li></ul>URI is returned as the top ranked result<br />URI is ranked somewhere between 2 and 10<br />URI is ranked somewhere between 11 and 100<br />URI is ranked somewhere beyond rank 100 considered as not returned<br />33<br />
  66. 66. Performance of LSs wrt Length<br />Results:<br /><ul><li>2-, 3- and 4-term LSs perform poorly
  67. 67. 5-, 6- and 7-term LSs seem best
  68. 68. Top mean rank (MR) value with 5 terms
  69. 69. Most top ranked with 7 terms
  70. 70. Binary pattern: either in top 10 or undiscovered
  71. 71. 8 terms and beyond do not show improvement</li></ul>34<br />
  72. 72. Performance of LSs wrt Length<br />nDCG for LSs consisting of 2-15 terms<br />(mean over all years)<br />35<br />
  73. 73. Performance of LSs over Time<br />nDCG for LSs consisting of 2, 5, 7 and 10 terms<br />36<br />
  74. 74. Conclusions<br /><ul><li> LSs decay over time
  75. 75. Rooted: quickly after generation
  76. 76. Sliding: seem to stabilize
  77. 77. LSs older than 5 years perform poorly
  78. 78. 5-, 6- and 7-term LSs seem to perform best
  79. 79. 7 – most top ranked
  80. 80. 5 – lowest mean rank
  81. 81. 2..4 as well as 8+ term LSs are insufficient </li></ul>Contribution:<br />Determined age and length limits for LSs.<br />37<br />
  82. 82. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />38<br />Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure(JCDL 2010)<br />Motivation<br />Background<br />
  83. 83. 59 copies<br />The Problem<br />The Problem<br />Internet Archive - Wayback Machine<br />www.aircharter-international.com<br />http://web.archive.org/web/*/http://www.aircharter-international.com<br />Lexical Signature<br />(TF/IDF)<br />Charter Aircraft Cargo Passenger Jet Air Enquiry<br />Title<br />ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International<br />39<br />
  84. 84. The Problem<br />The Problem<br />www.aircharter-international.com<br />Lexical Signature<br />(TF/IDF)<br />Charter Aircraft Cargo Passenger Jet Air Enquiry <br />40<br />
  85. 85. The Problem<br />The Problem<br />www.aircharter-international.com<br />Title<br />ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International<br />41<br />
  86. 86. The Idea<br />Contributions<br />Compare performance of two automated methods to rediscover web pages<br />Lexical signatures (LSs)<br />Titles<br />Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery<br />42<br />
  87. 87. LS Retrieval Performance<br />LS Retrieval Performance<br />5- and 7-Term LSs<br /><ul><li>Yahoo! returns most URIs top ranked and leaves least undiscovered
  88. 88. Binary retrieval pattern, URI either within top 10 or undiscovered</li></ul>43<br />
  89. 89. Title Retrieval Performance<br />Title Retrieval Performance<br />Non-Quoted and Quoted Titles<br /><ul><li>Results at least as good as for LSs
  90. 90. Google and Yahoo! return more URIs for non-quoted titles
  91. 91. Same binary retrieval pattern</li></ul>44<br />
  92. 92. Combination of Methods<br />Combination of Methods<br />Top Results for Combination of Methods<br />45<br />
  93. 93. Conclusions<br />Concluding Remarks<br /><ul><li> LSs and titles are suitable as search engine queries
  94. 94. Return 50%-70% URIs top ranked</li></ul>BUT<br /><ul><li> Titles are cheaper to obtain, hence
  95. 95. Preferred primary method
  96. 96. 5-term LSs secondary method
  97. 97. Results in 75% top ranked URIs</li></ul>Contributions:<br />Provided evidence for suitability of titles and introduced web page discovery framework.<br />46<br />
  98. 98. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />47<br />Is This a Good Title?(Hypertext 2010)<br />Motivation<br />Background<br />
  99. 99. ???<br />The Problem<br />The Problem<br />http://www.drbartell.com/<br />Lexical Signature<br />(TF/IDF)<br />Plastic Surgeon Reconstructive Dr Bartell Symbol University<br />48<br />
  100. 100. The Problem<br />The Problem<br />http://www.drbartell.com/<br />Title<br />Thomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery<br />49<br />
  101. 101. The Problem<br />The Problem<br />www.reagan.navy.mil<br />Lexical Signature<br />(TF/IDF)<br />Ronald USS MCSN Torrey Naval Sea Commanding <br />50<br />
  102. 102. The Problem<br />The Problem<br />www.reagan.navy.mil<br />???<br />Title<br />Home Page<br />Is This a Good Title?<br />51<br />
  103. 103. The Idea<br />Contributions<br />Display title evolution over time<br />Compare to content evolution<br />“Normalize” time as fixed size windows<br />Provide prediction model for title’s retrieval potential<br />52<br />
  104. 104. Title and LS Retrieval Performance<br />Title (and LS) Retrieval Performance<br />Titles<br />5- and 7-Term LSs<br /><ul><li>Titles return more than 60% URIs top ranked
  105. 105. Binary retrieval pattern, URI either within top 10 or undiscovered</li></ul>53<br />
  106. 106. Title Evolution – Example I<br />Title Evolution - Example I<br />www.sun.com/solutions<br />1998-01-27<br />Sun Software Products Selector Guides - Solutions Tree<br />1999-02-20<br />Sun Software Solutions<br />2002-02-01<br />Sun Microsystems Products<br />2002-06-01<br />Sun Microsystems - Business & Industry Solutions<br />2003-08-01<br />Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions<br />2004-02-02<br />Sun Microsystems – Solutions<br />2004-06-10<br />Gateway Page - Sun Solutions <br />2006-01-09<br />Sun Microsystems Solutions & Services<br />2007-01-03<br />Services & Solutions<br />2007-02-07<br />Sun Services & Solutions<br />2008-01-19<br />Sun Solutions<br />54<br />
  107. 107. Title Evolution – Example II<br />Title Evolution - Example II<br />www.datacity.com/mainf.html<br />2002-10-16<br />computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free<br />2006-03-14<br />Est1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB<br />2000-06-19<br />DataCityof Manassas Park Main Page<br />2000-10-12<br />DataCityof Manassas Park sells Custom Built Computers & Removable Hard Drives<br />2001-08-21<br />DataCitya computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives<br />55<br />
  108. 108. Title Evolution Over Time<br />Title Evolution Over Time<br />How much do titles change over time?<br /><ul><li>Copies from fixed size time windows per year
  109. 109. Extract available titles of past 14 years
  110. 110. Compute normalized Levenshtein edit distance between titles of copies and baseline (today)(0 = identical;1 = completely dissimilar)</li></ul>56<br />
  111. 111. Title Evolution Over Time<br />Title Evolution Over Time<br />Title edit distance frequencies<br /><ul><li>Half the titles of available copies from recent years are (close to) identical
  112. 112. Decay from 2005 on (with fewer copies available)
  113. 113. 4 year old title:40% chance to be unchanged</li></ul>57<br />
  114. 114. [0,0] - 122 times<br />[0,1] - over 1600 times<br />Title Evolution Over Time<br />Title Evolution Over Time<br />Title vs Document<br /><ul><li>Y: avg shingle value for all copies per URI
  115. 115. X: avg edit distance of corresponding titles
  116. 116. overlap indicated by:green: <10red: >90
  117. 117. Semi-transparent: total amount of points plotted</li></ul>58<br />
  118. 118. Title Performance Prediction<br />Title Performance Prediction<br /><ul><li>Quality prediction of title by
  119. 119. Number of nouns, articles etc.
  120. 120. Amount of title terms, characters [Ntoulas2006]
  121. 121. Observation of re-occurring terms in poorly performing titles - “Stop Titles”</li></ul>home, index, home page, welcome, untitled document<br />The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”!<br />59<br />
  122. 122. Conclusions<br />Concluding Remarks<br /><ul><li>Titles change more slowly and less significantly over time than web page content
  123. 123. Not all titles equally good
  124. 124. If the majority of title terms are Stop Titles its quality can be predicted poor</li></ul>Contribution:<br />Quantified title evolution and introduced stop titles.<br />60<br />
  125. 125. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />61<br />Motivation<br />Background<br />Find, New, Copy, Web, Page - Tagging for the (Re-)Discovery of Web Pages(TPDL 2011)<br />
  126. 126. The Problem<br />The Problem<br />We have seen that we have a good chance to rediscover missing pages with<br /><ul><li> Lexical signatures
  127. 127. Titles </li></ul>BUT<br />What if no archived/cached copy can be found?<br />62<br />
  128. 128. The Solution?<br />The Problem<br />Conferences<br />Digitallibraries<br />Conference<br />Library<br />Jcdl2005<br />63<br />
  129. 129. The Idea<br />The Problem<br /><ul><li>Experimental evaluation of tag based query length cf. 5- or 7-term LSs
  130. 130. Test combination of methods to improve retrieval performance
  131. 131. Investigate “descriptive” power of tags</li></ul>64<br />
  132. 132. The Experiment<br />The Problem<br /><ul><li> Tags queried against the Yahoo! BOSS API
  133. 133. Same four retrieval cases introduced earlier
  134. 134. nDCG w/ binary relevance scoring
  135. 135. Mean Average Precision</li></ul>65<br />
  136. 136. The Experiment<br />The Problem<br />Combining methods<br />66<br />
  137. 137. The Experiment<br />The Problem<br /><ul><li> Fact:
  138. 138. ~50% of tags do not occur in page [Bischoff2008]
  139. 139. “Secret”:
  140. 140. ~50% of tags do not occur in current version of page
  141. 141. ergo: How about previous versions?</li></ul>67<br />
  142. 142. Ghost Tags<br />The Problem<br /><ul><li> 3,306 URIs w/ older copies
  143. 143. 66.3% of our tags do not occur in page
  144. 144. 4.9% of tags occur in previous version of page Ghost Tags
  145. 145. represent a previous version better than the current one
  146. 146. What kind of tags are these?
  147. 147. Important to the document, to the Delicious user?</li></ul>68<br />
  148. 148. Ghost Tags<br />The Problem<br />Document importance:<br />TF rank<br />User importance:<br />Delicious rank<br />Normalized rank:<br />0 - top<br />1 - bottom<br />69<br />
  149. 149. Conclusions<br />Concluding Remarks<br /><ul><li>Tags can be used for search (if available)
  150. 150. Combining tags with titles and LSs gains URIs
  151. 151. Ghost Tags exist!
  152. 152. 1/3 of them are important to the page and user</li></ul>Contributions:<br />Added tags to web page discovery framework and introduced notion of Ghost Tags.<br />70<br />
  153. 153. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />71<br />Motivation<br />Background<br />Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures(JCDL 2011)<br />
  154. 154. The Problem<br />The Problem<br />We have seen that we have a good chance to rediscover missing pages with<br /><ul><li> Lexical signatures
  155. 155. Titles </li></ul>BUT<br />What if no archived/cached copy can be found?<br />Plan A: Tags<br />72<br />
  156. 156. Plan B<br />The Problem<br />Link neighborhood Lexical Signatures (LNLSs)<br />is about<br />Computer<br />Dominion<br />Norfolk<br />Monarch<br />extract<br />73<br />
  157. 157. The Idea<br />The Problem<br /><ul><li>Determine for well performing LNLS:
  158. 158. Length
  159. 159. Number of backlinks
  160. 160. Backlink levels
  161. 161. Radius of terms on backlink page</li></ul>74<br />
  162. 162. The Radius on a Backlink Page<br />The Problem<br />Entire page<br />Paragraph<br />Anchor text<br />75<br />
  163. 163. The Dataset<br />309 URIs<br />28,325 first level<br />306,700 second level backlinks<br />Filter for language, file type, etc. <br /> 12% discarded<br /><ul><li>Lexical signature generation
  164. 164. IDF values from Yahoo!
  165. 165. 1..7 and 10 terms</li></ul>Query Yahoo! API<br />Compute “goodness” (nDCG)<br />76<br />
  166. 166. The Results<br />The Problem<br />1st and 2nd<br />level<br />level-radius-rank<br />better<br />77<br />
  167. 167. The Results – Radius<br />The Problem<br />All Radii<br />level-radius-rank<br />78<br />
  168. 168. The Results – Backlink Rank<br />The Problem<br />Ranks<br />10<br />100<br />1000<br />level-radius-rank<br />79<br />
  169. 169. The Results – In Numbers<br />The Problem<br />GOOD<br />1-anchor-1000<br />WINNER<br />1-anchor-10<br />80<br />
  170. 170. Conclusions<br />Concluding Remarks<br />Optimal link neighborhood lexical signatures:<br /><ul><li>Contain 4 terms
  171. 171. Parsed from top 10backlink pages
  172. 172. Include firstbacklink level only
  173. 173. Consider anchor text only</li></ul>Contributions:<br />Added LNLS to web page discovery framework.<br />81<br />
  174. 174. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />82<br />Motivation<br />Background<br />Synchronicity – Automatically Rediscover Missing Web Pages in Real Time<br />(JCDL 2011)<br />
  175. 175. Synchronicity<br />Concluding Remarks<br />Firefox add-on<br />Triggers on 404 error<br />Rediscover page via:<br />Memento<br />Title<br />Lexical signature<br />Tags<br />Link neighborhood lexical signature<br />URI modification<br />http://bit.ly/no-more-404<br />83<br />
  176. 176. Contributions<br />Concluding Remarks<br />Introduce reliable real-time approach to estimate IDF values<br />Workflow for generation of well performing lexical signatures<br />Performance evaluation of web page titles<br />Investigation of tags for web page discovery<br />Analysis of link neighborhood lexical signatures and their optimal parameter<br />Introduce Synchronicity implementing the entire framework<br />84<br />
  177. 177. Concluding Remarks<br />85<br />
  178. 178. Next Stop… New Mexico<br />Concluding Remarks<br />86<br />
  179. 179. List of my Relevant Publications<br />Concluding Remarks<br />M.Klein, M.L.Nelson, “A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web“, WIDM 2008, pp. 39-46<br />M.Klein, M.L.Nelson, “Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008, pp. 371-382<br />M.Klein, M.L.Nelson, “Correlation of Term Count and Document Frequency for Google N-Grams“, ECIR 2009, pp. 620-627<br />M.Klein, M.L.Nelson, “Inter-Search Engine Lexical Signature Performance“, JCDL 2009, pp. 413-414<br />M.Klein, M.L.Nelson, “Investigating the Change of Web Pages Titles Over Time“, InDP 2009<br />M.Klein, J.Shipman, M.L.Nelson, “Is This a Good Title”, Hypertext 2010, pp. 3-12<br />M.Klein, M.L.Nelson, “Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure”, JCDL 2010, pp. 59-68<br />M.Klein, J.Ware, M.L.Nelson, “Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures”, JCDL 2011<br />M.Klein, M.Aly, M.L.Nelson, “Synchronicity - Automatically Rediscover Missing Web Pages in Real Time”, JCDL 2011<br />M.Klein, M.L.Nelson, “Find, New, Copy, Web, Page – Tagging for the (Re-)Discovery of Web Pages”, TPDL 2011 to appear<br />87<br />
  180. 180. References<br />Concluding Remarks<br />Bischoff2008<br />K.Bischoff, C.Firan, W.Nejdl, R.Paiu, “Can All Tags Be Used for Search?” In: Proceedings of CIKM '08, pp.193-202, 2008<br />Dellavalle2003<br />R.P.Dellavalle, E.J.Hester, L.F.Heilig, A.L.Drake, J.W.Kuntzman, M.Graber, L.M.Schilling, “Information Science: Going, Going, Gone: Lost Internet References”, Science 302(5646), pp.787-788, 2003<br />Jones1973<br />K.Spärck Jones, “Index Term Weighting”, Information Storage and Retrieval, pp. 619-633, 1973<br />Kahle1997<br />B.Kahle, “Preserving the Internet”, Scientific American 276, pp.82-83, 1997<br />Koehler2002<br />W.C.Koehler, “Web Page Change and Persistence - A Four-Year Longitudinal Study”, JASIST 53(2), pp.162-171, 2002<br />Lawrence2001<br />S.Lawrence, D.M.Pennock, G.W.Flake, R.Krovetz, F.M.Coetzee, E.Glover, F.A.Nielsen, A.Kruger, C.L.Giles, “Persistence of Web References in Scientic Research”, Computer 34(2), pp.26-31, 2001<br />McCown2005<br />F.McCown, S.Chan, M.L.Nelson, J.Bollen, “The Availability and Persistence of Web References in D-Lib Magazine”, Proceedings of IWAW '05, 2005<br />Nelson2002<br />M.L.Nelson, B.D.Allen, “Object Persistence and Availability in Digital Libraries”, D-Lib Magazine 8(1), 2002<br />Ntoulas2006<br />A. Ntoulas, M.Najork, M.Manasse, D.Fetterly, “Detecting Spam Web Pages Through Content Analysis”, Proceedings of WWW ’06, pp 83-92, 2006<br />Park2004<br />S.T.Park, D.M.Pennock, C.L.Giles, R.Krovetz, “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web”, TOIS 22(4), pp.540-572, 2004<br />Phelps2000<br />T.A.Phelps, R.Wilensky, “Robust Hyperlinks Cost Just Five Words Each”, technical report, UC Berkeley, 2000<br />Sanderson2011<br />R.Sanderson, M.Phillips, H.Van de Sompel, “Analyzing the Persistence of Referenced Web Resources with Memento”, Proceedings of OR '11, 2011<br />88<br />
  181. 181. Using the Web Infrastructure<br />for Real Time Recovery<br />of Missing Web Pages<br />Martin Klein<br />mklein@cs.odu.edu<br />http://www.cs.odu.edu/~mklein/<br />
  182. 182. Backup Slides<br />
  183. 183. Future Work<br />91<br /><ul><li>“Story Telling” with Memento
  184. 184. Find more Stop Titles
  185. 185. Find more Ghost Tags
  186. 186. Identify “Stop Anchors”
  187. 187. Synchronicity 1.0
  188. 188. Web service
  189. 189. CMD line tool</li></li></ul><li>Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />92<br />Correlation of Term Count and Document Frequency for Google N-Grams(ECIR 2009)<br />Motivation<br />Background<br />
  190. 190. The Problem<br /><ul><li> Need of a reliable source to accurately compute IDF values of web pages (in real time)
  191. 191. Shown, screen scraping works but
  192. 192. missing validation of baseline (Google N-Grams)
  193. 193. N-Grams seem suitable (recently created, based on web pages) but provide TC and not DF  what is their relationship?</li></ul>93<br />
  194. 194. 94<br />Background<br /><ul><li> Google N-grams provide term count (TC) values</li></ul> D1 = “Please, Please Me” D2 = “Can’t Buy Me Love”<br /> D3 = “All You Need Is Love” D4 = “Long, Long, Long” <br />TC >= DF, but is there a correlation?<br />Can we use TC to estimate DF?<br />
  195. 195. 95<br />Experiment Results<br /> Investigate correlation between TC and DF<br />within “Web as Corpus” (WaC)<br />Rank similarity of all terms<br />
  196. 196. 96<br />Experiment Results<br /> Investigate correlation between TC and DF<br />within “Web as Corpus” (WaC)<br />Spearman’s ρ and Kendall τ<br />
  197. 197. 97<br />Experiment Results<br />Top 10 terms in decreasing order of their TF/IDF values<br />taken from http://ecir09.irit.fr<br />U = 14<br />∩ = 6<br />Strong indicator that TC can be used to estimate DF for web pages!<br />Google: screen scraping DF values from the Google web interface<br />
  198. 198. 98<br />Experiment Results<br />Show similarity between WaC based TC and<br />Google N-Gram based TC<br />TC frequencies<br />N-Grams have a threshold of 200<br />
  199. 199. Experiment Results<br />Frequency of TC/DF Ratio Within the WaC<br />Integer Values<br />Two Decimals<br />One Decimal<br />99<br />
  200. 200. Conclusions<br /><ul><li> TC and DF Ranks within the WaC show strong correlation
  201. 201. TC frequencies of WaC and Google N-Grams are very similiar
  202. 202. N-Grams are suitable for accurate IDF estimation for web pages</li></ul> Does not mean everything correlated to TC can be used as DF substitute!<br />100<br />
  203. 203. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />101<br />Inter-Search Engine Lexical Signature Performance<br />(JCDL 2009)<br />Motivation<br />Background<br />
  204. 204. Inter-Search EngineLexical Signature Performance<br />http://en.wikipedia.org/wiki/Elephant<br />Elephant<br />Tusks<br />Trunk<br />African<br />Loxodonta<br />Elephant, Asian, African<br />Species, Trunk<br />Elephant, African, Tusks<br />Asian, Trunk<br />
  205. 205. 103<br />
  206. 206. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />104<br />Motivation<br />Background<br />Synchronicity – Automatically Rediscover Missing Web Pages in Real Time<br />(JCDL 2011)<br />
  207. 207. Synchro…What?<br />Synchronicity<br /><ul><li>Experience of causally unrelated events occurring together in a meaningful manner
  208. 208. Events reveal underlying pattern, framework bigger than any of the synchronous systems
  209. 209. Carl Gustav Jung (1875-1961)
  210. 210. “meaningful coincidence”
  211. 211. Deschamps – de Fontgibu plumpudding example</li></ul>picture from http://www.crystalinks.com/jung.html<br />105<br />
  212. 212. Synchro…What?<br />Repo Man (1984)<br />http://www.imdb.com/title/tt0087995/<br />http://www.youtube.com/watch?v=X4HQyqc-aVU<br />106<br />
  213. 213. Agenda<br />LSs for Web Pages<br />DF Estimation Techniques<br />TC-DF Correlation<br />Web Page Titles<br />Synchronicity<br />Link Neighborhood LSs<br />Book of the Dead<br />Web Page Tags<br />107<br />Motivation<br />Background<br />(Not yet published)<br />
  214. 214. Book of the Dead<br /><ul><li>Corpus of missing web pages
  215. 215. 233 URIs returning status 404
  216. 216. Mechanical Turk to determine “aboutness”
  217. 217. Guess from URI string
  218. 218. Mementos for 161 URIs
  219. 219. Apply lexical signatures and title</li></ul>108<br />
  220. 220. 5-term LSs<br />Titles<br />109<br />Experiment Results<br />Dice Similarity Coefficient<br />of Top 100 Results<br />D = 0<br />0.0 < D ≤ 0.3<br />0.3 < D ≤ 0.6<br />0.6 < D ≤ 1.0<br />
  221. 221. 5-term LSs<br />Titles<br />110<br />Experiment Results<br />Jaro Distance <br />of Top 100 Results<br />J = 0<br />0.0 < J ≤ 0.3<br />0.3 < J ≤ 0.6<br />0.6 < J ≤ 1.0<br />
  222. 222. Book of the Dead<br /><ul><li>Mechanical Turk to determine relevance of results
  223. 223. Top 10 only
  224. 224. Relevant
  225. 225. Somewhat relevant
  226. 226. Not relevant
  227. 227. Broken URI
  228. 228. nDCG of top 10 results</li></ul>111<br />
  229. 229. 5-term LSs<br />Titles<br />112<br />Experiment Results<br />Relevance of Top 10 Results<br />
  230. 230. 113<br />Experiment Results<br />nDCG of Top 10 Results<br />

×