Investigating the Change of
Web Pages’ Titles Over Time

   Martin Klein and Michael L. Nelson
         Old Dominion Unive...
The Problem




              2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Environment

                           Web Infrastructure (WI) [McCown07]

            •     Web search engines (Goog...
The Bigger Picture

           (1)

                                                                        DONE

query fo...
The Bigger Picture

           (1)                                                                 •   System catches
    ...
The Bigger Picture

           (1)                                                                 •   System catches
    ...
The Bigger Picture

           (1)                                                                 •   System catches
    ...
The Bigger Picture

           (1)                                                                 •   System catches
    ...
The Bigger Picture

           (1)                                                                 •   System catches
    ...
The Bigger Picture

           (1)

                                                                        DONE

query fo...
The Bigger Picture

           (1)

                                                                        DONE

query fo...
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a docum...
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a docum...
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a docum...
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a docum...
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a docum...
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Als...
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Als...
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Als...
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Als...
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Als...
Dataset

•   6k URLs randomly sampled from DMOZ

•   Parsed the pages and extracted up to three URLs
    referencing to in...
Dataset

  •   6k URLs randomly sampled from DMOZ

  •   Parsed the pages and extracted up to three URLs
      referencing...
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html     ...
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html     ...
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html     ...
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html     ...
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in      ...
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in      ...
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in      ...
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in      ...
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Upcoming SlideShare
Loading in …5
×

Investigating the Change of Web Pages’ Titles Over Time

773 views
698 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
773
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Investigating the Change of Web Pages’ Titles Over Time

  1. 1. Investigating the Change of Web Pages’ Titles Over Time Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu InDP 2009 Austin, TX 06/19/2009
  2. 2. The Problem 2
  3. 3. The Problem http://www.pspcentral.org/events/annual_meeting_2003.html 2
  4. 4. The Problem http://www.pspcentral.org/events/annual_meeting_2003.html 2
  5. 5. The Problem http://www.pspcentral.org/events/annual_meeting_2003.html 2
  6. 6. The Problem http://www.pspcentral.org/events/annual_meeting_2003.html 2
  7. 7. The Environment Web Infrastructure (WI) [McCown07] • Web search engines (Google, Yahoo!, MSN Live) and their caches • Research projects (CiteSeer) • Web archives (Internet Archive) [McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007. 3
  8. 8. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  9. 9. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  10. 10. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  11. 11. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  12. 12. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  13. 13. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page • ! change number of terms in LS More sophisticated methods ! add/delete term from LS ! advanced search operators (6) present results ! DONE needed if unsuccessful so far 4
  14. 14. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  15. 15. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! REAL TIME!!! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  16. 16. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  17. 17. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  18. 18. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  19. 19. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  20. 20. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate Web pages’ titles [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  21. 21. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page 6
  22. 22. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] 6
  23. 23. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time 6
  24. 24. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change 6
  25. 25. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change • Degree of change as Levenshtein score 6
  26. 26. Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 7
  27. 27. Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 1090 URLs and more than 100K observations 7
  28. 28. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  29. 29. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  30. 30. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  31. 31. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  32. 32. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations 10000 1) observations 2) changes Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  33. 33. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  34. 34. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  35. 35. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 • number of observations does not impact the 100 number of changes 10 1 0 200 400 600 800 1000 URLs 9

×