Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Investigating the Change of Web Pages’ Titles Over Time
1. Investigating the Change of
Web Pages’ Titles Over Time
Martin Klein and Michael L. Nelson
Old Dominion University
{mklein,mln}@cs.odu.edu
InDP 2009
Austin, TX
06/19/2009
7. The Environment
Web Infrastructure (WI) [McCown07]
• Web search engines (Google, Yahoo!, MSN Live) and
their caches
• Research projects (CiteSeer)
• Web archives (Internet Archive)
[McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007. 3
8. The Bigger Picture
(1)
DONE
query for URL in:
·search engine caches present
! (2)
·Internet Archive results
user is
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
9. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
query for URL in:
·search engine caches present
! (2)
·Internet Archive results
user is
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
10. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
!
•
query for URL in: (2)
·search engine caches
·Internet Archive
present
results
user is Discovers copy of missing page
in WI and provides to user
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
11. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
!
•
query for URL in: (2)
·search engine caches
·Internet Archive
present
results
user is Discovers copy of missing page
in WI and provides to user
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs • Obtains further data about
missing page (LS, title, tags) and
·obtain tags
no results ·query search engines
present
found
feeds that back into WI
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
12. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
!
•
query for URL in: (2)
·search engine caches
·Internet Archive
present
results
user is Discovers copy of missing page
in WI and provides to user
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs • Obtains further data about
missing page (LS, title, tags) and
·obtain tags
no results ·query search engines
present
found
feeds that back into WI
results
!
•
! user is (4)
Provides page at its new location
DONE
satisfied
(5)
·include link neighborhood
or “good enough” alternative
·relevance feedback
·user interaction:
! request keywords
page
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
13. The Bigger Picture
(1) • System catches
DONE
404 “Page not found” errors
!
•
query for URL in: (2)
·search engine caches
·Internet Archive
present
results
user is Discovers copy of missing page
in WI and provides to user
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs • Obtains further data about
missing page (LS, title, tags) and
·obtain tags
no results ·query search engines
present
found
feeds that back into WI
results
!
•
! user is (4)
Provides page at its new location
DONE
satisfied
(5)
·include link neighborhood
or “good enough” alternative
·relevance feedback
·user interaction:
! request keywords
page
•
! change number of terms in LS
More sophisticated methods
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
needed if unsuccessful so far
4
14. The Bigger Picture
(1)
DONE
query for URL in:
·search engine caches present
! (2)
·Internet Archive results
user is
satisfied
(3)
!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
15. The Bigger Picture
(1)
DONE
query for URL in:
·search engine caches present
! (2)
·Internet Archive results
user is
satisfied
(3)
!
REAL TIME!!!
·identify dissimilar pages
·extract titles
·generate LSs
·obtain tags
no results ·query search engines
present
found
results
! user is ! (4)
DONE
satisfied
(5)
·include link neighborhood
·relevance feedback
·user interaction:
! request keywords
! change number of terms in LS
! add/delete term from LS
! advanced search operators
(6)
present results
! DONE
4
16. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
17. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
BUT:
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
18. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
BUT:
• IDF can only be estimated when the entire web is the corpus
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
19. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
BUT:
• IDF can only be estimated when the entire web is the corpus
• Expensive to generate
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
20. Search Engine Queries
• Lexical signatures (LSs)
• Small set of terms capturing the “aboutness” of a document
• Generated following the TF-IDF scheme
• Phelps and Wilensky assumed ‘5’[Phelps00]
• We have shown that 5- and 7-term LSs perform best [Klein08]
BUT:
• IDF can only be estimated when the entire web is the corpus
• Expensive to generate
Web pages’ titles
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
21. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
6
22. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
• We have shown that LSs decay over time and their
retrieval performance decreases [Klein08]
6
23. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
• We have shown that LSs decay over time and their
retrieval performance decreases [Klein08]
• Investigate change of titles over time
6
24. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
• We have shown that LSs decay over time and their
retrieval performance decreases [Klein08]
• Investigate change of titles over time
• General frequency of change
6
25. Web Pages’ Titles
• Easier/cheaper to obtain than LSs
• High availability (1-2% of web pages have no title)
• Also capturing “aboutness” of a web page
• We have shown that LSs decay over time and their
retrieval performance decreases [Klein08]
• Investigate change of titles over time
• General frequency of change
• Degree of change as Levenshtein score
6
26. Dataset
• 6k URLs randomly sampled from DMOZ
• Parsed the pages and extracted up to three URLs
referencing to in-domain pages
• Applied filter for:
• Inaccessible pages
• Pages not containing any links
• Pages not in the .com, .net, .org or .edu domain
• Pages without copies in the IA
7
27. Dataset
• 6k URLs randomly sampled from DMOZ
• Parsed the pages and extracted up to three URLs
referencing to in-domain pages
• Applied filter for:
• Inaccessible pages
• Pages not containing any links
• Pages not in the .com, .net, .org or .edu domain
• Pages without copies in the IA
1090 URLs and more than 100K observations
7
32. Frequency of Change
Number of Changes and Observations in the IA
ordered in Number of Changes
increasing order by: Number of Observations
10000
1) observations
2) changes
Number of Changes/Observations
1000
100
10
1
0 200 400 600 800 1000
URLs 9
33. Frequency of Change
Number of Changes and Observations in the IA
ordered in Number of Changes
increasing order by: Number of Observations
• generally low number of
10000
1) observations
2) changes change
Number of Changes/Observations
1000
100
10
1
0 200 400 600 800 1000
URLs 9
34. Frequency of Change
Number of Changes and Observations in the IA
ordered in Number of Changes
increasing order by: Number of Observations
• generally low number of
10000
1) observations
2) changes change
• max changes: 25
Number of Changes/Observations
1000
100
10
1
0 200 400 600 800 1000
URLs 9
35. Frequency of Change
Number of Changes and Observations in the IA
ordered in Number of Changes
increasing order by: Number of Observations
• generally low number of
10000
1) observations
2) changes change
• max changes: 25
Number of Changes/Observations
1000
• number of observations
does not impact the
100
number of changes
10
1
0 200 400 600 800 1000
URLs 9