SlideShare a Scribd company logo
1 of 52
Investigating the Change of
Web Pages’ Titles Over Time

   Martin Klein and Michael L. Nelson
         Old Dominion University

       {mklein,mln}@cs.odu.edu

                InDP 2009
                Austin, TX
                06/19/2009
The Problem




              2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Environment

                           Web Infrastructure (WI) [McCown07]

            •     Web search engines (Google, Yahoo!, MSN Live) and
                  their caches

            • Research projects (CiteSeer)
            • Web archives (Internet Archive)



[McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007.   3
The Bigger Picture

           (1)

                                                                        DONE

query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                               4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                 4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                    4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs                 •   Obtains further data about
                                                                                   missing page (LS, title, tags) and
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found

                                                                                   feeds that back into WI
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                        4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs                 •   Obtains further data about
                                                                                   missing page (LS, title, tags) and
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found

                                                                                   feeds that back into WI
                                  results




                                                  !
                                                                               •
                              !        user is                (4)

                                                                                   Provides page at its new location
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
                                                                                   or “good enough” alternative
           ·relevance feedback
           ·user interaction:
           ! request keywords
                                                                                   page
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                        4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs                 •   Obtains further data about
                                                                                   missing page (LS, title, tags) and
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found

                                                                                   feeds that back into WI
                                  results




                                                  !
                                                                               •
                              !        user is                (4)

                                                                                   Provides page at its new location
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
                                                                                   or “good enough” alternative
           ·relevance feedback
           ·user interaction:
           ! request keywords
                                                                                   page

                                                                               •
           ! change number of terms in LS


                                                                                   More sophisticated methods
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                   needed if unsuccessful so far
                                                                                                                        4
The Bigger Picture

           (1)

                                                                        DONE

query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                               4
The Bigger Picture

           (1)

                                                                        DONE

query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !

                                                                               REAL TIME!!!
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                              4
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]




[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:




[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:

  •    IDF can only be estimated when the entire web is the corpus



[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:

  •    IDF can only be estimated when the entire web is the corpus

  •    Expensive to generate

[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:

  •    IDF can only be estimated when the entire web is the corpus

  •    Expensive to generate
                                                             Web pages’ titles
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page




                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]




                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]

• Investigate change of titles over time

                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]

• Investigate change of titles over time
 • General frequency of change
                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]

• Investigate change of titles over time
 • General frequency of change
 • Degree of change as Levenshtein score
                                                          6
Dataset

•   6k URLs randomly sampled from DMOZ

•   Parsed the pages and extracted up to three URLs
    referencing to in-domain pages

•   Applied filter for:

    •   Inaccessible pages

    •   Pages not containing any links

    •   Pages not in the .com, .net, .org or .edu domain

    •   Pages without copies in the IA



                                                           7
Dataset

  •   6k URLs randomly sampled from DMOZ

  •   Parsed the pages and extracted up to three URLs
      referencing to in-domain pages

  •   Applied filter for:

      •   Inaccessible pages

      •   Pages not containing any links

      •   Pages not in the .com, .net, .org or .edu domain

      •   Pages without copies in the IA


1090 URLs and more than 100K observations
                                                             7
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations
                                                  10000


1) observations
2) changes
                 Number of Changes/Observations

                                                  1000
                                                  100
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                 9
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations


                                                                                                                  • generally low number of
                                                  10000


1) observations
2) changes                                                                                                          change
                 Number of Changes/Observations

                                                  1000
                                                  100
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                                         9
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations


                                                                                                                  • generally low number of
                                                  10000


1) observations
2) changes                                                                                                          change

                                                                                                                  • max changes: 25
                 Number of Changes/Observations

                                                  1000
                                                  100
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                                         9
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations


                                                                                                                  • generally low number of
                                                  10000


1) observations
2) changes                                                                                                          change

                                                                                                                  • max changes: 25
                 Number of Changes/Observations

                                                  1000




                                                                                                                  • number of observations
                                                                                                                    does not impact the
                                                  100




                                                                                                                    number of changes
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                                         9
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time

More Related Content

Similar to Investigating the Change of Web Pages’ Titles Over Time

CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databasessjwoodman
 
Statistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageStatistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageMarkus Luczak-Rösch
 
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchSPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchAgnes Molnar
 
Building apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon BostonBuilding apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon Bostonamansk
 
SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013Agnes Molnar
 
Chef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure AutomationChef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure AutomationNathaniel Brown
 

Similar to Investigating the Change of Web Pages’ Titles Over Time (7)

CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
 
A View on eScience
A View on eScienceA View on eScience
A View on eScience
 
Statistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageStatistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data Usage
 
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchSPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
 
Building apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon BostonBuilding apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon Boston
 
SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013
 
Chef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure AutomationChef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure Automation
 

More from Martin Klein

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansMartin Klein
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly WebMartin Klein
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...Martin Klein
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...Martin Klein
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncMartin Klein
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsMartin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansMartin Klein
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly ArtifactsMartin Klein
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...Martin Klein
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento RequestsMartin Klein
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesMartin Klein
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsMartin Klein
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsMartin Klein
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live WebMartin Klein
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web ResourcesMartin Klein
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for RepositoriesMartin Klein
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDMartin Klein
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationMartin Klein
 

More from Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Investigating the Change of Web Pages’ Titles Over Time

  • 1. Investigating the Change of Web Pages’ Titles Over Time Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu InDP 2009 Austin, TX 06/19/2009
  • 7. The Environment Web Infrastructure (WI) [McCown07] • Web search engines (Google, Yahoo!, MSN Live) and their caches • Research projects (CiteSeer) • Web archives (Internet Archive) [McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007. 3
  • 8. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 9. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 10. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 11. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 12. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 13. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page • ! change number of terms in LS More sophisticated methods ! add/delete term from LS ! advanced search operators (6) present results ! DONE needed if unsuccessful so far 4
  • 14. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 15. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! REAL TIME!!! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 16. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 17. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 18. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 19. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 20. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate Web pages’ titles [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 21. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page 6
  • 22. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] 6
  • 23. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time 6
  • 24. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change 6
  • 25. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change • Degree of change as Levenshtein score 6
  • 26. Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 7
  • 27. Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 1090 URLs and more than 100K observations 7
  • 28. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 29. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 30. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 31. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 32. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations 10000 1) observations 2) changes Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  • 33. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  • 34. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  • 35. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 • number of observations does not impact the 100 number of changes 10 1 0 200 400 600 800 1000 URLs 9