SlideShare a Scribd company logo
1 of 52
Investigating the Change of
Web Pages’ Titles Over Time

   Martin Klein and Michael L. Nelson
         Old Dominion University

       {mklein,mln}@cs.odu.edu

                InDP 2009
                Austin, TX
                06/19/2009
The Problem




              2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Problem




http://www.pspcentral.org/events/annual_meeting_2003.html   2
The Environment

                           Web Infrastructure (WI) [McCown07]

            •     Web search engines (Google, Yahoo!, MSN Live) and
                  their caches

            • Research projects (CiteSeer)
            • Web archives (Internet Archive)



[McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007.   3
The Bigger Picture

           (1)

                                                                        DONE

query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                               4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                 4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                    4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs                 •   Obtains further data about
                                                                                   missing page (LS, title, tags) and
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found

                                                                                   feeds that back into WI
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                        4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs                 •   Obtains further data about
                                                                                   missing page (LS, title, tags) and
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found

                                                                                   feeds that back into WI
                                  results




                                                  !
                                                                               •
                              !        user is                (4)

                                                                                   Provides page at its new location
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
                                                                                   or “good enough” alternative
           ·relevance feedback
           ·user interaction:
           ! request keywords
                                                                                   page
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                                                        4
The Bigger Picture

           (1)                                                                 •   System catches
                                                                        DONE
                                                                                   404 “Page not found” errors
                                    !
                                                                               •
query for URL in:                                 (2)
·search engine caches
·Internet Archive
                        present
                        results
                                       user is                                     Discovers copy of missing page
                                                                                   in WI and provides to user
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs                 •   Obtains further data about
                                                                                   missing page (LS, title, tags) and
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found

                                                                                   feeds that back into WI
                                  results




                                                  !
                                                                               •
                              !        user is                (4)

                                                                                   Provides page at its new location
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
                                                                                   or “good enough” alternative
           ·relevance feedback
           ·user interaction:
           ! request keywords
                                                                                   page

                                                                               •
           ! change number of terms in LS


                                                                                   More sophisticated methods
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                   needed if unsuccessful so far
                                                                                                                        4
The Bigger Picture

           (1)

                                                                        DONE

query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                               4
The Bigger Picture

           (1)

                                                                        DONE

query for URL in:
·search engine caches   present
                                    !             (2)
·Internet Archive       results
                                       user is
                                      satisfied


                                                                 (3)
                                     !

                                                                               REAL TIME!!!
                                                 ·identify dissimilar pages
                                                 ·extract titles
                                                 ·generate LSs
                                                 ·obtain tags
       no results                                ·query search engines
                                  present
         found
                                  results




                              !        user is    !           (4)
                                                                        DONE
                                      satisfied
                        (5)



           ·include link neighborhood
           ·relevance feedback
           ·user interaction:
           ! request keywords
           ! change number of terms in LS
           ! add/delete term from LS
           ! advanced search operators



                                                    (6)
                                            present results
                                                              !         DONE
                                                                                              4
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]




[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:




[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:

  •    IDF can only be estimated when the entire web is the corpus



[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:

  •    IDF can only be estimated when the entire web is the corpus

  •    Expensive to generate

[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Search Engine Queries

  •    Lexical signatures (LSs)

      •    Small set of terms capturing the “aboutness” of a document

      •    Generated following the TF-IDF scheme

      •    Phelps and Wilensky assumed ‘5’[Phelps00]

  •    We have shown that 5- and 7-term LSs perform best [Klein08]

       BUT:

  •    IDF can only be estimated when the entire web is the corpus

  •    Expensive to generate
                                                             Web pages’ titles
[Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000
                                                                                                                 5
[Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page




                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]




                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]

• Investigate change of titles over time

                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]

• Investigate change of titles over time
 • General frequency of change
                                                          6
Web Pages’ Titles

•   Easier/cheaper to obtain than LSs

•   High availability (1-2% of web pages have no title)

•   Also capturing “aboutness” of a web page


•   We have shown that LSs decay over time and their
    retrieval performance decreases [Klein08]

• Investigate change of titles over time
 • General frequency of change
 • Degree of change as Levenshtein score
                                                          6
Dataset

•   6k URLs randomly sampled from DMOZ

•   Parsed the pages and extracted up to three URLs
    referencing to in-domain pages

•   Applied filter for:

    •   Inaccessible pages

    •   Pages not containing any links

    •   Pages not in the .com, .net, .org or .edu domain

    •   Pages without copies in the IA



                                                           7
Dataset

  •   6k URLs randomly sampled from DMOZ

  •   Parsed the pages and extracted up to three URLs
      referencing to in-domain pages

  •   Applied filter for:

      •   Inaccessible pages

      •   Pages not containing any links

      •   Pages not in the .com, .net, .org or .edu domain

      •   Pages without copies in the IA


1090 URLs and more than 100K observations
                                                             7
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Dataset




  Length = 1                   Length = 2
     foo.bar/                   foo.bar/bar/
foo.bar/index.html         foo.bar/bar/index.html
                                                    8
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations
                                                  10000


1) observations
2) changes
                 Number of Changes/Observations

                                                  1000
                                                  100
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                 9
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations


                                                                                                                  • generally low number of
                                                  10000


1) observations
2) changes                                                                                                          change
                 Number of Changes/Observations

                                                  1000
                                                  100
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                                         9
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations


                                                                                                                  • generally low number of
                                                  10000


1) observations
2) changes                                                                                                          change

                                                                                                                  • max changes: 25
                 Number of Changes/Observations

                                                  1000
                                                  100
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                                         9
Frequency of Change
                                        Number of Changes and Observations in the IA

ordered in                                                    Number of Changes
increasing order by:                                          Number of Observations


                                                                                                                  • generally low number of
                                                  10000


1) observations
2) changes                                                                                                          change

                                                                                                                  • max changes: 25
                 Number of Changes/Observations

                                                  1000




                                                                                                                  • number of observations
                                                                                                                    does not impact the
                                                  100




                                                                                                                    number of changes
                                                  10
                                                  1




                                                          0           200              400     600   800   1000

                                                                                             URLs                                         9
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time
Investigating the Change of Web Pages’ Titles Over Time

More Related Content

Similar to Investigating the Change of Web Pages’ Titles Over Time

CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databasessjwoodman
 
Statistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageStatistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageMarkus Luczak-Rösch
 
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchSPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchAgnes Molnar
 
Building apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon BostonBuilding apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon Bostonamansk
 
SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013Agnes Molnar
 
Chef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure AutomationChef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure AutomationNathaniel Brown
 

Similar to Investigating the Change of Web Pages’ Titles Over Time (7)

CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
 
A View on eScience
A View on eScienceA View on eScience
A View on eScience
 
Statistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data UsageStatistical Analysis of Web of Data Usage
Statistical Analysis of Web of Data Usage
 
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchSPLive Orlando - 10 Things I Like in SharePoint 2013 Search
SPLive Orlando - 10 Things I Like in SharePoint 2013 Search
 
Building apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon BostonBuilding apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon Boston
 
SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013
 
Chef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure AutomationChef - Evolving with Infrastructure Automation
Chef - Evolving with Infrastructure Automation
 

More from Martin Klein

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansMartin Klein
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly WebMartin Klein
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...Martin Klein
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...Martin Klein
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncMartin Klein
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsMartin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansMartin Klein
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly ArtifactsMartin Klein
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...Martin Klein
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento RequestsMartin Klein
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesMartin Klein
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsMartin Klein
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsMartin Klein
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live WebMartin Klein
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web ResourcesMartin Klein
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for RepositoriesMartin Klein
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDMartin Klein
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationMartin Klein
 

More from Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Investigating the Change of Web Pages’ Titles Over Time

  • 1. Investigating the Change of Web Pages’ Titles Over Time Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu InDP 2009 Austin, TX 06/19/2009
  • 7. The Environment Web Infrastructure (WI) [McCown07] • Web search engines (Google, Yahoo!, MSN Live) and their caches • Research projects (CiteSeer) • Web archives (Internet Archive) [McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007. 3
  • 8. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 9. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 10. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 11. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 12. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 13. The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page • ! change number of terms in LS More sophisticated methods ! add/delete term from LS ! advanced search operators (6) present results ! DONE needed if unsuccessful so far 4
  • 14. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 15. The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! REAL TIME!!! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4
  • 16. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 17. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 18. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 19. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 20. Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate Web pages’ titles [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008
  • 21. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page 6
  • 22. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] 6
  • 23. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time 6
  • 24. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change 6
  • 25. Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change • Degree of change as Levenshtein score 6
  • 26. Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 7
  • 27. Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 1090 URLs and more than 100K observations 7
  • 28. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 29. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 30. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 31. Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8
  • 32. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations 10000 1) observations 2) changes Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  • 33. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  • 34. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9
  • 35. Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 • number of observations does not impact the 100 number of changes 10 1 0 200 400 600 800 1000 URLs 9