Is This a Good Title?


Martin Klein and Jeffery Shipman and Michael L. Nelson
                Old Dominion University

         {mklein,jshipman,mln}@cs.odu.edu

                      Hypertext 2010
                      Toronto, Canada
                        06/14/2010

                   This work is supported in part by the Library of Congress
The Problem

Professional Scholarly Publishing 2003
http://www.pspcentral.org/events/annual_meeting_2003.html




                                                            2
The Problem
Internet Archive -                     www.aircharter-international.com
                     http://web.archive.org/web/*/http://www.aircharter-international.com
Wayback Machine




                                                                                            3
The Problem
Internet Archive -                     www.aircharter-international.com
                     http://web.archive.org/web/*/http://www.aircharter-international.com
Wayback Machine




                           59 copies
                                                                                            3
The Problem
Internet Archive -                         www.aircharter-international.com
                         http://web.archive.org/web/*/http://www.aircharter-international.com
Wayback Machine

Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry

Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter           59 copies
International
                                                                                                3
The Problem
                     www.aircharter-international.com



Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry




                                                        4
The Problem
                        www.aircharter-international.com


Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
International




                                                           5
The Problem
                    http://www.drbartell.com/


Lexical Signature
(TF/IDF)

                                                ???
Plastic Surgeon
Reconstructive Dr
Bartell Symbol
University




                                                      6
The Problem
                    http://www.drbartell.com/


Title
Thomas Bartell MD
Board-Certified -
Cosmetic Plastic
Reconstructive
Surgery




                                                7
The Problem
                     www.reagan.navy.mil



Lexical Signature
(TF/IDF)
Ronald USS MCSN
Torrey Naval Sea
Commanding




                                           8
The Problem
             www.reagan.navy.mil




                                   ???
Title
Home Page




                                         9
The Problem
              www.reagan.navy.mil




                                    ???
Title
Home Page



 Is This a
Good Title?

                                          9
Contributions


•   Discuss discovery performance of web pages titles
    (compared to LSs)

•   Analysis of discovered pages regarding their
    relevancy

•   Display title evolution compared to content
    evolution over time

•   Provide prediction model for title’s retrieval potential




                                                               10
Experiment - Data Gathering

      •     20k URIs randomly sampled from DMOZ

      •     Applied filters
           •     English language
           •     min. of 50 terms [Park]

      •     Results in 6.875 URIs

      •     Downloaded and parsed the pages

      •     Extract title and generate LS per page (baseline)

                                             .com               .org             .net           .edu             sum

                   Original                 15289              2755             1459             497          20000
                    Filtered                 4863              1327              369             316           6875
[Park]
S.T. Park et al. “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web” ACM ToIS 22(4):540-572, 2004   11
Title (and LS) Retrieval Performance

                                         Titles                                                            5- and 7-Term LSs
                          70




                                                                                                      60
                                                             Top Ranked                                                              Top Ranked
                                                             Top 10                                                                  Top 10
                                                             Top 100                                                                 Top 100
                          60




                                                             Undiscovered                                                            Undiscovered




                                                                                                      50
                          50




                                                                                                      40
Relative Number of URLs




                                                                            Relative Number of URLs
                          40




                                                                                                      30
                          30




                                                                                                      20
                          20




                                                                                                      10
                          10
                          0




                                                                                                      0


                                   Top   Top10    Top100   Undiscovered                                    Top   Top10   Top100   Undiscovered




                               •   Titles return more than 60% URIs top ranked

                               •   Binary retrieval pattern, URI either within top 10 or
                                   undiscovered                                                                                                     12
Relevancy of Retrieval Results

Do titles return relevant
results besides the
original URI?

•   Distinguish between
                                    ???
    discovered (top 10) and
    undiscovered URIs

•   Analyze content of top 10
    results

•   Measure relevancy in terms of
    normalized term overlap
    and shingles between original
    URI and search result by rank
                                          13
Relevancy of Retrieval Results
                                                                       Term Overlap
                                   Discovered                                                                               Undiscovered
            6000




                           1        > 0.75              > 0.5       > 0.0       0                                   1         > 0.75              > 0.5       > 0.0       0




                                                                                                     1500
            5000
            4000




                                                                                                     1000
Frequency




                                                                                         Frequency
            3000
            2000




                                                                                                     500
            1000
            0




                                                                                                     0

                   1   2       3    4        5          6       7     8     9       10                      1   2       3     4        5          6       7     8     9       10

                                                 Rank                                                                                      Rank




                                                    High relevancy in the top ranks
                                                  with possible aliases and duplicates.
                                                                                                                                                                                   14
Relevancy of Retrieval Results

                                   Discovered
                                                                                Shingles                                    Undiscovered
                           1        > 0.75              > 0.5       > 0.0       0                                   1         > 0.75              > 0.5       > 0.0       0




                                                                                                     1500
            5000
            4000




                                                                                                     1000
Frequency




                                                                                         Frequency
            3000
            2000




                                                                                                     500
            1000
            0




                                                                                                     0


                   1   2       3    4        5          6       7     8     9       10                      1   2       3     4        5          6       7     8     9       10

                                                 Rank                                                                                      Rank




                               More optimal shingles values than top ranked URIs -
                                        possible aliases and duplicates.
                                                                                                                                                                                   15
Title Evolution - Example I
                         www.sun.com/solutions
1998-01-27                               2004-02-02
Sun Software Products Selector Guides    Sun Microsystems - Solutions
- Solutions Tree
                                         2004-06-10
1999-02-20                               Gateway Page - Sun Solutions
Sun Software Solutions
                                         2006-01-09
2002-02-01                               Sun Microsystems Solutions & Services
Sun Microsystems Products
                                         2007-01-03
2002-06-01                               Services & Solutions
Sun Microsystems - Business & Industry
Solutions                                2007-02-07
                                         Sun Services & Solutions
2003-08-01
Sun Microsystems - Industry &            2008-01-19
Infrastructure Solutions Sun Solutions   Sun Solutions
                                                                                 16
Title Evolution - Example I
                         www.sun.com/solutions
1998-01-27                               2004-02-02
Sun Software Products Selector Guides    Sun Microsystems - Solutions
- Solutions Tree
                                         2004-06-10
1999-02-20                               Gateway Page - Sun Solutions
Sun Software Solutions
                                         2006-01-09
2002-02-01                               Sun Microsystems Solutions & Services
Sun Microsystems Products
                                         2007-01-03
2002-06-01                               Services & Solutions
Sun Microsystems - Business & Industry
Solutions                                2007-02-07
                                         Sun Services & Solutions
2003-08-01
Sun Microsystems - Industry &            2008-01-19
Infrastructure Solutions Sun Solutions   Sun Solutions
                                                                                 16
Title Evolution - Example II
                www.datacity.com/mainf.html
                                      2002-10-16
2000-06-19
                                      computer company in Manassas Virginia
DataCity of Manassas Park Main Page   sells Custom Built Computers with
                                      Removable Hard Drives Kits and
2000-10-12                            Iomega 2GB Jaz Drives (jazz drives)
DataCity of Manassas Park sells       October 2002 DataCity 800-326-5051
Custom Built Computers & Removable    toll free
Hard Drives
                                      2006-03-14
2001-08-21                            Est 1989 Computer company in Stafford
DataCity a computer company in        Virginia sells Custom Built Secure
Manassas Park sells Custom Built      Computers with DoD 5200.1-R
Computers & Removable Hard Drives     Approved Removable Hard Drives,
                                      Hard Drive Kits and Iomega 2GB Jaz
                                      Drives (jazz drives), introduces the
                                      IllumiNite; lighted keyboard DataCity
                                      800-326-5051 Service Disabled Veteran
                                      Owned Business SDVOB                  17
Title Evolution - Example II
                www.datacity.com/mainf.html
                                      2002-10-16
2000-06-19
                                      computer company in Manassas Virginia
DataCity of Manassas Park Main Page   sells Custom Built Computers with
                                      Removable Hard Drives Kits and
2000-10-12                            Iomega 2GB Jaz Drives (jazz drives)
DataCity of Manassas Park sells       October 2002 DataCity 800-326-5051
Custom Built Computers & Removable    toll free
Hard Drives
                                      2006-03-14
2001-08-21                            Est 1989 Computer company in Stafford
DataCity a computer company in        Virginia sells Custom Built Secure
Manassas Park sells Custom Built      Computers with DoD 5200.1-R
Computers & Removable Hard Drives     Approved Removable Hard Drives,
                                      Hard Drive Kits and Iomega 2GB Jaz
                                      Drives (jazz drives), introduces the
                                      IllumiNite; lighted keyboard DataCity
                                      800-326-5051 Service Disabled Veteran
                                      Owned Business SDVOB                  17
Title Evolution Over Time

How much do titles
change over time?

•   Copies from fixed size time
    windows per year

•   Extract available titles of past
    14 years

•   Compute normalized
    Levenshtein edit
    distance between titles of
    copies and baseline
    (0 = identical; 1 = completely
    dissimilar)
                                        18
Title Evolution Over Time




                            100
Title edit distance                        Unchanged                                                 0

    frequencies
                                           Slightly Changed                                          0.1
                                                                                                     0.2
                                                                                                     0.3
                                                                                                     0.4




                            80
•
                                                                                                     0.5
    Half the titles of                                                                               0.6
                                                                                                     0.7
    available copies from                                                                            0.8
                                                                                                     0.9
    recent years are        60
                                                                                                     1.0

    (close to) identical

•
                            40




    Decay from 2005 on
    (with fewer copies
    available)
                            20




•   4 year old title:
    40% chance to be
                            0




    unchanged                     2/2009      2/2007    2/2005   2/2003   2/2001   2/1999   2/1997

                                                                                                           19
Title Evolution Over Time
    Title vs Document

•   Y: avg shingle value for
    all copies per URI

•   X: avg edit distance of
    corresponding titles

•   overlap indicated by:
    green: <10
    red: >90

•   Semi-transparent: total
    amount of points
    plotted

                                           20
Title Evolution Over Time
    Title vs Document

•   Y: avg shingle value for
    all copies per URI

•   X: avg edit distance of
    corresponding titles

•   overlap indicated by:
    green: <10
    red: >90

•   Semi-transparent: total
    amount of points
    plotted                    [0,0] - 122 times
                                                   20
Title Evolution Over Time
    Title vs Document

•   Y: avg shingle value for
                               [0,1] - over 1600 times
    all copies per URI

•   X: avg edit distance of
    corresponding titles

•   overlap indicated by:
    green: <10
    red: >90

•   Semi-transparent: total
    amount of points
    plotted                         [0,0] - 122 times
                                                         20
Title Performance Prediction

    •    Quality prediction of title by

        •    Number of nouns, articles etc.

        •    Amount of title terms, characters ([Ntoulas])

    •    Observation of re-occurring terms in poorly performing
         titles - “Stop Titles”

    home, index, home page, welcome, untitled document




[Ntoulas]
A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92   21
Title Performance Prediction

    •    Quality prediction of title by

        •    Number of nouns, articles etc.

        •    Amount of title terms, characters ([Ntoulas])

    •    Observation of re-occurring terms in poorly performing
         titles - “Stop Titles”

    home, index, home page, welcome, untitled document

                  The performance of any given title can
                  be predicted as insufficient if it consists
                     to 75% or more of a “Stop Title”!
[Ntoulas]
A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92   21
Concluding Remarks

The “aboutness” of web pages can be determined from either
the content or from the title.

More than 60% of URIs are returned top ranked when using
the title as a search engine query.

Titles change more slowly and less significantly over time than
the web pages’ content.

Not all titles are equally good.
If the majority of title terms are Stop Titles its quality can be
predicted poor.

                                                                    22
Is This a Good Title?



                    Questions?



Martin Klein and Jeffrey Shipman and Michael L. Nelson
                Old Dominion University

         {mklein,jshipman,mln}@cs.odu.edu
                                                         23

Is This a Good Title?

  • 1.
    Is This aGood Title? Martin Klein and Jeffery Shipman and Michael L. Nelson Old Dominion University {mklein,jshipman,mln}@cs.odu.edu Hypertext 2010 Toronto, Canada 06/14/2010 This work is supported in part by the Library of Congress
  • 2.
    The Problem Professional ScholarlyPublishing 2003 http://www.pspcentral.org/events/annual_meeting_2003.html 2
  • 3.
    The Problem Internet Archive- www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine 3
  • 4.
    The Problem Internet Archive- www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine 59 copies 3
  • 5.
    The Problem Internet Archive- www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter 59 copies International 3
  • 6.
    The Problem www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry 4
  • 7.
    The Problem www.aircharter-international.com Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 5
  • 8.
    The Problem http://www.drbartell.com/ Lexical Signature (TF/IDF) ??? Plastic Surgeon Reconstructive Dr Bartell Symbol University 6
  • 9.
    The Problem http://www.drbartell.com/ Title Thomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery 7
  • 10.
    The Problem www.reagan.navy.mil Lexical Signature (TF/IDF) Ronald USS MCSN Torrey Naval Sea Commanding 8
  • 11.
    The Problem www.reagan.navy.mil ??? Title Home Page 9
  • 12.
    The Problem www.reagan.navy.mil ??? Title Home Page Is This a Good Title? 9
  • 13.
    Contributions • Discuss discovery performance of web pages titles (compared to LSs) • Analysis of discovered pages regarding their relevancy • Display title evolution compared to content evolution over time • Provide prediction model for title’s retrieval potential 10
  • 14.
    Experiment - DataGathering • 20k URIs randomly sampled from DMOZ • Applied filters • English language • min. of 50 terms [Park] • Results in 6.875 URIs • Downloaded and parsed the pages • Extract title and generate LS per page (baseline) .com .org .net .edu sum Original 15289 2755 1459 497 20000 Filtered 4863 1327 369 316 6875 [Park] S.T. Park et al. “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web” ACM ToIS 22(4):540-572, 2004 11
  • 15.
    Title (and LS)Retrieval Performance Titles 5- and 7-Term LSs 70 60 Top Ranked Top Ranked Top 10 Top 10 Top 100 Top 100 60 Undiscovered Undiscovered 50 50 40 Relative Number of URLs Relative Number of URLs 40 30 30 20 20 10 10 0 0 Top Top10 Top100 Undiscovered Top Top10 Top100 Undiscovered • Titles return more than 60% URIs top ranked • Binary retrieval pattern, URI either within top 10 or undiscovered 12
  • 16.
    Relevancy of RetrievalResults Do titles return relevant results besides the original URI? • Distinguish between ??? discovered (top 10) and undiscovered URIs • Analyze content of top 10 results • Measure relevancy in terms of normalized term overlap and shingles between original URI and search result by rank 13
  • 17.
    Relevancy of RetrievalResults Term Overlap Discovered Undiscovered 6000 1 > 0.75 > 0.5 > 0.0 0 1 > 0.75 > 0.5 > 0.0 0 1500 5000 4000 1000 Frequency Frequency 3000 2000 500 1000 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Rank Rank High relevancy in the top ranks with possible aliases and duplicates. 14
  • 18.
    Relevancy of RetrievalResults Discovered Shingles Undiscovered 1 > 0.75 > 0.5 > 0.0 0 1 > 0.75 > 0.5 > 0.0 0 1500 5000 4000 1000 Frequency Frequency 3000 2000 500 1000 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Rank Rank More optimal shingles values than top ranked URIs - possible aliases and duplicates. 15
  • 19.
    Title Evolution -Example I www.sun.com/solutions 1998-01-27 2004-02-02 Sun Software Products Selector Guides Sun Microsystems - Solutions - Solutions Tree 2004-06-10 1999-02-20 Gateway Page - Sun Solutions Sun Software Solutions 2006-01-09 2002-02-01 Sun Microsystems Solutions & Services Sun Microsystems Products 2007-01-03 2002-06-01 Services & Solutions Sun Microsystems - Business & Industry Solutions 2007-02-07 Sun Services & Solutions 2003-08-01 Sun Microsystems - Industry & 2008-01-19 Infrastructure Solutions Sun Solutions Sun Solutions 16
  • 20.
    Title Evolution -Example I www.sun.com/solutions 1998-01-27 2004-02-02 Sun Software Products Selector Guides Sun Microsystems - Solutions - Solutions Tree 2004-06-10 1999-02-20 Gateway Page - Sun Solutions Sun Software Solutions 2006-01-09 2002-02-01 Sun Microsystems Solutions & Services Sun Microsystems Products 2007-01-03 2002-06-01 Services & Solutions Sun Microsystems - Business & Industry Solutions 2007-02-07 Sun Services & Solutions 2003-08-01 Sun Microsystems - Industry & 2008-01-19 Infrastructure Solutions Sun Solutions Sun Solutions 16
  • 21.
    Title Evolution -Example II www.datacity.com/mainf.html 2002-10-16 2000-06-19 computer company in Manassas Virginia DataCity of Manassas Park Main Page sells Custom Built Computers with Removable Hard Drives Kits and 2000-10-12 Iomega 2GB Jaz Drives (jazz drives) DataCity of Manassas Park sells October 2002 DataCity 800-326-5051 Custom Built Computers & Removable toll free Hard Drives 2006-03-14 2001-08-21 Est 1989 Computer company in Stafford DataCity a computer company in Virginia sells Custom Built Secure Manassas Park sells Custom Built Computers with DoD 5200.1-R Computers & Removable Hard Drives Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB 17
  • 22.
    Title Evolution -Example II www.datacity.com/mainf.html 2002-10-16 2000-06-19 computer company in Manassas Virginia DataCity of Manassas Park Main Page sells Custom Built Computers with Removable Hard Drives Kits and 2000-10-12 Iomega 2GB Jaz Drives (jazz drives) DataCity of Manassas Park sells October 2002 DataCity 800-326-5051 Custom Built Computers & Removable toll free Hard Drives 2006-03-14 2001-08-21 Est 1989 Computer company in Stafford DataCity a computer company in Virginia sells Custom Built Secure Manassas Park sells Custom Built Computers with DoD 5200.1-R Computers & Removable Hard Drives Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB 17
  • 23.
    Title Evolution OverTime How much do titles change over time? • Copies from fixed size time windows per year • Extract available titles of past 14 years • Compute normalized Levenshtein edit distance between titles of copies and baseline (0 = identical; 1 = completely dissimilar) 18
  • 24.
    Title Evolution OverTime 100 Title edit distance Unchanged 0 frequencies Slightly Changed 0.1 0.2 0.3 0.4 80 • 0.5 Half the titles of 0.6 0.7 available copies from 0.8 0.9 recent years are 60 1.0 (close to) identical • 40 Decay from 2005 on (with fewer copies available) 20 • 4 year old title: 40% chance to be 0 unchanged 2/2009 2/2007 2/2005 2/2003 2/2001 2/1999 2/1997 19
  • 25.
    Title Evolution OverTime Title vs Document • Y: avg shingle value for all copies per URI • X: avg edit distance of corresponding titles • overlap indicated by: green: <10 red: >90 • Semi-transparent: total amount of points plotted 20
  • 26.
    Title Evolution OverTime Title vs Document • Y: avg shingle value for all copies per URI • X: avg edit distance of corresponding titles • overlap indicated by: green: <10 red: >90 • Semi-transparent: total amount of points plotted [0,0] - 122 times 20
  • 27.
    Title Evolution OverTime Title vs Document • Y: avg shingle value for [0,1] - over 1600 times all copies per URI • X: avg edit distance of corresponding titles • overlap indicated by: green: <10 red: >90 • Semi-transparent: total amount of points plotted [0,0] - 122 times 20
  • 28.
    Title Performance Prediction • Quality prediction of title by • Number of nouns, articles etc. • Amount of title terms, characters ([Ntoulas]) • Observation of re-occurring terms in poorly performing titles - “Stop Titles” home, index, home page, welcome, untitled document [Ntoulas] A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 21
  • 29.
    Title Performance Prediction • Quality prediction of title by • Number of nouns, articles etc. • Amount of title terms, characters ([Ntoulas]) • Observation of re-occurring terms in poorly performing titles - “Stop Titles” home, index, home page, welcome, untitled document The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”! [Ntoulas] A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 21
  • 30.
    Concluding Remarks The “aboutness”of web pages can be determined from either the content or from the title. More than 60% of URIs are returned top ranked when using the title as a search engine query. Titles change more slowly and less significantly over time than the web pages’ content. Not all titles are equally good. If the majority of title terms are Stop Titles its quality can be predicted poor. 22
  • 31.
    Is This aGood Title? Questions? Martin Klein and Jeffrey Shipman and Michael L. Nelson Old Dominion University {mklein,jshipman,mln}@cs.odu.edu 23