Is This a Good Title?

1,225 views

Published on

Presentation I gave at ACM Hypertext 2010 in Toronto, Canada. The paper can be found at:
http://doi.acm.org/10.1145/1810617.1810621

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,225
On SlideShare
0
From Embeds
0
Number of Embeds
77
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide


















































  • Is This a Good Title?

    1. 1. Is This a Good Title? Martin Klein and Jeffery Shipman and Michael L. Nelson Old Dominion University {mklein,jshipman,mln}@cs.odu.edu Hypertext 2010 Toronto, Canada 06/14/2010 This work is supported in part by the Library of Congress
    2. 2. The Problem Professional Scholarly Publishing 2003 http://www.pspcentral.org/events/annual_meeting_2003.html 2
    3. 3. The Problem Internet Archive - www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine 3
    4. 4. The Problem Internet Archive - www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine 59 copies 3
    5. 5. The Problem Internet Archive - www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter 59 copies International 3
    6. 6. The Problem www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry 4
    7. 7. The Problem www.aircharter-international.com Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 5
    8. 8. The Problem http://www.drbartell.com/ Lexical Signature (TF/IDF) ??? Plastic Surgeon Reconstructive Dr Bartell Symbol University 6
    9. 9. The Problem http://www.drbartell.com/ Title Thomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery 7
    10. 10. The Problem www.reagan.navy.mil Lexical Signature (TF/IDF) Ronald USS MCSN Torrey Naval Sea Commanding 8
    11. 11. The Problem www.reagan.navy.mil ??? Title Home Page 9
    12. 12. The Problem www.reagan.navy.mil ??? Title Home Page Is This a Good Title? 9
    13. 13. Contributions • Discuss discovery performance of web pages titles (compared to LSs) • Analysis of discovered pages regarding their relevancy • Display title evolution compared to content evolution over time • Provide prediction model for title’s retrieval potential 10
    14. 14. Experiment - Data Gathering • 20k URIs randomly sampled from DMOZ • Applied filters • English language • min. of 50 terms [Park] • Results in 6.875 URIs • Downloaded and parsed the pages • Extract title and generate LS per page (baseline) .com .org .net .edu sum Original 15289 2755 1459 497 20000 Filtered 4863 1327 369 316 6875 [Park] S.T. Park et al. “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web” ACM ToIS 22(4):540-572, 2004 11
    15. 15. Title (and LS) Retrieval Performance Titles 5- and 7-Term LSs 70 60 Top Ranked Top Ranked Top 10 Top 10 Top 100 Top 100 60 Undiscovered Undiscovered 50 50 40 Relative Number of URLs Relative Number of URLs 40 30 30 20 20 10 10 0 0 Top Top10 Top100 Undiscovered Top Top10 Top100 Undiscovered • Titles return more than 60% URIs top ranked • Binary retrieval pattern, URI either within top 10 or undiscovered 12
    16. 16. Relevancy of Retrieval Results Do titles return relevant results besides the original URI? • Distinguish between ??? discovered (top 10) and undiscovered URIs • Analyze content of top 10 results • Measure relevancy in terms of normalized term overlap and shingles between original URI and search result by rank 13
    17. 17. Relevancy of Retrieval Results Term Overlap Discovered Undiscovered 6000 1 > 0.75 > 0.5 > 0.0 0 1 > 0.75 > 0.5 > 0.0 0 1500 5000 4000 1000 Frequency Frequency 3000 2000 500 1000 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Rank Rank High relevancy in the top ranks with possible aliases and duplicates. 14
    18. 18. Relevancy of Retrieval Results Discovered Shingles Undiscovered 1 > 0.75 > 0.5 > 0.0 0 1 > 0.75 > 0.5 > 0.0 0 1500 5000 4000 1000 Frequency Frequency 3000 2000 500 1000 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Rank Rank More optimal shingles values than top ranked URIs - possible aliases and duplicates. 15
    19. 19. Title Evolution - Example I www.sun.com/solutions 1998-01-27 2004-02-02 Sun Software Products Selector Guides Sun Microsystems - Solutions - Solutions Tree 2004-06-10 1999-02-20 Gateway Page - Sun Solutions Sun Software Solutions 2006-01-09 2002-02-01 Sun Microsystems Solutions & Services Sun Microsystems Products 2007-01-03 2002-06-01 Services & Solutions Sun Microsystems - Business & Industry Solutions 2007-02-07 Sun Services & Solutions 2003-08-01 Sun Microsystems - Industry & 2008-01-19 Infrastructure Solutions Sun Solutions Sun Solutions 16
    20. 20. Title Evolution - Example I www.sun.com/solutions 1998-01-27 2004-02-02 Sun Software Products Selector Guides Sun Microsystems - Solutions - Solutions Tree 2004-06-10 1999-02-20 Gateway Page - Sun Solutions Sun Software Solutions 2006-01-09 2002-02-01 Sun Microsystems Solutions & Services Sun Microsystems Products 2007-01-03 2002-06-01 Services & Solutions Sun Microsystems - Business & Industry Solutions 2007-02-07 Sun Services & Solutions 2003-08-01 Sun Microsystems - Industry & 2008-01-19 Infrastructure Solutions Sun Solutions Sun Solutions 16
    21. 21. Title Evolution - Example II www.datacity.com/mainf.html 2002-10-16 2000-06-19 computer company in Manassas Virginia DataCity of Manassas Park Main Page sells Custom Built Computers with Removable Hard Drives Kits and 2000-10-12 Iomega 2GB Jaz Drives (jazz drives) DataCity of Manassas Park sells October 2002 DataCity 800-326-5051 Custom Built Computers & Removable toll free Hard Drives 2006-03-14 2001-08-21 Est 1989 Computer company in Stafford DataCity a computer company in Virginia sells Custom Built Secure Manassas Park sells Custom Built Computers with DoD 5200.1-R Computers & Removable Hard Drives Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB 17
    22. 22. Title Evolution - Example II www.datacity.com/mainf.html 2002-10-16 2000-06-19 computer company in Manassas Virginia DataCity of Manassas Park Main Page sells Custom Built Computers with Removable Hard Drives Kits and 2000-10-12 Iomega 2GB Jaz Drives (jazz drives) DataCity of Manassas Park sells October 2002 DataCity 800-326-5051 Custom Built Computers & Removable toll free Hard Drives 2006-03-14 2001-08-21 Est 1989 Computer company in Stafford DataCity a computer company in Virginia sells Custom Built Secure Manassas Park sells Custom Built Computers with DoD 5200.1-R Computers & Removable Hard Drives Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB 17
    23. 23. Title Evolution Over Time How much do titles change over time? • Copies from fixed size time windows per year • Extract available titles of past 14 years • Compute normalized Levenshtein edit distance between titles of copies and baseline (0 = identical; 1 = completely dissimilar) 18
    24. 24. Title Evolution Over Time 100 Title edit distance Unchanged 0 frequencies Slightly Changed 0.1 0.2 0.3 0.4 80 • 0.5 Half the titles of 0.6 0.7 available copies from 0.8 0.9 recent years are 60 1.0 (close to) identical • 40 Decay from 2005 on (with fewer copies available) 20 • 4 year old title: 40% chance to be 0 unchanged 2/2009 2/2007 2/2005 2/2003 2/2001 2/1999 2/1997 19
    25. 25. Title Evolution Over Time Title vs Document • Y: avg shingle value for all copies per URI • X: avg edit distance of corresponding titles • overlap indicated by: green: <10 red: >90 • Semi-transparent: total amount of points plotted 20
    26. 26. Title Evolution Over Time Title vs Document • Y: avg shingle value for all copies per URI • X: avg edit distance of corresponding titles • overlap indicated by: green: <10 red: >90 • Semi-transparent: total amount of points plotted [0,0] - 122 times 20
    27. 27. Title Evolution Over Time Title vs Document • Y: avg shingle value for [0,1] - over 1600 times all copies per URI • X: avg edit distance of corresponding titles • overlap indicated by: green: <10 red: >90 • Semi-transparent: total amount of points plotted [0,0] - 122 times 20
    28. 28. Title Performance Prediction • Quality prediction of title by • Number of nouns, articles etc. • Amount of title terms, characters ([Ntoulas]) • Observation of re-occurring terms in poorly performing titles - “Stop Titles” home, index, home page, welcome, untitled document [Ntoulas] A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 21
    29. 29. Title Performance Prediction • Quality prediction of title by • Number of nouns, articles etc. • Amount of title terms, characters ([Ntoulas]) • Observation of re-occurring terms in poorly performing titles - “Stop Titles” home, index, home page, welcome, untitled document The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”! [Ntoulas] A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 21
    30. 30. Concluding Remarks The “aboutness” of web pages can be determined from either the content or from the title. More than 60% of URIs are returned top ranked when using the title as a search engine query. Titles change more slowly and less significantly over time than the web pages’ content. Not all titles are equally good. If the majority of title terms are Stop Titles its quality can be predicted poor. 22
    31. 31. Is This a Good Title? Questions? Martin Klein and Jeffrey Shipman and Michael L. Nelson Old Dominion University {mklein,jshipman,mln}@cs.odu.edu 23

    ×