Key Phrases for Better Search

3,898 views

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,898
On SlideShare
0
From Embeds
0
Number of Embeds
1,574
Actions
Shares
0
Downloads
37
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • 10 days of data (5K/min)
  • 10 days of data (5K/min)
  • Key Phrases for Better Search

    1. 1. Content Analytics for Better Search Otis Gospodneti ć ••• Sematext International
    2. 2. Agenda <ul><li>Intro: Otis & Sematext
    3. 3. Basic Search
    4. 4. Taming Search Results
    5. 5. Key Phrases
    6. 6. Beyond Search </li></ul>
    7. 7. About Otis Gospodneti ć <ul><li>Member: Apache Lucene, Solr, Nutch, Mahout
    8. 8. Author: Lucene in Action 1 & 2
    9. 9. Entrepreneur: Simpy (2004), Lucene Consulting (2005), Sematext Int'l (2007)
    10. 10. Organizer: NY Search & Discovery Meetup </li></ul>
    11. 11. About Sematext <ul>Consulting, development, support: <li>Big Data (Hadoop, HBase, Voldemort...)
    12. 12. Search (Lucene, Solr, Elastic Search...)
    13. 13. Web Crawling (Nutch)
    14. 14. Machine Learning (Mahout) </li></ul>
    15. 15. Basic Search
    16. 16. Taming Search Results <ul><li>Related searches (high query volume)
    17. 17. Search results clustering (fuzzy)
    18. 18. Named Entity Recognition (NER)
    19. 19. Faceted search (structured input)
    20. 20. … </li></ul>
    21. 21. Example: Related Searches
    22. 22. Example: Results Clustering
    23. 23. Example: Named Entities <ul>Sorry, no screenshot, but I know sites use this! Really, I do! :) </ul>
    24. 24. Example: Faceted Search
    25. 25. Content Analysis: Key Phrases <ul><li>Related searches
    26. 26. Search results clustering
    27. 27. Named Entity Recognition (NER)
    28. 28. Faceted search
    29. 29. Key Phrases </li><ul><li>Collocations
    30. 30. Statistically Improbable Phrases (SIPs) </li></ul></ul>
    31. 31. Example: Key Phrases & Search
    32. 32. Example: Key Phrases & Search
    33. 33. Definitions: Collocations <ul><li>Collocations are phrases whose words are seen together more than you would expect given an estimate of how frequent each individual word is in the given text vs. how often they are seen together in the same text.
    34. 34. Source: http://sematext.com/demo/kpe/
    35. 35. See: http://en.wikipedia.org/wiki/Collocation </li></ul>
    36. 36. Definitions: SIPs <ul><li>Statistically Improbably Phrases are phrases that appear in a text more often than you would expect given how often they appear in another text.
    37. 37. Source: http://sematext.com/demo/kpe/
    38. 38. See: http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases </li></ul>
    39. 39. Language Models
    40. 40. Hybrid Key Phrases
    41. 41. Beyond Search <ul><li>Content enrichment / Tagging
    42. 42. Navigation / Cross-linking / Site stickiness
    43. 43. Related content / At-a-glance “aboutness”
    44. 44. Trend spotting / Buzz monitoring
    45. 45. Social media
    46. 46. Customer reviews
    47. 47. Brand and market campaign monitoring </li></ul>
    48. 48. Book Content Analysis
    49. 49. SIPs at Amazon <ul><li>Amazon SIPs are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.
    50. 50. SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements. </li></ul>
    51. 51. News Content Analysis <ul><li>Source: http://sematext.com/demo/kpe/ </li></ul>
    52. 52. SIPs & News Topic Trending <ul><li>The text for the new (or you can think of it as &quot;current&quot;) period goes from now to up to 7 days back. The text for the old (or &quot;past&quot;) period is for the 7 days before that.
    53. 53. now ← new text -> (now - 7 days) ← text -> (now - 14 days) </li></ul>
    54. 54. Customer Experience <ul><li>Mindshare Technologies (MT) is a Voice of the Customer company who helps companies make operational improvements based on customer feedback. MT's client list includes many of the world's largest restaurant chains, hotels, car rental agencies, and telecommunications companies. Much of the feedback we collect is from surveys that contain open-ended questions where customers can leave comments. MT has used the Key Phrase Extractor to unlock the value contained in these comments. We are able to identify common problems experienced by customers and are even able to detect emerging topics that are starting to catch fire . Mindshare's clients are able to leverage this information and make operational changes that improve customer experiences . </li></ul>
    55. 55. Lessons <ul><li>GIGO
    56. 56. Language-awareness (POS)
    57. 57. Filtering (England v) </li></ul>
    58. 58. <ul><li>sematext.com
    59. 59. blog.sematext.com
    60. 60. @ sematext
    61. 61. @ otisg
    62. 62. [email_address] </li></ul>Contact

    ×