Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Key Phrases for Better Search

3,947 views

Published on

  • Be the first to comment

Key Phrases for Better Search

  1. 1. Content Analytics for Better Search Otis Gospodneti ć ••• Sematext International
  2. 2. Agenda <ul><li>Intro: Otis & Sematext
  3. 3. Basic Search
  4. 4. Taming Search Results
  5. 5. Key Phrases
  6. 6. Beyond Search </li></ul>
  7. 7. About Otis Gospodneti ć <ul><li>Member: Apache Lucene, Solr, Nutch, Mahout
  8. 8. Author: Lucene in Action 1 & 2
  9. 9. Entrepreneur: Simpy (2004), Lucene Consulting (2005), Sematext Int'l (2007)
  10. 10. Organizer: NY Search & Discovery Meetup </li></ul>
  11. 11. About Sematext <ul>Consulting, development, support: <li>Big Data (Hadoop, HBase, Voldemort...)
  12. 12. Search (Lucene, Solr, Elastic Search...)
  13. 13. Web Crawling (Nutch)
  14. 14. Machine Learning (Mahout) </li></ul>
  15. 15. Basic Search
  16. 16. Taming Search Results <ul><li>Related searches (high query volume)
  17. 17. Search results clustering (fuzzy)
  18. 18. Named Entity Recognition (NER)
  19. 19. Faceted search (structured input)
  20. 20. … </li></ul>
  21. 21. Example: Related Searches
  22. 22. Example: Results Clustering
  23. 23. Example: Named Entities <ul>Sorry, no screenshot, but I know sites use this! Really, I do! :) </ul>
  24. 24. Example: Faceted Search
  25. 25. Content Analysis: Key Phrases <ul><li>Related searches
  26. 26. Search results clustering
  27. 27. Named Entity Recognition (NER)
  28. 28. Faceted search
  29. 29. Key Phrases </li><ul><li>Collocations
  30. 30. Statistically Improbable Phrases (SIPs) </li></ul></ul>
  31. 31. Example: Key Phrases & Search
  32. 32. Example: Key Phrases & Search
  33. 33. Definitions: Collocations <ul><li>Collocations are phrases whose words are seen together more than you would expect given an estimate of how frequent each individual word is in the given text vs. how often they are seen together in the same text.
  34. 34. Source: http://sematext.com/demo/kpe/
  35. 35. See: http://en.wikipedia.org/wiki/Collocation </li></ul>
  36. 36. Definitions: SIPs <ul><li>Statistically Improbably Phrases are phrases that appear in a text more often than you would expect given how often they appear in another text.
  37. 37. Source: http://sematext.com/demo/kpe/
  38. 38. See: http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases </li></ul>
  39. 39. Language Models
  40. 40. Hybrid Key Phrases
  41. 41. Beyond Search <ul><li>Content enrichment / Tagging
  42. 42. Navigation / Cross-linking / Site stickiness
  43. 43. Related content / At-a-glance “aboutness”
  44. 44. Trend spotting / Buzz monitoring
  45. 45. Social media
  46. 46. Customer reviews
  47. 47. Brand and market campaign monitoring </li></ul>
  48. 48. Book Content Analysis
  49. 49. SIPs at Amazon <ul><li>Amazon SIPs are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.
  50. 50. SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements. </li></ul>
  51. 51. News Content Analysis <ul><li>Source: http://sematext.com/demo/kpe/ </li></ul>
  52. 52. SIPs & News Topic Trending <ul><li>The text for the new (or you can think of it as &quot;current&quot;) period goes from now to up to 7 days back. The text for the old (or &quot;past&quot;) period is for the 7 days before that.
  53. 53. now ← new text -> (now - 7 days) ← text -> (now - 14 days) </li></ul>
  54. 54. Customer Experience <ul><li>Mindshare Technologies (MT) is a Voice of the Customer company who helps companies make operational improvements based on customer feedback. MT's client list includes many of the world's largest restaurant chains, hotels, car rental agencies, and telecommunications companies. Much of the feedback we collect is from surveys that contain open-ended questions where customers can leave comments. MT has used the Key Phrase Extractor to unlock the value contained in these comments. We are able to identify common problems experienced by customers and are even able to detect emerging topics that are starting to catch fire . Mindshare's clients are able to leverage this information and make operational changes that improve customer experiences . </li></ul>
  55. 55. Lessons <ul><li>GIGO
  56. 56. Language-awareness (POS)
  57. 57. Filtering (England v) </li></ul>
  58. 58. <ul><li>sematext.com
  59. 59. blog.sematext.com
  60. 60. @ sematext
  61. 61. @ otisg
  62. 62. [email_address] </li></ul>Contact

×