Your SlideShare is downloading. ×
0
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Key Phrases for Better Search
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Key Phrases for Better Search

3,362

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,362
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
36
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 10 days of data (5K/min)
  • 10 days of data (5K/min)
  • Transcript

    • 1. Content Analytics for Better Search Otis Gospodneti ć ••• Sematext International
    • 2. Agenda <ul><li>Intro: Otis & Sematext
    • 3. Basic Search
    • 4. Taming Search Results
    • 5. Key Phrases
    • 6. Beyond Search </li></ul>
    • 7. About Otis Gospodneti ć <ul><li>Member: Apache Lucene, Solr, Nutch, Mahout
    • 8. Author: Lucene in Action 1 & 2
    • 9. Entrepreneur: Simpy (2004), Lucene Consulting (2005), Sematext Int'l (2007)
    • 10. Organizer: NY Search & Discovery Meetup </li></ul>
    • 11. About Sematext <ul>Consulting, development, support: <li>Big Data (Hadoop, HBase, Voldemort...)
    • 12. Search (Lucene, Solr, Elastic Search...)
    • 13. Web Crawling (Nutch)
    • 14. Machine Learning (Mahout) </li></ul>
    • 15. Basic Search
    • 16. Taming Search Results <ul><li>Related searches (high query volume)
    • 17. Search results clustering (fuzzy)
    • 18. Named Entity Recognition (NER)
    • 19. Faceted search (structured input)
    • 20. … </li></ul>
    • 21. Example: Related Searches
    • 22. Example: Results Clustering
    • 23. Example: Named Entities <ul>Sorry, no screenshot, but I know sites use this! Really, I do! :) </ul>
    • 24. Example: Faceted Search
    • 25. Content Analysis: Key Phrases <ul><li>Related searches
    • 26. Search results clustering
    • 27. Named Entity Recognition (NER)
    • 28. Faceted search
    • 29. Key Phrases </li><ul><li>Collocations
    • 30. Statistically Improbable Phrases (SIPs) </li></ul></ul>
    • 31. Example: Key Phrases & Search
    • 32. Example: Key Phrases & Search
    • 33. Definitions: Collocations <ul><li>Collocations are phrases whose words are seen together more than you would expect given an estimate of how frequent each individual word is in the given text vs. how often they are seen together in the same text.
    • 34. Source: http://sematext.com/demo/kpe/
    • 35. See: http://en.wikipedia.org/wiki/Collocation </li></ul>
    • 36. Definitions: SIPs <ul><li>Statistically Improbably Phrases are phrases that appear in a text more often than you would expect given how often they appear in another text.
    • 37. Source: http://sematext.com/demo/kpe/
    • 38. See: http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases </li></ul>
    • 39. Language Models
    • 40. Hybrid Key Phrases
    • 41. Beyond Search <ul><li>Content enrichment / Tagging
    • 42. Navigation / Cross-linking / Site stickiness
    • 43. Related content / At-a-glance “aboutness”
    • 44. Trend spotting / Buzz monitoring
    • 45. Social media
    • 46. Customer reviews
    • 47. Brand and market campaign monitoring </li></ul>
    • 48. Book Content Analysis
    • 49. SIPs at Amazon <ul><li>Amazon SIPs are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.
    • 50. SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements. </li></ul>
    • 51. News Content Analysis <ul><li>Source: http://sematext.com/demo/kpe/ </li></ul>
    • 52. SIPs & News Topic Trending <ul><li>The text for the new (or you can think of it as &quot;current&quot;) period goes from now to up to 7 days back. The text for the old (or &quot;past&quot;) period is for the 7 days before that.
    • 53. now ← new text -> (now - 7 days) ← text -> (now - 14 days) </li></ul>
    • 54. Customer Experience <ul><li>Mindshare Technologies (MT) is a Voice of the Customer company who helps companies make operational improvements based on customer feedback. MT's client list includes many of the world's largest restaurant chains, hotels, car rental agencies, and telecommunications companies. Much of the feedback we collect is from surveys that contain open-ended questions where customers can leave comments. MT has used the Key Phrase Extractor to unlock the value contained in these comments. We are able to identify common problems experienced by customers and are even able to detect emerging topics that are starting to catch fire . Mindshare's clients are able to leverage this information and make operational changes that improve customer experiences . </li></ul>
    • 55. Lessons <ul><li>GIGO
    • 56. Language-awareness (POS)
    • 57. Filtering (England v) </li></ul>
    • 58. <ul><li>sematext.com
    • 59. blog.sematext.com
    • 60. @ sematext
    • 61. @ otisg
    • 62. [email_address] </li></ul>Contact

    ×