Copyright 2010 Sematext Int'l. All rights reserved.
1
Content Analytics
for
Better Search
Otis Gospodnetić ••• Sematext In...
Copyright 2010 Sematext Int'l. All rights reserved.
2
Agenda
● Intro: Otis & Sematext
● Basic Search
● Taming Search Resul...
Copyright 2010 Sematext Int'l. All rights reserved.
3
About Otis Gospodnetić
• Member: Apache Lucene/Solr/Nutch/Mahout
• A...
Copyright 2010 Sematext Int'l. All rights reserved.
4
About Sematext
Consulting, development, support:
● Big Data (Hadoop,...
Copyright 2010 Sematext Int'l. All rights reserved.
5
Basic Search
Copyright 2010 Sematext Int'l. All rights reserved.
6
Taming Search Results
● Related searches (high query volume)
● Searc...
Copyright 2010 Sematext Int'l. All rights reserved.
7
Example: Related Searches
Copyright 2010 Sematext Int'l. All rights reserved.
8
Example: Results Clustering
Copyright 2010 Sematext Int'l. All rights reserved.
9
Example: Named Entities
Sorry, no screenshot, but I know sites use t...
Copyright 2010 Sematext Int'l. All rights reserved.
10
Example: Faceted Search
Copyright 2010 Sematext Int'l. All rights reserved.
11
Content Analysis: Key Phrases
● Related searches
● Search results c...
Copyright 2010 Sematext Int'l. All rights reserved.
12
Example: Key Phrases & Search
Copyright 2010 Sematext Int'l. All rights reserved.
13
Example: Key Phrases & Search
Copyright 2010 Sematext Int'l. All rights reserved.
14
Definitions: Collocations
● Collocations are phrases whose words ar...
Copyright 2010 Sematext Int'l. All rights reserved.
15
Definitions: SIPs
● Statistically Improbably Phrases are phrases
th...
Copyright 2010 Sematext Int'l. All rights reserved.
16
Language Models
Copyright 2010 Sematext Int'l. All rights reserved.
17
Hybrid Key Phrases
Copyright 2010 Sematext Int'l. All rights reserved.
18
Beyond Search
● Content analysis
● Trend spotting / Buzz monitoring...
Copyright 2010 Sematext Int'l. All rights reserved.
19
Book Content Analysis
Copyright 2010 Sematext Int'l. All rights reserved.
20
SIPs at Amazon
● Amazon SIPs are the most distinctive phrases in th...
Copyright 2010 Sematext Int'l. All rights reserved.
21
News Content Analysis
● Source: http://sematext.com/demo/kpe/
Copyright 2010 Sematext Int'l. All rights reserved.
22
SIPs & News Topic Trending
● The text for the new (or you can think...
Copyright 2010 Sematext Int'l. All rights reserved.
23
Customer Experience
● Mindshare Technologies (MT) is a Voice of the...
Copyright 2010 Sematext Int'l. All rights reserved.
24
Lessons
● GIGO
● Language-awareness (POS)
● Filtering (England v)
Copyright 2010 Sematext Int'l. All rights reserved.
25
• sematext.com
• blog.sematext.com
• @sematext
• @otisg
• otis@sema...
Upcoming SlideShare
Loading in …5
×

Content Analytics for Better Search

957 views
903 views

Published on

Presentation by Otis Gospodnetić, Sematext International, at Smart Content: The Content Analytics Conference, October 19, 2010, http://smartcontentconference.com

Published in: Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
957
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • 10 days of data (5K/min)
  • 10 days of data (5K/min)
  • Content Analytics for Better Search

    1. 1. Copyright 2010 Sematext Int'l. All rights reserved. 1 Content Analytics for Better Search Otis Gospodnetić ••• Sematext International
    2. 2. Copyright 2010 Sematext Int'l. All rights reserved. 2 Agenda ● Intro: Otis & Sematext ● Basic Search ● Taming Search Results ● Key Phrases ● Beyond Search
    3. 3. Copyright 2010 Sematext Int'l. All rights reserved. 3 About Otis Gospodnetić • Member: Apache Lucene/Solr/Nutch/Mahout • Author: Lucene in Action 1 & 2 • Entrepreneur: Simpy, Lucene Consulting, Sematext Int'l since 2007
    4. 4. Copyright 2010 Sematext Int'l. All rights reserved. 4 About Sematext Consulting, development, support: ● Big Data (Hadoop, HBase, Voldemort...) ● Search (Lucene, Solr, Elastic Search...) ● Web Crawling (Nutch) ● Machine Learning (Mahout)
    5. 5. Copyright 2010 Sematext Int'l. All rights reserved. 5 Basic Search
    6. 6. Copyright 2010 Sematext Int'l. All rights reserved. 6 Taming Search Results ● Related searches (high query volume) ● Search results clustering (fuzzy) ● Named Entity Recognition (NER) ● Faceted search (structured input) ● …
    7. 7. Copyright 2010 Sematext Int'l. All rights reserved. 7 Example: Related Searches
    8. 8. Copyright 2010 Sematext Int'l. All rights reserved. 8 Example: Results Clustering
    9. 9. Copyright 2010 Sematext Int'l. All rights reserved. 9 Example: Named Entities Sorry, no screenshot, but I know sites use this! Really, I do! :)
    10. 10. Copyright 2010 Sematext Int'l. All rights reserved. 10 Example: Faceted Search
    11. 11. Copyright 2010 Sematext Int'l. All rights reserved. 11 Content Analysis: Key Phrases ● Related searches ● Search results clustering ● Named Entity Recognition (NER) ● Faceted search ● Key Phrases ● Collocations ● Statistically Improbable Phrases (SIPs)
    12. 12. Copyright 2010 Sematext Int'l. All rights reserved. 12 Example: Key Phrases & Search
    13. 13. Copyright 2010 Sematext Int'l. All rights reserved. 13 Example: Key Phrases & Search
    14. 14. Copyright 2010 Sematext Int'l. All rights reserved. 14 Definitions: Collocations ● Collocations are phrases whose words are seen together more than you would expect given an estimate of how frequent each individual word is in the given text vs. how often they are seen together in the same text. ● Source: http://sematext.com/demo/kpe/ ● See: http://en.wikipedia.org/wiki/Collocation
    15. 15. Copyright 2010 Sematext Int'l. All rights reserved. 15 Definitions: SIPs ● Statistically Improbably Phrases are phrases that appear in a text more often than you would expect given how often they appear in another text. In this demo we extract SIPs by comparing texts from two different time periods. ● Source: http://sematext.com/demo/kpe/ ● See: http://en.wikipedia.org/wiki/Statistically_Improba ble_Phrases
    16. 16. Copyright 2010 Sematext Int'l. All rights reserved. 16 Language Models
    17. 17. Copyright 2010 Sematext Int'l. All rights reserved. 17 Hybrid Key Phrases
    18. 18. Copyright 2010 Sematext Int'l. All rights reserved. 18 Beyond Search ● Content analysis ● Trend spotting / Buzz monitoring ● Social media ● Customer reviews / Brand
    19. 19. Copyright 2010 Sematext Int'l. All rights reserved. 19 Book Content Analysis
    20. 20. Copyright 2010 Sematext Int'l. All rights reserved. 20 SIPs at Amazon ● Amazon SIPs are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book. SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.
    21. 21. Copyright 2010 Sematext Int'l. All rights reserved. 21 News Content Analysis ● Source: http://sematext.com/demo/kpe/
    22. 22. Copyright 2010 Sematext Int'l. All rights reserved. 22 SIPs & News Topic Trending ● The text for the new (or you can think of it as "current") period goes from now to up to 7 days back. The text for the old (or "past") period is for the 7 days before that. now ← new text → (now - 7 days) ← text → (now - 14 days)
    23. 23. Copyright 2010 Sematext Int'l. All rights reserved. 23 Customer Experience ● Mindshare Technologies (MT) is a Voice of the Customer company who helps companies make operational improvements based on customer feedback. MT's client list includes many of the world's largest restaurant chains, hotels, car rental agencies, and telecommunications companies. Much of the feedback we collect is from surveys that contain open- ended questions where customers can leave comments. MT has used the Key Phrase Extractor to unlock the value contained in these comments. We are able to identify common problems experienced by customers and are even able to detect emerging topics that are starting to catch fire. Mindshare's clients are able to leverage this information and make operational changes that improve customer experiences.
    24. 24. Copyright 2010 Sematext Int'l. All rights reserved. 24 Lessons ● GIGO ● Language-awareness (POS) ● Filtering (England v)
    25. 25. Copyright 2010 Sematext Int'l. All rights reserved. 25 • sematext.com • blog.sematext.com • @sematext • @otisg • otis@sematext.com Contact

    ×