Search Analytics Business Value & NoSQL Backend Otis Gospodneti ć  –  Sematext International @otisg  ◦  @sematext  ◦  sema...
About Otis Gospodneti ć <ul><li>ASF Member : Lucene, Solr, Nutch, Mahout </li></ul><ul><li>Author :  Lucene in Action  1 &...
Sematext Metrics <ul><li>100%  organic : no GMO, no VC </li></ul><ul><li>4 years old </li></ul><ul><li>< 10 people </li></...
About Sematext <ul><li>Products & Services </li></ul><ul><li>Consulting, Development, Tech Support: </li></ul><ul><li>Sear...
Agenda <ul><li>What is Search Analytics and why it matters </li></ul><ul><li>Example reports and their value </li></ul><ul...
Communication <ul><li>twitter.com/ sematext </li></ul><ul><li>twitter.com/ otisg </li></ul><ul><li>hash tags:  # stsa  or ...
The Compass <ul><li>Search logs are your  Map </li></ul><ul><li>Search Analytics is your  Compass </li></ul>
High Level Why search users search providers search experience
High Level Why search providers search experience This search sucks! It takes 17 tries to find anything here! F!?@#$%^&?!?...
Don't Be Like This Dude
Got Clue? Search Analytics Performance Monitoring Quality Assurance Tuning UI
More Concrete Why <ul><li>Measure  and  monitor everything . Introspection. </li></ul><ul><li>Supports (re)design, navigat...
The Moment of Truth <ul><li>Question for the audience #1 What do you use for Search Analytics? </li></ul><ul><li>a) Home g...
Search Analytics Outline <ul><li>Collect:  queries  &  clicks  &  interactions  & ... </li></ul><ul><li>Analyze: actions /...
Search vs. Web Analytics <ul><li>User  intent  and information needs vs. inferring </li></ul><ul><li>Hand in hand </li></u...
Example Core Reports <ul><li>Rate & Volume, Latency (mean, avg, 90%) </li></ul><ul><li>Click Through Rate, Mean Reciprocal...
More Reports in More Detail <ul><li>See  Search Analytics What? Why? How? http://blog.sematext.com/tag/analytics/ </li></ul>
Part Dos <ul><li>Switching gears... Juno digs NoSQL </li></ul>
What We've Built <ul><li>Search Analytics SaaS </li></ul><ul><ul><li>Numerous  reports  (e.g. query volume, rate, latency,...
Who Needs a Compass? <ul><li>We need it </li></ul><ul><ul><li>search-hadoop.com   &   search-lucene.com </li></ul></ul><ul...
Sematext Search Analytics
Big Dreams <ul><li>SaaS </li></ul><ul><li>Multitenant </li></ul><ul><li>Large Scale – Massive Data </li></ul><ul><li>Cloud...
Storage Choices <ul><li>RDBMS: MySQL, PostgreSQL </li></ul><ul><li>HDFS </li></ul><ul><li>Hive </li></ul><ul><li>HBase </l...
SaaS vs. In-House <ul><li>Question for the audience #2 </li></ul><ul><li>SaaS vs in-house Search Analytics? a) SaaS b) in-...
Sematext Search Analytics
Sematext Search Analytics
Sematext Search Analytics
Sematext Search Analytics
Data Flow <ul><li>See  Search Analytics with Flume and HBase   http://blog.sematext.com/2010/10/16/search-analytics-hadoop...
Data Collection <ul><li>See  Search Analytics with Flume and HBase   http://blog.sematext.com/2010/10/16/search-analytics-...
Core Tech <ul><li>JavaScript  Beacons </li></ul><ul><li>Metric Capture Web App aka  Receiver </li></ul><ul><li>Flume  Agen...
What is Flume <ul><li>Distributed data/log collection service </li></ul><ul><li>Scalable, configurable, extensible </li></...
What is HBase <ul><li>Scalable, reliable, distributed, column-oriented DB </li></ul><ul><li>On top of HDFS </li></ul><ul><...
Data Flow, Detailed
Why Flume <ul><li>Reliable delivery </li></ul><ul><ul><li>e.g. queue msgs locally if destination unreachable </li></ul></u...
Why HBase <ul><li>Scalable raw & aggregate data storage </li></ul><ul><li>MapReduce data input </li></ul><ul><li>Fast scan...
Open Sourcing <ul><li>2 open-source projects: </li></ul><ul><li>github.com/sematext/HBaseWD </li></ul><ul><li>github.com/s...
Challenges <ul><li>Data size. Solutions: </li></ul><ul><ul><li>Compression (4-5x smaller with lzo) </li></ul></ul><ul><ul>...
Output++ <ul><li>AutoComplete - $MM improvement </li></ul><ul><li>Better DYM Spellchecker </li></ul><ul><li>Related Search...
Closing the Loop search users search providers search experience
Resource http://rosenfeldmedia.com/books/searchanalytics/   Search Analytics for Your Site Louis Rosenfeld
We're Hiring <ul><li>Dig  Search ? </li></ul><ul><li>Dig  Analytics ? </li></ul><ul><li>Dig  Big Data ? </li></ul><ul><li>...
<ul><li>sematext.com </li></ul><ul><li>blog.sematext.com </li></ul><ul><li>@sematext </li></ul><ul><li>@otisg </li></ul><u...
Upcoming SlideShare
Loading in …5
×

Otis gospodnetic Search Analytics Lucene Eurocon 2011

1,230 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,230
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 10 days of data (5K/min)
  • 10 days of data (5K/min)
  • Otis gospodnetic Search Analytics Lucene Eurocon 2011

    1. 1. Search Analytics Business Value & NoSQL Backend Otis Gospodneti ć – Sematext International @otisg ◦ @sematext ◦ sematext.com sematext.com/search-analytics
    2. 2. About Otis Gospodneti ć <ul><li>ASF Member : Lucene, Solr, Nutch, Mahout </li></ul><ul><li>Author : Lucene in Action 1 & 2 </li></ul><ul><li>Entrepreneur : Sematext , Simpy </li></ul>
    3. 3. Sematext Metrics <ul><li>100% organic : no GMO, no VC </li></ul><ul><li>4 years old </li></ul><ul><li>< 10 people </li></ul><ul><li>7 countries </li></ul><ul><li>3 timezones </li></ul><ul><li>2 continents </li></ul><ul><li>> 100 customers </li></ul>
    4. 4. About Sematext <ul><li>Products & Services </li></ul><ul><li>Consulting, Development, Tech Support: </li></ul><ul><li>Search (Lucene, Solr, ElasticSearch...) </li></ul><ul><li>Big Data (Hadoop, HBase, Voldemort...) </li></ul><ul><li>Web Crawling (Nutch, Droids) </li></ul><ul><li>Machine Learning (Mahout) </li></ul>
    5. 5. Agenda <ul><li>What is Search Analytics and why it matters </li></ul><ul><li>Example reports and their value </li></ul><ul><li>What we built, why, and how </li></ul>
    6. 6. Communication <ul><li>twitter.com/ sematext </li></ul><ul><li>twitter.com/ otisg </li></ul><ul><li>hash tags: # stsa or # stanalytics </li></ul><ul><li>http://sematext.com/search-analytics/index.html </li></ul><ul><li>Raise your hand ! </li></ul><ul><li>otis @sematext.com </li></ul>
    7. 7. The Compass <ul><li>Search logs are your Map </li></ul><ul><li>Search Analytics is your Compass </li></ul>
    8. 8. High Level Why search users search providers search experience
    9. 9. High Level Why search providers search experience This search sucks! It takes 17 tries to find anything here! F!?@#$%^&?!? search users Cool, the latest search tweaks made our site really sticky! Awesome!
    10. 10. Don't Be Like This Dude
    11. 11. Got Clue? Search Analytics Performance Monitoring Quality Assurance Tuning UI
    12. 12. More Concrete Why <ul><li>Measure and monitor everything . Introspection. </li></ul><ul><li>Supports (re)design, navigation choices </li></ul><ul><li>Helps with content acquisition & enhancement </li></ul><ul><li>Improve search experience </li></ul><ul><li>Mula </li></ul>
    13. 13. The Moment of Truth <ul><li>Question for the audience #1 What do you use for Search Analytics? </li></ul><ul><li>a) Home grown stuff b) Google Analytics c) Omniture d) Webtrends e) Other f ) Nothing </li></ul>
    14. 14. Search Analytics Outline <ul><li>Collect: queries & clicks & interactions & ... </li></ul><ul><li>Analyze: actions / xactions / conversions </li></ul><ul><li>Output: reports – over time </li></ul><ul><li>Output++: feedback loop </li></ul><ul><li>The means, not the goal </li></ul><ul><li>Ongoing, not one-off </li></ul>remember this
    15. 15. Search vs. Web Analytics <ul><li>User intent and information needs vs. inferring </li></ul><ul><li>Hand in hand </li></ul><ul><li>Ideally you can relate data from both or even unify it </li></ul>
    16. 16. Example Core Reports <ul><li>Rate & Volume, Latency (mean, avg, 90%) </li></ul><ul><li>Click Through Rate, Mean Reciprocal Rank </li></ul><ul><li>Top Queries by count, clicks, 0 hits... </li></ul><ul><li>Query Trending </li></ul><ul><li>Top Seen Docs, Top Clicked Docs (msft) </li></ul><ul><li>Page & Click Depth </li></ul><ul><li>Facet & Sort Usage </li></ul><ul><li>... </li></ul>
    17. 17. More Reports in More Detail <ul><li>See Search Analytics What? Why? How? http://blog.sematext.com/tag/analytics/ </li></ul>
    18. 18. Part Dos <ul><li>Switching gears... Juno digs NoSQL </li></ul>
    19. 19. What We've Built <ul><li>Search Analytics SaaS </li></ul><ul><ul><li>Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.) </li></ul></ul><ul><ul><li>Trending over time </li></ul></ul><ul><ul><li>Comparisons of time periods </li></ul></ul><ul><ul><li>Top N reports </li></ul></ul><ul><ul><li>Filter , slice and dice </li></ul></ul>
    20. 20. Who Needs a Compass? <ul><li>We need it </li></ul><ul><ul><li>search-hadoop.com & search-lucene.com </li></ul></ul><ul><li>Our customers need it! </li></ul><ul><li>You ? </li></ul>
    21. 21. Sematext Search Analytics
    22. 22. Big Dreams <ul><li>SaaS </li></ul><ul><li>Multitenant </li></ul><ul><li>Large Scale – Massive Data </li></ul><ul><li>Cloud </li></ul>
    23. 23. Storage Choices <ul><li>RDBMS: MySQL, PostgreSQL </li></ul><ul><li>HDFS </li></ul><ul><li>Hive </li></ul><ul><li>HBase </li></ul><ul><li>Cassandra </li></ul>
    24. 24. SaaS vs. In-House <ul><li>Question for the audience #2 </li></ul><ul><li>SaaS vs in-house Search Analytics? a) SaaS b) in-house </li></ul>
    25. 25. Sematext Search Analytics
    26. 26. Sematext Search Analytics
    27. 27. Sematext Search Analytics
    28. 28. Sematext Search Analytics
    29. 29. Data Flow <ul><li>See Search Analytics with Flume and HBase http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/ </li></ul>
    30. 30. Data Collection <ul><li>See Search Analytics with Flume and HBase http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/ </li></ul>
    31. 31. Core Tech <ul><li>JavaScript Beacons </li></ul><ul><li>Metric Capture Web App aka Receiver </li></ul><ul><li>Flume Agents, Collectors, Sinks </li></ul><ul><li>HBase </li></ul><ul><li>MapReduce Aggregations </li></ul><ul><li>Search Analytics Reporting Web App </li></ul>
    32. 32. What is Flume <ul><li>Distributed data/log collection service </li></ul><ul><li>Scalable, configurable, extensible </li></ul><ul><li>Centrally manageable, open source </li></ul><ul><li>Agents get data from app, Collectors save it </li></ul><ul><li>Abstractions: Source -> Decorator(s) -> Sink </li></ul>
    33. 33. What is HBase <ul><li>Scalable, reliable, distributed, column-oriented DB </li></ul><ul><li>On top of HDFS </li></ul><ul><li>MapReducable </li></ul>
    34. 34. Data Flow, Detailed
    35. 35. Why Flume <ul><li>Reliable delivery </li></ul><ul><ul><li>e.g. queue msgs locally if destination unreachable </li></ul></ul><ul><li>Easy, centralized management via Web UI or console </li></ul><ul><li>Good community, good progress, now @ASF </li></ul><ul><li>But: more complex, more moving parts </li></ul><ul><li>On Flume: slideshare.net/cloudera/inside-flume </li></ul><ul><li>Alternatives: Kafka, Scribe... </li></ul>
    36. 36. Why HBase <ul><li>Scalable raw & aggregate data storage </li></ul><ul><li>MapReduce data input </li></ul><ul><li>Fast scans for time ranges, fast key lookups </li></ul><ul><li>Easy storage and compute power expansion </li></ul><ul><li>Good looking roadmap, community, progress </li></ul>
    37. 37. Open Sourcing <ul><li>2 open-source projects: </li></ul><ul><li>github.com/sematext/HBaseWD </li></ul><ul><li>github.com/sematext/HBaseHUT </li></ul><ul><li>See sematext.com/open-source/index.html </li></ul><ul><li>Patches for Flume and HBase blog.sematext.com/tag/flume/ </li></ul>
    38. 38. Challenges <ul><li>Data size. Solutions: </li></ul><ul><ul><li>Compression (4-5x smaller with lzo) </li></ul></ul><ul><ul><li>Data pruning (variable levels) </li></ul></ul><ul><li>Query string distribution: very long-tail </li></ul><ul><ul><li>Lots of data to process, update, aggregate </li></ul></ul><ul><li>Young tools: Flume, HBase </li></ul><ul><li>Poor IO on EC2 </li></ul><ul><li>Hadoop distributions </li></ul>
    39. 39. Output++ <ul><li>AutoComplete - $MM improvement </li></ul><ul><li>Better DYM Spellchecker </li></ul><ul><li>Related Searches </li></ul><ul><li>Recommendations </li></ul><ul><li>Relevance Feedback </li></ul><ul><li>... </li></ul>
    40. 40. Closing the Loop search users search providers search experience
    41. 41. Resource http://rosenfeldmedia.com/books/searchanalytics/ Search Analytics for Your Site Louis Rosenfeld
    42. 42. We're Hiring <ul><li>Dig Search ? </li></ul><ul><li>Dig Analytics ? </li></ul><ul><li>Dig Big Data ? </li></ul><ul><li>Dig Performance ? </li></ul><ul><li>Dig working with and in open-source ? </li></ul><ul><li>We're hiring world-wide! </li></ul><ul><li>http://sematext.com/about/jobs.html </li></ul>
    43. 43. <ul><li>sematext.com </li></ul><ul><li>blog.sematext.com </li></ul><ul><li>@sematext </li></ul><ul><li>@otisg </li></ul><ul><li>[email_address] Want SA? Grab me or go to: </li></ul><ul><li>sematext.com/search-analytics </li></ul><ul><li> Hash tags: # stsa or # stanalytics </li></ul>Contact

    ×