Search Analytics      Business Value            &      NoSQL BackendOtis Gospodnetić – Sematext International  @otisg ◦ @s...
About Otis Gospodnetić• ASF Member: Lucene, Solr, Nutch, Mahout• Author: Lucene in Action 1 & 2• Entrepreneur: Sematext, S...
Sematext Metrics●   100% organic: no GMO, no VC●   4 years old●   < 10 people●   7 countries●   3 timezones●   2 continent...
About Sematext    Products & Services    Consulting, Development, Tech Support:●   Search (Lucene, Solr, ElasticSearch...)...
Agenda●   What is Search Analytics and why it matters●   Example reports and their value●   What we built, why, and how   ...
Communication●   twitter.com/sematext●   twitter.com/otisg●   hash tags: #stsa or #stanalytics●   http://sematext.com/sear...
The Compass     Search logs are your Map     Search Analytics is your Compass                                             ...
High Level Why                         search                          users                      search                  ...
High Level Why                                                             This search sucks!                             ...
Dont Be Like This Dude                                                                10          Copyright 2011 Sematext ...
Got Clue?                Performance Monitoring    Tuning      Search Analytics                                   UI      ...
More Concrete Why●   Measure and monitor everything. Introspection.●   Supports (re)design, navigation choices●   Helps wi...
The Moment of Truth       Question for the audience #1   What do you use for Search Analytics?   a) Home grown stuff   b) ...
Search Analytics Outline●   Collect: queries & clicks & interactions & ...●   Analyze: actions / xactions / conversions●  ...
Search vs. Web Analytics●   User intent and information needs vs. inferring●   Hand in hand●   Ideally you can relate data...
Example Core Reports●   Rate & Volume, Latency (mean, avg, 90%)●   Click Through Rate, Mean Reciprocal Rank●   Top Queries...
More Reports in More Detail●   See Search Analytics What? Why?    How?    http://blog.sematext.com/tag/analytics/         ...
Part Dos     Switching gears... Juno digs NoSQL                                                                  18       ...
What Weve Built●   Search Analytics SaaS    ●   Numerous reports (e.g. query volume,        rate, latency, term frequencie...
Who Needs a Compass?●   We need it    ●   search-hadoop.com & search-lucene.com●   Our customers need it!●   You?         ...
Sematext Search Analytics                                                                21          Copyright 2011 Semate...
Big Dreams●   SaaS●   Multitenant●   Large Scale – Massive Data●   Cloud                                                  ...
Storage Choices●   RDBMS: MySQL, PostgreSQL●   HDFS●   Hive●   HBase●   Cassandra                                         ...
SaaS vs. In-House     Question for the audience #2     SaaS vs in-house Search Analytics?     a) SaaS     b) in-house     ...
Sematext Search Analytics                                                                25          Copyright 2011 Semate...
Sematext Search Analytics                                                                26          Copyright 2011 Semate...
Sematext Search Analytics                                                                27          Copyright 2011 Semate...
Sematext Search Analytics                                                                28          Copyright 2011 Semate...
Data Flow●   See Search Analytics with Flume and HBase     http://blog.sematext.com/2010/10/16/search-analytics-hadoop-wor...
Data Collection●   See Search Analytics with Flume and HBase    http://blog.sematext.com/2010/10/16/search-analytics-hadoo...
Core Tech●   JavaScript Beacons●   Metric Capture Web App aka Receiver●   Flume Agents, Collectors, Sinks●   HBase●   MapR...
What is Flume●   Distributed data/log collection service●   Scalable, configurable, extensible●   Centrally manageable, op...
What is HBase●   Scalable, reliable, distributed, column-oriented DB●   On top of HDFS●   MapReducable                    ...
Data Flow, Detailed                                                                 34           Copyright 2011 Sematext I...
Why Flume●   Reliable delivery    ●   e.g. queue msgs locally if destination unreachable●   Easy, centralized management v...
Why HBase●   Scalable raw & aggregate data storage●   MapReduce data input●   Fast scans for time ranges, fast key lookups...
Open Sourcing●   2 open-source projects:    github.com/sematext/HBaseWD    github.com/sematext/HBaseHUT●   See sematext.co...
Challenges●   Data size. Solutions:    ●   Compression (4-5x smaller with lzo)    ●   Data pruning (variable levels)●   Qu...
Output++●   AutoComplete - $MM improvement●   Better DYM Spellchecker●   Related Searches●   Recommendations●   Relevance ...
Closing the Loop                         search                          users                      search                ...
Resource                                      Search Analytics for Your Site                                              ...
Were Hiring    Dig Search?    Dig Analytics?    Dig Big Data?    Dig Performance?    Dig working with and in open-source? ...
Contact      sematext.com      blog.sematext.com      @sematext      @otisg      otis@sematext.com      Want SA? Grab me o...
Upcoming SlideShare
Loading in …5
×

Search Analytics Business Value & NoSQL Backend

10,033 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,033
On SlideShare
0
From Embeds
0
Number of Embeds
8,238
Actions
Shares
0
Downloads
37
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Search Analytics Business Value & NoSQL Backend

  1. 1. Search Analytics Business Value & NoSQL BackendOtis Gospodnetić – Sematext International @otisg ◦ @sematext ◦ sematext.com sematext.com/search-analytics
  2. 2. About Otis Gospodnetić• ASF Member: Lucene, Solr, Nutch, Mahout• Author: Lucene in Action 1 & 2• Entrepreneur: Sematext, Simpy 2 Copyright 2011 Sematext Intl. All rights reserved.
  3. 3. Sematext Metrics● 100% organic: no GMO, no VC● 4 years old● < 10 people● 7 countries● 3 timezones● 2 continents● > 100 customers 3 Copyright 2011 Sematext Intl. All rights reserved.
  4. 4. About Sematext Products & Services Consulting, Development, Tech Support:● Search (Lucene, Solr, ElasticSearch...)● Big Data (Hadoop, HBase, Voldemort...)● Web Crawling (Nutch, Droids)● Machine Learning (Mahout) 4 Copyright 2011 Sematext Intl. All rights reserved.
  5. 5. Agenda● What is Search Analytics and why it matters● Example reports and their value● What we built, why, and how 5 Copyright 2011 Sematext Intl. All rights reserved.
  6. 6. Communication● twitter.com/sematext● twitter.com/otisg● hash tags: #stsa or #stanalytics● http://sematext.com/search-analytics/index.html● Raise your hand!● otis@sematext.com 6 Copyright 2011 Sematext Intl. All rights reserved.
  7. 7. The Compass Search logs are your Map Search Analytics is your Compass 7 Copyright 2011 Sematext Intl. All rights reserved.
  8. 8. High Level Why search users search experience search providers 8 Copyright 2011 Sematext Intl. All rights reserved.
  9. 9. High Level Why This search sucks! It takes 17 tries to find anything here! F!?@#$%^&?!? search users search experience search providers Cool, the latest search tweaks made our site really sticky! Awesome! 9 Copyright 2011 Sematext Intl. All rights reserved.
  10. 10. Dont Be Like This Dude 10 Copyright 2011 Sematext Intl. All rights reserved.
  11. 11. Got Clue? Performance Monitoring Tuning Search Analytics UI Quality Assurance 11 Copyright 2011 Sematext Intl. All rights reserved.
  12. 12. More Concrete Why● Measure and monitor everything. Introspection.● Supports (re)design, navigation choices● Helps with content acquisition & enhancement● Improve search experience● Mula 12 Copyright 2011 Sematext Intl. All rights reserved.
  13. 13. The Moment of Truth Question for the audience #1 What do you use for Search Analytics? a) Home grown stuff b) Google Analytics c) Omniture d) Webtrends e) Other f ) Nothing 13 Copyright 2011 Sematext Intl. All rights reserved.
  14. 14. Search Analytics Outline● Collect: queries & clicks & interactions & ...● Analyze: actions / xactions / conversions● Output: reports – over time● Output++: feedback loop remember this● The means, not the goal● Ongoing, not one-off 14 Copyright 2011 Sematext Intl. All rights reserved.
  15. 15. Search vs. Web Analytics● User intent and information needs vs. inferring● Hand in hand● Ideally you can relate data from both or even unify it 15 Copyright 2011 Sematext Intl. All rights reserved.
  16. 16. Example Core Reports● Rate & Volume, Latency (mean, avg, 90%)● Click Through Rate, Mean Reciprocal Rank● Top Queries by count, clicks, 0 hits...● Query Trending● Top Seen Docs, Top Clicked Docs (msft)● Page & Click Depth● Facet & Sort Usage● ... 16 Copyright 2011 Sematext Intl. All rights reserved.
  17. 17. More Reports in More Detail● See Search Analytics What? Why? How? http://blog.sematext.com/tag/analytics/ 17 Copyright 2011 Sematext Intl. All rights reserved.
  18. 18. Part Dos Switching gears... Juno digs NoSQL 18 Copyright 2011 Sematext Intl. All rights reserved.
  19. 19. What Weve Built● Search Analytics SaaS ● Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.) ● Trending over time ● Comparisons of time periods ● Top N reports ● Filter, slice and dice 19 Copyright 2011 Sematext Intl. All rights reserved.
  20. 20. Who Needs a Compass?● We need it ● search-hadoop.com & search-lucene.com● Our customers need it!● You? 20 Copyright 2011 Sematext Intl. All rights reserved.
  21. 21. Sematext Search Analytics 21 Copyright 2011 Sematext Intl. All rights reserved.
  22. 22. Big Dreams● SaaS● Multitenant● Large Scale – Massive Data● Cloud 22 Copyright 2011 Sematext Intl. All rights reserved.
  23. 23. Storage Choices● RDBMS: MySQL, PostgreSQL● HDFS● Hive● HBase● Cassandra 23 Copyright 2011 Sematext Intl. All rights reserved.
  24. 24. SaaS vs. In-House Question for the audience #2 SaaS vs in-house Search Analytics? a) SaaS b) in-house 24 Copyright 2011 Sematext Intl. All rights reserved.
  25. 25. Sematext Search Analytics 25 Copyright 2011 Sematext Intl. All rights reserved.
  26. 26. Sematext Search Analytics 26 Copyright 2011 Sematext Intl. All rights reserved.
  27. 27. Sematext Search Analytics 27 Copyright 2011 Sematext Intl. All rights reserved.
  28. 28. Sematext Search Analytics 28 Copyright 2011 Sematext Intl. All rights reserved.
  29. 29. Data Flow● See Search Analytics with Flume and HBase http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/ 29 Copyright 2011 Sematext Intl. All rights reserved.
  30. 30. Data Collection● See Search Analytics with Flume and HBase http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/ 30 Copyright 2011 Sematext Intl. All rights reserved.
  31. 31. Core Tech● JavaScript Beacons● Metric Capture Web App aka Receiver● Flume Agents, Collectors, Sinks● HBase● MapReduce Aggregations● Search Analytics Reporting Web App 31 Copyright 2011 Sematext Intl. All rights reserved.
  32. 32. What is Flume● Distributed data/log collection service● Scalable, configurable, extensible● Centrally manageable, open source● Agents get data from app, Collectors save it● Abstractions: Source → Decorator(s) → Sink 32 Copyright 2011 Sematext Intl. All rights reserved.
  33. 33. What is HBase● Scalable, reliable, distributed, column-oriented DB● On top of HDFS● MapReducable 33 Copyright 2011 Sematext Intl. All rights reserved.
  34. 34. Data Flow, Detailed 34 Copyright 2011 Sematext Intl. All rights reserved.
  35. 35. Why Flume● Reliable delivery ● e.g. queue msgs locally if destination unreachable● Easy, centralized management via Web UI or console● Good community, good progress, now @ASF● But: more complex, more moving parts● On Flume: slideshare.net/cloudera/inside-flume● Alternatives: Kafka, Scribe... 35 Copyright 2011 Sematext Intl. All rights reserved.
  36. 36. Why HBase● Scalable raw & aggregate data storage● MapReduce data input● Fast scans for time ranges, fast key lookups● Easy storage and compute power expansion● Good looking roadmap, community, progress 36 Copyright 2011 Sematext Intl. All rights reserved.
  37. 37. Open Sourcing● 2 open-source projects: github.com/sematext/HBaseWD github.com/sematext/HBaseHUT● See sematext.com/open-source/index.html● Patches for Flume and HBase blog.sematext.com/tag/flume/ 37 Copyright 2011 Sematext Intl. All rights reserved.
  38. 38. Challenges● Data size. Solutions: ● Compression (4-5x smaller with lzo) ● Data pruning (variable levels)● Query string distribution: very long-tail ● Lots of data to process, update, aggregate● Young tools: Flume, HBase● Poor IO on EC2● Hadoop distributions 38 Copyright 2011 Sematext Intl. All rights reserved.
  39. 39. Output++● AutoComplete - $MM improvement● Better DYM Spellchecker● Related Searches● Recommendations● Relevance Feedback● ... 39 Copyright 2011 Sematext Intl. All rights reserved.
  40. 40. Closing the Loop search users search experience search providers 40 Copyright 2011 Sematext Intl. All rights reserved.
  41. 41. Resource Search Analytics for Your Site Louis Rosenfeld http://rosenfeldmedia.com/books/searchanalytics/ 41 Copyright 2011 Sematext Intl. All rights reserved.
  42. 42. Were Hiring Dig Search? Dig Analytics? Dig Big Data? Dig Performance? Dig working with and in open-source? Were hiring world-wide! http://sematext.com/about/jobs.html 42 Copyright 2011 Sematext Intl. All rights reserved.
  43. 43. Contact sematext.com blog.sematext.com @sematext @otisg otis@sematext.com Want SA? Grab me or go to: sematext.com/search-analytics Hash tags: #stsa or #stanalytics 43 Copyright 2011 Sematext Intl. All rights reserved.

×