Your SlideShare is downloading. ×
0
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Otis gospodnetic Search Analytics Lucene Eurocon 2011
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Otis gospodnetic Search Analytics Lucene Eurocon 2011

1,007

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011 …

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,007
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 10 days of data (5K/min)
  • 10 days of data (5K/min)
  • Transcript

    • 1. Search Analytics Business Value & NoSQL Backend Otis Gospodneti ć – Sematext International @otisg ◦ @sematext ◦ sematext.com sematext.com/search-analytics
    • 2. About Otis Gospodneti ć <ul><li>ASF Member : Lucene, Solr, Nutch, Mahout </li></ul><ul><li>Author : Lucene in Action 1 & 2 </li></ul><ul><li>Entrepreneur : Sematext , Simpy </li></ul>
    • 3. Sematext Metrics <ul><li>100% organic : no GMO, no VC </li></ul><ul><li>4 years old </li></ul><ul><li>< 10 people </li></ul><ul><li>7 countries </li></ul><ul><li>3 timezones </li></ul><ul><li>2 continents </li></ul><ul><li>> 100 customers </li></ul>
    • 4. About Sematext <ul><li>Products & Services </li></ul><ul><li>Consulting, Development, Tech Support: </li></ul><ul><li>Search (Lucene, Solr, ElasticSearch...) </li></ul><ul><li>Big Data (Hadoop, HBase, Voldemort...) </li></ul><ul><li>Web Crawling (Nutch, Droids) </li></ul><ul><li>Machine Learning (Mahout) </li></ul>
    • 5. Agenda <ul><li>What is Search Analytics and why it matters </li></ul><ul><li>Example reports and their value </li></ul><ul><li>What we built, why, and how </li></ul>
    • 6. Communication <ul><li>twitter.com/ sematext </li></ul><ul><li>twitter.com/ otisg </li></ul><ul><li>hash tags: # stsa or # stanalytics </li></ul><ul><li>http://sematext.com/search-analytics/index.html </li></ul><ul><li>Raise your hand ! </li></ul><ul><li>otis @sematext.com </li></ul>
    • 7. The Compass <ul><li>Search logs are your Map </li></ul><ul><li>Search Analytics is your Compass </li></ul>
    • 8. High Level Why search users search providers search experience
    • 9. High Level Why search providers search experience This search sucks! It takes 17 tries to find anything here! F!?@#$%^&?!? search users Cool, the latest search tweaks made our site really sticky! Awesome!
    • 10. Don't Be Like This Dude
    • 11. Got Clue? Search Analytics Performance Monitoring Quality Assurance Tuning UI
    • 12. More Concrete Why <ul><li>Measure and monitor everything . Introspection. </li></ul><ul><li>Supports (re)design, navigation choices </li></ul><ul><li>Helps with content acquisition & enhancement </li></ul><ul><li>Improve search experience </li></ul><ul><li>Mula </li></ul>
    • 13. The Moment of Truth <ul><li>Question for the audience #1 What do you use for Search Analytics? </li></ul><ul><li>a) Home grown stuff b) Google Analytics c) Omniture d) Webtrends e) Other f ) Nothing </li></ul>
    • 14. Search Analytics Outline <ul><li>Collect: queries & clicks & interactions & ... </li></ul><ul><li>Analyze: actions / xactions / conversions </li></ul><ul><li>Output: reports – over time </li></ul><ul><li>Output++: feedback loop </li></ul><ul><li>The means, not the goal </li></ul><ul><li>Ongoing, not one-off </li></ul>remember this
    • 15. Search vs. Web Analytics <ul><li>User intent and information needs vs. inferring </li></ul><ul><li>Hand in hand </li></ul><ul><li>Ideally you can relate data from both or even unify it </li></ul>
    • 16. Example Core Reports <ul><li>Rate & Volume, Latency (mean, avg, 90%) </li></ul><ul><li>Click Through Rate, Mean Reciprocal Rank </li></ul><ul><li>Top Queries by count, clicks, 0 hits... </li></ul><ul><li>Query Trending </li></ul><ul><li>Top Seen Docs, Top Clicked Docs (msft) </li></ul><ul><li>Page & Click Depth </li></ul><ul><li>Facet & Sort Usage </li></ul><ul><li>... </li></ul>
    • 17. More Reports in More Detail <ul><li>See Search Analytics What? Why? How? http://blog.sematext.com/tag/analytics/ </li></ul>
    • 18. Part Dos <ul><li>Switching gears... Juno digs NoSQL </li></ul>
    • 19. What We've Built <ul><li>Search Analytics SaaS </li></ul><ul><ul><li>Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.) </li></ul></ul><ul><ul><li>Trending over time </li></ul></ul><ul><ul><li>Comparisons of time periods </li></ul></ul><ul><ul><li>Top N reports </li></ul></ul><ul><ul><li>Filter , slice and dice </li></ul></ul>
    • 20. Who Needs a Compass? <ul><li>We need it </li></ul><ul><ul><li>search-hadoop.com & search-lucene.com </li></ul></ul><ul><li>Our customers need it! </li></ul><ul><li>You ? </li></ul>
    • 21. Sematext Search Analytics
    • 22. Big Dreams <ul><li>SaaS </li></ul><ul><li>Multitenant </li></ul><ul><li>Large Scale – Massive Data </li></ul><ul><li>Cloud </li></ul>
    • 23. Storage Choices <ul><li>RDBMS: MySQL, PostgreSQL </li></ul><ul><li>HDFS </li></ul><ul><li>Hive </li></ul><ul><li>HBase </li></ul><ul><li>Cassandra </li></ul>
    • 24. SaaS vs. In-House <ul><li>Question for the audience #2 </li></ul><ul><li>SaaS vs in-house Search Analytics? a) SaaS b) in-house </li></ul>
    • 25. Sematext Search Analytics
    • 26. Sematext Search Analytics
    • 27. Sematext Search Analytics
    • 28. Sematext Search Analytics
    • 29. Data Flow <ul><li>See Search Analytics with Flume and HBase http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/ </li></ul>
    • 30. Data Collection <ul><li>See Search Analytics with Flume and HBase http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/ </li></ul>
    • 31. Core Tech <ul><li>JavaScript Beacons </li></ul><ul><li>Metric Capture Web App aka Receiver </li></ul><ul><li>Flume Agents, Collectors, Sinks </li></ul><ul><li>HBase </li></ul><ul><li>MapReduce Aggregations </li></ul><ul><li>Search Analytics Reporting Web App </li></ul>
    • 32. What is Flume <ul><li>Distributed data/log collection service </li></ul><ul><li>Scalable, configurable, extensible </li></ul><ul><li>Centrally manageable, open source </li></ul><ul><li>Agents get data from app, Collectors save it </li></ul><ul><li>Abstractions: Source -> Decorator(s) -> Sink </li></ul>
    • 33. What is HBase <ul><li>Scalable, reliable, distributed, column-oriented DB </li></ul><ul><li>On top of HDFS </li></ul><ul><li>MapReducable </li></ul>
    • 34. Data Flow, Detailed
    • 35. Why Flume <ul><li>Reliable delivery </li></ul><ul><ul><li>e.g. queue msgs locally if destination unreachable </li></ul></ul><ul><li>Easy, centralized management via Web UI or console </li></ul><ul><li>Good community, good progress, now @ASF </li></ul><ul><li>But: more complex, more moving parts </li></ul><ul><li>On Flume: slideshare.net/cloudera/inside-flume </li></ul><ul><li>Alternatives: Kafka, Scribe... </li></ul>
    • 36. Why HBase <ul><li>Scalable raw & aggregate data storage </li></ul><ul><li>MapReduce data input </li></ul><ul><li>Fast scans for time ranges, fast key lookups </li></ul><ul><li>Easy storage and compute power expansion </li></ul><ul><li>Good looking roadmap, community, progress </li></ul>
    • 37. Open Sourcing <ul><li>2 open-source projects: </li></ul><ul><li>github.com/sematext/HBaseWD </li></ul><ul><li>github.com/sematext/HBaseHUT </li></ul><ul><li>See sematext.com/open-source/index.html </li></ul><ul><li>Patches for Flume and HBase blog.sematext.com/tag/flume/ </li></ul>
    • 38. Challenges <ul><li>Data size. Solutions: </li></ul><ul><ul><li>Compression (4-5x smaller with lzo) </li></ul></ul><ul><ul><li>Data pruning (variable levels) </li></ul></ul><ul><li>Query string distribution: very long-tail </li></ul><ul><ul><li>Lots of data to process, update, aggregate </li></ul></ul><ul><li>Young tools: Flume, HBase </li></ul><ul><li>Poor IO on EC2 </li></ul><ul><li>Hadoop distributions </li></ul>
    • 39. Output++ <ul><li>AutoComplete - $MM improvement </li></ul><ul><li>Better DYM Spellchecker </li></ul><ul><li>Related Searches </li></ul><ul><li>Recommendations </li></ul><ul><li>Relevance Feedback </li></ul><ul><li>... </li></ul>
    • 40. Closing the Loop search users search providers search experience
    • 41. Resource http://rosenfeldmedia.com/books/searchanalytics/ Search Analytics for Your Site Louis Rosenfeld
    • 42. We're Hiring <ul><li>Dig Search ? </li></ul><ul><li>Dig Analytics ? </li></ul><ul><li>Dig Big Data ? </li></ul><ul><li>Dig Performance ? </li></ul><ul><li>Dig working with and in open-source ? </li></ul><ul><li>We're hiring world-wide! </li></ul><ul><li>http://sematext.com/about/jobs.html </li></ul>
    • 43. <ul><li>sematext.com </li></ul><ul><li>blog.sematext.com </li></ul><ul><li>@sematext </li></ul><ul><li>@otisg </li></ul><ul><li>[email_address] Want SA? Grab me or go to: </li></ul><ul><li>sematext.com/search-analytics </li></ul><ul><li> Hash tags: # stsa or # stanalytics </li></ul>Contact

    ×