Search Analytics with Flume & HBase Otis Gospodneti ć   •••  Sematext International
Agenda <ul><li>Who I am
What Why How
Architecture Evolution
Role of Flume and HBase + Flume HBase Sink
Challenges </li></ul>
About Otis Gospodneti ć <ul><li>Lucene/Solr/Nutch/Mahout committer
Lucene in Action 1 & 2 co-author
Lucene Consulting since 2005
Sematext Int'l since 2007 </li></ul>
About Sematext <ul>Consulting, development, support for: <li>Big Data  (Hadoop, HBase, Voldemort...)
Search  (Lucene, Solr, Elastic Search...)
Web Crawling  (Nutch)
Machine Learning  (Mahout) </li></ul>
What We Built <ul><li>Analytics for Search </li><ul><li>Numerous reports (e.g. query volume, rate, latency, term frequenci...
Trending over time
Comparisons of time periods
Top N reports
Various report filters </li></ul></ul>
Report Example
Why We Built it <ul><li>We need it </li><ul><li>search-hadoop.com  &  search-lucene.com </li></ul><li>Search customers nee...
Want to know how their search is behaving
… </li></ul></ul>subliminal msg:  go use this site
How We Built it <ul><li>JavaScript Beacons
Upcoming SlideShare
Loading in...5
×

Search Analytics with Flume and HBase

16,144

Published on

Published in: Technology
0 Comments
29 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
16,144
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
353
Comments
0
Likes
29
Embeds 0
No embeds

No notes for slide
  • 10 days of data (5K/min)
  • Flume is used simply to collect logs to a central place (HDFS) from multiple agents. But at the end we still have a single log file that something (raw log importer) then needs to process. No HBase is involved directly with Flume here and there is no HBase sink in this scenario.
  • Making use of Flume&apos;s ability to plug in different Sinks, so instead of just collecting data to a log file on HDFS, we hook up FLUME-247 Sink to Flume and make it write directly to HBase.
  • 2h, 2K/min, 1sys (240K actions, 43mb of input data) 1193mb - no prune, no compress 624mb - prune sort index only, no compress 408mb - prune, no compress 196mb - no prune, copress 106mb - prune sort index only, compress 64mb - prune, compress
  • Transcript of "Search Analytics with Flume and HBase"

    1. 1. Search Analytics with Flume & HBase Otis Gospodneti ć ••• Sematext International
    2. 2. Agenda <ul><li>Who I am
    3. 3. What Why How
    4. 4. Architecture Evolution
    5. 5. Role of Flume and HBase + Flume HBase Sink
    6. 6. Challenges </li></ul>
    7. 7. About Otis Gospodneti ć <ul><li>Lucene/Solr/Nutch/Mahout committer
    8. 8. Lucene in Action 1 & 2 co-author
    9. 9. Lucene Consulting since 2005
    10. 10. Sematext Int'l since 2007 </li></ul>
    11. 11. About Sematext <ul>Consulting, development, support for: <li>Big Data (Hadoop, HBase, Voldemort...)
    12. 12. Search (Lucene, Solr, Elastic Search...)
    13. 13. Web Crawling (Nutch)
    14. 14. Machine Learning (Mahout) </li></ul>
    15. 15. What We Built <ul><li>Analytics for Search </li><ul><li>Numerous reports (e.g. query volume, rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.)
    16. 16. Trending over time
    17. 17. Comparisons of time periods
    18. 18. Top N reports
    19. 19. Various report filters </li></ul></ul>
    20. 20. Report Example
    21. 21. Why We Built it <ul><li>We need it </li><ul><li>search-hadoop.com & search-lucene.com </li></ul><li>Search customers need it </li><ul><li>Want to know what their visitors are searching for
    22. 22. Want to know how their search is behaving
    23. 23. … </li></ul></ul>subliminal msg: go use this site
    24. 24. How We Built it <ul><li>JavaScript Beacons
    25. 25. Metric Capture Web App
    26. 26. Data Capture Mechanisms </li><ul><li>Custom Log4J Appender
    27. 27. Flume Agents, Collectors, Sinks </li></ul><li>HBase
    28. 28. MapReduce Aggregations
    29. 29. Search Analytics Reporting Web App </li></ul>
    30. 30. What's Flume <ul><li>Distributed data/log collection service
    31. 31. Scalable, configurable, extensible
    32. 32. Centrally manageable, open source
    33. 33. Agents get data from app, Collectors save it
    34. 34. Abstractions: Source -> Decorator(s) -> Sink </li></ul>
    35. 35. What's HBase <ul><li>Scalable, reliable, distributed, column-oriented DB
    36. 36. On top of HDFS
    37. 37. MapReducable </li></ul>
    38. 38. High Level Architecture
    39. 39. Architecture #1
    40. 40. Architecture #1 - Getting Messy
    41. 41. Arch #2 – HBaseLog4JAppender
    42. 42. HBaseLog4JAppender Cons <ul><li>Doesn't help with reliable delivery </li><ul><li>e.g. when network or HBase down </li></ul><li>Non-centralized config with larger clusters </li><ul><li>e.g. changing destination table in HBase
    43. 43. e.g. changing sampling rate </li></ul></ul>
    44. 44. Architecture #3 – Flume OOTB
    45. 45. Arch #4 – Flume HBase Sink
    46. 46. FLUME-247 – Flume HBase sink <ul><li>Contributed by Sematext in September 2010
    47. 47. Reviewed, pending commit
    48. 48. Similar to FLUME-6 (basic example), but more flexible
    49. 49. https://issues.cloudera.org/browse/FLUME-247 </li></ul>
    50. 50. Walk-Through <ul><li>Start EC2 micro instance, configure logs-generation tool to simulate user actions
    51. 51. User actions start getting logged to a log file
    52. 52. Configure Flume Agent to &quot;tail&quot; the generated logs and send data to Flume Collector
    53. 53. Collector processes log messages and sends them to HBase's &quot;raw logs&quot; table
    54. 54. Later these logs are processed by the MapReduce job </li></ul>Search Action -> Metric Capture -> Log File -> Flume Agent -> Flume Collector -> Decorators -> HBase Sink -> HBase <ul><li>Decorator: processes Flume Collector log events and prepares them for HBase
    55. 55. HBase sink: FLUME-247 </li></ul>
    56. 56. Why Flume <ul><li>Reliable delivery </li><ul><li>e.g. queue msgs locally if destination unreachable </li></ul><li>Easy, centralized management via Web UI or console
    57. 57. Good community, good progress
    58. 58. But: more complex, more moving parts
    59. 59. On Flume: slideshare.net/cloudera/inside-flume </li></ul>
    60. 60. Why HBase <ul><li>Scalable raw search data storage
    61. 61. MapReduce data input
    62. 62. Scalable aggregate data storage
    63. 63. Fast scans for time ranges, fast key lookups
    64. 64. Easy storage and compute power expansion
    65. 65. Good looking roadmap, community, progress </li></ul>
    66. 66. Challenges <ul><li>“ HBase in a box” is like “dynamic equilibrium”, or “virtual reality”, or “jumbo shrimp” – search-hadoop.com/m/p68C12nb7Hn
    67. 67. Data size. Solutions: </li><ul><li>Compression (4-5x smaller with lzo)
    68. 68. Data pruning (variable levels) </li></ul><li>Query string distribution: very long-tail </li><ul><li>Lots of data to process, update, aggregate </li></ul></ul>
    69. 69. Work @ Sematext We are hiring world-wide! Search & Data Analytics Machine Learning & NLP Biiig Data
    70. 70. <ul><li>sematext.com
    71. 71. blog.sematext.com
    72. 72. @ sematext
    73. 73. @ otisg
    74. 74. [email_address] </li></ul>Contact
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×