Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Similar to HBaseCon 2012 | Real-time Analytics with HBase - Sematext(20)

Advertisement

More from Cloudera, Inc.(20)

Advertisement

HBaseCon 2012 | Real-time Analytics with HBase - Sematext

  1. Real-time Analytics with HBase Alex Baranau, Sematext International Sunday, May 20, 12
  2. About me Software Engineer at Sematext International http://blog.sematext.com/author/abaranau @abaranau http://github.com/sematext (abaranau) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  3. Plan Problem background: what? why? Going real-time with append-only updates approach: how? Open-source implementation: how exactly? Q&A Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  4. Background: our services Systems Monitoring Service (Solr, HBase, ...) Search Analytics Service data collector Reports Data 100 75 data collector Analytics & 50 Storage 25 data collector 0 2007 2008 2009 2010 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  5. Background: Report Example Search engine (Solr) request latency Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  6. Background: Report Example HBase flush operations Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  7. Background: requirements High volume of input data Multiple filters/dimensions Interactive (fast) reports Show wide range of data intervals Real-time data changes visibility No sampling, accurate data needed Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  8. Background: serve raw data? simply storing all data points doesn’t work to show 1-year worth of data points collected every second 31,536,000 points have to be fetched pre-aggregation (at least partial) needed Data Analytics & Storage Reports aggregated data input data data processing (pre-aggregating) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  9. Background: pre-aggregation OLAP-like Solution aggregation rules * filters/dimensions * time range granularities aggregated * ... value aggregated input data processing value item logic aggregated value Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  10. Background: pre-aggregation Simplified Example aggregated record groups by minute minute: 22214701 aggregation rules value: 30.0 * by sensor ... * by minute/day by day day: 2012-04-26 input data item value: 10.5 time: 1332882078 processing ... sensor: sensor55 value: 80.0 logic by minute & sensor minute: 22214701 sensor: sensor55 cpu: 70.3 ... ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  11. Background: RMW updates are slow more dimensions/filters -> greater output data vs input data ratio individual ready-modify-write (Get+Put) operations are slow and not efficient (10-20+ times slower than only Puts) sensor1 sensor2 ... <...> sensor2 value:15.0 ... value:41.0 input Get Put Get Put Get Put ... sensor1 sensor2 ... reports <...> avg : 28.7 Get/Scan sensor2 min: 15.0 storage max: 41.0 (HBase) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  12. Background: improve updates Using in-place increment operations? Not fast enough and not flexible... Buffering input records on the way in and writing in small batches? Doesn’t scale and possible loss of data... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  13. Background: batch updates More efficient data processing: multiple updates processed at once, not individually Decreases aggregation output (per input record) Reliable, no data loss Using “dictionary records” helps to reduce number of Get operations Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  14. Background: batch updates “Dictionary Records” Using data de-normalization to reduce random Get operations while doing “Get+Put” updates: Keep compound records which hold data of multiple “normal” records that are usually updated together N Get+Put operations replaced with M (Get+Put) and N Put operations, where M << N Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  15. Background: batch updates Not real-time If done frequently (closer to real-time), still a lot of costly Get+Put update operations Bad (any?) rollback support Handling of failures of tasks which partially wrote data to HBase is complex Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  16. Going Real-time with Append-based Updates Sunday, May 20, 12
  17. Append-only: main goals Increase record update throughput Process updates more efficiently: reduce operations number and resources usage Ideally, apply high volume of incoming data changes in real-time Add ability to roll back changes Handle well high update peaks Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  18. Append-only: how? 1. Replace read-modify-write (Get+Put) operations at write time with simple append-only writes (Put) 2. Defer processing of updates to periodic jobs 3. Perform processing of updates on the fly only if user asks for data earlier than updates are processed. Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  19. Append-only: writing updates 1 Replace update (Get+Put) operations at write time with simple append-only writes (Put) sensor1 sensor2 sensor2 ... ... value:15.0 ... value:41.0 input Put Put Put ... sensor1 sensor2 ... <...> avg : 22.7 max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 storage sensor2 value: 41.0 (HBase) ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  20. Append-only: writing updates 2 Defer processing of updates to periodic jobs processing updates with MR job ... sensor1 sensor2 ... <...> avg : 22.7 ... sensor1 sensor2 ... max: 31.0 <...> avg : 23.4 sensor1 <...> ... max: 41.0 ... sensor2 value: 15.0 sensor2 value: 41.0 ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  21. Append-only: writing updates 3 Perform aggregations on the fly if user asks for data earlier than updates are processed ... sensor1 sensor2 ... reports <...> avg : 22.7 sensor1 ... max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 sensor2 storage value: 41.0 ... sensor2 avg : 23.4 max: 41.0 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  22. Append-only: benefits High update throughput Real-time updates visibility Efficient updates processing Handling high peaks of update operations Ability to roll back any range of changes Automatically handling failures of tasks which only partially updated data (e.g. in MR jobs) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  23. 1/6 Append-only: high update throughput Avoid Get+Put operations upon writing Use only Put operations (i.e. insert new records only) which is very fast in HBase Process updates when flushing client-side buffer to reduce the number of actual writes Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  24. 2/6 Append-only: real-time updates Increased update throughput allows to apply updates in real-time User always sees the latest data changes Updates processed on the fly during Get or Scan can be stored back right away Periodic updates processing helps avoid doing a lot of work during reads, making reading very fast Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  25. 3/6 Append-only: efficient updates To apply N changes: N Get+Put operations replaced with N Puts and 1 Scan (shared) + 1 Put operation Applying N changes at once is much more efficient than performing N individual changes Especially when updated value is complex (like bitmaps), takes time to load in memory Skip compacting if too few records to process Avoid a lot of redundant Get operations when large portion of operations - inserting new data Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  26. 4/6 Append-only: high peaks handling Actual updates do not happen at write time Merge is deferred to periodic jobs, which can be scheduled to run at off-peak time (nights/ week-ends) Merge speed is not critical, doesn’t affect the visibility of changes Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  27. 5/6 Append-only: rollback Rollbacks are easy when updates were not processed yet (not merged) To preserve rollback ability after they are processed (and result is written back), updates can be compacted into groups written at: processing updates 9:00 ... sensor2 ... ... ... sensor2 ... ... sensor2 ... ... 10:00 sensor2 sensor2 ... ... sensor2 ... ... 11:00 sensor2 sensor2 ... ... ... ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  28. 5/6 Append-only: rollback Example: * keep all-time avg value for sensor * data collected every 10 second for 30 days Solution: * perform periodic compactions every 4 hours * compact groups based on 1-hour interval Result: At any point of time there are no more than 24 * 30 + 4 * 60 * 6 = 2160 non-compacted records that needs to be processed on the fly Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  29. 6/6 Append-only: idempotency Using append-only approach helps recover from failed tasks which write data to HBase without rolling back partial updates avoids applying duplicate updates fixes task failure with simple restart of task Note: new task should write records with same row keys as failed one easy, esp. given that input data is likely to be same Very convenient when writing from MapReduce Updates processing periodic jobs are also idempotent Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  30. Append-only: cons Processing on the fly makes reading slower Looking for data to compact (during periodic compactions) may be inefficient Increased amount of stored data depending on use-case (in 0.92+) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  31. Append-only + Batch? Works very well together, batch approach benefits from: increased update throughput automatic task failures handling rollback ability Use when HBase cluster cannot cope with processing updates in real-time or update operations are bottleneck in your batch We use it ;) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  32. Append-only updates implementation HBaseHUT Sunday, May 20, 12
  33. HBaseHUT: Overview Simple Easy to integrate into existing projects Packed as a singe jar to be added to HBase client classpath (also add it to RegionServer classpath to benefit from server-side optimizations) Supports native HBase API: HBaseHUT classes implement native HBase interfaces Apache License, v2.0 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  34. HBaseHUT: Overview Processing of updates on-the-fly (behind ResultScanner interface) Allows storing back processed Result Can use CPs to process updates on server-side Periodic processing of updates with Scan or MapReduce job Including processing updates in groups based on write ts Rolling back changes with MapReduce job Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  35. HBaseHUT vs(?) OpenTSDB “vs” is wrong, they are simply different things OpenTSDB is a time-series database HBaseHUT is a library which implements append-only updates approach to be used in your project OpenTSDB uses “serve raw data” approach (with storage improvements), limited to handling numeric values HBaseHUT is meant for (but not limited to) “serve aggregated data” approach, works with any data Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  36. HBaseHUT: API overview Writing data: Put put = new Put(HutPut.adjustRow(rowKey)); // ... hTable.put(put); Reading data: Scan scan = new Scan(startKey, stopKey); ResultScanner resultScanner = new HutResultScanner(hTable.getScanner(scan), updateProcessor); for (Result current : resultScanner) {...} Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  37. HBaseHUT: API overview Example UpdateProcessor: public class MaxFunction extends UpdateProcessor { // ... constructor & utility methods @Override public void process(Iterable<Result> records, UpdateProcessingResult result) { Double maxVal = null; for (Result record : records) { double val = getValue(record); if (maxVal == null || maxVal < val) { maxVal = val; } } result.add(colfam, qual, Bytes.toBytes(maxVal)); } } Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  38. HBaseHUT: how we use it Data Analytics & Storage Reports aggregated data HBaseHUT HBaseHUT input initial data data processing HBase HBaseHUT HBaseHUT periodic MapReduce updates jobs processing Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  39. HBaseHUT: Next Steps Wider CPs (HBase 0.92+) utilization Process updates during memstore flush Make use of Append operation (HBase 0.94+) Integrate with asynchbase lib Reduce storage overhead from adjusting row keys etc. Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  40. Qs? http://github.com/sematext/HBaseHUT http://blog.sematext.com @abaranau http://github.com/sematext (abaranau) http://sematext.com, we are hiring! ;) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
Advertisement