Real-time Analytics with HBase (short version)

7,686 views
7,686 views

Published on

Published in: Technology, Business
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total views
7,686
On SlideShare
0
From Embeds
0
Number of Embeds
5,733
Actions
Shares
0
Downloads
8
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

Real-time Analytics with HBase (short version)

  1. 1. Real-time Analytics with HBase Alex Baranau, Sematext International (short version)Tuesday, June 5, 12
  2. 2. About me Software Engineer at Sematext International http://blog.sematext.com/author/abaranau @abaranau http://github.com/sematext (abaranau) Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  3. 3. Plan Problem background: what? why? Going real-time with append-only updates approach: how? Open-source implementation: how exactly? Q&A Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  4. 4. Background: our services Systems Monitoring Service (Solr, HBase, ...) Search Analytics Service data collector Reports Data 100 75 data collector Analytics & 50 Storage 25 data collector 0 2007 2008 2009 2010 Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  5. 5. Background: Report Example Search engine (Solr) request latency Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  6. 6. Background: requirements High volume of input data Multiple filters/dimensions Interactive (fast) reports Show wide range of data intervals Real-time data changes visibility No sampling, accurate data needed Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  7. 7. Background: serve raw data? simply storing all data points doesn’t work to show 1-year worth of data points collected every second 31,536,000 points have to be fetched pre-aggregation (at least partial) needed Data Analytics & Storage Reports aggregated data input data data processing (pre-aggregating) Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  8. 8. Background: pre-aggregation OLAP-like Solution aggregation rules * filters/dimensions * time range granularities aggregated * ... value aggregated input data processing value item logic aggregated value Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  9. 9. Background: RMW updates are slow more dimensions/filters -> greater output data vs input data ratio individual ready-modify-write (Get+Put) operations are slow and not efficient (10-20+ times slower than only Puts) sensor1 sensor2 ... <...> sensor2 value:15.0 ... value:41.0 input Get Put Get Put Get Put ... sensor1 sensor2 ... reports <...> avg : 28.7 Get/Scan sensor2 min: 15.0 storage max: 41.0 (HBase) Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  10. 10. Background: batch updates More efficient data processing: multiple updates processed at once, not individually Decreases aggregation output (per input record) Reliable, no data loss in case of failures Not real-time If done frequently (closer to real-time), still a lot of costly Get+Put update operations Handling of failures of tasks which partially wrote data to HBase is complex Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  11. 11. Going Real-time with Append-based UpdatesTuesday, June 5, 12
  12. 12. Append-only: main goals Increase record update throughput Process updates more efficiently: reduce operations number and resources usage Ideally, apply high volume of incoming data changes in real-time Add ability to roll back changes Handle well high update peaks Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  13. 13. Append-only: how? 1. Replace read-modify-write (Get+Put) operations at write time with simple append-only writes (Put) 2. Defer processing of updates to periodic jobs 3. Perform processing of updates on the fly only if user asks for data earlier than updates are processed. Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  14. 14. Append-only: writing updates 1 Replace update (Get+Put) operations at write time with simple append-only writes (Put) sensor1 sensor2 sensor2 ... ... value:15.0 ... value:41.0 input Put Put Put ... sensor1 sensor2 ... <...> avg : 22.7 max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 storage sensor2 value: 41.0 (HBase) ... Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  15. 15. Append-only: writing updates 2 Defer processing of updates to periodic jobs processing updates with MR job ... sensor1 sensor2 ... <...> avg : 22.7 ... sensor1 sensor2 ... max: 31.0 <...> avg : 23.4 sensor1 <...> ... max: 41.0 ... sensor2 value: 15.0 sensor2 value: 41.0 ... Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  16. 16. Append-only: writing updates 3 Perform aggregations on the fly if user asks for data earlier than updates are processed ... sensor1 sensor2 ... reports <...> avg : 22.7 sensor1 ... max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 sensor2 storage value: 41.0 ... sensor2 avg : 23.4 max: 41.0 Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  17. 17. Append-only: benefits High update throughput Real-time updates visibility Efficient updates processing Handling high peaks of update operations Ability to roll back any range of changes Automatically handling failures of tasks which only partially updated data (e.g. in MR jobs) Update operation becomes idempotent & atomic, easy to scale writers horizontally Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  18. 18. 3/7 Append-only: efficient updates To apply N changes: N Get+Put operations replaced with N Puts and 1 Scan (shared) + 1 Put operation Applying N changes at once is much more efficient than performing N individual changes Especially when updated value is complex (like bitmaps), takes time to load in memory Skip compacting if too few records to process Avoid a lot of redundant Get operations when large portion of operations - inserting new data Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  19. 19. 5/7 Append-only: rollback Rollbacks are easy when updates were not processed yet (not merged) To preserve rollback ability after they are processed (and result is written back), updates can be compacted into groups written at: processing updates 9:00 ... sensor2 ... ... ... sensor2 ... ... sensor2 ... ... 10:00 sensor2 sensor2 ... ... sensor2 ... ... 11:00 sensor2 sensor2 ... ... ... ... Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  20. 20. 6/7 Append-only: idempotency Using append-only approach helps recover from failed tasks which write data to HBase without rolling back partial updates avoids applying duplicate updates fixes task failure with simple restart of task Note: new task should write records with same row keys as failed one easy, esp. given that input data is likely to be same Very convenient when writing from MapReduce Updates processing periodic jobs are also idempotent Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  21. 21. Append-only: cons Processing on the fly makes reading slower Looking for data to compact (during periodic compactions) may be inefficient Increased amount of stored data depending on use-case (in 0.92+) Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  22. 22. Append-only updates implementation HBaseHUTTuesday, June 5, 12
  23. 23. HBaseHUT: Overview Simple Easy to integrate into existing projects Packed as a singe jar to be added to HBase client classpath (also add it to RegionServer classpath to benefit from server-side optimizations) Supports native HBase API: HBaseHUT classes implement native HBase interfaces Apache License, v2.0 Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  24. 24. HBaseHUT: Overview Processing of updates on-the-fly (behind ResultScanner interface) Allows storing back processed Result Can use CPs to process updates on server-side Periodic processing of updates with Scan or MapReduce job Including processing updates in groups based on write ts Rolling back changes with MapReduce job Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  25. 25. HBaseHUT: API overview Writing data: Put put = new Put(HutPut.adjustRow(rowKey)); // ... hTable.put(put); Reading data: Scan scan = new Scan(startKey, stopKey); ResultScanner resultScanner = new HutResultScanner(hTable.getScanner(scan), updateProcessor); for (Result current : resultScanner) {...} Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  26. 26. HBaseHUT: API overview Example UpdateProcessor: public class MaxFunction extends UpdateProcessor { // ... constructor & utility methods @Override public void process(Iterable<Result> records, UpdateProcessingResult result) { Double maxVal = null; for (Result record : records) { double val = getValue(record); if (maxVal == null || maxVal < val) { maxVal = val; } } result.add(colfam, qual, Bytes.toBytes(maxVal)); } } Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  27. 27. HBaseHUT: Next Steps Wider CPs (HBase 0.92+) utilization Process updates during memstore flush Make use of Append operation (HBase 0.94+) Integrate with asynchbase lib Reduce storage overhead from adjusting row keys etc. Contributors are welcome! Alex Baranau, Sematext International, 2012Tuesday, June 5, 12
  28. 28. Qs? http://github.com/sematext/HBaseHUT http://blog.sematext.com @abaranau http://github.com/sematext (abaranau) http://sematext.com, we are hiring! ;) there will be a longer version of the presentation on the web Alex Baranau, Sematext International, 2012Tuesday, June 5, 12

×