Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes


Published on

Presented by: Hari Shreedharan, Cloudera

Published in: Technology

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

  1. 1. 1 Streaming data into HBase using Flume Hari Shreedharan | Software Engineer, Cloudera
  2. 2. Apache Flume Fundamentals • Scalable collection, aggregation of event data (i.e. logs) • The simplest “unit” of data – “Event” • Event = {Map<String, String>, byte[] body} • Dynamic, contextual event routing • Low latency, high throughput • Declarative configuration • Productive out of the box, yet powerfully extensible • Open source software 2
  3. 3. Inside a Flume NG agent 3
  4. 4. Why Flume? 4 • Real user issue: • HBase Rest Server – did not scale • OOM, very high latency • High ops cost • Flume was a viable alternative • Schema changes – require app changes • In Flume, just change and deploy a plugin and restart Flume. • HBase downtime/compaction/gc isolated from production app • More data – just add more Flume agents, no app changes!
  5. 5. Topology: Connecting agents together 5 [Client]+  Agent [ Agent]*  Destination HBase
  6. 6. Flume writes to HBase – HBase Sinks 6 • HBase Sink • Currently supports 0.90.x, 0.92.x, 0.94.x • Uses the “standard” HBase Client API • Supports security • Async HBase Sink • Uses Async HBase • No security support • Faster • Uses Async HBase 1.4.1
  7. 7. Highly flexible sinks 7 • Both sinks are extremely flexible. • HBase sink uses a “serializer” to convert Flume events to HBase-friendly format. • Plugin architecture – user can drop in their own serializer • Serializers implement a very simple interface.
  8. 8. Serializers 8 public interface HbaseEventSerializer { void initialize(Event event, byte[] columnFamily); public List<Row> getActions(); public List<Increment> getIncrements(); public void close(); }
  9. 9. HBase Cluster performance 9 • HBase cluster itself scaled really well • No one I know of has hit scaling issues writing from Flume • Sometimes read performance was affected • Primarily due to row locks held by writes/increments • Increments made this situation more problematic • When Flume was writing to the same rows as being read, the read latency could be visibly high. • Pre-spilt tables, and uniform distribution of data also helped.
  10. 10. Issues we faced – why two sinks? 10 • Wrote the HBase Sink first using HBase client API • HBase Client API great at conserving resources • Several static maps hidden away in the API meant we could not open as many connections as wanted from the same JVM • Region Servers and Flume Agents were sitting idle while data was being sent over the wire! • More threads didn’t seem to help much.
  11. 11. Async HBase to the rescue! 11 • Async HBase – an easy way out • Maintained thread pools – callbacks based • Helped us get the full power of HBase • Scaled really well – allowing good HBase cluster utilization • Never seen a user complaining about Async HBase Sink performance!
  12. 12. What happens now? 12 • HBase 0.95+ no longer wire compatible with Async HBase • Hoping to see Async HBase support HBase 0.95+ (and willing to contribute!) • Hoping to see an HBase API which supports a “use all my resources” mode (and willing to contribute!)
  13. 13. Read and contribute! 13 • Apache Flume: • hitecture • a_into_apache_hbase • mance_tuning_part_1
  14. 14. Read and contribute! 14 • Apache Flume: • hitecture • a_into_apache_hbase • mance_tuning_part_1
  15. 15. Click to edit Master title style 15
  16. 16. Hari Shreedharan, Software Engineer, Cloudera @harisr1234 Thank you!