HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

•

18 likes•9,286 views

Cloudera, Inc.

Presented by: Hari Shreedharan, Cloudera

Technology

1
Streaming data into HBase using
Flume
Hari Shreedharan | Software Engineer, Cloudera

Apache Flume Fundamentals
• Scalable collection, aggregation of event data (i.e.
logs)
• The simplest “unit” of data – “Event”
• Event = {Map<String, String>, byte[] body}
• Dynamic, contextual event routing
• Low latency, high throughput
• Declarative configuration
• Productive out of the box, yet powerfully extensible
• Open source software
2

Why Flume?
4
• Real user issue:
• HBase Rest Server – did not scale
• OOM, very high latency
• High ops cost
• Flume was a viable alternative
• Schema changes – require app changes
• In Flume, just change and deploy a plugin and restart Flume.
• HBase downtime/compaction/gc isolated from
production app
• More data – just add more Flume agents, no app
changes!

Topology: Connecting agents together
5
[Client]+  Agent [ Agent]*  Destination
HBase

Flume writes to HBase – HBase Sinks
6
• HBase Sink
• Currently supports 0.90.x, 0.92.x, 0.94.x
• Uses the “standard” HBase Client API
• Supports security
• Async HBase Sink
• Uses Async HBase
• No security support
• Faster
• Uses Async HBase 1.4.1

Highly flexible sinks
7
• Both sinks are extremely flexible.
• HBase sink uses a “serializer” to convert Flume
events to HBase-friendly format.
• Plugin architecture – user can drop in their own
serializer
• Serializers implement a very simple interface.

Serializers
8
public interface HbaseEventSerializer {
void initialize(Event event, byte[]
columnFamily);
public List<Row> getActions();
public List<Increment> getIncrements();
public void close();
}

HBase Cluster performance
9
• HBase cluster itself scaled really well
• No one I know of has hit scaling issues writing from
Flume
• Sometimes read performance was affected
• Primarily due to row locks held by writes/increments
• Increments made this situation more problematic
• When Flume was writing to the same rows as being read,
the read latency could be visibly high.
• Pre-spilt tables, and uniform distribution of data also
helped.

Issues we faced – why two sinks?
10
• Wrote the HBase Sink first using HBase client API
• HBase Client API great at conserving resources
• Several static maps hidden away in the API meant we
could not open as many connections as wanted from
the same JVM
• Region Servers and Flume Agents were sitting idle
while data was being sent over the wire!
• More threads didn’t seem to help much.

Async HBase to the rescue!
11
• Async HBase – an easy way out
• Maintained thread pools – callbacks based
• Helped us get the full power of HBase
• Scaled really well – allowing good HBase cluster
utilization
• Never seen a user complaining about Async HBase
Sink performance!

What happens now?
12
• HBase 0.95+ no longer wire compatible with Async
HBase
• Hoping to see Async HBase support HBase 0.95+
(and willing to contribute!)
• Hoping to see an HBase API which supports a “use all
my resources” mode (and willing to contribute!)

Read and contribute!
13
• Apache Flume: http://flume.apache.org/
• https://blogs.apache.org/flume/entry/flume_ng_arc
hitecture
• https://blogs.apache.org/flume/entry/streaming_dat
a_into_apache_hbase
• https://blogs.apache.org/flume/entry/flume_perfor
mance_tuning_part_1

Read and contribute!
14
• Apache Flume: http://flume.apache.org/
• https://blogs.apache.org/flume/entry/flume_ng_arc
hitecture
• https://blogs.apache.org/flume/entry/streaming_dat
a_into_apache_hbase
• https://blogs.apache.org/flume/entry/flume_perfor
mance_tuning_part_1

Hari Shreedharan, Software Engineer, Cloudera @harisr1234
Thank you!

What's hot

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightHBaseCon

HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon

HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon

hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon

Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHBaseCon

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.

HBase: Where Online Meets Low LatencyHBaseCon

Tales from the Cloudera FieldHBaseCon

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.

Meet HBase 1.0enissoz

HBase Data Modeling and Access Patterns with Kite SDKHBaseCon

HBaseCon 2015: HBase Operations at XiaomiHBaseCon

Large-scale Web Apps @ PinterestHBaseCon

HBaseCon 2015- HBase @ FlipboardMatthew Blair

hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon

HBaseCon 2013: Apache HBase Operations at PinterestCloudera, Inc.

HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaCloudera, Inc.

HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.

Real-time HBase: Lessons from the CloudHBaseCon

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.

What's hot (20)

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

HBase Read High Availability Using Timeline-Consistent Region Replicas

HBaseCon 2015: State of HBase Docs and How to Contribute

hbaseconasia2017: HBase Disaster Recovery Solution at Huawei

Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase

HBase: Where Online Meets Low Latency

Tales from the Cloudera Field

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...

Meet HBase 1.0

HBase Data Modeling and Access Patterns with Kite SDK

HBaseCon 2015: HBase Operations at Xiaomi

Large-scale Web Apps @ Pinterest

HBaseCon 2015- HBase @ Flipboard

hbaseconasia2017: HBase在Hulu的使用和实践

HBaseCon 2013: Apache HBase Operations at Pinterest

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

HBaseCon 2013: ETL for Apache HBase

Real-time HBase: Lessons from the Cloud

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Viewers also liked

HBaseCon 2013: Using Metrics to Monitor and Debug Apache HBase Cloudera, Inc.

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA

Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...Cloudera, Inc.

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase Cloudera, Inc.

HBaseCon 2013: Real-Time Model Scoring in Recommender Systems Cloudera, Inc.

HBaseCon 2012 | Real-time Analytics with HBase - SematextCloudera, Inc.

HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.

Spark+flume seattleHari Shreedharan

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...Cloudera, Inc.

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.

HBaseCon 2013: Full-Text Indexing for Apache HBaseCloudera, Inc.

HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...Cloudera, Inc.

Taming HBase with Apache Phoenix and SQLHBaseCon

HBaseCon 2013: Near Real Time Indexing for eBay SearchCloudera, Inc.

HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.

Flume HBaseirayan

Enabling Microservices @Orbitz - DockerCon 2015Steve Hoffman

Apache flume - an IntroductionErik Schmiegelow

Viewers also liked (20)

HBaseCon 2013: Using Metrics to Monitor and Debug Apache HBase

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase

HBaseCon 2013: Real-Time Model Scoring in Recommender Systems

HBaseCon 2012 | Real-time Analytics with HBase - Sematext

HBaseCon 2013: Scalable Network Designs for Apache HBase

Spark+flume seattle

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...

HBaseCon 2013: Full-Text Indexing for Apache HBase

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...

Taming HBase with Apache Phoenix and SQL

HBaseCon 2013: Near Real Time Indexing for eBay Search

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

Flume HBase

Enabling Microservices @Orbitz - DockerCon 2015

Apache flume - an Introduction

Similar to HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

HBase and Hadoop at Urban Airshipdave_revell

Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin

HDFCloud Workshop: HDF5 in the CloudThe HDF-EOS Tools and Information Center

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21

HBase Low Latency, StrataNYC 2014Nick Dimiduk

Technologies for Data Analytics PlatformN Masahiro

HBase Applications - Atlanta HUG - May 2014larsgeorge

Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.

Service-Oriented Design and Implement with Rails3Wen-Tien Chang

HBase Low LatencyDataWorks Summit

Hive spark-s3acommitter-hbase-nfsYifeng Jiang

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media

[Hi c2011]building mission critical messaging system(guoqiang jerry)baggioss

Facebook keynote-nicolas-qconYiwei Ma

支撑Facebook消息处理的h base存储系统yongboy

Facebook Messages & HBase强王

Large-scale projects development (scaling LAMP)Alexey Rybak

Hp hadoop platformAkshat Thakar

HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...Michael Stack

Similar to HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes (20)

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase and Hadoop at Urban Airship

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HDFCloud Workshop: HDF5 in the Cloud

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf

HBase Low Latency, StrataNYC 2014

Technologies for Data Analytics Platform

HBase Applications - Atlanta HUG - May 2014

Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook

Service-Oriented Design and Implement with Rails3

HBase Low Latency

Hive spark-s3acommitter-hbase-nfs

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...

[Hi c2011]building mission critical messaging system(guoqiang jerry)

Facebook keynote-nicolas-qcon

支撑Facebook消息处理的h base存储系统

Facebook Messages & HBase

Large-scale projects development (scaling LAMP)

Hp hadoop platform

HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Slack Application Development 101 Slidespraypatel2

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Real Time Object Detection Using Open CVKhem

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

How to convert PDF to text with Nanonetsnaman860154

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Presentation on how to chat with PDF using ChatGPT code interpreter

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Slack Application Development 101 Slides

IAC 2024 - IA Fast Track to Search Focused AI Solutions

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Scaling API-first – The story of a global engineering organization

Real Time Object Detection Using Open CV

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

How to convert PDF to text with Nanonets

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

CNv6 Instructor Chapter 6 Quality of Service

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

1. 1 Streaming data into HBase using Flume Hari Shreedharan | Software Engineer, Cloudera

2. Apache Flume Fundamentals • Scalable collection, aggregation of event data (i.e. logs) • The simplest “unit” of data – “Event” • Event = {Map<String, String>, byte[] body} • Dynamic, contextual event routing • Low latency, high throughput • Declarative configuration • Productive out of the box, yet powerfully extensible • Open source software 2

3. Inside a Flume NG agent 3

4. Why Flume? 4 • Real user issue: • HBase Rest Server – did not scale • OOM, very high latency • High ops cost • Flume was a viable alternative • Schema changes – require app changes • In Flume, just change and deploy a plugin and restart Flume. • HBase downtime/compaction/gc isolated from production app • More data – just add more Flume agents, no app changes!

5. Topology: Connecting agents together 5 [Client]+  Agent [ Agent]*  Destination HBase

6. Flume writes to HBase – HBase Sinks 6 • HBase Sink • Currently supports 0.90.x, 0.92.x, 0.94.x • Uses the “standard” HBase Client API • Supports security • Async HBase Sink • Uses Async HBase • No security support • Faster • Uses Async HBase 1.4.1

7. Highly flexible sinks 7 • Both sinks are extremely flexible. • HBase sink uses a “serializer” to convert Flume events to HBase-friendly format. • Plugin architecture – user can drop in their own serializer • Serializers implement a very simple interface.

8. Serializers 8 public interface HbaseEventSerializer { void initialize(Event event, byte[] columnFamily); public List<Row> getActions(); public List<Increment> getIncrements(); public void close(); }

9. HBase Cluster performance 9 • HBase cluster itself scaled really well • No one I know of has hit scaling issues writing from Flume • Sometimes read performance was affected • Primarily due to row locks held by writes/increments • Increments made this situation more problematic • When Flume was writing to the same rows as being read, the read latency could be visibly high. • Pre-spilt tables, and uniform distribution of data also helped.

10. Issues we faced – why two sinks? 10 • Wrote the HBase Sink first using HBase client API • HBase Client API great at conserving resources • Several static maps hidden away in the API meant we could not open as many connections as wanted from the same JVM • Region Servers and Flume Agents were sitting idle while data was being sent over the wire! • More threads didn’t seem to help much.

11. Async HBase to the rescue! 11 • Async HBase – an easy way out • Maintained thread pools – callbacks based • Helped us get the full power of HBase • Scaled really well – allowing good HBase cluster utilization • Never seen a user complaining about Async HBase Sink performance!

12. What happens now? 12 • HBase 0.95+ no longer wire compatible with Async HBase • Hoping to see Async HBase support HBase 0.95+ (and willing to contribute!) • Hoping to see an HBase API which supports a “use all my resources” mode (and willing to contribute!)

13. Read and contribute! 13 • Apache Flume: http://flume.apache.org/ • https://blogs.apache.org/flume/entry/flume_ng_arc hitecture • https://blogs.apache.org/flume/entry/streaming_dat a_into_apache_hbase • https://blogs.apache.org/flume/entry/flume_perfor mance_tuning_part_1

14. Read and contribute! 14 • Apache Flume: http://flume.apache.org/ • https://blogs.apache.org/flume/entry/flume_ng_arc hitecture • https://blogs.apache.org/flume/entry/streaming_dat a_into_apache_hbase • https://blogs.apache.org/flume/entry/flume_perfor mance_tuning_part_1

15. Click to edit Master title style 15

16. Hari Shreedharan, Software Engineer, Cloudera @harisr1234 Thank you!

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

Similar to HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes