HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

•

18 likes•9,286 views

This document discusses using Apache Flume to stream data into Apache HBase. It describes how Flume provides a scalable and flexible way to collect and transport log and event data to HBase. Specifically, it covers the HBase sink plugin for Flume, which allows routing Flume events to HBase tables. It notes that while the initial HBase sink had limitations, the asynchronous HBase sink improved performance by fully utilizing the HBase cluster. Overall, the document presents Flume as a viable alternative to directly writing to HBase and provides flexibility to change schemas without code changes.

Technology

1
Streaming data into HBase using
Flume
Hari Shreedharan | Software Engineer, Cloudera

Apache Flume Fundamentals
• Scalable collection, aggregation of event data (i.e.
logs)
• The simplest “unit” of data – “Event”
• Event = {Map<String, String>, byte[] body}
• Dynamic, contextual event routing
• Low latency, high throughput
• Declarative configuration
• Productive out of the box, yet powerfully extensible
• Open source software
2

Why Flume?
4
• Real user issue:
• HBase Rest Server – did not scale
• OOM, very high latency
• High ops cost
• Flume was a viable alternative
• Schema changes – require app changes
• In Flume, just change and deploy a plugin and restart Flume.
• HBase downtime/compaction/gc isolated from
production app
• More data – just add more Flume agents, no app
changes!

Topology: Connecting agents together
5
[Client]+  Agent [ Agent]*  Destination
HBase

Flume writes to HBase – HBase Sinks
6
• HBase Sink
• Currently supports 0.90.x, 0.92.x, 0.94.x
• Uses the “standard” HBase Client API
• Supports security
• Async HBase Sink
• Uses Async HBase
• No security support
• Faster
• Uses Async HBase 1.4.1

Highly flexible sinks
7
• Both sinks are extremely flexible.
• HBase sink uses a “serializer” to convert Flume
events to HBase-friendly format.
• Plugin architecture – user can drop in their own
serializer
• Serializers implement a very simple interface.

Serializers
8
public interface HbaseEventSerializer {
void initialize(Event event, byte[]
columnFamily);
public List<Row> getActions();
public List<Increment> getIncrements();
public void close();
}

HBase Cluster performance
9
• HBase cluster itself scaled really well
• No one I know of has hit scaling issues writing from
Flume
• Sometimes read performance was affected
• Primarily due to row locks held by writes/increments
• Increments made this situation more problematic
• When Flume was writing to the same rows as being read,
the read latency could be visibly high.
• Pre-spilt tables, and uniform distribution of data also
helped.

Issues we faced – why two sinks?
10
• Wrote the HBase Sink first using HBase client API
• HBase Client API great at conserving resources
• Several static maps hidden away in the API meant we
could not open as many connections as wanted from
the same JVM
• Region Servers and Flume Agents were sitting idle
while data was being sent over the wire!
• More threads didn’t seem to help much.

Async HBase to the rescue!
11
• Async HBase – an easy way out
• Maintained thread pools – callbacks based
• Helped us get the full power of HBase
• Scaled really well – allowing good HBase cluster
utilization
• Never seen a user complaining about Async HBase
Sink performance!

What happens now?
12
• HBase 0.95+ no longer wire compatible with Async
HBase
• Hoping to see Async HBase support HBase 0.95+
(and willing to contribute!)
• Hoping to see an HBase API which supports a “use all
my resources” mode (and willing to contribute!)

Read and contribute!
13
• Apache Flume: http://flume.apache.org/
• https://blogs.apache.org/flume/entry/flume_ng_arc
hitecture
• https://blogs.apache.org/flume/entry/streaming_dat
a_into_apache_hbase
• https://blogs.apache.org/flume/entry/flume_perfor
mance_tuning_part_1

Read and contribute!
14
• Apache Flume: http://flume.apache.org/
• https://blogs.apache.org/flume/entry/flume_ng_arc
hitecture
• https://blogs.apache.org/flume/entry/streaming_dat
a_into_apache_hbase
• https://blogs.apache.org/flume/entry/flume_perfor
mance_tuning_part_1

Hari Shreedharan, Software Engineer, Cloudera @harisr1234
Thank you!

YapMap is a new kind of search platform that does multi-quanta search to better understand threaded discussions. This talk will cover how HBase made it possible for two self-funded guys to build a new kind of search platform. We will discuss our data model and how we use row based atomicity to manage parallel data integration problems. We’ll also talk about where we don’t use HBase and instead use a traditional SQL based infrastructure. We’ll cover the benefits of using MapReduce and HBase for index generation. Then we’ll cover our migration of some tasks from a message based queue to the Coprocessor framework as well as our future Coprocessor use cases. Finally, we’ll talk briefly about our operational experience with HBase, our hardware choices and challenges we’ve had.

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Cloudera, Inc.

HBase at Bloomberg: High Availability Needs for the Financial Industry

HBaseCon

Speaker: Sudarshan Kadambi and Matthew Hunt (Bloomberg LP) Bloomberg is a financial data and analytics provider, so data management is core to what we do. There's tremendous diversity in the type of data we manage, and HBase is a natural fit for many of these datasets - from the perspective of the data model as well as in terms of a scalable, distributed database. This talk covers data and analytics use cases at Bloomberg and operational challenges around HA. We'll explore the work currently being done under HBASE-10070, further extensions to it, and how this solution is qualitatively different to how failover is handled by Apache Cassandra.

HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment

HBaseCon

HBaseCon 2015: HBase Operations in a Flurry

HBaseCon

Digital Library Collection Management using HBase

HBaseCon

HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Cloudera, Inc.

Solbase is an exciting new open-source, real-time search engine being developed at Photobucket to service the over 30 million daily search requests Photobucket handles. Solbase replaces Lucene’s file system-based index with HBase. This allows the system to update in real-time and linearly scale to serve millions of daily search requests on a large dataset. This session will explore the architecture of Solbase as well as some of Lucene/Solr’s inherent issues we overcame. Finally, we’ll go over performance metrics of Solbase against production traffic.

Nitin Verma, Pravin Mittal, and Maxim Lukiyanov (Microsoft) This session presents our success story of enabling a big internal customer on Microsoft Azure’s HBase service along with the methodology and tools used to meet high-throughput goals. We will also present how new features in HBase (like BucketCache and MultiWAL) are helping our customers in the medium-latency/high-bandwidth cloud-storage scenario.

HBase Read High Availability Using Timeline-Consistent Region Replicas

HBaseCon

Speakers: Enis Soztutar and Devaraj Das (Hortonworks) HBase has ACID semantics within a row that make it a perfect candidate for a lot of real-time serving workloads. However, single homing a region to a server implies some periods of unavailability for the regions after a server crash. Although the mean time to recovery has improved a lot recently, for some use cases, it is still preferable to do possibly stale reads while the region is recovering. In this talk, you will get an overview of our design and implementation of region replicas in HBase, which provide timeline-consistent reads even when the primary region is unavailable or busy.

HBaseCon 2015: State of HBase Docs and How to Contribute

HBaseCon

hbaseconasia2017: HBase Disaster Recovery Solution at Huawei

HBaseCon

Ashish Singhi HBase Disaster recovery solution aims to maintain high availability of HBase service in case of disaster of one HBase cluster with very minimal user intervention. This session will introduce the HBase disaster recovery use cases and the various solutions adopted at Huawei like. a) Cluster Read-Write mode b) DDL operations synchronization with standby cluster c) Mutation and bulk loaded data replication d) Further challenges and pending work hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#

Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity

HBaseCon

Speakers: Dheeraj Kapur, Rajiv Chittajallu & Anish Mathew (Yahoo!) In early 2013, Yahoo! introduced multi-tenancy to HBase to offer it as a platform service for all Hadoop users. A certain degree of customization per tenant (a user or a project) was achieved through RegionServer groups, namespaces, and customized configs for each tenant. This talk covers how to accommodate diverse needs to individual tenants on the cluster, as well as operational tips and techniques that allow Yahoo! to automate the management of multi-tenant clusters at petabyte scale without errors.

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase

Cloudera, Inc.

HBase: Where Online Meets Low Latency

HBaseCon

Speakers: Nick Dimiduk (Hortonworks) and Nicolas Liochon (Scaled Risk) HBase is an online database so response latency is critical. This talk will examine sources of latency in HBase, detailing steps along the read and write paths. We'll examine the entire request lifecycle, from client to server and back again. We'll also look at the different factors that impact latency, including GC, cache misses, and system failures. Finally, the talk will highlight some of the work done in 0.96+ to improve the reliability of HBase.

Tales from the Cloudera Field

HBaseCon

Speakers: Kevin O'Dell, Aleksandr Shulman & Kathleen Ting (Cloudera) From supporting the 0.90.x, 0.92, 0.94, and 0.96 HBase installations on clusters ranging from tens to hundreds of nodes, Cloudera has seen it all. Having automated the upgrade paths from the different Apache releases, we have developed a smooth path that can help the community with upcoming upgrades. In addition to automation best practices, in this talk you'll also learn proactive configuration tweaks and operational best practices to keep your HBase cluster always up and running. We'll also walk through how to contain an application bug let loose in production, to minimize the impact on HBase posed by faulty hardware, and the direct correlation between inefficient schema design and HBase performance.

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...

Cloudera, Inc.

The newly added feature of Coprocessors within HBase allows the application designer to move functionality closer to where the data resides. While this sounds like Stored Procedures as known in the RDBMS realm, they have a different set of properties. The distributed nature of HBase adds to the complexity of their implementation, but the client side API allows for an easy, transparent access to their functionality across many servers. This session explains the concepts behind coprocessors and uses examples to show how they can be used to implement data side extensions to the application code.

Meet HBase 1.0

enissoz

HBase Data Modeling and Access Patterns with Kite SDK

HBaseCon

Speaker: Adam Warrington (Cloudera) The Kite SDK is a set of libraries and tools focused on making it easier to build systems on top of the Hadoop ecosystem. HBase support has recently been added to the Kite SDK Data Module, which allows a developer to model and access data in HBase consistent with how they would model data in HDFS using Kite. This talk will focus on Kite's HBase support by covering Kite basics and moving through the specifics of working with HBase as a data source. This feature overview will be supplemented by specifics of how that feature is being used in production applications at Cloudera.

HBaseCon 2015: HBase Operations at Xiaomi

HBaseCon

Large-scale Web Apps @ Pinterest

HBaseCon

Speaker: Varun Sharma (Pinterest) Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.

HBaseCon 2015- HBase @ Flipboard

Matthew Blair

hbaseconasia2017: HBase在Hulu的使用和实践

HBaseCon

Qianxi Zhang 1. Hulu是美国最受欢迎的在线视频网站之一，Hulu Beijing是Hulu第二大研发中心。北京大数据基础架构团队负责整个公司的大数据基础架构的研发和运维。 2. HBase在Hulu的概况 3. HBase在Hulu的使用 4. 用户画像系统，存放所有用户的基本信息，用户行为，第三方DMP数据和机器学习结果标签(几十万个Qualifier)，Spark和Spark Streaming读写HBase数据，运行各种机器学习模型，为公司的视频推荐，精准广告和Marketing团队服务 5. HBase在Hulu的优化 hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#

HBaseCon 2013: Apache HBase Operations at Pinterest

Cloudera, Inc.

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

Cloudera, Inc.

If you’re running an HBase cluster in production, you’ve probably noticed that HBase shares a number of useful metrics about everything from your block cache performance to your HDFS latencies over JMX (or Ganglia, or just a file). The problem is that it’s sometimes hard to know what these metrics mean to you and your users. Should you be worried if your memstore SizeMB is 1.5GB? What if your regionservers have a hundred stores each? This talk will explain how to understand and interpret the metrics HBase exports. Along the way we’ll cover some high-level background on HBase’s internals, and share some battle tested rules-of-thumb about how to interpret and react to metrics you might see.

HBaseCon 2013: ETL for Apache HBase

Cloudera, Inc.

Real-time HBase: Lessons from the Cloud

HBaseCon

Speaker: Bryan Beaudreault (HubSpot) Running HBase in real time in the cloud provides an interesting and ever-changing set of challenges -- instance types are not ideal, neighbors can degrade your performance, and instances can randomly die in unanticipated ways. This talk will cover what HubSpot has learned about running in production on Amazon EC2, how it handle DR and redundancy, and the tooling the team has found to be the most helpful.

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Cloudera, Inc.

Apache HDFS, the file system on which HBase is most commonly deployed, was originally designed for high-latency high-throughput batch analytic systems like MapReduce. Over the past two to three years, the rising popularity of HBase has driven many enhancements in HDFS to improve its suitability for real-time systems, including durability support for write-ahead logs, high availability, and improved low-latency performance. This talk will give a brief history of some of the enhancements from Hadoop 0.20.2 through 0.23.0, discuss some of the most exciting work currently under way, and explore some of the future enhancements we expect to develop in the coming years. We will include both high-level overviews of the new features as well as practical tips and benchmark results from real deployments.

HBaseCon 2013: Using Metrics to Monitor and Debug Apache HBase

Cloudera, Inc.

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...

Data Con LA

Abstract:- With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing! Bio:- Hari Shreedharan is a PMC member and committer on the Apache Flume Project. As a PMC member, he is involved in making decisions on the direction of the project. Author of the O’Reilly book Using Flume, Hari is also a software engineer at Cloudera, where he works on Apache Flume, Apache Spark, and Apache Sqoop. He also ensures that customers can successfully deploy and manage Flume, Spark, and Sqoop on their clusters, by helping them resolve any issues they are facing.

What's hot

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

HBaseCon

HBase Read High Availability Using Timeline-Consistent Region Replicas

HBaseCon

HBaseCon 2015: State of HBase Docs and How to Contribute

HBaseCon

hbaseconasia2017: HBase Disaster Recovery Solution at Huawei

HBaseCon

Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity

HBaseCon

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase

Cloudera, Inc.

HBase: Where Online Meets Low Latency

HBaseCon

Tales from the Cloudera Field

HBaseCon

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...

Cloudera, Inc.

Meet HBase 1.0

enissoz

HBase Data Modeling and Access Patterns with Kite SDK

HBaseCon

HBaseCon 2015: HBase Operations at Xiaomi

HBaseCon

Large-scale Web Apps @ Pinterest

HBaseCon

HBaseCon 2015- HBase @ Flipboard

Matthew Blair

hbaseconasia2017: HBase在Hulu的使用和实践

HBaseCon

HBaseCon 2013: Apache HBase Operations at Pinterest

Cloudera, Inc.

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

Cloudera, Inc.

HBaseCon 2013: ETL for Apache HBase

Cloudera, Inc.

Real-time HBase: Lessons from the Cloud

HBaseCon

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Cloudera, Inc.

What's hot (20)

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

HBase Read High Availability Using Timeline-Consistent Region Replicas

HBaseCon 2015: State of HBase Docs and How to Contribute

hbaseconasia2017: HBase Disaster Recovery Solution at Huawei

Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase

HBase: Where Online Meets Low Latency

Tales from the Cloudera Field

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...

Meet HBase 1.0

HBase Data Modeling and Access Patterns with Kite SDK

HBaseCon 2015: HBase Operations at Xiaomi

Large-scale Web Apps @ Pinterest

HBaseCon 2015- HBase @ Flipboard

hbaseconasia2017: HBase在Hulu的使用和实践

HBaseCon 2013: Apache HBase Operations at Pinterest

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

HBaseCon 2013: ETL for Apache HBase

Real-time HBase: Lessons from the Cloud

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Viewers also liked

HBaseCon 2013: Using Metrics to Monitor and Debug Apache HBase

Cloudera, Inc.

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...

Data Con LA

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Cloudera, Inc.

Speaker: Hari Shreedharan Data Day Texas 2015 Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster. Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in. In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...

Cloudera, Inc.

As small companies are adapting to handle Big Data, the cloud and HBase enable developers to leverage that data to provide revenue-generating real time applications. When developing a real time application for an existing system, one must balance incrementing counters in real time with Map Reduce jobs over the same data-set. When maintaining an analytics platform, ensuring data accuracy is essential. At Sproxil, SMS logs are ingested into HBase at a growing rate and we report metrics such as SMS throughput, unique user growth over time, and return SMS user activity in real time. Sproxil provides a versatile analytics application enabling customers to handpick statistics on demand to gain market insights enabling them react quickly to trends. This talk will identify the most profitable metrics and demonstrate how to calculate them using Map Reduce while continually updating data as it arrives.

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase

Cloudera, Inc.

HBaseCon 2013: Real-Time Model Scoring in Recommender Systems

Cloudera, Inc.

HBaseCon 2012 | Real-time Analytics with HBase - Sematext

Cloudera, Inc.

In this talk we’ll explain how we implemented “update-less updates” (not a typo!) for HBase using append-only approach. This approach uses HBase core strengths like fast range scans and the recently added coprocessors to enable real-time analytics. It shines in situations where high data volume and velocity make random updates (aka Get+Put) prohibitively expensive. Apart from making real-time analytics possible, we’ll show how the append-only approach to updates makes it possible to perform rollbacks of data changes and avoid data inconsistency problems caused by tasks in MapReduce jobs that fail after only partially updating data in HBase.

HBaseCon 2013: Scalable Network Designs for Apache HBase

Cloudera, Inc.

Spark+flume seattle

Hari Shreedharan

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...

Cloudera, Inc.

HBase brings interactivity to Hadoop, and allows users to collect, manage and process data in real-time. Lily wraps HBase and Solr in a comprehensive Big Data platform, with HBase-native secondary indexing complementing ad-hoc structured search. Through spare write-cycles during read operations, Lily transforms HBase in an scalable data management engine providing interactive analytics, profile harvesting and real-time recommendations. This talk highlights the architecture of Lily, how it completes HBase, and explains some of its implementation use cases.

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...

Cloudera, Inc.

HBaseCon 2013: Full-Text Indexing for Apache HBase

Cloudera, Inc.

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Cloudera, Inc.

eBay marketplace has been working hard on the next generation search infrastructure and software system, code-named Cassini. The new search engine processes over 250 million search queries and serves more than 2 billion page views each day. Its indexing platform is based on Apache Hadoop and Apache HBase. Apache HBase is a distributed persistent layer built on Hadoop to support billions of updates per day. Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds. We will share with the audience the technical details and share the difficulties and challenges that we’ve gone through and that we are still facing in the process.

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...

Cloudera, Inc.

Taming HBase with Apache Phoenix and SQL

HBaseCon

Speakers: Eli Levine, James Taylor (Salesforce.com) & Maryann Xue (Intel) HBase is the Turing machine of the Big Data world. It's been scientifically proven that you can do *anything* with it. This is, of course, a blessing and a curse, as there are so many different ways to implement a solution. Apache Phoenix (incubating), the SQL engine over HBase to the rescue. Come learn about the fundamentals of Phoenix and how it hides the complexities of HBase while giving you optimal performance, and hear about new features from our recent release, including updatable views that share the same physical HBase table and n-way equi-joins through a broadcast hash join mechanism. We'll conclude with a discussion about our roadmap and plans to implement a cost-based query optimization to dynamically adapt query execution based on your data sizes.

HBaseCon 2013: Near Real Time Indexing for eBay Search

Cloudera, Inc.

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

Cloudera, Inc.

Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.

Flume HBase

irayan

Enabling Microservices @Orbitz - DockerCon 2015

Steve Hoffman

Apache flume - an Introduction

Erik Schmiegelow

Viewers also liked (20)

HBaseCon 2013: Using Metrics to Monitor and Debug Apache HBase

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase

HBaseCon 2013: Real-Time Model Scoring in Recommender Systems

HBaseCon 2012 | Real-time Analytics with HBase - Sematext

HBaseCon 2013: Scalable Network Designs for Apache HBase

Spark+flume seattle

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...

HBaseCon 2013: Full-Text Indexing for Apache HBase

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...

Taming HBase with Apache Phoenix and SQL

HBaseCon 2013: Near Real Time Indexing for eBay Search

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

Flume HBase

Enabling Microservices @Orbitz - DockerCon 2015

Apache flume - an Introduction

Similar to HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Esther Kundin

HBase and Hadoop at Urban Airship

dave_revell

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Esther Kundin

HDFCloud Workshop: HDF5 in the Cloud

The HDF-EOS Tools and Information Center

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf

ssusere05ec21

HBase Low Latency, StrataNYC 2014

Nick Dimiduk

Technologies for Data Analytics Platform

N Masahiro

HBase Applications - Atlanta HUG - May 2014

larsgeorge

Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook

Cloudera, Inc.

Service-Oriented Design and Implement with Rails3Wen-Tien Chang

HBase Low LatencyDataWorks Summit

Hive spark-s3acommitter-hbase-nfs

Yifeng Jiang

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...

Dataconomy Media

[Hi c2011]building mission critical messaging system(guoqiang jerry)baggioss

Facebook keynote-nicolas-qconYiwei Ma

Facebook Messages & HBase

强王

支撑Facebook消息处理的h base存储系统yongboy

Large-scale projects development (scaling LAMP)

Alexey Rybak

This 8-hours tutorial was given at various conferences including Percona conference (London), DevConf (Moscow), Highload++ (Moscow). ABSTRACT During this tutorial we will cover various topics related to high scalability for the LAMP stack. This workshop is divided into three sections. The first section covers basic principles of shared nothing architectures and horizontal scaling for the app//cache/database tiers. Section two of this tutorial is devoted to MySQL sharding techniques, queues and a few performance-related tips and tricks. In section three we will cover the practical approach for measuring site performance and quality, porviding a "lean" support philosophy, connecting buesiness and technology metrics. In addition we will cover a very useful Pinba real-time statistical server, it's features and various use cases. All of the sections will be based on real-world examples built in Badoo, one of the biggest dating sites on the Internet.

Hp hadoop platform

Akshat Thakar

HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...

Michael Stack

Similar to HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes (20)

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase and Hadoop at Urban Airship

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HDFCloud Workshop: HDF5 in the Cloud

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf

HBase Low Latency, StrataNYC 2014

Technologies for Data Analytics Platform

HBase Applications - Atlanta HUG - May 2014

Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook

Service-Oriented Design and Implement with Rails3

HBase Low Latency

Hive spark-s3acommitter-hbase-nfs

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...

[Hi c2011]building mission critical messaging system(guoqiang jerry)

Facebook keynote-nicolas-qcon

Facebook Messages & HBase

支撑Facebook消息处理的h base存储系统

Large-scale projects development (scaling LAMP)

Hp hadoop platform

HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx

Cloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists

Cloudera, Inc.

This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today. The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact. The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories: Data Lifecycle Connection Data for Enterprise AI Cloud Innovation Security & Governance Leadership People First Data for Good Industry Transformation

2020 Cloudera Data Impact Awards Finalists

Cloudera, Inc.

Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.

Edc event vienna presentation 1 oct 2019

Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19

Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19

Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19

Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18

Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3

Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2

Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1

Cloudera, Inc.

Extending Cloudera SDX beyond the Platform

Cloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18

Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360

Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18

Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18

Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Product School

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

DevOps and Testing slides at DASA Connect

Kari Kakkonen

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Ramesh Iyer

In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.

Knowledge engineering: from people to machines and back

Elena Simperl

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Elevating Tactical DDD Patterns Through Object Calisthenics

JMeter webinar - integration with InfluxDB and Grafana

Designing Great Products: The Power of Design and Leadership by Chief Designe...

PCI PIN Basics Webinar from the Controlcase Team

UiPath Test Automation using UiPath Test Suite series, part 3

The Art of the Pitch: WordPress Relationships and Sales

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

How world-class product teams are winning in the AI era by CEO and Founder, P...

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

DevOps and Testing slides at DASA Connect

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Leading Change strategies and insights for effective change management pdf 1.pdf

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Knowledge engineering: from people to machines and back

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Neuro-symbolic is not enough, we need neuro-*semantic*

FIDO Alliance Osaka Seminar: Overview.pdf

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

1. 1 Streaming data into HBase using Flume Hari Shreedharan | Software Engineer, Cloudera

2. Apache Flume Fundamentals • Scalable collection, aggregation of event data (i.e. logs) • The simplest “unit” of data – “Event” • Event = {Map<String, String>, byte[] body} • Dynamic, contextual event routing • Low latency, high throughput • Declarative configuration • Productive out of the box, yet powerfully extensible • Open source software 2

3. Inside a Flume NG agent 3

4. Why Flume? 4 • Real user issue: • HBase Rest Server – did not scale • OOM, very high latency • High ops cost • Flume was a viable alternative • Schema changes – require app changes • In Flume, just change and deploy a plugin and restart Flume. • HBase downtime/compaction/gc isolated from production app • More data – just add more Flume agents, no app changes!

5. Topology: Connecting agents together 5 [Client]+  Agent [ Agent]*  Destination HBase

6. Flume writes to HBase – HBase Sinks 6 • HBase Sink • Currently supports 0.90.x, 0.92.x, 0.94.x • Uses the “standard” HBase Client API • Supports security • Async HBase Sink • Uses Async HBase • No security support • Faster • Uses Async HBase 1.4.1

7. Highly flexible sinks 7 • Both sinks are extremely flexible. • HBase sink uses a “serializer” to convert Flume events to HBase-friendly format. • Plugin architecture – user can drop in their own serializer • Serializers implement a very simple interface.

8. Serializers 8 public interface HbaseEventSerializer { void initialize(Event event, byte[] columnFamily); public List<Row> getActions(); public List<Increment> getIncrements(); public void close(); }

9. HBase Cluster performance 9 • HBase cluster itself scaled really well • No one I know of has hit scaling issues writing from Flume • Sometimes read performance was affected • Primarily due to row locks held by writes/increments • Increments made this situation more problematic • When Flume was writing to the same rows as being read, the read latency could be visibly high. • Pre-spilt tables, and uniform distribution of data also helped.

10. Issues we faced – why two sinks? 10 • Wrote the HBase Sink first using HBase client API • HBase Client API great at conserving resources • Several static maps hidden away in the API meant we could not open as many connections as wanted from the same JVM • Region Servers and Flume Agents were sitting idle while data was being sent over the wire! • More threads didn’t seem to help much.

11. Async HBase to the rescue! 11 • Async HBase – an easy way out • Maintained thread pools – callbacks based • Helped us get the full power of HBase • Scaled really well – allowing good HBase cluster utilization • Never seen a user complaining about Async HBase Sink performance!

12. What happens now? 12 • HBase 0.95+ no longer wire compatible with Async HBase • Hoping to see Async HBase support HBase 0.95+ (and willing to contribute!) • Hoping to see an HBase API which supports a “use all my resources” mode (and willing to contribute!)

13. Read and contribute! 13 • Apache Flume: http://flume.apache.org/ • https://blogs.apache.org/flume/entry/flume_ng_arc hitecture • https://blogs.apache.org/flume/entry/streaming_dat a_into_apache_hbase • https://blogs.apache.org/flume/entry/flume_perfor mance_tuning_part_1

14. Read and contribute! 14 • Apache Flume: http://flume.apache.org/ • https://blogs.apache.org/flume/entry/flume_ng_arc hitecture • https://blogs.apache.org/flume/entry/streaming_dat a_into_apache_hbase • https://blogs.apache.org/flume/entry/flume_perfor mance_tuning_part_1

15. Click to edit Master title style 15

16. Hari Shreedharan, Software Engineer, Cloudera @harisr1234 Thank you!

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes

Similar to HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experience with High Speed Writes