HBase and Hadoop at Urban Airship

HBase and
Hadoop at
Urban Airship
April 25, 2012
Dave Revell
dave@urbanairship.com
@dave_revell

Who are we?

• Who am I?
• Airshipper for 10 months, Hadoop user for 1.5 years
• Database Engineer on Core Data team: we collect
events from mobile devices and create reports
• What is Urban Airship?
• SaaS for mobile developers. Features that devs
shouldn’t build themselves.
• Mostly push notiﬁcations
• No airships :(

Goals
• “Near real time” reporting
• Counters: messages sent and received, app opens, in
various time slices
• More complex analyses: time-in-app, uniques,
conversions

Goals
• “Near real time” reporting
• Counters: messages sent and received, app opens, in
various time slices
• More complex analyses: time-in-app, uniques,
conversions
• Scale
• Billions of “events” per month, ~100 bytes each
• 40 billion events so far, looking exponential.
• Event arrival rate varies wildly, ~10K/sec (?)

Enter Hadoop

• An Apache project with HDFS, MapReduce, and Common
• Open source, Apache license

Enter Hadoop

• In common usage: platform, framework, ecosystem
• HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....

Enter Hadoop

• It’s in Java

Enter Hadoop

• It’s in Java
• History: early 2000s, originally a clone of Google’s GFS and
MapReduce

Enter HBase

• HBase is a database that uses HDFS for storage

Enter HBase

• Based on Google’s BigTable. Not relational or SQL.

Enter HBase

• Solves the problem “how do I query my Hadoop data?”
• Operations typically take a few milliseconds
• MapReduce is not suitable for real time queries

Enter HBase

• Scales well by adding servers (if you do everything right)

Enter HBase

• Scales well by adding servers (if you do everything right)
• Not highly-available or multi-datacenter

UA’s basic architecture
Events in Reports out
Mobile devices Reports user

Queue (Kafka) Web service

HBase

HDFS

(not shown: analysis code that reads
events from HBase and puts derived
data back into HBase)

Analyzing events
• Absorbs traffic spikes

Queue of incoming events • Partially decouples database from internet

• Pub/sub, groups of consumers share work

• Consumes event queue

• Does simple streaming analyses (counters)
UA proprietary Java code
• Stages data in HBase tables for more
complex analyses that come later

• Calculations that are difficult or inefficient to
compute as data streams through
Incremental batch jobs
• Read from HBase, write back to HBase

HBase data model

• The abstraction offered by HBase for reading and writing

HBase data model

• As useful as possible without limiting scalability too much

HBase data model

• Data is in rows, rows are in tables, ordered by row key

HBase data model


myApp:1335139200 OPENS_COUNT: 3987 SENDS_COUNT: 28832


HBase data model




(not shown: column families)

The HBase data model, cont.

{“myRowKey1”: {
• This is a nested map/dictionary
“myColFam”: {
• Scannable in lexicographic key order “myQualifierX”: “foo”,
“myQualifierY”: “bar”}},
• Interface is very simple: “rowKey2”: {
“myColFam”:
• get, put, delete, scan, increment “myQualifierA”: “baz”,
“myQualifierB”: “”}},
• Bytes only

HBase API example

byte[] firstNameQualifier = “fname”.getBytes();

byte[] lastNameQualifier = “lname”.getBytes();

byte[] personalInfoColFam = “personalInfo”.getBytes();

HTable hTable = new HTable(“users”);

Put put = new Put(“dave”.getBytes());

put.add(personalInfoColFam, firstNameQualifier, “Dave”.getBytes());

put.add(personalInfoColFam, lastNameQualifier, “Revell”.getBytes());

hTable.put(put);

How to not fail at HBase

• Things you should have done initially, but now it’s too late
and you’re irretrievably screwed
• Keep table count and column family count low
• Keep rows narrow, use compound keys
• Scale by adding more rows
• Tune your ﬂush threshold and memstore sizes
• It’s OK to store complex objects as Protobuf/Thrift/etc.
• Always try for sequential IO over random IO

MapReduce, brieﬂy
• The original use case for Hadoop
• Mappers take in large data set and send (key,value) pairs to
reducers. Reducers aggregate input pairs and generate
output.

My input data items

Mapper Mapper Mapper Mapper

Reducer Reducer

Output Output

MapReduce issues

• Hard to process incrementally (efﬁciently)

MapReduce issues

• Hard to achieve low latency

MapReduce issues

• Can’t have too many jobs

MapReduce issues

• Requires elaborate workﬂow automation

MapReduce issues

• Requires elaborate workﬂow automation

• Urban Airship uses MapReduce over HBase data for:
• Ad-hoc analysis
• Monthly billing

Live demo

(Jump to web browser for HBase and MR status pages)

Batch processing at UA

• Quartz scheduler, distributed over 3 nodes
• Time-in-app, audience count, conversions

Batch processing at UA

• Quartz scheduler, distributed over 3 nodes
• Time-in-app, audience count, conversions

• General pattern
• Arriving events set a low water mark for its app
• Batch jobs reprocess events starting at the low water
mark

Strengths

• Uptime
• We know all the ways to crash by now

Strengths

• Uptime
• Schema design, throughput, and scaling
• There are many subtle mistakes to avoid

Strengths

• Uptime
• Writing custom tools (statshtable, hbackup, gclogtailer)

Strengths

• Uptime
• Writing custom tools (statshtable, hbackup, gclogtailer)
• “Real time most of the time”

Weaknesses of our design

• Shipping features quickly


• Hardware efﬁciency


• Infrastructure automation


• Writing custom tools, getting bogged down at low levels,
leaky abstractions


• Writing custom tools, getting bogged down at low levels,
leaky abstractions
• Serious operational Java skills required

Reading

• Hadoop: The Deﬁnitive Guide by Tom White
• HBase: The Deﬁnitive Guide by Lars George
• http://hbase.apache.org/book.html

Questions?

• #hbase on Freenode
• hbase-dev, hbase-user Apache mailing lists

HBase and Hadoop at Urban Airship

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HBase and Hadoop at Urban Airship

Similar to HBase and Hadoop at Urban Airship (20)

Recently uploaded

Recently uploaded (20)

HBase and Hadoop at Urban Airship

Editor's Notes