• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
HBase and Hadoop at Urban Airship
 

HBase and Hadoop at Urban Airship

on

  • 2,852 views

An introduction to HBase and Hadoop and how they're being used at scale by Urban Airship's core data team.

An introduction to HBase and Hadoop and how they're being used at scale by Urban Airship's core data team.

Statistics

Views

Total Views
2,852
Views on SlideShare
2,848
Embed Views
4

Actions

Likes
2
Downloads
102
Comments
0

2 Embeds 4

http://www.slashdocs.com 3
http://www.docshut.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

HBase and Hadoop at Urban Airship HBase and Hadoop at Urban Airship Presentation Transcript

  • HBase andHadoop atUrban AirshipApril 25, 2012 Dave Revell dave@urbanairship.com @dave_revell
  • Who are we?• Who am I? • Airshipper for 10 months, Hadoop user for 1.5 years • Database Engineer on Core Data team: we collect events from mobile devices and create reports• What is Urban Airship? • SaaS for mobile developers. Features that devs shouldn’t build themselves. • Mostly push notifications • No airships :(
  • Goals
  • Goals• “Near real time” reporting • Counters: messages sent and received, app opens, in various time slices • More complex analyses: time-in-app, uniques, conversions
  • Goals• “Near real time” reporting • Counters: messages sent and received, app opens, in various time slices • More complex analyses: time-in-app, uniques, conversions• Scale • Billions of “events” per month, ~100 bytes each • 40 billion events so far, looking exponential. • Event arrival rate varies wildly, ~10K/sec (?)
  • Enter Hadoop
  • Enter Hadoop• An Apache project with HDFS, MapReduce, and Common • Open source, Apache license
  • Enter Hadoop• An Apache project with HDFS, MapReduce, and Common • Open source, Apache license• In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
  • Enter Hadoop• An Apache project with HDFS, MapReduce, and Common • Open source, Apache license• In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....• It’s in Java
  • Enter Hadoop• An Apache project with HDFS, MapReduce, and Common • Open source, Apache license• In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....• It’s in Java• History: early 2000s, originally a clone of Google’s GFS and MapReduce
  • Enter HBase
  • Enter HBase• HBase is a database that uses HDFS for storage
  • Enter HBase• HBase is a database that uses HDFS for storage• Based on Google’s BigTable. Not relational or SQL.
  • Enter HBase• HBase is a database that uses HDFS for storage• Based on Google’s BigTable. Not relational or SQL.• Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries
  • Enter HBase• HBase is a database that uses HDFS for storage• Based on Google’s BigTable. Not relational or SQL.• Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries• Scales well by adding servers (if you do everything right)
  • Enter HBase• HBase is a database that uses HDFS for storage• Based on Google’s BigTable. Not relational or SQL.• Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries• Scales well by adding servers (if you do everything right)• Not highly-available or multi-datacenter
  • UA’s basic architecture Events in Reports out Mobile devices Reports user Queue (Kafka) Web service HBase HDFS (not shown: analysis code that reads events from HBase and puts derived data back into HBase)
  • Analyzing events • Absorbs traffic spikesQueue of incoming events • Partially decouples database from internet • Pub/sub, groups of consumers share work • Consumes event queue • Does simple streaming analyses (counters)UA proprietary Java code • Stages data in HBase tables for more complex analyses that come later • Calculations that are difficult or inefficient to compute as data streams through Incremental batch jobs • Read from HBase, write back to HBase
  • HBase data model
  • HBase data model• The abstraction offered by HBase for reading and writing
  • HBase data model• The abstraction offered by HBase for reading and writing• As useful as possible without limiting scalability too much
  • HBase data model• The abstraction offered by HBase for reading and writing• As useful as possible without limiting scalability too much• Data is in rows, rows are in tables, ordered by row key
  • HBase data model• The abstraction offered by HBase for reading and writing• As useful as possible without limiting scalability too much• Data is in rows, rows are in tables, ordered by row key myApp:1335139200 OPENS_COUNT: 3987 SENDS_COUNT: 28832 myApp:1335142800 OPENS_COUNT: 4230 SENDS_COUNT: 38990
  • HBase data model• The abstraction offered by HBase for reading and writing• As useful as possible without limiting scalability too much• Data is in rows, rows are in tables, ordered by row key myApp:1335139200 OPENS_COUNT: 3987 SENDS_COUNT: 28832 myApp:1335142800 OPENS_COUNT: 4230 SENDS_COUNT: 38990 (not shown: column families)
  • The HBase data model, cont. {“myRowKey1”: {• This is a nested map/dictionary “myColFam”: {• Scannable in lexicographic key order “myQualifierX”: “foo”, “myQualifierY”: “bar”}},• Interface is very simple: “rowKey2”: { “myColFam”: • get, put, delete, scan, increment “myQualifierA”: “baz”, “myQualifierB”: “”}},• Bytes only
  • HBase API examplebyte[] firstNameQualifier = “fname”.getBytes();byte[] lastNameQualifier = “lname”.getBytes();byte[] personalInfoColFam = “personalInfo”.getBytes();HTable hTable = new HTable(“users”);Put put = new Put(“dave”.getBytes());put.add(personalInfoColFam, firstNameQualifier, “Dave”.getBytes());put.add(personalInfoColFam, lastNameQualifier, “Revell”.getBytes());hTable.put(put);
  • How to not fail at HBase
  • How to not fail at HBase• Things you should have done initially, but now it’s too late and you’re irretrievably screwed • Keep table count and column family count low • Keep rows narrow, use compound keys • Scale by adding more rows • Tune your flush threshold and memstore sizes • It’s OK to store complex objects as Protobuf/Thrift/etc. • Always try for sequential IO over random IO
  • MapReduce, briefly• The original use case for Hadoop• Mappers take in large data set and send (key,value) pairs to reducers. Reducers aggregate input pairs and generate output. My input data items Mapper Mapper Mapper Mapper Reducer Reducer Output Output
  • MapReduce issues
  • MapReduce issues• Hard to process incrementally (efficiently)
  • MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency
  • MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency• Can’t have too many jobs
  • MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency• Can’t have too many jobs• Requires elaborate workflow automation
  • MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency• Can’t have too many jobs• Requires elaborate workflow automation
  • MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency• Can’t have too many jobs• Requires elaborate workflow automation• Urban Airship uses MapReduce over HBase data for: • Ad-hoc analysis • Monthly billing
  • Live demo (Jump to web browser for HBase and MR status pages)
  • Batch processing at UA
  • Batch processing at UA• Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions
  • Batch processing at UA• Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions
  • Batch processing at UA• Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions• General pattern • Arriving events set a low water mark for its app • Batch jobs reprocess events starting at the low water mark
  • Strengths
  • Strengths• Uptime • We know all the ways to crash by now
  • Strengths• Uptime • We know all the ways to crash by now• Schema design, throughput, and scaling • There are many subtle mistakes to avoid
  • Strengths• Uptime • We know all the ways to crash by now• Schema design, throughput, and scaling • There are many subtle mistakes to avoid• Writing custom tools (statshtable, hbackup, gclogtailer)
  • Strengths• Uptime • We know all the ways to crash by now• Schema design, throughput, and scaling • There are many subtle mistakes to avoid• Writing custom tools (statshtable, hbackup, gclogtailer)• “Real time most of the time”
  • Weaknesses of our design
  • Weaknesses of our design• Shipping features quickly
  • Weaknesses of our design• Shipping features quickly• Hardware efficiency
  • Weaknesses of our design• Shipping features quickly• Hardware efficiency• Infrastructure automation
  • Weaknesses of our design• Shipping features quickly• Hardware efficiency• Infrastructure automation• Writing custom tools, getting bogged down at low levels, leaky abstractions
  • Weaknesses of our design• Shipping features quickly• Hardware efficiency• Infrastructure automation• Writing custom tools, getting bogged down at low levels, leaky abstractions• Serious operational Java skills required
  • Reading• Hadoop: The Definitive Guide by Tom White• HBase: The Definitive Guide by Lars George• http://hbase.apache.org/book.html
  • Questions?• #hbase on Freenode• hbase-dev, hbase-user Apache mailing lists