HBase andHadoop atUrban AirshipApril 25, 2012                           Dave Revell                 dave@urbanairship.com ...
Who are we?•   Who am I?     •   Airshipper for 10 months, Hadoop user for 1.5 years     •   Database Engineer on Core Dat...
Goals
Goals•   “Near real time” reporting      •   Counters: messages sent and received, app opens, in          various time sli...
Goals•   “Near real time” reporting      •   Counters: messages sent and received, app opens, in          various time sli...
Enter Hadoop
Enter Hadoop•   An Apache project with HDFS, MapReduce, and Common     •   Open source, Apache license
Enter Hadoop•   An Apache project with HDFS, MapReduce, and Common     •   Open source, Apache license•   In common usage:...
Enter Hadoop•   An Apache project with HDFS, MapReduce, and Common      •   Open source, Apache license•   In common usage...
Enter Hadoop•   An Apache project with HDFS, MapReduce, and Common      •   Open source, Apache license•   In common usage...
Enter HBase
Enter HBase•   HBase is a database that uses HDFS for storage
Enter HBase•   HBase is a database that uses HDFS for storage•   Based on Google’s BigTable. Not relational or SQL.
Enter HBase•   HBase is a database that uses HDFS for storage•   Based on Google’s BigTable. Not relational or SQL.•   Sol...
Enter HBase•   HBase is a database that uses HDFS for storage•   Based on Google’s BigTable. Not relational or SQL.•   Sol...
Enter HBase•   HBase is a database that uses HDFS for storage•   Based on Google’s BigTable. Not relational or SQL.•   Sol...
UA’s basic architecture   Events in                                     Reports out            Mobile devices             ...
Analyzing events                           •   Absorbs traffic spikesQueue of incoming events   •   Partially decouples dat...
HBase data model
HBase data model•   The abstraction offered by HBase for reading and writing
HBase data model•   The abstraction offered by HBase for reading and writing•   As useful as possible without limiting sca...
HBase data model•   The abstraction offered by HBase for reading and writing•   As useful as possible without limiting sca...
HBase data model•   The abstraction offered by HBase for reading and writing•   As useful as possible without limiting sca...
HBase data model•   The abstraction offered by HBase for reading and writing•   As useful as possible without limiting sca...
The HBase data model, cont.                                              {“myRowKey1”: {•   This is a nested map/dictionar...
HBase API examplebyte[] firstNameQualifier = “fname”.getBytes();byte[] lastNameQualifier = “lname”.getBytes();byte[] personal...
How to not fail at HBase
How to not fail at HBase•   Things you should have done initially, but now it’s too late    and you’re irretrievably screw...
MapReduce, briefly•   The original use case for Hadoop•   Mappers take in large data set and send (key,value) pairs to    r...
MapReduce issues
MapReduce issues•   Hard to process incrementally (efficiently)
MapReduce issues•   Hard to process incrementally (efficiently)•   Hard to achieve low latency
MapReduce issues•   Hard to process incrementally (efficiently)•   Hard to achieve low latency•   Can’t have too many jobs
MapReduce issues•   Hard to process incrementally (efficiently)•   Hard to achieve low latency•   Can’t have too many jobs•...
MapReduce issues•   Hard to process incrementally (efficiently)•   Hard to achieve low latency•   Can’t have too many jobs•...
MapReduce issues•   Hard to process incrementally (efficiently)•   Hard to achieve low latency•   Can’t have too many jobs•...
Live demo (Jump to web browser for HBase and MR status pages)
Batch processing at UA
Batch processing at UA•   Quartz scheduler, distributed over 3 nodes      •   Time-in-app, audience count, conversions
Batch processing at UA•   Quartz scheduler, distributed over 3 nodes      •   Time-in-app, audience count, conversions
Batch processing at UA•   Quartz scheduler, distributed over 3 nodes      •   Time-in-app, audience count, conversions•   ...
Strengths
Strengths•   Uptime     •   We know all the ways to crash by now
Strengths•   Uptime     •   We know all the ways to crash by now•   Schema design, throughput, and scaling     •   There a...
Strengths•   Uptime      •   We know all the ways to crash by now•   Schema design, throughput, and scaling      •   There...
Strengths•   Uptime      •   We know all the ways to crash by now•   Schema design, throughput, and scaling      •   There...
Weaknesses of our design
Weaknesses of our design•   Shipping features quickly
Weaknesses of our design•   Shipping features quickly•   Hardware efficiency
Weaknesses of our design•   Shipping features quickly•   Hardware efficiency•   Infrastructure automation
Weaknesses of our design•   Shipping features quickly•   Hardware efficiency•   Infrastructure automation•   Writing custom...
Weaknesses of our design•   Shipping features quickly•   Hardware efficiency•   Infrastructure automation•   Writing custom...
Reading•   Hadoop: The Definitive Guide by Tom White•   HBase: The Definitive Guide by Lars George•   http://hbase.apache.or...
Questions?•   #hbase on Freenode•   hbase-dev, hbase-user Apache mailing lists
Upcoming SlideShare
Loading in...5
×

HBase and Hadoop at Urban Airship

2,780

Published on

An introduction to HBase and Hadoop and how they're being used at scale by Urban Airship's core data team.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,780
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
119
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • HBase and Hadoop at Urban Airship

    1. 1. HBase andHadoop atUrban AirshipApril 25, 2012 Dave Revell dave@urbanairship.com @dave_revell
    2. 2. Who are we?• Who am I? • Airshipper for 10 months, Hadoop user for 1.5 years • Database Engineer on Core Data team: we collect events from mobile devices and create reports• What is Urban Airship? • SaaS for mobile developers. Features that devs shouldn’t build themselves. • Mostly push notifications • No airships :(
    3. 3. Goals
    4. 4. Goals• “Near real time” reporting • Counters: messages sent and received, app opens, in various time slices • More complex analyses: time-in-app, uniques, conversions
    5. 5. Goals• “Near real time” reporting • Counters: messages sent and received, app opens, in various time slices • More complex analyses: time-in-app, uniques, conversions• Scale • Billions of “events” per month, ~100 bytes each • 40 billion events so far, looking exponential. • Event arrival rate varies wildly, ~10K/sec (?)
    6. 6. Enter Hadoop
    7. 7. Enter Hadoop• An Apache project with HDFS, MapReduce, and Common • Open source, Apache license
    8. 8. Enter Hadoop• An Apache project with HDFS, MapReduce, and Common • Open source, Apache license• In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
    9. 9. Enter Hadoop• An Apache project with HDFS, MapReduce, and Common • Open source, Apache license• In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....• It’s in Java
    10. 10. Enter Hadoop• An Apache project with HDFS, MapReduce, and Common • Open source, Apache license• In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....• It’s in Java• History: early 2000s, originally a clone of Google’s GFS and MapReduce
    11. 11. Enter HBase
    12. 12. Enter HBase• HBase is a database that uses HDFS for storage
    13. 13. Enter HBase• HBase is a database that uses HDFS for storage• Based on Google’s BigTable. Not relational or SQL.
    14. 14. Enter HBase• HBase is a database that uses HDFS for storage• Based on Google’s BigTable. Not relational or SQL.• Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries
    15. 15. Enter HBase• HBase is a database that uses HDFS for storage• Based on Google’s BigTable. Not relational or SQL.• Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries• Scales well by adding servers (if you do everything right)
    16. 16. Enter HBase• HBase is a database that uses HDFS for storage• Based on Google’s BigTable. Not relational or SQL.• Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries• Scales well by adding servers (if you do everything right)• Not highly-available or multi-datacenter
    17. 17. UA’s basic architecture Events in Reports out Mobile devices Reports user Queue (Kafka) Web service HBase HDFS (not shown: analysis code that reads events from HBase and puts derived data back into HBase)
    18. 18. Analyzing events • Absorbs traffic spikesQueue of incoming events • Partially decouples database from internet • Pub/sub, groups of consumers share work • Consumes event queue • Does simple streaming analyses (counters)UA proprietary Java code • Stages data in HBase tables for more complex analyses that come later • Calculations that are difficult or inefficient to compute as data streams through Incremental batch jobs • Read from HBase, write back to HBase
    19. 19. HBase data model
    20. 20. HBase data model• The abstraction offered by HBase for reading and writing
    21. 21. HBase data model• The abstraction offered by HBase for reading and writing• As useful as possible without limiting scalability too much
    22. 22. HBase data model• The abstraction offered by HBase for reading and writing• As useful as possible without limiting scalability too much• Data is in rows, rows are in tables, ordered by row key
    23. 23. HBase data model• The abstraction offered by HBase for reading and writing• As useful as possible without limiting scalability too much• Data is in rows, rows are in tables, ordered by row key myApp:1335139200 OPENS_COUNT: 3987 SENDS_COUNT: 28832 myApp:1335142800 OPENS_COUNT: 4230 SENDS_COUNT: 38990
    24. 24. HBase data model• The abstraction offered by HBase for reading and writing• As useful as possible without limiting scalability too much• Data is in rows, rows are in tables, ordered by row key myApp:1335139200 OPENS_COUNT: 3987 SENDS_COUNT: 28832 myApp:1335142800 OPENS_COUNT: 4230 SENDS_COUNT: 38990 (not shown: column families)
    25. 25. The HBase data model, cont. {“myRowKey1”: {• This is a nested map/dictionary “myColFam”: {• Scannable in lexicographic key order “myQualifierX”: “foo”, “myQualifierY”: “bar”}},• Interface is very simple: “rowKey2”: { “myColFam”: • get, put, delete, scan, increment “myQualifierA”: “baz”, “myQualifierB”: “”}},• Bytes only
    26. 26. HBase API examplebyte[] firstNameQualifier = “fname”.getBytes();byte[] lastNameQualifier = “lname”.getBytes();byte[] personalInfoColFam = “personalInfo”.getBytes();HTable hTable = new HTable(“users”);Put put = new Put(“dave”.getBytes());put.add(personalInfoColFam, firstNameQualifier, “Dave”.getBytes());put.add(personalInfoColFam, lastNameQualifier, “Revell”.getBytes());hTable.put(put);
    27. 27. How to not fail at HBase
    28. 28. How to not fail at HBase• Things you should have done initially, but now it’s too late and you’re irretrievably screwed • Keep table count and column family count low • Keep rows narrow, use compound keys • Scale by adding more rows • Tune your flush threshold and memstore sizes • It’s OK to store complex objects as Protobuf/Thrift/etc. • Always try for sequential IO over random IO
    29. 29. MapReduce, briefly• The original use case for Hadoop• Mappers take in large data set and send (key,value) pairs to reducers. Reducers aggregate input pairs and generate output. My input data items Mapper Mapper Mapper Mapper Reducer Reducer Output Output
    30. 30. MapReduce issues
    31. 31. MapReduce issues• Hard to process incrementally (efficiently)
    32. 32. MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency
    33. 33. MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency• Can’t have too many jobs
    34. 34. MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency• Can’t have too many jobs• Requires elaborate workflow automation
    35. 35. MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency• Can’t have too many jobs• Requires elaborate workflow automation
    36. 36. MapReduce issues• Hard to process incrementally (efficiently)• Hard to achieve low latency• Can’t have too many jobs• Requires elaborate workflow automation• Urban Airship uses MapReduce over HBase data for: • Ad-hoc analysis • Monthly billing
    37. 37. Live demo (Jump to web browser for HBase and MR status pages)
    38. 38. Batch processing at UA
    39. 39. Batch processing at UA• Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions
    40. 40. Batch processing at UA• Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions
    41. 41. Batch processing at UA• Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions• General pattern • Arriving events set a low water mark for its app • Batch jobs reprocess events starting at the low water mark
    42. 42. Strengths
    43. 43. Strengths• Uptime • We know all the ways to crash by now
    44. 44. Strengths• Uptime • We know all the ways to crash by now• Schema design, throughput, and scaling • There are many subtle mistakes to avoid
    45. 45. Strengths• Uptime • We know all the ways to crash by now• Schema design, throughput, and scaling • There are many subtle mistakes to avoid• Writing custom tools (statshtable, hbackup, gclogtailer)
    46. 46. Strengths• Uptime • We know all the ways to crash by now• Schema design, throughput, and scaling • There are many subtle mistakes to avoid• Writing custom tools (statshtable, hbackup, gclogtailer)• “Real time most of the time”
    47. 47. Weaknesses of our design
    48. 48. Weaknesses of our design• Shipping features quickly
    49. 49. Weaknesses of our design• Shipping features quickly• Hardware efficiency
    50. 50. Weaknesses of our design• Shipping features quickly• Hardware efficiency• Infrastructure automation
    51. 51. Weaknesses of our design• Shipping features quickly• Hardware efficiency• Infrastructure automation• Writing custom tools, getting bogged down at low levels, leaky abstractions
    52. 52. Weaknesses of our design• Shipping features quickly• Hardware efficiency• Infrastructure automation• Writing custom tools, getting bogged down at low levels, leaky abstractions• Serious operational Java skills required
    53. 53. Reading• Hadoop: The Definitive Guide by Tom White• HBase: The Definitive Guide by Lars George• http://hbase.apache.org/book.html
    54. 54. Questions?• #hbase on Freenode• hbase-dev, hbase-user Apache mailing lists
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×