SlideShare a Scribd company logo
1 of 54
HBase and
Hadoop at
Urban Airship
April 25, 2012
                           Dave Revell
                 dave@urbanairship.com
                          @dave_revell
Who are we?

•   Who am I?
     •   Airshipper for 10 months, Hadoop user for 1.5 years
     •   Database Engineer on Core Data team: we collect
         events from mobile devices and create reports
•   What is Urban Airship?
     •   SaaS for mobile developers. Features that devs
         shouldn’t build themselves.
     •   Mostly push notifications
     •   No airships :(
Goals
Goals
•   “Near real time” reporting
      •   Counters: messages sent and received, app opens, in
          various time slices
      •   More complex analyses: time-in-app, uniques,
          conversions
Goals
•   “Near real time” reporting
      •   Counters: messages sent and received, app opens, in
          various time slices
      •   More complex analyses: time-in-app, uniques,
          conversions
•   Scale
      •   Billions of “events” per month, ~100 bytes each
      •   40 billion events so far, looking exponential.
      •   Event arrival rate varies wildly, ~10K/sec (?)
Enter Hadoop
Enter Hadoop

•   An Apache project with HDFS, MapReduce, and Common
     •   Open source, Apache license
Enter Hadoop

•   An Apache project with HDFS, MapReduce, and Common
     •   Open source, Apache license
•   In common usage: platform, framework, ecosystem
     •   HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
Enter Hadoop

•   An Apache project with HDFS, MapReduce, and Common
      •   Open source, Apache license
•   In common usage: platform, framework, ecosystem
      •   HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
•   It’s in Java
Enter Hadoop

•   An Apache project with HDFS, MapReduce, and Common
      •   Open source, Apache license
•   In common usage: platform, framework, ecosystem
      •   HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
•   It’s in Java
•   History: early 2000s, originally a clone of Google’s GFS and
    MapReduce
Enter HBase
Enter HBase

•   HBase is a database that uses HDFS for storage
Enter HBase

•   HBase is a database that uses HDFS for storage
•   Based on Google’s BigTable. Not relational or SQL.
Enter HBase

•   HBase is a database that uses HDFS for storage
•   Based on Google’s BigTable. Not relational or SQL.
•   Solves the problem “how do I query my Hadoop data?”
      •   Operations typically take a few milliseconds
      •   MapReduce is not suitable for real time queries
Enter HBase

•   HBase is a database that uses HDFS for storage
•   Based on Google’s BigTable. Not relational or SQL.
•   Solves the problem “how do I query my Hadoop data?”
      •   Operations typically take a few milliseconds
      •   MapReduce is not suitable for real time queries
•   Scales well by adding servers (if you do everything right)
Enter HBase

•   HBase is a database that uses HDFS for storage
•   Based on Google’s BigTable. Not relational or SQL.
•   Solves the problem “how do I query my Hadoop data?”
      •   Operations typically take a few milliseconds
      •   MapReduce is not suitable for real time queries
•   Scales well by adding servers (if you do everything right)
•   Not highly-available or multi-datacenter
UA’s basic architecture
   Events in                                     Reports out
            Mobile devices               Reports user

             Queue (Kafka)               Web service

                                 HBase

                                  HDFS



  (not shown: analysis code that reads
  events from HBase and puts derived
         data back into HBase)
Analyzing events
                           •   Absorbs traffic spikes

Queue of incoming events   •   Partially decouples database from internet

                           •   Pub/sub, groups of consumers share work

                           •   Consumes event queue

                           •   Does simple streaming analyses (counters)
UA proprietary Java code
                           •   Stages data in HBase tables for more
                               complex analyses that come later


                           •   Calculations that are difficult or inefficient to
                               compute as data streams through
 Incremental batch jobs
                           •   Read from HBase, write back to HBase
HBase data model
HBase data model

•   The abstraction offered by HBase for reading and writing
HBase data model

•   The abstraction offered by HBase for reading and writing
•   As useful as possible without limiting scalability too much
HBase data model

•   The abstraction offered by HBase for reading and writing
•   As useful as possible without limiting scalability too much
•   Data is in rows, rows are in tables, ordered by row key
HBase data model

•   The abstraction offered by HBase for reading and writing
•   As useful as possible without limiting scalability too much
•   Data is in rows, rows are in tables, ordered by row key


      myApp:1335139200   OPENS_COUNT: 3987 SENDS_COUNT: 28832

      myApp:1335142800   OPENS_COUNT: 4230 SENDS_COUNT: 38990
HBase data model

•   The abstraction offered by HBase for reading and writing
•   As useful as possible without limiting scalability too much
•   Data is in rows, rows are in tables, ordered by row key


      myApp:1335139200       OPENS_COUNT: 3987 SENDS_COUNT: 28832

      myApp:1335142800       OPENS_COUNT: 4230 SENDS_COUNT: 38990



       (not shown: column families)
The HBase data model, cont.


                                              {“myRowKey1”: {
•   This is a nested map/dictionary
                                                 “myColFam”: {
•   Scannable in lexicographic key order            “myQualifierX”: “foo”,
                                                    “myQualifierY”: “bar”}},
•   Interface is very simple:                  “rowKey2”: {
                                                 “myColFam”:
      •   get, put, delete, scan, increment        “myQualifierA”: “baz”,
                                                   “myQualifierB”: “”}},
•   Bytes only
HBase API example

byte[] firstNameQualifier = “fname”.getBytes();

byte[] lastNameQualifier = “lname”.getBytes();

byte[] personalInfoColFam = “personalInfo”.getBytes();



HTable hTable = new HTable(“users”);

Put put = new Put(“dave”.getBytes());

put.add(personalInfoColFam, firstNameQualifier, “Dave”.getBytes());

put.add(personalInfoColFam, lastNameQualifier, “Revell”.getBytes());

hTable.put(put);
How to not fail at HBase
How to not fail at HBase

•   Things you should have done initially, but now it’s too late
    and you’re irretrievably screwed
      •   Keep table count and column family count low
      •   Keep rows narrow, use compound keys
      •   Scale by adding more rows
      •   Tune your flush threshold and memstore sizes
      •   It’s OK to store complex objects as Protobuf/Thrift/etc.
      •   Always try for sequential IO over random IO
MapReduce, briefly
•   The original use case for Hadoop
•   Mappers take in large data set and send (key,value) pairs to
    reducers. Reducers aggregate input pairs and generate
    output.

                    My     input    data items

                  Mapper   Mapper   Mapper   Mapper



                    Reducer            Reducer

                    Output             Output
MapReduce issues
MapReduce issues

•   Hard to process incrementally (efficiently)
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
•   Can’t have too many jobs
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
•   Can’t have too many jobs
•   Requires elaborate workflow automation
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
•   Can’t have too many jobs
•   Requires elaborate workflow automation
MapReduce issues

•   Hard to process incrementally (efficiently)
•   Hard to achieve low latency
•   Can’t have too many jobs
•   Requires elaborate workflow automation


•   Urban Airship uses MapReduce over HBase data for:
      •   Ad-hoc analysis
      •   Monthly billing
Live demo




 (Jump to web browser for HBase and MR status pages)
Batch processing at UA
Batch processing at UA

•   Quartz scheduler, distributed over 3 nodes
      •   Time-in-app, audience count, conversions
Batch processing at UA

•   Quartz scheduler, distributed over 3 nodes
      •   Time-in-app, audience count, conversions
Batch processing at UA

•   Quartz scheduler, distributed over 3 nodes
      •   Time-in-app, audience count, conversions


•   General pattern
      •   Arriving events set a low water mark for its app
      •   Batch jobs reprocess events starting at the low water
          mark
Strengths
Strengths

•   Uptime
     •   We know all the ways to crash by now
Strengths

•   Uptime
     •   We know all the ways to crash by now
•   Schema design, throughput, and scaling
     •   There are many subtle mistakes to avoid
Strengths

•   Uptime
      •   We know all the ways to crash by now
•   Schema design, throughput, and scaling
      •   There are many subtle mistakes to avoid
•   Writing custom tools (statshtable, hbackup, gclogtailer)
Strengths

•   Uptime
      •   We know all the ways to crash by now
•   Schema design, throughput, and scaling
      •   There are many subtle mistakes to avoid
•   Writing custom tools (statshtable, hbackup, gclogtailer)
•   “Real time most of the time”
Weaknesses of our design
Weaknesses of our design


•   Shipping features quickly
Weaknesses of our design


•   Shipping features quickly
•   Hardware efficiency
Weaknesses of our design


•   Shipping features quickly
•   Hardware efficiency
•   Infrastructure automation
Weaknesses of our design


•   Shipping features quickly
•   Hardware efficiency
•   Infrastructure automation
•   Writing custom tools, getting bogged down at low levels,
    leaky abstractions
Weaknesses of our design


•   Shipping features quickly
•   Hardware efficiency
•   Infrastructure automation
•   Writing custom tools, getting bogged down at low levels,
    leaky abstractions
•   Serious operational Java skills required
Reading



•   Hadoop: The Definitive Guide by Tom White
•   HBase: The Definitive Guide by Lars George
•   http://hbase.apache.org/book.html
Questions?




•   #hbase on Freenode
•   hbase-dev, hbase-user Apache mailing lists

More Related Content

What's hot

Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Adam Doyle
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive Alex Silva
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application ArchetypesCloudera, Inc.
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoopzenyk
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...Cloudera, Inc.
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataCloudera, Inc.
 

What's hot (20)

Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application Archetypes
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 

Similar to HBase and Hadoop at Urban Airship

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceCloudera, Inc.
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementDataWorks Summit
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and SparkMichael Stack
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 

Similar to HBase and Hadoop at Urban Airship (20)

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
מיכאל
מיכאלמיכאל
מיכאל
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hive
HiveHive
Hive
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

HBase and Hadoop at Urban Airship

  • 1. HBase and Hadoop at Urban Airship April 25, 2012 Dave Revell dave@urbanairship.com @dave_revell
  • 2. Who are we? • Who am I? • Airshipper for 10 months, Hadoop user for 1.5 years • Database Engineer on Core Data team: we collect events from mobile devices and create reports • What is Urban Airship? • SaaS for mobile developers. Features that devs shouldn’t build themselves. • Mostly push notifications • No airships :(
  • 4. Goals • “Near real time” reporting • Counters: messages sent and received, app opens, in various time slices • More complex analyses: time-in-app, uniques, conversions
  • 5. Goals • “Near real time” reporting • Counters: messages sent and received, app opens, in various time slices • More complex analyses: time-in-app, uniques, conversions • Scale • Billions of “events” per month, ~100 bytes each • 40 billion events so far, looking exponential. • Event arrival rate varies wildly, ~10K/sec (?)
  • 7. Enter Hadoop • An Apache project with HDFS, MapReduce, and Common • Open source, Apache license
  • 8. Enter Hadoop • An Apache project with HDFS, MapReduce, and Common • Open source, Apache license • In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie ....
  • 9. Enter Hadoop • An Apache project with HDFS, MapReduce, and Common • Open source, Apache license • In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie .... • It’s in Java
  • 10. Enter Hadoop • An Apache project with HDFS, MapReduce, and Common • Open source, Apache license • In common usage: platform, framework, ecosystem • HBase, Hive, Pig, ZooKeeper, Mahout, Oozie .... • It’s in Java • History: early 2000s, originally a clone of Google’s GFS and MapReduce
  • 12. Enter HBase • HBase is a database that uses HDFS for storage
  • 13. Enter HBase • HBase is a database that uses HDFS for storage • Based on Google’s BigTable. Not relational or SQL.
  • 14. Enter HBase • HBase is a database that uses HDFS for storage • Based on Google’s BigTable. Not relational or SQL. • Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries
  • 15. Enter HBase • HBase is a database that uses HDFS for storage • Based on Google’s BigTable. Not relational or SQL. • Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries • Scales well by adding servers (if you do everything right)
  • 16. Enter HBase • HBase is a database that uses HDFS for storage • Based on Google’s BigTable. Not relational or SQL. • Solves the problem “how do I query my Hadoop data?” • Operations typically take a few milliseconds • MapReduce is not suitable for real time queries • Scales well by adding servers (if you do everything right) • Not highly-available or multi-datacenter
  • 17. UA’s basic architecture Events in Reports out Mobile devices Reports user Queue (Kafka) Web service HBase HDFS (not shown: analysis code that reads events from HBase and puts derived data back into HBase)
  • 18. Analyzing events • Absorbs traffic spikes Queue of incoming events • Partially decouples database from internet • Pub/sub, groups of consumers share work • Consumes event queue • Does simple streaming analyses (counters) UA proprietary Java code • Stages data in HBase tables for more complex analyses that come later • Calculations that are difficult or inefficient to compute as data streams through Incremental batch jobs • Read from HBase, write back to HBase
  • 20. HBase data model • The abstraction offered by HBase for reading and writing
  • 21. HBase data model • The abstraction offered by HBase for reading and writing • As useful as possible without limiting scalability too much
  • 22. HBase data model • The abstraction offered by HBase for reading and writing • As useful as possible without limiting scalability too much • Data is in rows, rows are in tables, ordered by row key
  • 23. HBase data model • The abstraction offered by HBase for reading and writing • As useful as possible without limiting scalability too much • Data is in rows, rows are in tables, ordered by row key myApp:1335139200 OPENS_COUNT: 3987 SENDS_COUNT: 28832 myApp:1335142800 OPENS_COUNT: 4230 SENDS_COUNT: 38990
  • 24. HBase data model • The abstraction offered by HBase for reading and writing • As useful as possible without limiting scalability too much • Data is in rows, rows are in tables, ordered by row key myApp:1335139200 OPENS_COUNT: 3987 SENDS_COUNT: 28832 myApp:1335142800 OPENS_COUNT: 4230 SENDS_COUNT: 38990 (not shown: column families)
  • 25. The HBase data model, cont. {“myRowKey1”: { • This is a nested map/dictionary “myColFam”: { • Scannable in lexicographic key order “myQualifierX”: “foo”, “myQualifierY”: “bar”}}, • Interface is very simple: “rowKey2”: { “myColFam”: • get, put, delete, scan, increment “myQualifierA”: “baz”, “myQualifierB”: “”}}, • Bytes only
  • 26. HBase API example byte[] firstNameQualifier = “fname”.getBytes(); byte[] lastNameQualifier = “lname”.getBytes(); byte[] personalInfoColFam = “personalInfo”.getBytes(); HTable hTable = new HTable(“users”); Put put = new Put(“dave”.getBytes()); put.add(personalInfoColFam, firstNameQualifier, “Dave”.getBytes()); put.add(personalInfoColFam, lastNameQualifier, “Revell”.getBytes()); hTable.put(put);
  • 27. How to not fail at HBase
  • 28. How to not fail at HBase • Things you should have done initially, but now it’s too late and you’re irretrievably screwed • Keep table count and column family count low • Keep rows narrow, use compound keys • Scale by adding more rows • Tune your flush threshold and memstore sizes • It’s OK to store complex objects as Protobuf/Thrift/etc. • Always try for sequential IO over random IO
  • 29. MapReduce, briefly • The original use case for Hadoop • Mappers take in large data set and send (key,value) pairs to reducers. Reducers aggregate input pairs and generate output. My input data items Mapper Mapper Mapper Mapper Reducer Reducer Output Output
  • 31. MapReduce issues • Hard to process incrementally (efficiently)
  • 32. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency
  • 33. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency • Can’t have too many jobs
  • 34. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency • Can’t have too many jobs • Requires elaborate workflow automation
  • 35. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency • Can’t have too many jobs • Requires elaborate workflow automation
  • 36. MapReduce issues • Hard to process incrementally (efficiently) • Hard to achieve low latency • Can’t have too many jobs • Requires elaborate workflow automation • Urban Airship uses MapReduce over HBase data for: • Ad-hoc analysis • Monthly billing
  • 37. Live demo (Jump to web browser for HBase and MR status pages)
  • 39. Batch processing at UA • Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions
  • 40. Batch processing at UA • Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions
  • 41. Batch processing at UA • Quartz scheduler, distributed over 3 nodes • Time-in-app, audience count, conversions • General pattern • Arriving events set a low water mark for its app • Batch jobs reprocess events starting at the low water mark
  • 43. Strengths • Uptime • We know all the ways to crash by now
  • 44. Strengths • Uptime • We know all the ways to crash by now • Schema design, throughput, and scaling • There are many subtle mistakes to avoid
  • 45. Strengths • Uptime • We know all the ways to crash by now • Schema design, throughput, and scaling • There are many subtle mistakes to avoid • Writing custom tools (statshtable, hbackup, gclogtailer)
  • 46. Strengths • Uptime • We know all the ways to crash by now • Schema design, throughput, and scaling • There are many subtle mistakes to avoid • Writing custom tools (statshtable, hbackup, gclogtailer) • “Real time most of the time”
  • 48. Weaknesses of our design • Shipping features quickly
  • 49. Weaknesses of our design • Shipping features quickly • Hardware efficiency
  • 50. Weaknesses of our design • Shipping features quickly • Hardware efficiency • Infrastructure automation
  • 51. Weaknesses of our design • Shipping features quickly • Hardware efficiency • Infrastructure automation • Writing custom tools, getting bogged down at low levels, leaky abstractions
  • 52. Weaknesses of our design • Shipping features quickly • Hardware efficiency • Infrastructure automation • Writing custom tools, getting bogged down at low levels, leaky abstractions • Serious operational Java skills required
  • 53. Reading • Hadoop: The Definitive Guide by Tom White • HBase: The Definitive Guide by Lars George • http://hbase.apache.org/book.html
  • 54. Questions? • #hbase on Freenode • hbase-dev, hbase-user Apache mailing lists

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n