SlideShare a Scribd company logo
Scaling near-realtime
analytics with Kafka and
HBase
OSCON 2012
Dave Revell & Nate Putnam
Urban Airship
About Us

•   Nate Putnam
    •   Team Lead, Core Data and Analytics (1 year)
    •   Previously Engineer at Jive Software (4 years)
    •   Contributor to HBase/Zookeeper
•   Dave Revell
    •   Database Engineer, Core Data and Analytics (1 year)
    •   Previously Engineer at Meebo (1 year)
    •   HBase contributor
In this Talk

•   About Urban Airship
•   Why near-realtime?
•   About Kafka
•   Data Consumption
•   Scheduling
•   HBase / High speed counting
•   Questions
What is an Urban Airship?
•   Hosting for mobile services that developers should not
    build themselves
•   Unified API for services across platforms
•   SLAs for throughput, latency
By The Numbers
By The Numbers



•   Hundreds of millions devices
By The Numbers



•   Hundreds of millions devices
•   Front end API sustains thousands of requests per
    second
By The Numbers



•   Hundreds of millions devices
•   Front end API sustains thousands of requests per
    second
•   Millions of Android devices online all the time
By The Numbers



•   Hundreds of millions devices
•   Front end API sustains thousands of requests per
    second
•   Millions of Android devices online all the time
•   6 months for the company to deliver 1M messages,
    hundred million plus a day now.
Pretty Graphs
Near-Realtime?



•   Realtime or Really fast?
•   Events happen async
•   Realtime failure scenarios are hard
•   In practice a few minutes is all right for analytics
Overview

  Kafka-0



  Kafka-1    Consumer-0


  Kafka-2   Memory Queue   Worker-0


                           Indexer-0


                           Indexer-1   HBase


                           Indexer-2
Kafka


•   Created by the SNA Team @ LinkedIn
•   Publish-Subscribe system with minimal features
•   Can keep as many messages buffered as you have disk
    space
•   Fast with some trade offs
Kafka

•   Topics are partitioned across servers
•   Partitioning scheme is customizable


            Topic_0         Topic_0         Topic_0
             partition_0     partition_4     partition_8

             partition_1     partition_5     partition_9

             partition_2     partition_6    partition_10

             partition_3     partition_7    partition_11
Kafka

•   Consumption is 1 thread per distinct partition
•   Consumers keep track of their own offsets
      Time




                                    Topic_0, Consumer Group 1
                                    Partition 0, Offset : 7




                                                  Topic_0, Consumer Group 1
                                                  Partition 1, Offset : 10




             Topic_0, Consumer Group 1
             Partition 3, Offset : 2
Kafka



•   Manipulation of time based indexes is powerful
•   Monitor offsets and lag
•   Throw as much disk at this as you can
•   http://incubator.apache.org/kafka/
Consumers



•   Mirror Kafka design
•   Lots of parallelism to increase throughput
•   Share nothing
•   No ordering guarantees
Consumers

  Kafka-0



  Kafka-1    Consumer-0


  Kafka-2   Memory Queue   Worker-0


                           Indexer-0


                           Indexer-1   HBase


                           Indexer-2
Consumers


  Kafka-0



  Kafka-1    Consumer-0


  Kafka-2   Memory Queue   Worker-0


                           Indexer-0


                           Indexer-1   HBase


                           Indexer-2
Consumers

    partition_0

    partition_1
    partition_2

    partition_3
                   Consumer-0   Queue
    partition_4

    partition_5    Consumer-1   Queue

    partition_6
                   Consumer-2   Queue
    partition_7

                   Consumer-3   Queue
    partition_8
    partition_9

    partition_10
    partition_11
Consumers

  Kafka-0



  Kafka-1    Consumer-0



  Kafka-2   Memory Queue   Worker-0


                           Indexer-0


                           Indexer-1   HBase


                           Indexer-2
Consumers

            Worker-0   Indexer-0


            Worker-1
                       Indexer-1

            Worker-2
    Queue              Indexer-2

            Worker-3
                       Indexer-3
            Worker-4               HBase

                       Indexer-4
            Worker-5

                       Indexer-5


                       Indexer-6


                       Indexer-7
Scheduled aggregation tasks


•   Challenge: aggregate values that arrive out of order
•   Example: sessions/clickstream
•   Two steps:
                                   t6   t4   t5   t7   t3   t8   t1   t9   t2   t0

    •   Quickly write into HBase
    •   Periodically scan to
        calculate aggregates
Events arrive in arbitrary order


                    t6   t4      t5    t7    t3    t8   t1   t9   t2   t0

     Initially we
     store in an
        HBase
      table with
     timestamp
        as key
                    t0   t1      t2    t3    t4    t5   t6   t7   t8   t9


                              Time-ordered on disk

S = session start   S     E      S     S     E     E    S    E    S    S
E = session end


                              Scheduled task scans
                              sequentially and infers
                                    sessions
Events arrive in arbitrary order


                    t6   t4      t5    t7    t3    t8   t1   t9   t2   t0

     Initially we
     store in an
        HBase
      table with
     timestamp
        as key
                    t0   t1      t2    t3    t4    t5   t6   t7   t8   t9


                              Time-ordered on disk

S = session start   S     E      S     S     E     E    S    E    S    S
E = session end


                              Scheduled task scans
                              sequentially and infers
                                    sessions
Events arrive in arbitrary order


                    t6   t4      t5    t7    t3    t8   t1   t9   t2   t0

     Initially we
     store in an
        HBase
      table with
     timestamp
        as key
                    t0   t1      t2    t3    t4    t5   t6   t7   t8   t9


                              Time-ordered on disk

S = session start   S     E      S     S     E     E    S    E    S    S
E = session end


                              Scheduled task scans
                              sequentially and infers
                                    sessions
Strengths



•   Efficient with disk and memory
•   Can tradeoff response time for disk usage
•   Fine granularity, 10Ks of jobs
Compared to MapReduce

                                                 Input data size
•   Similar to MapReduce shuffle:
    sequential IO, external sort          100

•   Fine grained failures, scheduling,    75
    resource allocation
                                          50
•   Can’t do lots of jobs, can’t do big
                                          25
    jobs
                                           0
•   But MapReduce is easier to use              Alice   Bob   Charlie Dave
Pro/con vs. realtime streaming


•   For example, a Storm topology
•   Challenge: avoid random reads
    (disk seeks) without keeping too   Bob's app,

    much state in RAM                   Devices
                                         10-20



•   Sorting minimizes state
•   But latency would be good
HBase




•   What it is
•   Why it’s good for low-latency big data
HBase

•   A database that uses HDFS for storage
•   Based on Google’s BigTable
•   Solves the problem “how do I query my Hadoop data?”
      •   Operations typically measured in milliseconds
      •   MapReduce is not suitable for real time queries
•   Scales well by adding servers (if you do everything right)
•   Not partition tolerant or eventually consistent
Why we like HBase



•   Scalable
•   Fast: millions of ops/sec
•   Open source, ASF top-level project
•   Strong community
HBase difficulties


•   Low level features, harder to use      Is it fast

    than RDBMS                             enough?



•   Hard to avoid accidentally             Identify

    introducing bottlenecks
                                          bottleneck




•   Garbage collection, JVM tuning      Rethink access
                                           patterns


•   HDFS
How to fail at HBase

•   Schema can limit scalability
                            HBase              HDFS

       KeyA                                  Datanode 1
                 Region 1
       KeyB
       KeyC
                 Region 2           Region
       KeyD
                                    Server   Datanode 2
       KeyE                           1
                 Region 3
       KeyF
       KeyG
                 Region 4                    Datanode 3
       KeyH
       KeyI
                 Region 5
       KeyJ
                                    Region
       KeyK                         Server   Datanode 4
                 Region 6             2
       KeyL
       KeyM
                 Region 7
       KeyN
Troubleshooting
•   Isolate slow regions or servers with statshtable
    •      http://github.com/urbanairship/statshtable



                        HBase              HDFS

    KeyA                                 Datanode 1
             Region 1
    KeyB
    KeyC
             Region 2           Region
    KeyD
                                Server   Datanode 2
    KeyE                          1
             Region 3
    KeyF
    KeyG
             Region 4                    Datanode 3
    KeyH
    KeyI
             Region 5
    KeyJ
                                Region
    KeyK                        Server   Datanode 4
             Region 6             2
    KeyL
    KeyM
             Region 7
    KeyN
Counting


•   The main thing that we do
•   Scaling dimensions:
    •   Many counters of interest per event
    •   Many events
    •   Many changes to counter definitions
A naive attempt
for event in stream:
  user_id = extract_user_id(event)
  timestamp = extract_timestamp(event)
  event_type = extract_event_type(event)
  client_type = extract_client_type(event)
  location = extract_location(event)

 increment_event_type_count(event_type)
 increment_client_and_event_type_count(event_type, client_type)
 increment_user_id_and_event_type_count(user_id, event_type)
 increment_user_id_and_client_type_count(user_id, client_type)

 for time_precision in {HOURLY, DAILY, MONTHLY}:
   increment_time_count(time_precision, timestamp)
   increment_time_client_type_event_type_count(time_precision, ..)
   increment_(...)

   for location_precision in {CITY, STATE, COUNTRY}:
     increment_time_location_event_type_client_type_user_id(...)

     for bucket in yet_another_dimension: ....
Counting with datacubes



•   Challenge: count items in a stream matching various
    criteria, when criteria may change
•   github.com/urbanairship/datacube
•   A Java library for turning
    streams into OLAP cubes
    •   Especially multidimensional counters
The data cube                                                    n
 abstraction                                                 s io
                                                           en
                                                       m
                                                   t di
                                                 en




                                                      An
                                               ag                                      5
                                          er




                                                   dr
                                        s




                                                      oid
                                    U




                                                       IO
                                                          S
                                                                            2
                          7/15/12 00:00          24            11     3           10
      Hourly buckets      7/15/12 01:00                               5
                       … more hourly rows...                          ...
   Time Dimension
                         7/15/12 00:00:00                             8
       Daily buckets     7/16/12 00:00:00                             0
                       … more daily rows...                           ...


                                                                      8
      Not shown: many more dimensions                        Alice   Bob        Charlie

                                                               User ID dimension
Why datacube?

•   Handles exponential number of writes
•   Async IO with batching
•   Declarative interface: say what to count, not how
•   Pluggable database backend (currently HBase)
•   Bulk loader
•   Easily change/extend online cubes
Datacube isn’t just for counters




 Courtesy Tim Robertson, GBIF

github.com/urbanairship/datacube
Courtesy Tim Robertson, GBIF

github.com/urbanairship/datacube
Questions?
Thanks!



•   HBase and Kafka for being awesome
•   We’re hiring! urbanairship.com/jobs/
•   @nateputnam @dave_revell

More Related Content

What's hot

Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka Security
DataWorks Summit
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
Cloudera, Inc.
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
Ayyappadas Ravindran (Appu)
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Cloudera, Inc.
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
DataStax Academy
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
HBaseCon
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
Gwen (Chen) Shapira
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
Camuel Gilyadov
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Newton Alex
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 

What's hot (20)

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka Security
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 

Viewers also liked

Event-Stream Processing with Kafka
Event-Stream Processing with KafkaEvent-Stream Processing with Kafka
Event-Stream Processing with Kafka
Tim Lossen
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Real-time Analytics with HBase (short version)
Real-time Analytics with HBase (short version)Real-time Analytics with HBase (short version)
Real-time Analytics with HBase (short version)alexbaranau
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
larsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
larsgeorge
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
DataWorks Summit
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
Nick Dimiduk
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Apache Kafka Demo
Apache Kafka DemoApache Kafka Demo
Apache Kafka Demo
Edward Capriolo
 
Transactional Streaming: If you can compute it, you can probably stream it.
Transactional Streaming: If you can compute it, you can probably stream it.Transactional Streaming: If you can compute it, you can probably stream it.
Transactional Streaming: If you can compute it, you can probably stream it.
jhugg
 
Cgc2 cdn gamingsummit-real-time-customer-analytics
Cgc2 cdn gamingsummit-real-time-customer-analyticsCgc2 cdn gamingsummit-real-time-customer-analytics
Cgc2 cdn gamingsummit-real-time-customer-analyticsbrock55
 
Real Time Recommendation System using Kiji
Real Time Recommendation System using KijiReal Time Recommendation System using Kiji
Real Time Recommendation System using Kiji
Daqing Zhao
 
HBase: Extreme makeover
HBase: Extreme makeoverHBase: Extreme makeover
HBase: Extreme makeover
bigbase
 
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
Daqing Zhao
 
Realtime Analytics With Elasticsearch [New Media Inspiration 2013]
Realtime Analytics With Elasticsearch [New Media Inspiration 2013]Realtime Analytics With Elasticsearch [New Media Inspiration 2013]
Realtime Analytics With Elasticsearch [New Media Inspiration 2013]
Karel Minarik
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
DataStax Academy
 

Viewers also liked (20)

Event-Stream Processing with Kafka
Event-Stream Processing with KafkaEvent-Stream Processing with Kafka
Event-Stream Processing with Kafka
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Real-time Analytics with HBase (short version)
Real-time Analytics with HBase (short version)Real-time Analytics with HBase (short version)
Real-time Analytics with HBase (short version)
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Apache Kafka Demo
Apache Kafka DemoApache Kafka Demo
Apache Kafka Demo
 
Transactional Streaming: If you can compute it, you can probably stream it.
Transactional Streaming: If you can compute it, you can probably stream it.Transactional Streaming: If you can compute it, you can probably stream it.
Transactional Streaming: If you can compute it, you can probably stream it.
 
Cgc2 cdn gamingsummit-real-time-customer-analytics
Cgc2 cdn gamingsummit-real-time-customer-analyticsCgc2 cdn gamingsummit-real-time-customer-analytics
Cgc2 cdn gamingsummit-real-time-customer-analytics
 
Real Time Recommendation System using Kiji
Real Time Recommendation System using KijiReal Time Recommendation System using Kiji
Real Time Recommendation System using Kiji
 
HBase: Extreme makeover
HBase: Extreme makeoverHBase: Extreme makeover
HBase: Extreme makeover
 
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
 
Realtime Analytics With Elasticsearch [New Media Inspiration 2013]
Realtime Analytics With Elasticsearch [New Media Inspiration 2013]Realtime Analytics With Elasticsearch [New Media Inspiration 2013]
Realtime Analytics With Elasticsearch [New Media Inspiration 2013]
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 

Similar to Near-realtime analytics with Kafka and HBase

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files
Haohui Mai
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
SudheerKumar499932
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Scott Miao
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookbaggioss
 
Hbase schema design and sizing apache-con europe - nov 2012
Hbase schema design and sizing   apache-con europe - nov 2012Hbase schema design and sizing   apache-con europe - nov 2012
Hbase schema design and sizing apache-con europe - nov 2012Chris Huang
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
Nick Dimiduk
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
Schubert Zhang
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
JAX London
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
Hanborq Inc.
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
Jean-Baptiste Poullet
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
HBaseCon
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsliqiang xu
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
StampedeCon
 

Similar to Near-realtime analytics with Kafka and HBase (20)

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebook
 
Hbase schema design and sizing apache-con europe - nov 2012
Hbase schema design and sizing   apache-con europe - nov 2012Hbase schema design and sizing   apache-con europe - nov 2012
Hbase schema design and sizing apache-con europe - nov 2012
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Near-realtime analytics with Kafka and HBase

  • 1. Scaling near-realtime analytics with Kafka and HBase OSCON 2012 Dave Revell & Nate Putnam Urban Airship
  • 2. About Us • Nate Putnam • Team Lead, Core Data and Analytics (1 year) • Previously Engineer at Jive Software (4 years) • Contributor to HBase/Zookeeper • Dave Revell • Database Engineer, Core Data and Analytics (1 year) • Previously Engineer at Meebo (1 year) • HBase contributor
  • 3. In this Talk • About Urban Airship • Why near-realtime? • About Kafka • Data Consumption • Scheduling • HBase / High speed counting • Questions
  • 4. What is an Urban Airship? • Hosting for mobile services that developers should not build themselves • Unified API for services across platforms • SLAs for throughput, latency
  • 6. By The Numbers • Hundreds of millions devices
  • 7. By The Numbers • Hundreds of millions devices • Front end API sustains thousands of requests per second
  • 8. By The Numbers • Hundreds of millions devices • Front end API sustains thousands of requests per second • Millions of Android devices online all the time
  • 9. By The Numbers • Hundreds of millions devices • Front end API sustains thousands of requests per second • Millions of Android devices online all the time • 6 months for the company to deliver 1M messages, hundred million plus a day now.
  • 11. Near-Realtime? • Realtime or Really fast? • Events happen async • Realtime failure scenarios are hard • In practice a few minutes is all right for analytics
  • 12. Overview Kafka-0 Kafka-1 Consumer-0 Kafka-2 Memory Queue Worker-0 Indexer-0 Indexer-1 HBase Indexer-2
  • 13. Kafka • Created by the SNA Team @ LinkedIn • Publish-Subscribe system with minimal features • Can keep as many messages buffered as you have disk space • Fast with some trade offs
  • 14. Kafka • Topics are partitioned across servers • Partitioning scheme is customizable Topic_0 Topic_0 Topic_0 partition_0 partition_4 partition_8 partition_1 partition_5 partition_9 partition_2 partition_6 partition_10 partition_3 partition_7 partition_11
  • 15. Kafka • Consumption is 1 thread per distinct partition • Consumers keep track of their own offsets Time Topic_0, Consumer Group 1 Partition 0, Offset : 7 Topic_0, Consumer Group 1 Partition 1, Offset : 10 Topic_0, Consumer Group 1 Partition 3, Offset : 2
  • 16. Kafka • Manipulation of time based indexes is powerful • Monitor offsets and lag • Throw as much disk at this as you can • http://incubator.apache.org/kafka/
  • 17. Consumers • Mirror Kafka design • Lots of parallelism to increase throughput • Share nothing • No ordering guarantees
  • 18. Consumers Kafka-0 Kafka-1 Consumer-0 Kafka-2 Memory Queue Worker-0 Indexer-0 Indexer-1 HBase Indexer-2
  • 19. Consumers Kafka-0 Kafka-1 Consumer-0 Kafka-2 Memory Queue Worker-0 Indexer-0 Indexer-1 HBase Indexer-2
  • 20. Consumers partition_0 partition_1 partition_2 partition_3 Consumer-0 Queue partition_4 partition_5 Consumer-1 Queue partition_6 Consumer-2 Queue partition_7 Consumer-3 Queue partition_8 partition_9 partition_10 partition_11
  • 21. Consumers Kafka-0 Kafka-1 Consumer-0 Kafka-2 Memory Queue Worker-0 Indexer-0 Indexer-1 HBase Indexer-2
  • 22. Consumers Worker-0 Indexer-0 Worker-1 Indexer-1 Worker-2 Queue Indexer-2 Worker-3 Indexer-3 Worker-4 HBase Indexer-4 Worker-5 Indexer-5 Indexer-6 Indexer-7
  • 23. Scheduled aggregation tasks • Challenge: aggregate values that arrive out of order • Example: sessions/clickstream • Two steps: t6 t4 t5 t7 t3 t8 t1 t9 t2 t0 • Quickly write into HBase • Periodically scan to calculate aggregates
  • 24. Events arrive in arbitrary order t6 t4 t5 t7 t3 t8 t1 t9 t2 t0 Initially we store in an HBase table with timestamp as key t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 Time-ordered on disk S = session start S E S S E E S E S S E = session end Scheduled task scans sequentially and infers sessions
  • 25. Events arrive in arbitrary order t6 t4 t5 t7 t3 t8 t1 t9 t2 t0 Initially we store in an HBase table with timestamp as key t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 Time-ordered on disk S = session start S E S S E E S E S S E = session end Scheduled task scans sequentially and infers sessions
  • 26. Events arrive in arbitrary order t6 t4 t5 t7 t3 t8 t1 t9 t2 t0 Initially we store in an HBase table with timestamp as key t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 Time-ordered on disk S = session start S E S S E E S E S S E = session end Scheduled task scans sequentially and infers sessions
  • 27. Strengths • Efficient with disk and memory • Can tradeoff response time for disk usage • Fine granularity, 10Ks of jobs
  • 28. Compared to MapReduce Input data size • Similar to MapReduce shuffle: sequential IO, external sort 100 • Fine grained failures, scheduling, 75 resource allocation 50 • Can’t do lots of jobs, can’t do big 25 jobs 0 • But MapReduce is easier to use Alice Bob Charlie Dave
  • 29. Pro/con vs. realtime streaming • For example, a Storm topology • Challenge: avoid random reads (disk seeks) without keeping too Bob's app, much state in RAM Devices 10-20 • Sorting minimizes state • But latency would be good
  • 30. HBase • What it is • Why it’s good for low-latency big data
  • 31. HBase • A database that uses HDFS for storage • Based on Google’s BigTable • Solves the problem “how do I query my Hadoop data?” • Operations typically measured in milliseconds • MapReduce is not suitable for real time queries • Scales well by adding servers (if you do everything right) • Not partition tolerant or eventually consistent
  • 32. Why we like HBase • Scalable • Fast: millions of ops/sec • Open source, ASF top-level project • Strong community
  • 33. HBase difficulties • Low level features, harder to use Is it fast than RDBMS enough? • Hard to avoid accidentally Identify introducing bottlenecks bottleneck • Garbage collection, JVM tuning Rethink access patterns • HDFS
  • 34. How to fail at HBase • Schema can limit scalability HBase HDFS KeyA Datanode 1 Region 1 KeyB KeyC Region 2 Region KeyD Server Datanode 2 KeyE 1 Region 3 KeyF KeyG Region 4 Datanode 3 KeyH KeyI Region 5 KeyJ Region KeyK Server Datanode 4 Region 6 2 KeyL KeyM Region 7 KeyN
  • 35. Troubleshooting • Isolate slow regions or servers with statshtable • http://github.com/urbanairship/statshtable HBase HDFS KeyA Datanode 1 Region 1 KeyB KeyC Region 2 Region KeyD Server Datanode 2 KeyE 1 Region 3 KeyF KeyG Region 4 Datanode 3 KeyH KeyI Region 5 KeyJ Region KeyK Server Datanode 4 Region 6 2 KeyL KeyM Region 7 KeyN
  • 36. Counting • The main thing that we do • Scaling dimensions: • Many counters of interest per event • Many events • Many changes to counter definitions
  • 37. A naive attempt for event in stream: user_id = extract_user_id(event) timestamp = extract_timestamp(event) event_type = extract_event_type(event) client_type = extract_client_type(event) location = extract_location(event) increment_event_type_count(event_type) increment_client_and_event_type_count(event_type, client_type) increment_user_id_and_event_type_count(user_id, event_type) increment_user_id_and_client_type_count(user_id, client_type) for time_precision in {HOURLY, DAILY, MONTHLY}: increment_time_count(time_precision, timestamp) increment_time_client_type_event_type_count(time_precision, ..) increment_(...) for location_precision in {CITY, STATE, COUNTRY}: increment_time_location_event_type_client_type_user_id(...) for bucket in yet_another_dimension: ....
  • 38. Counting with datacubes • Challenge: count items in a stream matching various criteria, when criteria may change • github.com/urbanairship/datacube • A Java library for turning streams into OLAP cubes • Especially multidimensional counters
  • 39. The data cube n abstraction s io en m t di en An ag 5 er dr s oid U IO S 2 7/15/12 00:00 24 11 3 10 Hourly buckets 7/15/12 01:00 5 … more hourly rows... ... Time Dimension 7/15/12 00:00:00 8 Daily buckets 7/16/12 00:00:00 0 … more daily rows... ... 8 Not shown: many more dimensions Alice Bob Charlie User ID dimension
  • 40. Why datacube? • Handles exponential number of writes • Async IO with batching • Declarative interface: say what to count, not how • Pluggable database backend (currently HBase) • Bulk loader • Easily change/extend online cubes
  • 41. Datacube isn’t just for counters Courtesy Tim Robertson, GBIF github.com/urbanairship/datacube
  • 42. Courtesy Tim Robertson, GBIF github.com/urbanairship/datacube
  • 44. Thanks! • HBase and Kafka for being awesome • We’re hiring! urbanairship.com/jobs/ • @nateputnam @dave_revell

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n