Storm + Hadoop

            @nathanmarz   1
So many Big Data technologies...




                                   2
So many Big Data technologies...




                                   2
So many Big Data technologies...




                                   2
So many Big Data technologies...




                                   2
So many Big Data technologies...




                                   2
So many Big Data technologies...




                                   2
So many Big Data technologies...



 Storm

                                   2
So many Big Data technologies...



 Storm

                                   2
So many Big Data technologies...



 Storm

                                   2
So many Big Data technologies...



 Storm
            Kafka
                                   2
How to make these tools work
together?




                               3
Goals of data system
• Low latency reads
• Low latency writes
• Fault-tolerant
• Scalable




                       4
What is a data system?


    Query = Function(All data)



                                 5
Is there a general purpose way to
compute arbitrary functions in
realtime?

                                    6
(What’s the title of this talk?)


                                   7
Example query


 Total number of pageviews to a
 URL over a range of time

                                  8
Example query




          Implementation   9
Too slow: “all data” is petabyte-scale


                                     10
Precomputation



          All    Query
         data


                         11
Precomputation


     All     Precomputed
                           Query
    data         view




                                   12
Example query
 Pageview

 Pageview

 Pageview                                  2930
                                   Query
 Pageview

 Pageview
  All data
                Precomputed view

                                              13
Precomputation


     All     Precomputed
                           Query
    data         view




                                   14
Precomputation


     All              Precomputed
                                               Query
    data   Function
                          view
                                    Function




                                                       15
Hadoop


 Great at computing arbitrary
 functions


                                16
Expressing those functions


                       Cascalog


                 Scalding
                                  17
Hadoop precomputation
                                    Batch view #1

                       e wo rkflow
              MapR educ
   All data



              MapRed
                    uce work
                            fl ow    Batch view #2


                                                    18
Batch view database

Need a database that...
• Is batch-writable from Hadoop
• Has fast random reads




                                  19
Batch view database


  No random writes required!



                               20
Batch view database

Examples
• ElephantDB
• Voldemort
• Manhattan




                      21
Batch view database

• Extremely simple
• ElephantDB is only a few thousand lines of code




                                                    22
Hadoop precomputation




                        23
So we’re done, right?


                        24
Not quite...
• A batch workflow is too slow
• Views are out of date


             Absorbed into batch views   Not absorbed


                                                   Now

                                Time
                                                         25
Not quite...
                                           Just a few hours
• A batch workflow is too slow              of data!
• Views are out of date


             Absorbed into batch views   Not absorbed


                                                   Now

                                Time
                                                              25
Compensating for last few hours of
data
                           Realtime view #1




New data stream
                           Realtime view #2




                                              26
Compensating for last few hours of
data
                           Realtime view #1




New data stream
                           Realtime view #2




                  Storm                       26
Realtime views
Random read / random write databases
• Cassandra
• HBase
• Riak




                                       27
Application queries

         Batch view


                        Merge
        Realtime view




                                28
Precomputation


     All     Precomputed
                           Query
    data         view




                                   29
Precomputation

               All   Precomputed
                      batch view
              data
                                     Query
                     Precomputed
                     realtime view
 New data stream


                                             30
Precomputation

               All   Hadoop Precomputed
                               batch view
              data
                                              Query
                              Precomputed
                              realtime view
 New data stream


                                                      30
Precomputation

               All   Hadoop Precomputed
                               batch view
              data
                                              Query
                              Precomputed
                              realtime view
 New data stream     Storm


                                                      30
Storm

                          Realtime view #1




New data stream
                          Realtime view #2




                  Storm                      31
Storm
Realtime computation system
• Guarantees data will be processed
• Horizontally scalable
• Fault-tolerant
• Fast




                                      32
Storm

        Source stream




        Source stream
                        Storm


                                33
Storm Cluster




                34
Storm Cluster




       Master node (similar to Hadoop JobTracker)   35
Storm Cluster




          Used for cluster coordination   36
Storm Cluster




           Run worker processes   37
Starting a topology




                      38
Killing a topology




                     39
Storm concepts
• Streams
• Spouts
• Bolts
• Topologies




                 40
Streams


    Tuple   Tuple   Tuple   Tuple   Tuple   Tuple   Tuple




               Unbounded sequence of tuples                 41
Spouts




         Source of streams   42
Spouts
• Read from Kestrel queue
• Read directly from Twitter streaming API




                                             43
Bolts




        44
Bolts
• Functions
• Filters
• Joins
• Aggregations
• Talk to databases




                      45
Topology




           46
Tasks




        47
Stream grouping




     When a tuple is emitted, to which task does it go to?   48
Stream grouping

• Shuffle grouping: pick a random task
• Fields grouping: mod hashing on a subset of tuple fields
• All grouping: send to all tasks
• Global grouping: pick task with lowest id

                                                            49
Streaming word count




                       50
Streaming word count




                       51
Streaming word count




                       52
Streaming word count




                       53
Streaming word count




                       54
Streaming word count




                       55
Precomputation


     All            Precomputed
                                  Query
    data   Hadoop
                        views

              +
            Storm


                                          56
Precomputation


     All            Precomputed
                                          Query
    data   Hadoop
                        views
                                  Storm
              +
            Storm


                                                  57
Distributed RPC


 Sometimes there’s very little
 you can precompute


                                 58
Distributed RPC


 And you still require a lot of
 on-the-fly computation


                                  59
Example


 Reach is the number of unique
 people exposed to a URL on
 Twitter
                                 60
Reach
                    Follower
                               Distinct
          Tweeter   Follower   follower

                    Follower
                               Distinct
    URL   Tweeter              follower
                    Follower

                    Follower   Distinct
          Tweeter              follower
                    Follower

                                          61
Reach topology




                 62
Distributed RPC




                  63
Storm + HDFS

                     HDFS




      New data       Storm       Distributed RPC



  Use HBase-like strategy to reliably store state
  within Storm bolts
                                                    64
Storm + HDFS



 https://github.com/nathanmarz/storm-contrib/tree/master/storm-state




                      storm-state library                              65
Missing pieces
• Getting data into Storm
• Getting data into Hadoop




                             66
Getting data into Storm
Queuing system
• Kestrel
• Kafka
• RabbitMQ




                          67
Getting data into Hadoop
• Scribe
• Flume
• Kafka




                           68
Learn more




        http://manning.com/marz   69
Questions?




             70

Realtime Analytics with Storm and Hadoop