Storm + Hadoop            @nathanmarz   1
So many Big Data technologies...                                   2
So many Big Data technologies...                                   2
So many Big Data technologies...                                   2
So many Big Data technologies...                                   2
So many Big Data technologies...                                   2
So many Big Data technologies...                                   2
So many Big Data technologies... Storm                                   2
So many Big Data technologies... Storm                                   2
So many Big Data technologies... Storm                                   2
So many Big Data technologies... Storm            Kafka                                   2
How to make these tools worktogether?                               3
Goals of data system• Low latency reads• Low latency writes• Fault-tolerant• Scalable                       4
What is a data system?    Query = Function(All data)                                 5
Is there a general purpose way tocompute arbitrary functions inrealtime?                                    6
(What’s the title of this talk?)                                   7
Example query Total number of pageviews to a URL over a range of time                                  8
Example query          Implementation   9
Too slow: “all data” is petabyte-scale                                     10
Precomputation          All    Query         data                         11
Precomputation     All     Precomputed                           Query    data         view                               ...
Example query Pageview Pageview Pageview                                  2930                                   Query Pag...
Precomputation     All     Precomputed                           Query    data         view                               ...
Precomputation     All              Precomputed                                               Query    data   Function    ...
Hadoop Great at computing arbitrary functions                                16
Expressing those functions                       Cascalog                 Scalding                                  17
Hadoop precomputation                                    Batch view #1                       e wo rkflow              MapR ...
Batch view databaseNeed a database that...• Is batch-writable from Hadoop• Has fast random reads                          ...
Batch view database  No random writes required!                               20
Batch view databaseExamples• ElephantDB• Voldemort• Manhattan                      21
Batch view database• Extremely simple• ElephantDB is only a few thousand lines of code                                    ...
Hadoop precomputation                        23
So we’re done, right?                        24
Not quite...• A batch workflow is too slow• Views are out of date             Absorbed into batch views   Not absorbed     ...
Not quite...                                           Just a few hours• A batch workflow is too slow              of data!...
Compensating for last few hours ofdata                           Realtime view #1New data stream                          ...
Compensating for last few hours ofdata                           Realtime view #1New data stream                          ...
Realtime viewsRandom read / random write databases• Cassandra• HBase• Riak                                       27
Application queries         Batch view                        Merge        Realtime view                                28
Precomputation     All     Precomputed                           Query    data         view                               ...
Precomputation               All   Precomputed                      batch view              data                          ...
Precomputation               All   Hadoop Precomputed                               batch view              data          ...
Precomputation               All   Hadoop Precomputed                               batch view              data          ...
Storm                          Realtime view #1New data stream                          Realtime view #2                  ...
StormRealtime computation system• Guarantees data will be processed• Horizontally scalable• Fault-tolerant• Fast          ...
Storm        Source stream        Source stream                        Storm                                33
Storm Cluster                34
Storm Cluster       Master node (similar to Hadoop JobTracker)   35
Storm Cluster          Used for cluster coordination   36
Storm Cluster           Run worker processes   37
Starting a topology                      38
Killing a topology                     39
Storm concepts• Streams• Spouts• Bolts• Topologies                 40
Streams    Tuple   Tuple   Tuple   Tuple   Tuple   Tuple   Tuple               Unbounded sequence of tuples               ...
Spouts         Source of streams   42
Spouts• Read from Kestrel queue• Read directly from Twitter streaming API                                             43
Bolts        44
Bolts• Functions• Filters• Joins• Aggregations• Talk to databases                      45
Topology           46
Tasks        47
Stream grouping     When a tuple is emitted, to which task does it go to?   48
Stream grouping• Shuffle grouping: pick a random task• Fields grouping: mod hashing on a subset of tuple fields• All groupin...
Streaming word count                       50
Streaming word count                       51
Streaming word count                       52
Streaming word count                       53
Streaming word count                       54
Streaming word count                       55
Precomputation     All            Precomputed                                  Query    data   Hadoop                     ...
Precomputation     All            Precomputed                                          Query    data   Hadoop             ...
Distributed RPC Sometimes there’s very little you can precompute                                 58
Distributed RPC And you still require a lot of on-the-fly computation                                  59
Example Reach is the number of unique people exposed to a URL on Twitter                                 60
Reach                    Follower                               Distinct          Tweeter   Follower   follower           ...
Reach topology                 62
Distributed RPC                  63
Storm + HDFS                     HDFS      New data       Storm       Distributed RPC  Use HBase-like strategy to reliably...
Storm + HDFS https://github.com/nathanmarz/storm-contrib/tree/master/storm-state                      storm-state library ...
Missing pieces• Getting data into Storm• Getting data into Hadoop                             66
Getting data into StormQueuing system• Kestrel• Kafka• RabbitMQ                          67
Getting data into Hadoop• Scribe• Flume• Kafka                           68
Learn more        http://manning.com/marz   69
Questions?             70
Upcoming SlideShare
Loading in...5
×

Realtime Analytics with Storm and Hadoop

66,960

Published on

Published in: Technology
1 Comment
302 Likes
Statistics
Notes
No Downloads
Views
Total Views
66,960
On Slideshare
0
From Embeds
0
Number of Embeds
53
Actions
Shares
0
Downloads
0
Comments
1
Likes
302
Embeds 0
No embeds

No notes for slide

Transcript of "Realtime Analytics with Storm and Hadoop"

  1. 1. Storm + Hadoop @nathanmarz 1
  2. 2. So many Big Data technologies... 2
  3. 3. So many Big Data technologies... 2
  4. 4. So many Big Data technologies... 2
  5. 5. So many Big Data technologies... 2
  6. 6. So many Big Data technologies... 2
  7. 7. So many Big Data technologies... 2
  8. 8. So many Big Data technologies... Storm 2
  9. 9. So many Big Data technologies... Storm 2
  10. 10. So many Big Data technologies... Storm 2
  11. 11. So many Big Data technologies... Storm Kafka 2
  12. 12. How to make these tools worktogether? 3
  13. 13. Goals of data system• Low latency reads• Low latency writes• Fault-tolerant• Scalable 4
  14. 14. What is a data system? Query = Function(All data) 5
  15. 15. Is there a general purpose way tocompute arbitrary functions inrealtime? 6
  16. 16. (What’s the title of this talk?) 7
  17. 17. Example query Total number of pageviews to a URL over a range of time 8
  18. 18. Example query Implementation 9
  19. 19. Too slow: “all data” is petabyte-scale 10
  20. 20. Precomputation All Query data 11
  21. 21. Precomputation All Precomputed Query data view 12
  22. 22. Example query Pageview Pageview Pageview 2930 Query Pageview Pageview All data Precomputed view 13
  23. 23. Precomputation All Precomputed Query data view 14
  24. 24. Precomputation All Precomputed Query data Function view Function 15
  25. 25. Hadoop Great at computing arbitrary functions 16
  26. 26. Expressing those functions Cascalog Scalding 17
  27. 27. Hadoop precomputation Batch view #1 e wo rkflow MapR educ All data MapRed uce work fl ow Batch view #2 18
  28. 28. Batch view databaseNeed a database that...• Is batch-writable from Hadoop• Has fast random reads 19
  29. 29. Batch view database No random writes required! 20
  30. 30. Batch view databaseExamples• ElephantDB• Voldemort• Manhattan 21
  31. 31. Batch view database• Extremely simple• ElephantDB is only a few thousand lines of code 22
  32. 32. Hadoop precomputation 23
  33. 33. So we’re done, right? 24
  34. 34. Not quite...• A batch workflow is too slow• Views are out of date Absorbed into batch views Not absorbed Now Time 25
  35. 35. Not quite... Just a few hours• A batch workflow is too slow of data!• Views are out of date Absorbed into batch views Not absorbed Now Time 25
  36. 36. Compensating for last few hours ofdata Realtime view #1New data stream Realtime view #2 26
  37. 37. Compensating for last few hours ofdata Realtime view #1New data stream Realtime view #2 Storm 26
  38. 38. Realtime viewsRandom read / random write databases• Cassandra• HBase• Riak 27
  39. 39. Application queries Batch view Merge Realtime view 28
  40. 40. Precomputation All Precomputed Query data view 29
  41. 41. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream 30
  42. 42. Precomputation All Hadoop Precomputed batch view data Query Precomputed realtime view New data stream 30
  43. 43. Precomputation All Hadoop Precomputed batch view data Query Precomputed realtime view New data stream Storm 30
  44. 44. Storm Realtime view #1New data stream Realtime view #2 Storm 31
  45. 45. StormRealtime computation system• Guarantees data will be processed• Horizontally scalable• Fault-tolerant• Fast 32
  46. 46. Storm Source stream Source stream Storm 33
  47. 47. Storm Cluster 34
  48. 48. Storm Cluster Master node (similar to Hadoop JobTracker) 35
  49. 49. Storm Cluster Used for cluster coordination 36
  50. 50. Storm Cluster Run worker processes 37
  51. 51. Starting a topology 38
  52. 52. Killing a topology 39
  53. 53. Storm concepts• Streams• Spouts• Bolts• Topologies 40
  54. 54. Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples 41
  55. 55. Spouts Source of streams 42
  56. 56. Spouts• Read from Kestrel queue• Read directly from Twitter streaming API 43
  57. 57. Bolts 44
  58. 58. Bolts• Functions• Filters• Joins• Aggregations• Talk to databases 45
  59. 59. Topology 46
  60. 60. Tasks 47
  61. 61. Stream grouping When a tuple is emitted, to which task does it go to? 48
  62. 62. Stream grouping• Shuffle grouping: pick a random task• Fields grouping: mod hashing on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest id 49
  63. 63. Streaming word count 50
  64. 64. Streaming word count 51
  65. 65. Streaming word count 52
  66. 66. Streaming word count 53
  67. 67. Streaming word count 54
  68. 68. Streaming word count 55
  69. 69. Precomputation All Precomputed Query data Hadoop views + Storm 56
  70. 70. Precomputation All Precomputed Query data Hadoop views Storm + Storm 57
  71. 71. Distributed RPC Sometimes there’s very little you can precompute 58
  72. 72. Distributed RPC And you still require a lot of on-the-fly computation 59
  73. 73. Example Reach is the number of unique people exposed to a URL on Twitter 60
  74. 74. Reach Follower Distinct Tweeter Follower follower Follower Distinct URL Tweeter follower Follower Follower Distinct Tweeter follower Follower 61
  75. 75. Reach topology 62
  76. 76. Distributed RPC 63
  77. 77. Storm + HDFS HDFS New data Storm Distributed RPC Use HBase-like strategy to reliably store state within Storm bolts 64
  78. 78. Storm + HDFS https://github.com/nathanmarz/storm-contrib/tree/master/storm-state storm-state library 65
  79. 79. Missing pieces• Getting data into Storm• Getting data into Hadoop 66
  80. 80. Getting data into StormQueuing system• Kestrel• Kafka• RabbitMQ 67
  81. 81. Getting data into Hadoop• Scribe• Flume• Kafka 68
  82. 82. Learn more http://manning.com/marz 69
  83. 83. Questions? 70

×