Realtime Analytics with Storm and Hadoop

101,498
-1

Published on

Published in: Technology
1 Comment
315 Likes
Statistics
Notes
No Downloads
Views
Total Views
101,498
On Slideshare
0
From Embeds
0
Number of Embeds
58
Actions
Shares
0
Downloads
0
Comments
1
Likes
315
Embeds 0
No embeds

No notes for slide

Realtime Analytics with Storm and Hadoop

  1. Storm + Hadoop @nathanmarz 1
  2. So many Big Data technologies... 2
  3. So many Big Data technologies... 2
  4. So many Big Data technologies... 2
  5. So many Big Data technologies... 2
  6. So many Big Data technologies... 2
  7. So many Big Data technologies... 2
  8. So many Big Data technologies... Storm 2
  9. So many Big Data technologies... Storm 2
  10. So many Big Data technologies... Storm 2
  11. So many Big Data technologies... Storm Kafka 2
  12. How to make these tools worktogether? 3
  13. Goals of data system• Low latency reads• Low latency writes• Fault-tolerant• Scalable 4
  14. What is a data system? Query = Function(All data) 5
  15. Is there a general purpose way tocompute arbitrary functions inrealtime? 6
  16. (What’s the title of this talk?) 7
  17. Example query Total number of pageviews to a URL over a range of time 8
  18. Example query Implementation 9
  19. Too slow: “all data” is petabyte-scale 10
  20. Precomputation All Query data 11
  21. Precomputation All Precomputed Query data view 12
  22. Example query Pageview Pageview Pageview 2930 Query Pageview Pageview All data Precomputed view 13
  23. Precomputation All Precomputed Query data view 14
  24. Precomputation All Precomputed Query data Function view Function 15
  25. Hadoop Great at computing arbitrary functions 16
  26. Expressing those functions Cascalog Scalding 17
  27. Hadoop precomputation Batch view #1 e wo rkflow MapR educ All data MapRed uce work fl ow Batch view #2 18
  28. Batch view databaseNeed a database that...• Is batch-writable from Hadoop• Has fast random reads 19
  29. Batch view database No random writes required! 20
  30. Batch view databaseExamples• ElephantDB• Voldemort• Manhattan 21
  31. Batch view database• Extremely simple• ElephantDB is only a few thousand lines of code 22
  32. Hadoop precomputation 23
  33. So we’re done, right? 24
  34. Not quite...• A batch workflow is too slow• Views are out of date Absorbed into batch views Not absorbed Now Time 25
  35. Not quite... Just a few hours• A batch workflow is too slow of data!• Views are out of date Absorbed into batch views Not absorbed Now Time 25
  36. Compensating for last few hours ofdata Realtime view #1New data stream Realtime view #2 26
  37. Compensating for last few hours ofdata Realtime view #1New data stream Realtime view #2 Storm 26
  38. Realtime viewsRandom read / random write databases• Cassandra• HBase• Riak 27
  39. Application queries Batch view Merge Realtime view 28
  40. Precomputation All Precomputed Query data view 29
  41. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream 30
  42. Precomputation All Hadoop Precomputed batch view data Query Precomputed realtime view New data stream 30
  43. Precomputation All Hadoop Precomputed batch view data Query Precomputed realtime view New data stream Storm 30
  44. Storm Realtime view #1New data stream Realtime view #2 Storm 31
  45. StormRealtime computation system• Guarantees data will be processed• Horizontally scalable• Fault-tolerant• Fast 32
  46. Storm Source stream Source stream Storm 33
  47. Storm Cluster 34
  48. Storm Cluster Master node (similar to Hadoop JobTracker) 35
  49. Storm Cluster Used for cluster coordination 36
  50. Storm Cluster Run worker processes 37
  51. Starting a topology 38
  52. Killing a topology 39
  53. Storm concepts• Streams• Spouts• Bolts• Topologies 40
  54. Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples 41
  55. Spouts Source of streams 42
  56. Spouts• Read from Kestrel queue• Read directly from Twitter streaming API 43
  57. Bolts 44
  58. Bolts• Functions• Filters• Joins• Aggregations• Talk to databases 45
  59. Topology 46
  60. Tasks 47
  61. Stream grouping When a tuple is emitted, to which task does it go to? 48
  62. Stream grouping• Shuffle grouping: pick a random task• Fields grouping: mod hashing on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest id 49
  63. Streaming word count 50
  64. Streaming word count 51
  65. Streaming word count 52
  66. Streaming word count 53
  67. Streaming word count 54
  68. Streaming word count 55
  69. Precomputation All Precomputed Query data Hadoop views + Storm 56
  70. Precomputation All Precomputed Query data Hadoop views Storm + Storm 57
  71. Distributed RPC Sometimes there’s very little you can precompute 58
  72. Distributed RPC And you still require a lot of on-the-fly computation 59
  73. Example Reach is the number of unique people exposed to a URL on Twitter 60
  74. Reach Follower Distinct Tweeter Follower follower Follower Distinct URL Tweeter follower Follower Follower Distinct Tweeter follower Follower 61
  75. Reach topology 62
  76. Distributed RPC 63
  77. Storm + HDFS HDFS New data Storm Distributed RPC Use HBase-like strategy to reliably store state within Storm bolts 64
  78. Storm + HDFS https://github.com/nathanmarz/storm-contrib/tree/master/storm-state storm-state library 65
  79. Missing pieces• Getting data into Storm• Getting data into Hadoop 66
  80. Getting data into StormQueuing system• Kestrel• Kafka• RabbitMQ 67
  81. Getting data into Hadoop• Scribe• Flume• Kafka 68
  82. Learn more http://manning.com/marz 69
  83. Questions? 70

×