Realtime Analytics with Storm and Hadoop

  • 59,518 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
59,518
On Slideshare
0
From Embeds
0
Number of Embeds
49

Actions

Shares
Downloads
0
Comments
0
Likes
262

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Storm + Hadoop @nathanmarz 1
  • 2. So many Big Data technologies... 2
  • 3. So many Big Data technologies... 2
  • 4. So many Big Data technologies... 2
  • 5. So many Big Data technologies... 2
  • 6. So many Big Data technologies... 2
  • 7. So many Big Data technologies... 2
  • 8. So many Big Data technologies... Storm 2
  • 9. So many Big Data technologies... Storm 2
  • 10. So many Big Data technologies... Storm 2
  • 11. So many Big Data technologies... Storm Kafka 2
  • 12. How to make these tools worktogether? 3
  • 13. Goals of data system• Low latency reads• Low latency writes• Fault-tolerant• Scalable 4
  • 14. What is a data system? Query = Function(All data) 5
  • 15. Is there a general purpose way tocompute arbitrary functions inrealtime? 6
  • 16. (What’s the title of this talk?) 7
  • 17. Example query Total number of pageviews to a URL over a range of time 8
  • 18. Example query Implementation 9
  • 19. Too slow: “all data” is petabyte-scale 10
  • 20. Precomputation All Query data 11
  • 21. Precomputation All Precomputed Query data view 12
  • 22. Example query Pageview Pageview Pageview 2930 Query Pageview Pageview All data Precomputed view 13
  • 23. Precomputation All Precomputed Query data view 14
  • 24. Precomputation All Precomputed Query data Function view Function 15
  • 25. Hadoop Great at computing arbitrary functions 16
  • 26. Expressing those functions Cascalog Scalding 17
  • 27. Hadoop precomputation Batch view #1 e wo rkflow MapR educ All data MapRed uce work fl ow Batch view #2 18
  • 28. Batch view databaseNeed a database that...• Is batch-writable from Hadoop• Has fast random reads 19
  • 29. Batch view database No random writes required! 20
  • 30. Batch view databaseExamples• ElephantDB• Voldemort• Manhattan 21
  • 31. Batch view database• Extremely simple• ElephantDB is only a few thousand lines of code 22
  • 32. Hadoop precomputation 23
  • 33. So we’re done, right? 24
  • 34. Not quite...• A batch workflow is too slow• Views are out of date Absorbed into batch views Not absorbed Now Time 25
  • 35. Not quite... Just a few hours• A batch workflow is too slow of data!• Views are out of date Absorbed into batch views Not absorbed Now Time 25
  • 36. Compensating for last few hours ofdata Realtime view #1New data stream Realtime view #2 26
  • 37. Compensating for last few hours ofdata Realtime view #1New data stream Realtime view #2 Storm 26
  • 38. Realtime viewsRandom read / random write databases• Cassandra• HBase• Riak 27
  • 39. Application queries Batch view Merge Realtime view 28
  • 40. Precomputation All Precomputed Query data view 29
  • 41. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream 30
  • 42. Precomputation All Hadoop Precomputed batch view data Query Precomputed realtime view New data stream 30
  • 43. Precomputation All Hadoop Precomputed batch view data Query Precomputed realtime view New data stream Storm 30
  • 44. Storm Realtime view #1New data stream Realtime view #2 Storm 31
  • 45. StormRealtime computation system• Guarantees data will be processed• Horizontally scalable• Fault-tolerant• Fast 32
  • 46. Storm Source stream Source stream Storm 33
  • 47. Storm Cluster 34
  • 48. Storm Cluster Master node (similar to Hadoop JobTracker) 35
  • 49. Storm Cluster Used for cluster coordination 36
  • 50. Storm Cluster Run worker processes 37
  • 51. Starting a topology 38
  • 52. Killing a topology 39
  • 53. Storm concepts• Streams• Spouts• Bolts• Topologies 40
  • 54. Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples 41
  • 55. Spouts Source of streams 42
  • 56. Spouts• Read from Kestrel queue• Read directly from Twitter streaming API 43
  • 57. Bolts 44
  • 58. Bolts• Functions• Filters• Joins• Aggregations• Talk to databases 45
  • 59. Topology 46
  • 60. Tasks 47
  • 61. Stream grouping When a tuple is emitted, to which task does it go to? 48
  • 62. Stream grouping• Shuffle grouping: pick a random task• Fields grouping: mod hashing on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest id 49
  • 63. Streaming word count 50
  • 64. Streaming word count 51
  • 65. Streaming word count 52
  • 66. Streaming word count 53
  • 67. Streaming word count 54
  • 68. Streaming word count 55
  • 69. Precomputation All Precomputed Query data Hadoop views + Storm 56
  • 70. Precomputation All Precomputed Query data Hadoop views Storm + Storm 57
  • 71. Distributed RPC Sometimes there’s very little you can precompute 58
  • 72. Distributed RPC And you still require a lot of on-the-fly computation 59
  • 73. Example Reach is the number of unique people exposed to a URL on Twitter 60
  • 74. Reach Follower Distinct Tweeter Follower follower Follower Distinct URL Tweeter follower Follower Follower Distinct Tweeter follower Follower 61
  • 75. Reach topology 62
  • 76. Distributed RPC 63
  • 77. Storm + HDFS HDFS New data Storm Distributed RPC Use HBase-like strategy to reliably store state within Storm bolts 64
  • 78. Storm + HDFS https://github.com/nathanmarz/storm-contrib/tree/master/storm-state storm-state library 65
  • 79. Missing pieces• Getting data into Storm• Getting data into Hadoop 66
  • 80. Getting data into StormQueuing system• Kestrel• Kafka• RabbitMQ 67
  • 81. Getting data into Hadoop• Scribe• Flume• Kafka 68
  • 82. Learn more http://manning.com/marz 69
  • 83. Questions? 70