• Save
Realtime Analytics with Storm and Hadoop
Upcoming SlideShare
Loading in...5
×
 

Realtime Analytics with Storm and Hadoop

on

  • 58,693 views

 

Statistics

Views

Total Views
58,693
Views on SlideShare
52,837
Embed Views
5,856

Actions

Likes
241
Downloads
0
Comments
0

53 Embeds 5,856

http://tech.gilt.com 2776
http://www.scoop.it 1383
http://dschool.co 346
http://tedwon.com 332
http://localhost 170
https://twitter.com 163
http://www.dschool.co 105
http://eventifier.co 74
http://www.tumblr.com 73
http://hadoopbigdata.wordpress.com 63
http://storify.com 52
http://www.bigdatacloud.com 46
http://www.joeytaleno.com 42
http://feedly.com 34
http://marilson.pbworks.com 30
http://iitkgpsv.org 23
http://eventifier.com 17
http://us-w1.rockmelt.com 17
http://115.68.2.182 9
http://pinterest.com 8
http://news.google.com 7
http://jcc.dschool.co 7
http://tweetedtimes.com 7
http://www.linkedin.com 6
http://digg.com 6
http://www.pinterest.com 5
http://www.iitkgpsv.org 5
http://josephconventpatna.dschool.co 5
http://5189371878824529096_614456453a52425624f7ee55bd1cd03a43319bcf.blogspot.com 5
http://kvdanapur.dschool.co 4
http://webcache.googleusercontent.com 3
http://www.newsblur.com 3
http://rajendravidyalayajamshedpur.dschool.co 2
http://translate.googleusercontent.com 2
http://mountcarmelschoolpatna.dschool.co 2
https://www.linkedin-ei.com 2
https://dev.twitter.com 2
http://newsblur.com 2
http://cfl.tonetechnology.com 2
http://heliocordeiro.tumblr.com 2
http://www.party09.com 2
http://bwgsdoranda.dschool.co 1
http://stanthonyschool.dschool.co 1
http://sbe-ridgecrestcharterkern.dschool.co 1
http://www.surendranathcentenary.dschool.co 1
http://www.tuicool.com 1
http://tech.gilt.com.uncensor.it 1
https://www.google.com 1
http://cloud.feedly.com 1
http://www.instacurate.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Realtime Analytics with Storm and Hadoop Realtime Analytics with Storm and Hadoop Presentation Transcript

  • Storm + Hadoop @nathanmarz 1
  • So many Big Data technologies... 2
  • So many Big Data technologies... 2
  • So many Big Data technologies... 2
  • So many Big Data technologies... 2
  • So many Big Data technologies... 2
  • So many Big Data technologies... 2
  • So many Big Data technologies... Storm 2
  • So many Big Data technologies... Storm 2
  • So many Big Data technologies... Storm 2
  • So many Big Data technologies... Storm Kafka 2
  • How to make these tools worktogether? 3
  • Goals of data system• Low latency reads• Low latency writes• Fault-tolerant• Scalable 4
  • What is a data system? Query = Function(All data) 5
  • Is there a general purpose way tocompute arbitrary functions inrealtime? 6
  • (What’s the title of this talk?) 7
  • Example query Total number of pageviews to a URL over a range of time 8
  • Example query Implementation 9
  • Too slow: “all data” is petabyte-scale 10
  • Precomputation All Query data 11
  • Precomputation All Precomputed Query data view 12
  • Example query Pageview Pageview Pageview 2930 Query Pageview Pageview All data Precomputed view 13
  • Precomputation All Precomputed Query data view 14
  • Precomputation All Precomputed Query data Function view Function 15
  • Hadoop Great at computing arbitrary functions 16
  • Expressing those functions Cascalog Scalding 17
  • Hadoop precomputation Batch view #1 e wo rkflow MapR educ All data MapRed uce work fl ow Batch view #2 18
  • Batch view databaseNeed a database that...• Is batch-writable from Hadoop• Has fast random reads 19
  • Batch view database No random writes required! 20
  • Batch view databaseExamples• ElephantDB• Voldemort• Manhattan 21
  • Batch view database• Extremely simple• ElephantDB is only a few thousand lines of code 22
  • Hadoop precomputation 23
  • So we’re done, right? 24
  • Not quite...• A batch workflow is too slow• Views are out of date Absorbed into batch views Not absorbed Now Time 25
  • Not quite... Just a few hours• A batch workflow is too slow of data!• Views are out of date Absorbed into batch views Not absorbed Now Time 25
  • Compensating for last few hours ofdata Realtime view #1New data stream Realtime view #2 26
  • Compensating for last few hours ofdata Realtime view #1New data stream Realtime view #2 Storm 26
  • Realtime viewsRandom read / random write databases• Cassandra• HBase• Riak 27
  • Application queries Batch view Merge Realtime view 28
  • Precomputation All Precomputed Query data view 29
  • Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream 30
  • Precomputation All Hadoop Precomputed batch view data Query Precomputed realtime view New data stream 30
  • Precomputation All Hadoop Precomputed batch view data Query Precomputed realtime view New data stream Storm 30
  • Storm Realtime view #1New data stream Realtime view #2 Storm 31
  • StormRealtime computation system• Guarantees data will be processed• Horizontally scalable• Fault-tolerant• Fast 32
  • Storm Source stream Source stream Storm 33
  • Storm Cluster 34
  • Storm Cluster Master node (similar to Hadoop JobTracker) 35
  • Storm Cluster Used for cluster coordination 36
  • Storm Cluster Run worker processes 37
  • Starting a topology 38
  • Killing a topology 39
  • Storm concepts• Streams• Spouts• Bolts• Topologies 40
  • Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples 41
  • Spouts Source of streams 42
  • Spouts• Read from Kestrel queue• Read directly from Twitter streaming API 43
  • Bolts 44
  • Bolts• Functions• Filters• Joins• Aggregations• Talk to databases 45
  • Topology 46
  • Tasks 47
  • Stream grouping When a tuple is emitted, to which task does it go to? 48
  • Stream grouping• Shuffle grouping: pick a random task• Fields grouping: mod hashing on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest id 49
  • Streaming word count 50
  • Streaming word count 51
  • Streaming word count 52
  • Streaming word count 53
  • Streaming word count 54
  • Streaming word count 55
  • Precomputation All Precomputed Query data Hadoop views + Storm 56
  • Precomputation All Precomputed Query data Hadoop views Storm + Storm 57
  • Distributed RPC Sometimes there’s very little you can precompute 58
  • Distributed RPC And you still require a lot of on-the-fly computation 59
  • Example Reach is the number of unique people exposed to a URL on Twitter 60
  • Reach Follower Distinct Tweeter Follower follower Follower Distinct URL Tweeter follower Follower Follower Distinct Tweeter follower Follower 61
  • Reach topology 62
  • Distributed RPC 63
  • Storm + HDFS HDFS New data Storm Distributed RPC Use HBase-like strategy to reliably store state within Storm bolts 64
  • Storm + HDFS https://github.com/nathanmarz/storm-contrib/tree/master/storm-state storm-state library 65
  • Missing pieces• Getting data into Storm• Getting data into Hadoop 66
  • Getting data into StormQueuing system• Kestrel• Kafka• RabbitMQ 67
  • Getting data into Hadoop• Scribe• Flume• Kafka 68
  • Learn more http://manning.com/marz 69
  • Questions? 70