Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) -> list(k2,v2)The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:Reduce(k2, list (v2)) -> list(v3)
A bolt can subscribe to an unlimited number of streams, by chaining groupings.
declareOutputFields is used to declare streams and their schemas. It is possible to declare several streams and specify the stream to use when outputting tuples in the emit function call.
Each spout or bolt are running X instances in parallel (called tasks). All grouping: send to all tasks• Global grouping: pick task with lowest id
Storm• Originally from BackType for analyzing tweets – (More than 2000 watchers on GitHub)• “the realtime Hadoop” – continuous computation system (open source)• distributed, reliable, fault-tolerant – suitable for big data processing
Big Data challenges• Scalability – vertical, horizontal• (high) Avalaibility• Stability (fault-tolerance)caching, replication, partitioning/sharding, load-balancing, …
Google!• published papers on MapReduce, Google FileSystem (GFS), BigTable
Hadoop limits• Batch processing with jobs -> not realtime• Stateful nodes, SPOF – JobTracker/NameNode• Cumbersome API now Unprocessed Data t Fully processed Latest full Hadoop job period takes this long for this data
Agenda• Why Storm created• Basic concepts• Some use cases• Q&A
Task• Thread which executes a Spout or Bolt• Deploy a topology: $ storm jar myCode.jar com.example.MyTopology arg1 arg2• Kill a topology: $ storm kill topologyName
Sample code Create stream called “word” Run 10 tasks Create stream called “first-…” Run 3 tasks Subscribes to stream “word”, using shuffle groupingSource code of this sample: https://ducquoc.googlecode.com/svn/trunk/storm/
Sample code (2/3)• RandomWordSpout emits a random string from the array words, each 100 milliseconds
Sample code (3/3)• InterrogativeBolt appends a question mark to the first field of Tuple then emit
Stream grouping• Decides which task in the bolt, the tuple is sent to• ShuffleGrouping: randomly• FieldsGrouping: groups tuples by named fields• Global grouping, All grouping, None grouping, Direct grouping
Bonus• I wanna know how many queries I get – Per second, minute, day, week• Results should be available – within <2 seconds 99.8+% of the time – within 50 seconds almost always• History should last >2 years• Should work for 0.01 q/s up to 50,000 q/s• Failure tolerant, yadda, yadda
Real-time and Long-time together Blended now View view t Hadoop works Storm great back here works here