Storm - pipes andfilters on steroids      Andre Sprenger    BigData Roundtable   Hamburg 30. Nov 2011
My background•   info@andresprenger.de•   Studied Computer Science and Economics•   Background: banking, ecommerce, online...
“Next click” problemRaymie Strata (CTO,Yahoo):“With the paths that go through Hadoop [at Yahoo!], thelatency is about fifte...
“Next click” problem                             (next) HTTP         HTTP           HTTP          HTTPRequest      Respons...
Example problems•   Realtime statistics - counting, trends, moving average•   Read Twitter stream and output images that a...
Pick your framework...•   S4 - Yahoo, “real time map reduce”, actor model•   Storm - Twitter•   MapReduce Online - Yahoo• ...
System requirements•   Fault tolerance - system keeps running when a node    fails•   Horizontal scalability - should be e...
Storm in a nutshell•   Written by Backtype (aquired by Twitter)•   Open Source, Github•   Runs on JVM•   Clojure, Python, ...
Programming model•   Tuple - name/value list•   Stream - unbounded sequence of Tuples•   Spout - source of Streams•   Bolt...
Spout        tuple tuple tuple tupleSpout        tuple tuple tuple tuple
Bolt   Processes streams and generates new streams.tuple tuple tuple tuple                                  tuple tuple tu...
Bolt•   filtering•   transformation•   split / aggregate streams•   counting, statistics•   read from / write to database
TopologyNetwork of Streams, Spouts and Bolts                    Bolt         Bolt     Spout                    Bolt     Sp...
TaskParallel processor inside Spouts and Bolts.Each Spout / Bolt has a fixed number of Tasks.      Spout                Bol...
Stream groupingWhich Task does a Tuple go to?•   shuffle grouping - distribute randomly•   field grouping - partition by fiel...
Word count example                Sentence            Word    (“a”, 2)                 Splitter           Count   (“b”, 2)...
Guaranteed processing                             (“a”)                             (“b”)                                 ...
Runtime view
Reliability•   Nimbus / Supervisor are SPOF•   both are stateless, easy to restart without data loss•   Failure of master ...
Administration•   Nimbus / Supervisor / Zookeeper need monitoring    and supervisor (e.g. Monit)•   Cluster nodes can be a...
Community•   Source is on Github - https://github.com/    nathanmarz/storm.git•   Wiki - https://github.com/nathanmarz/sto...
Storm summary•   Nice programming model•   Easy to deploy new topologies•   Horizontal scalability•   Low latency•   Fault...
Questions?
Upcoming SlideShare
Loading in...5
×

Bigdata roundtable-storm

2,464

Published on

Andre Sprenger presentation on the Twitter Storm framework at the first bigdata-roundtable in Hamburg

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,464
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
136
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Bigdata roundtable-storm

  1. 1. Storm - pipes andfilters on steroids Andre Sprenger BigData Roundtable Hamburg 30. Nov 2011
  2. 2. My background• info@andresprenger.de• Studied Computer Science and Economics• Background: banking, ecommerce, online advertising• Freelancer• Java, Scala, Ruby, Rails• Hadoop, Pig, Hive, Cassandra
  3. 3. “Next click” problemRaymie Strata (CTO,Yahoo):“With the paths that go through Hadoop [at Yahoo!], thelatency is about fifteen minutes. … [I]t will never be truereal-time. It will never be what we call “next click,” whereI click and by the time the page loads, the semanticimplication of my decision is reflected in the page.”
  4. 4. “Next click” problem (next) HTTP HTTP HTTP HTTPRequest Response Request Response max latency max latency 80 ms 80 ms web server realtime near realtime response response real time layer collect data process data time
  5. 5. Example problems• Realtime statistics - counting, trends, moving average• Read Twitter stream and output images that are trending in the last 10 minutes• CTR calculation - read ad clicks/ad impressions and calculate new click through rate• ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist• Search advertising
  6. 6. Pick your framework...• S4 - Yahoo, “real time map reduce”, actor model• Storm - Twitter• MapReduce Online - Yahoo• Cloud Map Reduce - Accenture• HStreaming - Startup, based on Hadoop• Brisk - DataStax, Cassandra
  7. 7. System requirements• Fault tolerance - system keeps running when a node fails• Horizontal scalability - should be easy, just add a node• Low latency• Reliable - does not loose data• High availability - well, if it’s down for an hour its not realtime
  8. 8. Storm in a nutshell• Written by Backtype (aquired by Twitter)• Open Source, Github• Runs on JVM• Clojure, Python, Zookeeper, ZeroMQ• Currently used by Twitter for real time statistics
  9. 9. Programming model• Tuple - name/value list• Stream - unbounded sequence of Tuples• Spout - source of Streams• Bolt - consumer / producer of Streams• Topology - network of Streams, Spouts and Bolts
  10. 10. Spout tuple tuple tuple tupleSpout tuple tuple tuple tuple
  11. 11. Bolt Processes streams and generates new streams.tuple tuple tuple tuple tuple tuple tuple tuple Bolttuple tuple tuple tuple
  12. 12. Bolt• filtering• transformation• split / aggregate streams• counting, statistics• read from / write to database
  13. 13. TopologyNetwork of Streams, Spouts and Bolts Bolt Bolt Spout Bolt Spout Bolt Bolt
  14. 14. TaskParallel processor inside Spouts and Bolts.Each Spout / Bolt has a fixed number of Tasks. Spout Bolt Task Task Task Task Task
  15. 15. Stream groupingWhich Task does a Tuple go to?• shuffle grouping - distribute randomly• field grouping - partition by field value• all grouping - send to all Tasks• custom grouping - implement your own logic
  16. 16. Word count example Sentence Word (“a”, 2) Splitter Count (“b”, 2)Spout Bolt Bolt (“c”, 1) (“a”) (“d”, 1) (“b”) (“a b c a b d”) (“c”) (“a”) (“b”) (“d”)
  17. 17. Guaranteed processing (“a”) (“b”) (“a”, 2) (“c”) (“b”, 2)Spout (“a b c a b d”) (“c”, 1) (“a”) (“d”, 1) (“b”) (“d”)Topology has a timeout for processing of the tuple tree
  18. 18. Runtime view
  19. 19. Reliability• Nimbus / Supervisor are SPOF• both are stateless, easy to restart without data loss• Failure of master node (?)• Running Topologies should not be affected!• Failed Workers are restarted• Guaranteed message processing
  20. 20. Administration• Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit)• Cluster nodes can be added at runtime• But: existing Topologies are not rebalanced (there is a ticket)• Administration web GUI
  21. 21. Community• Source is on Github - https://github.com/ nathanmarz/storm.git• Wiki - https://github.com/nathanmarz/storm/wiki• Nice documentation• Google Group• People start to build add-ons: JRuby integration, adapters for JMS, AMQP
  22. 22. Storm summary• Nice programming model• Easy to deploy new topologies• Horizontal scalability• Low latency• Fault tolerance• Easy to setup on EC2
  23. 23. Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×