Bigdata roundtable-storm


Published on

Andre Sprenger presentation on the Twitter Storm framework at the first bigdata-roundtable in Hamburg

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bigdata roundtable-storm

  1. 1. Storm - pipes andfilters on steroids Andre Sprenger BigData Roundtable Hamburg 30. Nov 2011
  2. 2. My background•• Studied Computer Science and Economics• Background: banking, ecommerce, online advertising• Freelancer• Java, Scala, Ruby, Rails• Hadoop, Pig, Hive, Cassandra
  3. 3. “Next click” problemRaymie Strata (CTO,Yahoo):“With the paths that go through Hadoop [at Yahoo!], thelatency is about fifteen minutes. … [I]t will never be truereal-time. It will never be what we call “next click,” whereI click and by the time the page loads, the semanticimplication of my decision is reflected in the page.”
  4. 4. “Next click” problem (next) HTTP HTTP HTTP HTTPRequest Response Request Response max latency max latency 80 ms 80 ms web server realtime near realtime response response real time layer collect data process data time
  5. 5. Example problems• Realtime statistics - counting, trends, moving average• Read Twitter stream and output images that are trending in the last 10 minutes• CTR calculation - read ad clicks/ad impressions and calculate new click through rate• ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist• Search advertising
  6. 6. Pick your framework...• S4 - Yahoo, “real time map reduce”, actor model• Storm - Twitter• MapReduce Online - Yahoo• Cloud Map Reduce - Accenture• HStreaming - Startup, based on Hadoop• Brisk - DataStax, Cassandra
  7. 7. System requirements• Fault tolerance - system keeps running when a node fails• Horizontal scalability - should be easy, just add a node• Low latency• Reliable - does not loose data• High availability - well, if it’s down for an hour its not realtime
  8. 8. Storm in a nutshell• Written by Backtype (aquired by Twitter)• Open Source, Github• Runs on JVM• Clojure, Python, Zookeeper, ZeroMQ• Currently used by Twitter for real time statistics
  9. 9. Programming model• Tuple - name/value list• Stream - unbounded sequence of Tuples• Spout - source of Streams• Bolt - consumer / producer of Streams• Topology - network of Streams, Spouts and Bolts
  10. 10. Spout tuple tuple tuple tupleSpout tuple tuple tuple tuple
  11. 11. Bolt Processes streams and generates new streams.tuple tuple tuple tuple tuple tuple tuple tuple Bolttuple tuple tuple tuple
  12. 12. Bolt• filtering• transformation• split / aggregate streams• counting, statistics• read from / write to database
  13. 13. TopologyNetwork of Streams, Spouts and Bolts Bolt Bolt Spout Bolt Spout Bolt Bolt
  14. 14. TaskParallel processor inside Spouts and Bolts.Each Spout / Bolt has a fixed number of Tasks. Spout Bolt Task Task Task Task Task
  15. 15. Stream groupingWhich Task does a Tuple go to?• shuffle grouping - distribute randomly• field grouping - partition by field value• all grouping - send to all Tasks• custom grouping - implement your own logic
  16. 16. Word count example Sentence Word (“a”, 2) Splitter Count (“b”, 2)Spout Bolt Bolt (“c”, 1) (“a”) (“d”, 1) (“b”) (“a b c a b d”) (“c”) (“a”) (“b”) (“d”)
  17. 17. Guaranteed processing (“a”) (“b”) (“a”, 2) (“c”) (“b”, 2)Spout (“a b c a b d”) (“c”, 1) (“a”) (“d”, 1) (“b”) (“d”)Topology has a timeout for processing of the tuple tree
  18. 18. Runtime view
  19. 19. Reliability• Nimbus / Supervisor are SPOF• both are stateless, easy to restart without data loss• Failure of master node (?)• Running Topologies should not be affected!• Failed Workers are restarted• Guaranteed message processing
  20. 20. Administration• Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit)• Cluster nodes can be added at runtime• But: existing Topologies are not rebalanced (there is a ticket)• Administration web GUI
  21. 21. Community• Source is on Github - nathanmarz/storm.git• Wiki -• Nice documentation• Google Group• People start to build add-ons: JRuby integration, adapters for JMS, AMQP
  22. 22. Storm summary• Nice programming model• Easy to deploy new topologies• Horizontal scalability• Low latency• Fault tolerance• Easy to setup on EC2
  23. 23. Questions?