Bigdata roundtable-storm
Upcoming SlideShare
Loading in...5

Bigdata roundtable-storm



Andre Sprenger presentation on the Twitter Storm framework at the first bigdata-roundtable in Hamburg

Andre Sprenger presentation on the Twitter Storm framework at the first bigdata-roundtable in Hamburg



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Bigdata roundtable-storm Presentation Transcript

  • 1. Storm - pipes andfilters on steroids Andre Sprenger BigData Roundtable Hamburg 30. Nov 2011
  • 2. My background•• Studied Computer Science and Economics• Background: banking, ecommerce, online advertising• Freelancer• Java, Scala, Ruby, Rails• Hadoop, Pig, Hive, Cassandra
  • 3. “Next click” problemRaymie Strata (CTO,Yahoo):“With the paths that go through Hadoop [at Yahoo!], thelatency is about fifteen minutes. … [I]t will never be truereal-time. It will never be what we call “next click,” whereI click and by the time the page loads, the semanticimplication of my decision is reflected in the page.”
  • 4. “Next click” problem (next) HTTP HTTP HTTP HTTPRequest Response Request Response max latency max latency 80 ms 80 ms web server realtime near realtime response response real time layer collect data process data time
  • 5. Example problems• Realtime statistics - counting, trends, moving average• Read Twitter stream and output images that are trending in the last 10 minutes• CTR calculation - read ad clicks/ad impressions and calculate new click through rate• ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist• Search advertising
  • 6. Pick your framework...• S4 - Yahoo, “real time map reduce”, actor model• Storm - Twitter• MapReduce Online - Yahoo• Cloud Map Reduce - Accenture• HStreaming - Startup, based on Hadoop• Brisk - DataStax, Cassandra
  • 7. System requirements• Fault tolerance - system keeps running when a node fails• Horizontal scalability - should be easy, just add a node• Low latency• Reliable - does not loose data• High availability - well, if it’s down for an hour its not realtime
  • 8. Storm in a nutshell• Written by Backtype (aquired by Twitter)• Open Source, Github• Runs on JVM• Clojure, Python, Zookeeper, ZeroMQ• Currently used by Twitter for real time statistics
  • 9. Programming model• Tuple - name/value list• Stream - unbounded sequence of Tuples• Spout - source of Streams• Bolt - consumer / producer of Streams• Topology - network of Streams, Spouts and Bolts
  • 10. Spout tuple tuple tuple tupleSpout tuple tuple tuple tuple
  • 11. Bolt Processes streams and generates new streams.tuple tuple tuple tuple tuple tuple tuple tuple Bolttuple tuple tuple tuple
  • 12. Bolt• filtering• transformation• split / aggregate streams• counting, statistics• read from / write to database
  • 13. TopologyNetwork of Streams, Spouts and Bolts Bolt Bolt Spout Bolt Spout Bolt Bolt
  • 14. TaskParallel processor inside Spouts and Bolts.Each Spout / Bolt has a fixed number of Tasks. Spout Bolt Task Task Task Task Task
  • 15. Stream groupingWhich Task does a Tuple go to?• shuffle grouping - distribute randomly• field grouping - partition by field value• all grouping - send to all Tasks• custom grouping - implement your own logic
  • 16. Word count example Sentence Word (“a”, 2) Splitter Count (“b”, 2)Spout Bolt Bolt (“c”, 1) (“a”) (“d”, 1) (“b”) (“a b c a b d”) (“c”) (“a”) (“b”) (“d”)
  • 17. Guaranteed processing (“a”) (“b”) (“a”, 2) (“c”) (“b”, 2)Spout (“a b c a b d”) (“c”, 1) (“a”) (“d”, 1) (“b”) (“d”)Topology has a timeout for processing of the tuple tree
  • 18. Runtime view
  • 19. Reliability• Nimbus / Supervisor are SPOF• both are stateless, easy to restart without data loss• Failure of master node (?)• Running Topologies should not be affected!• Failed Workers are restarted• Guaranteed message processing
  • 20. Administration• Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit)• Cluster nodes can be added at runtime• But: existing Topologies are not rebalanced (there is a ticket)• Administration web GUI
  • 21. Community• Source is on Github - nathanmarz/storm.git• Wiki -• Nice documentation• Google Group• People start to build add-ons: JRuby integration, adapters for JMS, AMQP
  • 22. Storm summary• Nice programming model• Easy to deploy new topologies• Horizontal scalability• Low latency• Fault tolerance• Easy to setup on EC2
  • 23. Questions?