Your SlideShare is downloading. ×
0
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Bigdata roundtable-storm
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Bigdata roundtable-storm

2,443

Published on

Andre Sprenger presentation on the Twitter Storm framework at the first bigdata-roundtable in Hamburg

Andre Sprenger presentation on the Twitter Storm framework at the first bigdata-roundtable in Hamburg

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,443
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
135
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Storm - pipes andfilters on steroids Andre Sprenger BigData Roundtable Hamburg 30. Nov 2011
  • 2. My background• info@andresprenger.de• Studied Computer Science and Economics• Background: banking, ecommerce, online advertising• Freelancer• Java, Scala, Ruby, Rails• Hadoop, Pig, Hive, Cassandra
  • 3. “Next click” problemRaymie Strata (CTO,Yahoo):“With the paths that go through Hadoop [at Yahoo!], thelatency is about fifteen minutes. … [I]t will never be truereal-time. It will never be what we call “next click,” whereI click and by the time the page loads, the semanticimplication of my decision is reflected in the page.”
  • 4. “Next click” problem (next) HTTP HTTP HTTP HTTPRequest Response Request Response max latency max latency 80 ms 80 ms web server realtime near realtime response response real time layer collect data process data time
  • 5. Example problems• Realtime statistics - counting, trends, moving average• Read Twitter stream and output images that are trending in the last 10 minutes• CTR calculation - read ad clicks/ad impressions and calculate new click through rate• ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist• Search advertising
  • 6. Pick your framework...• S4 - Yahoo, “real time map reduce”, actor model• Storm - Twitter• MapReduce Online - Yahoo• Cloud Map Reduce - Accenture• HStreaming - Startup, based on Hadoop• Brisk - DataStax, Cassandra
  • 7. System requirements• Fault tolerance - system keeps running when a node fails• Horizontal scalability - should be easy, just add a node• Low latency• Reliable - does not loose data• High availability - well, if it’s down for an hour its not realtime
  • 8. Storm in a nutshell• Written by Backtype (aquired by Twitter)• Open Source, Github• Runs on JVM• Clojure, Python, Zookeeper, ZeroMQ• Currently used by Twitter for real time statistics
  • 9. Programming model• Tuple - name/value list• Stream - unbounded sequence of Tuples• Spout - source of Streams• Bolt - consumer / producer of Streams• Topology - network of Streams, Spouts and Bolts
  • 10. Spout tuple tuple tuple tupleSpout tuple tuple tuple tuple
  • 11. Bolt Processes streams and generates new streams.tuple tuple tuple tuple tuple tuple tuple tuple Bolttuple tuple tuple tuple
  • 12. Bolt• filtering• transformation• split / aggregate streams• counting, statistics• read from / write to database
  • 13. TopologyNetwork of Streams, Spouts and Bolts Bolt Bolt Spout Bolt Spout Bolt Bolt
  • 14. TaskParallel processor inside Spouts and Bolts.Each Spout / Bolt has a fixed number of Tasks. Spout Bolt Task Task Task Task Task
  • 15. Stream groupingWhich Task does a Tuple go to?• shuffle grouping - distribute randomly• field grouping - partition by field value• all grouping - send to all Tasks• custom grouping - implement your own logic
  • 16. Word count example Sentence Word (“a”, 2) Splitter Count (“b”, 2)Spout Bolt Bolt (“c”, 1) (“a”) (“d”, 1) (“b”) (“a b c a b d”) (“c”) (“a”) (“b”) (“d”)
  • 17. Guaranteed processing (“a”) (“b”) (“a”, 2) (“c”) (“b”, 2)Spout (“a b c a b d”) (“c”, 1) (“a”) (“d”, 1) (“b”) (“d”)Topology has a timeout for processing of the tuple tree
  • 18. Runtime view
  • 19. Reliability• Nimbus / Supervisor are SPOF• both are stateless, easy to restart without data loss• Failure of master node (?)• Running Topologies should not be affected!• Failed Workers are restarted• Guaranteed message processing
  • 20. Administration• Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit)• Cluster nodes can be added at runtime• But: existing Topologies are not rebalanced (there is a ticket)• Administration web GUI
  • 21. Community• Source is on Github - https://github.com/ nathanmarz/storm.git• Wiki - https://github.com/nathanmarz/storm/wiki• Nice documentation• Google Group• People start to build add-ons: JRuby integration, adapters for JMS, AMQP
  • 22. Storm summary• Nice programming model• Easy to deploy new topologies• Horizontal scalability• Low latency• Fault tolerance• Easy to setup on EC2
  • 23. Questions?

×