Introduction to Storm

  • 4,848 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,848
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
300
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. StormDistributed and fault-tolerant realtime computation system Chandler@PyHug previa [at] gmail.com
  • 2. Outline• Background• Why Strom• Component• Topology• Storm & DRPC• Multilang Protocol• Experience
  • 3. Background
  • 4. Background• Creates by Nathan Marz @ BackType/Twitter – Analyze twits, links, users on Twitter• Opensourced at Sep 2011 – Eclipse Public License 1.0 – Storm 0.5.2 – 16k java and 7k Clojure Loc – Current stable release 0.8.2 • 0.9.0 major core improvement
  • 5. Background• Active user group – https://groups.google.com/group/storm-user – https://github.com/nathanmarz/storm – Most watched java repo at GitHub (>4k watcher) – Used by over 30 companies • Twitter, Groupon, Alibaba, GumGum, ..
  • 6. Why Storm ?
  • 7. Before Storm
  • 8. Problems• Scale is painful• Poor fault-tolerance – Hadoop is stateful• Coding is tedious• Batch processing – Long latency – no realtime
  • 9. Storm• Scalable and robust – No persistent layer• Guarantees no data loss• Fault-tolerant• Programming language agnostic• Use case – Stream processing – Distributed RPC – Continues computation
  • 10. Components
  • 11. Base on• Apache Zookeeper – Distributed system, used to store metadata• ØMQ – Asynchronous message transport layer• Apache Thrift – Cross-language bridge, RPC• LMAX Disruptor – High performance queue shared by threads• Kryo – Serialization framework
  • 12. System architecture
  • 13. System architecture• Nimbus – Like JobtTacker in hadoop• Supervisor – Manage workers• Zookeeper – Store meta data• UI – Web-UI
  • 14. Topology
  • 15. Topology• Tuples – ordered list of elements – (“user”, “link”, “event”, “10/3/12 17:50“)• Streams – unbounded sequence of tuples
  • 16. Spouts• Source of streams• Example • Read from logs, API calls, event data, queues, …
  • 17. Spout• Interface ISpout – BaseRichSpout, ClojureSpout, DRPCSpout, FeederSpout, FixedTupleSpout, MasterBatchCoordinator, NoOpSpout, RichShellSpout, RichSpoutBatchTriggerer, ShellS pout, SpoutTracker, TestPlannerSpout, TestWordSpout, TransactionalSpoutCoordinator
  • 18. Topology• Bolts – Processes input streams and produces new streams – Example • Stream Joins, DBs, APIs, Filters, Aggregation, …
  • 19. Bolts• Interface Ibolt – BaseRichBolt, BasicBoltExecutor, BatchBoltExecutor, BoltTracker, ClojureBolt, Coordinate dBolt, JoinResult, KeyedFairBolt, NonRichBoltTracker, ReturnResults, BaseShellBolt, ShellBolt, TestAggregatesCounter, TestGlobalCount, TestPlannerBolt, TransactionalSpout BatchExecutor,TridentBoltExecutor, TupleCaptureBolt
  • 20. Topology• Topology – A directed graph of Spouts and Bolts
  • 21. Tasks• Instances of Spouts and Blots• Managed by Supervisor – http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
  • 22. Stream grouping• All grouping – Send to all tasks• Global grouping – Pick task with lowest id• Shuffle grouping – Pick a random task• Fields grouping – Consistent hashing on a subset of tuple fields
  • 23. Storm fault-tolerance• Reliability API – Spout tuple creation • colloctor.emit(values, msgID); – Child tuple creation (Bolts) • colloctor.emit(parentTuples, values); – Tuple end of processing • collector.ack(tuple); – Tuple failed to process • collector.fail(tuple);
  • 24. Storm fault-tolerance• Disable reliability API – Globally • Config.TOPOLOGY_ACKER_EXECUTORS = 0 – On topology level • Collector.emit(values, msgID); – For a single tuple • Collector.emit(paranetTuples, values);
  • 25. Storm & DRPC
  • 26. Distributed RPC
  • 27. Multilang Protocol
  • 28. Multilang protocol• Using ShellSpout/ShellBolt• Process using stand in/out to communicate• Massage are encoded as JSON/ lines of plain text
  • 29. Three steps• Initiate a handshake – Keep track with process id – Send a json object to standard input while start – Contains • Storm configuration, topology, context, PID directory
  • 30. Three steps• Start looping – storm_sync would expect torm_ack• Read or write tuples – Follow defined structure – Implement read_msg(), storm_emit() ,…
  • 31. Experience
  • 32. Experience• Not hard to setup, but – Beware of certain version of Zookeeper – Wait a while after topology deployed• Fast, – Better use fabric• Stable – But beware of memory leak
  • 33. Reference
  • 34. Reference• “Getting started with Storm”, O’REILLY• Twitter Storm – Sergey Lukjanov@slideshare – http://www.slideshare.net/lukjanovsv/twitter-storm• Storm – nathanmarz@slideshare – http://www.slideshare.net/nathanmarz/storm-11164672• Realtime Analytics with Storm and Hadoop – Hadoop_Summit@slideshare – http://www.slideshare.net/Hadoop_Summit/realtime-analytics-with- storm
  • 35. Q/A
  • 36. Thanks