Storm
Upcoming SlideShare
Loading in...5
×
 

Storm

on

  • 11,390 views

My presentation of Storm at the Bay Area Hadoop User Group on January 18th, 2012.

My presentation of Storm at the Bay Area Hadoop User Group on January 18th, 2012.

Statistics

Views

Total Views
11,390
Views on SlideShare
7,114
Embed Views
4,276

Actions

Likes
26
Downloads
401
Comments
1

17 Embeds 4,276

http://blog.xiuwz.com 2414
http://tjun.jp 802
http://tjun.org 608
http://dataddict.wordpress.com 300
http://hochul.net 96
http://www.tuicool.com 18
http://a0.twimg.com 11
http://webcache.googleusercontent.com 11
http://cache.baidu.com 5
http://us-w1.rockmelt.com 4
http://translate.googleusercontent.com 1
http://www.linkedin.com 1
https://twitter.com 1
http://localhost 1
http://snapshot.soso.com 1
http://cncc.bingj.com 1
https://dataddict.wordpress.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • How to monetize your searching skills and make money working from home pretty easy, using computer, internet and a little of your time. Complete guidance at http://MakeCash25.com
    I came across this site quite randomly 4 months ago, registered, and till now I receive payments on a weekly basis.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Storm Storm Presentation Transcript

    • StormDistributed and fault-tolerant realtime computation Nathan Marz Twitter
    • Basic info• Open sourced September 19th• Implementation is 12,000 lines of code• Used by over 25 companies• >2280 watchers on Github (most watched JVM project)• Very active mailing list • >1700 messages • >520 members
    • Hadoop Batch computationDistributed Fault-tolerant
    • Storm Realtime computationDistributed Fault-tolerant
    • Hadoop• Large, finite jobs• Process a lot of data at once• High latency
    • Storm• Infinite computations called topologies• Process infinite streams of data• Tuple-at-a-time computational model• Low latency
    • Before StormQueues Workers
    • Example (simplified)
    • ExampleWorkers schemify tweets and append to Hadoop
    • ExampleWorkers update statistics on URLs byincrementing counters in Cassandra
    • Problems• Scaling is painful• Poor fault-tolerance• Coding is tedious
    • What we want• Guaranteed data processing• Horizontal scalability• Fault-tolerance• No intermediate message brokers!• Higher level abstraction than message passing• “Just works”
    • StormGuaranteed data processingHorizontal scalabilityFault-toleranceNo intermediate message brokers!Higher level abstraction than message passing“Just works”
    • Use cases Stream Distributed Continuousprocessing RPC computation
    • Storm Cluster
    • Storm ClusterMaster node (similar to Hadoop JobTracker)
    • Storm ClusterUsed for cluster coordination
    • Storm Cluster Run worker processes
    • Starting a topology
    • Killing a topology
    • Concepts• Streams• Spouts• Bolts• Topologies
    • StreamsTuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
    • SpoutsSource of streams
    • Spout examples• Read from Kestrel queue• Read from Twitter streaming API
    • BoltsProcesses input streams and produces new streams
    • Bolts• Functions• Filters• Aggregation• Joins• Talk to databases
    • TopologyNetwork of spouts and bolts
    • TasksSpouts and bolts execute asmany tasks across the cluster
    • Task executionTasks are spread across the cluster
    • Task executionTasks are spread across the cluster
    • Stream groupingWhen a tuple is emitted, which task does it go to?
    • Stream grouping• Shuffle grouping: pick a random task• Fields grouping: mod hashing on a subset of tuple fields• All grouping: send to all tasks• Global grouping: pick task with lowest id
    • Topologyshuffle [“id1”, “id2”] shuffle[“url”] shuffle all
    • Streaming word countTopologyBuilder is used to construct topologies in Java
    • Streaming word countDefine a spout in the topology with parallelism of 5 tasks
    • Streaming word countSplit sentences into words with parallelism of 8 tasks
    • Streaming word countConsumer decides what data it receives and how it gets groupedSplit sentences into words with parallelism of 8 tasks
    • Streaming word count Create a word count stream
    • Streaming word count splitsentence.py
    • Streaming word count
    • Streaming word count Submitting topology to a cluster
    • Streaming word count Running topology in local mode
    • Demo
    • Distributed RPCData flow for Distributed RPC
    • DRPC ExampleComputing “reach” of a URL on the fly
    • ReachReach is the number of unique people exposed to a URL on Twitter
    • Computing reach Follower Distinct Tweeter Follower follower Follower DistinctURL Tweeter follower Count Reach Follower Follower Distinct Tweeter follower Follower
    • Reach topology
    • Reach topology
    • Reach topology
    • Reach topology Keep set of followers for each request id in memory
    • Reach topology Update followers set when receive a new follower
    • Reach topology Emit partial count after receiving all followers for a request id
    • Demo
    • Guaranteeing message processing “Tuple tree”
    • Guaranteeing message processing• A spout tuple is not fully processed until all tuples in the tree have been completed
    • Guaranteeing message processing• If the tuple tree is not completed within a specified timeout, the spout tuple is replayed
    • Guaranteeing message processing Reliability API
    • Guaranteeing message processing“Anchoring” creates a new edge in the tuple tree
    • Guaranteeing message processing Marks a single node in the tree as complete
    • Guaranteeing message processing• Storm tracks tuple trees for you in an extremely efficient way
    • Transactional topologiesHow do you do idempotent counting with an at least once delivery guarantee?
    • Transactional topologies Won’t you overcount?
    • Transactional topologies Transactional topologies solve this problem
    • Transactional topologiesBuilt completely on top of Storm’s primitives of streams, spouts, and bolts
    • Transactional topologiesBatch 1 Batch 2 Batch 3 Process small batches of tuples
    • Transactional topologiesBatch 1 Batch 2 Batch 3 If a batch fails, replay the whole batch
    • Transactional topologiesBatch 1 Batch 2 Batch 3 Once a batch is completed, commit the batch
    • Transactional topologiesBatch 1 Batch 2 Batch 3Bolts can optionally implement “commit” method
    • Transactional topologiesCommit 1 Commit 1 Commit 2 Commit 3 Commit 4 Commit 4 Commits are ordered. If there’s a failure during commit, the whole batch + commit is retried
    • Example
    • Example New instance of this object for every transaction attempt
    • Example Aggregate the count for this batch
    • Example Only update database if transaction ids differ
    • Example This enables idempotency since commits are ordered
    • Example (Credit goes to Kafka guys for figuring out this trick)
    • Transactional topologiesMultiple batches can be processed in parallel,but commits are guaranteed to be ordered
    • Transactional topologies• Will be available in next version of Storm (0.7.0)• Requires a source queue that can replay identical batches of messages• Aiming for first TransactionalSpout implementation to use Kafka
    • Storm UI
    • Storm on EC2https://github.com/nathanmarz/storm-deploy One-click deploy tool
    • Starter codehttps://github.com/nathanmarz/storm-starter Example topologies
    • Documentation
    • Ecosystem• Scala, JRuby, and Clojure DSL’s• Kestrel, AMQP, JMS, and other spout adapters• Serializers• Multilang adapters• Cassandra, MongoDB integration
    • Questions?http://github.com/nathanmarz/storm