Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Streaming Distributed Processing with Storm


Published on


Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.

Presented by Brandon O'Brien
Code example:

Published in: Data & Analytics
  • There are over 16,000 woodworking plans that comes with step-by-step instructions and detailed photos, Click here to take a look ✔✔✔
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Introduction to Streaming Distributed Processing with Storm

  1. 1. Introduction to Streaming Distributed Processing with Storm Presenter: Brandon O’Brien Data Engineer @ Expedia
  2. 2. Outline  Distributed Systems & Batch Processing  Streaming Processing. Introduce Storm  WordCount Demo & Setup  Storm Cluster Architecture  Storm Topology Architecture  WordCount Deep Dive  Discussion and Q&A: Storm Use Cases & Patterns
  3. 3. Distributed Systems  Distribute work across N nodes  Hadoop Ecosystem  Batch processing  Massively parallel (horizontal scale out)  Problems – data latency, 24 hour batching vs global client base  What’s next? Increasing need to move to real time & streaming processing models
  4. 4. Streaming Processing  Provides near real time views into analytical data sets and system status. Allows for real time intervention & response to events  Streaming frameworks: Spark, Azure Streaming Analytics, AWS Kinesis+Lambda, Storm  Created by Nathan Marz, first used at Twitter  Storm: “Doing for realtime processing what Hadoop did for batch processing”  Stream definition: “unbounded sequence of tuples”
  5. 5. Storm WordCount Demo  WordCount Storm Topology Streams text blobs Counts word occurrences Reporting results each 10 seconds  Getting it running mvn clean install exec:java -Dexec.mainClass= "dataclub.storm.TokenCountingTopology”
  6. 6. Storm Cluster Architecture  Core components:  Zookeeper  Nimbus  Supervisors  Workers/JVM  Executor/thread  Component/task (bolts & spouts)  Scalability – can add supervisors while topologies are running, no code change required  Supervisors run Worker JVMs  Workers run Executor Threads  Executors run Tasks (instances of Spouts and Bolts)
  7. 7. Storm Topology Architecture  DAG Processing Model  Directed Acyclic Graph  Components: Spout & Bolt (benefit: decouple logic from scalability)  Tasks (instances of Spouts & Bolts)  Executors (run Tasks)
  8. 8. Storm WordCount Deep Dive  Topology structure  Classes  Spout:  Bolt:  Bolt:  Putting it all together:
  9. 9. Storm Use Cases & Patterns  Consume data from Kafka, Kinesis or other queue  Persist data to high write perf datastore like Cassandra  Streaming map reduce, multi-stage map reduce  Storm is stateless & fail-fast. Externalize state using Redis or other cache for resiliency  Online learning / realtime model updates (using frameworks like WEKA or others)  Real world use cases: Real time ad targeting, travel market analytics, user behavior analytics, system monitoring & SLA  Storm multi lang API (Python, Ruby, PERL, JavaScript, Scala, and more)
  10. 10. Distributed Streaming Processing with Storm  Going Further Frameworks: Trident, Summingbird Stand up Storm cluster: http://www.michael-  Contact Brandon O’Brien, Data Engineer @ Expedia  Q&A