Introduction to Streaming
Distributed Processing with
Storm
Presenter: Brandon O’Brien
Data Engineer @ Expedia
Outline
 Distributed Systems & Batch Processing
 Streaming Processing. Introduce Storm
 WordCount Demo & Setup
 Storm Cluster Architecture
 Storm Topology Architecture
 WordCount Deep Dive
 Discussion and Q&A: Storm Use Cases & Patterns
Distributed Systems
 Distribute work across N nodes
 Hadoop Ecosystem
 Batch processing
 Massively parallel (horizontal scale out)
 Problems – data latency, 24 hour batching vs global
client base
 What’s next? Increasing need to move to real time &
streaming processing models
Streaming Processing
 Provides near real time views into analytical data sets
and system status. Allows for real time intervention &
response to events
 Streaming frameworks: Spark, Azure Streaming
Analytics, AWS Kinesis+Lambda, Storm
 Created by Nathan Marz, first used at Twitter
 Storm: “Doing for realtime processing what Hadoop did
for batch processing”
 Stream definition: “unbounded sequence of tuples”
Storm WordCount Demo
 WordCount Storm Topology
Streams text blobs
Counts word occurrences
Reporting results each 10 seconds
 Getting it running
https://github.com/OpenDataMining/brandonobrien
mvn clean install exec:java -Dexec.mainClass=
"dataclub.storm.TokenCountingTopology”
Storm Cluster Architecture
 Core components:
 Zookeeper
 Nimbus
 Supervisors
 Workers/JVM
 Executor/thread
 Component/task (bolts & spouts)
 Scalability – can add supervisors while topologies are running, no
code change required
 Supervisors run Worker JVMs
 Workers run Executor Threads
 Executors run Tasks (instances of Spouts and Bolts)
Storm Topology Architecture
 DAG Processing Model
 Directed Acyclic Graph
 Components: Spout & Bolt (benefit: decouple logic from
scalability)
 Tasks (instances of Spouts & Bolts)
 Executors (run Tasks)
Storm WordCount Deep Dive
 Topology structure
 Classes
 Spout: SentenceProducer.java
 Bolt: SentenceTokenizer.java
 Bolt: TokenCounter.java
 Putting it all together: TokenCountingTopology.java
Storm Use Cases & Patterns
 Consume data from Kafka, Kinesis or other queue
 Persist data to high write perf datastore like Cassandra
 Streaming map reduce, multi-stage map reduce
 Storm is stateless & fail-fast. Externalize state using Redis or other
cache for resiliency
 Online learning / realtime model updates (using frameworks like
WEKA or others)
 Real world use cases: Real time ad targeting, travel market
analytics, user behavior analytics, system monitoring & SLA
 Storm multi lang API (Python, Ruby, PERL, JavaScript, Scala, and
more)
Distributed Streaming Processing
with Storm
 Going Further
https://storm.apache.org/
http://storm.apache.org/documentation/Common-patterns.html
Frameworks: Trident, Summingbird
Stand up Storm cluster: http://www.michael-
noll.com/tutorials/running-multi-node-storm-cluster/
 Contact
Brandon O’Brien, Data Engineer @ Expedia
https://www.linkedin.com/in/brandonjobrien
 Q&A

Introduction to Streaming Distributed Processing with Storm

  • 1.
    Introduction to Streaming DistributedProcessing with Storm Presenter: Brandon O’Brien Data Engineer @ Expedia
  • 2.
    Outline  Distributed Systems& Batch Processing  Streaming Processing. Introduce Storm  WordCount Demo & Setup  Storm Cluster Architecture  Storm Topology Architecture  WordCount Deep Dive  Discussion and Q&A: Storm Use Cases & Patterns
  • 3.
    Distributed Systems  Distributework across N nodes  Hadoop Ecosystem  Batch processing  Massively parallel (horizontal scale out)  Problems – data latency, 24 hour batching vs global client base  What’s next? Increasing need to move to real time & streaming processing models
  • 4.
    Streaming Processing  Providesnear real time views into analytical data sets and system status. Allows for real time intervention & response to events  Streaming frameworks: Spark, Azure Streaming Analytics, AWS Kinesis+Lambda, Storm  Created by Nathan Marz, first used at Twitter  Storm: “Doing for realtime processing what Hadoop did for batch processing”  Stream definition: “unbounded sequence of tuples”
  • 5.
    Storm WordCount Demo WordCount Storm Topology Streams text blobs Counts word occurrences Reporting results each 10 seconds  Getting it running https://github.com/OpenDataMining/brandonobrien mvn clean install exec:java -Dexec.mainClass= "dataclub.storm.TokenCountingTopology”
  • 6.
    Storm Cluster Architecture Core components:  Zookeeper  Nimbus  Supervisors  Workers/JVM  Executor/thread  Component/task (bolts & spouts)  Scalability – can add supervisors while topologies are running, no code change required  Supervisors run Worker JVMs  Workers run Executor Threads  Executors run Tasks (instances of Spouts and Bolts)
  • 7.
    Storm Topology Architecture DAG Processing Model  Directed Acyclic Graph  Components: Spout & Bolt (benefit: decouple logic from scalability)  Tasks (instances of Spouts & Bolts)  Executors (run Tasks)
  • 8.
    Storm WordCount DeepDive  Topology structure  Classes  Spout: SentenceProducer.java  Bolt: SentenceTokenizer.java  Bolt: TokenCounter.java  Putting it all together: TokenCountingTopology.java
  • 9.
    Storm Use Cases& Patterns  Consume data from Kafka, Kinesis or other queue  Persist data to high write perf datastore like Cassandra  Streaming map reduce, multi-stage map reduce  Storm is stateless & fail-fast. Externalize state using Redis or other cache for resiliency  Online learning / realtime model updates (using frameworks like WEKA or others)  Real world use cases: Real time ad targeting, travel market analytics, user behavior analytics, system monitoring & SLA  Storm multi lang API (Python, Ruby, PERL, JavaScript, Scala, and more)
  • 10.
    Distributed Streaming Processing withStorm  Going Further https://storm.apache.org/ http://storm.apache.org/documentation/Common-patterns.html Frameworks: Trident, Summingbird Stand up Storm cluster: http://www.michael- noll.com/tutorials/running-multi-node-storm-cluster/  Contact Brandon O’Brien, Data Engineer @ Expedia https://www.linkedin.com/in/brandonjobrien  Q&A

Editor's Notes

  • #2 1 A framework I’ve used extensively for real time processing of travel market analytics data. It’s really underpinning the analytics platform I’m building, so I wanted to share what I’ve learned about it, for anyone who’s interested in getting started with streaming processing and storm
  • #3 2 Gauge audience. Engineers vs data scientists. Today’s talk is focused on Data Engineering. Domain = data engineering. Setting up analytics pipelines and services to realize the value of data science models.
  • #4 3 For analytics processing that can’t fit in memory on a single machine, we need to scale horizontally. That is, add more machines. Ideally, we’d like to scale out horizontally using cheap commodity hardware. For a variety of reasons, many analytics teams need to move to a streaming processing model
  • #5 4 Batch processing was great for reports, but if we can get real time views into markets & systems, then we can get real time alerts, updates. This unlocks whole new categories of use cases where we can see what’s happening in systems and markets in real time, and respond or intervene in real time.
  • #6 5 Hands on demo so you can see a concrete example of what we’re talking about
  • #7 6
  • #8 7
  • #9 8 dig into the code
  • #10 9 Feel dubious about Java? Use Python
  • #11 10