Introduction to Streaming Distributed Processing with Storm

Introduction to Streaming
Distributed Processing with
Storm
Presenter: Brandon O’Brien
Data Engineer @ Expedia

Outline
 Distributed Systems & Batch Processing
 Streaming Processing. Introduce Storm
 WordCount Demo & Setup
 Storm Cluster Architecture
 Storm Topology Architecture
 WordCount Deep Dive
 Discussion and Q&A: Storm Use Cases & Patterns

Distributed Systems
 Distribute work across N nodes
 Hadoop Ecosystem
 Batch processing
 Massively parallel (horizontal scale out)
 Problems – data latency, 24 hour batching vs global
client base
 What’s next? Increasing need to move to real time &
streaming processing models

Streaming Processing
 Provides near real time views into analytical data sets
and system status. Allows for real time intervention &
response to events
 Streaming frameworks: Spark, Azure Streaming
Analytics, AWS Kinesis+Lambda, Storm
 Created by Nathan Marz, first used at Twitter
 Storm: “Doing for realtime processing what Hadoop did
for batch processing”
 Stream definition: “unbounded sequence of tuples”

Storm WordCount Demo
 WordCount Storm Topology
Streams text blobs
Counts word occurrences
Reporting results each 10 seconds
 Getting it running
https://github.com/OpenDataMining/brandonobrien
mvn clean install exec:java -Dexec.mainClass=
"dataclub.storm.TokenCountingTopology”

Storm Cluster Architecture
 Core components:
 Zookeeper
 Nimbus
 Supervisors
 Workers/JVM
 Executor/thread
 Component/task (bolts & spouts)
 Scalability – can add supervisors while topologies are running, no
code change required
 Supervisors run Worker JVMs
 Workers run Executor Threads
 Executors run Tasks (instances of Spouts and Bolts)

Storm Topology Architecture
 DAG Processing Model
 Directed Acyclic Graph
 Components: Spout & Bolt (benefit: decouple logic from
scalability)
 Tasks (instances of Spouts & Bolts)
 Executors (run Tasks)

Storm WordCount Deep Dive
 Topology structure
 Classes
 Spout: SentenceProducer.java
 Bolt: SentenceTokenizer.java
 Bolt: TokenCounter.java
 Putting it all together: TokenCountingTopology.java

Storm Use Cases & Patterns
 Consume data from Kafka, Kinesis or other queue
 Persist data to high write perf datastore like Cassandra
 Streaming map reduce, multi-stage map reduce
 Storm is stateless & fail-fast. Externalize state using Redis or other
cache for resiliency
 Online learning / realtime model updates (using frameworks like
WEKA or others)
 Real world use cases: Real time ad targeting, travel market
analytics, user behavior analytics, system monitoring & SLA
 Storm multi lang API (Python, Ruby, PERL, JavaScript, Scala, and
more)

Distributed Streaming Processing
with Storm
 Going Further
https://storm.apache.org/
http://storm.apache.org/documentation/Common-patterns.html
Frameworks: Trident, Summingbird
Stand up Storm cluster: http://www.michael-
noll.com/tutorials/running-multi-node-storm-cluster/
 Contact
Brandon O’Brien, Data Engineer @ Expedia
https://www.linkedin.com/in/brandonjobrien
 Q&A

Introduction to Streaming Distributed Processing with Storm

More Related Content

What's hot

Similar to Introduction to Streaming Distributed Processing with Storm

Recently uploaded

Introduction to Streaming Distributed Processing with Storm

Editor's Notes