Streaming Data Processing with Apache Storm

16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Processing with Apache Storm
Data Stream Processing
Slides @ https://goo.gl/BJRf9A

16BIT IITR
Overview
• Streaming Data Processing
• What is Apache Storm?
• Storm Architecture and Key Concepts
• Monitoring of Storm Cluster
• Development of Storm Apps
• Comparison with other softwares
• Resources

16BIT IITR
Types of Processing of Big Data
• Batch Processing
Takes large amount Data at a time, analyzes it and produces a large output.
• Real-Time Processing
Collects, analyzes and produces output in Real time.

16BIT IITR
Streaming Data Processing
• Today, most data is continuously produced
user activity logs, web logs, sensors, database transactions, social data…
• The common approach to analyze such data so far
o Record data stream to stable storage (DBMS, HDFS, …)
o Periodically analyze data with batch processing engine
(DBMS, MapReduce, ...)
• Streaming processing engines analyze data while it arrives

16BIT IITR
Why do Stream Processing?
• Decreases the overall latency to obtain results
o No need to persist data in stable storage
o No periodic batch analysis jobs
• Simplifies the data infrastructure
o Fewer moving parts to be maintained and coordinated
• Makes time dimension of data explicit
o Each event has a timestamp
o Data can be processed based on timestamps

16BIT IITR
What are the Requirements?
• Large Scale
• Low latency
Results in millisecond
• High throughput
Millions of events per second
• Exactly-once consistency
Correct results in case of failures
• Out-of-order events
Process events based on their associated time
• Intuitive APIs

16BIT IITR
Streaming Data Architecture

16BIT IITR
Stream Processing Technologies

16BIT IITR
Apache Storm
• Distributed, fault-tolerant and real-time computation
• Originated at BackType/Twitter, open sourced in late 2011
• Implemented in Clojure, some Java
• Supports APIs in many languages including Java, Python, Scala etc.

16BIT IITR
Use Cases of Apache Storm

16BIT IITR
Storm Cluster Architecture
Supervisor
Nimbus
ZooKeeper
ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus – The Master Node
o Distributes codes around cluster
o Assigns tasks to machines/supervisors
o Failure Monitoring
o Stateless
Apache Zookeeper
o Highly Robust
o Provides for Service Discovery and Coordination
o Stores the states
Supervisor
o Listens for work assigned to its machine
o Starts and stops worker processes based on instructions
from Nimbus
o Stateless

16BIT IITR
Storm Architecture – Fault Tolerance
• What happens when Nimbus dies (master node)?
o If Nimbus is run under process supervision as recommended (e.g. via supervisord), it will restart like
nothing happened.
o While Nimbus is down:
Existing topologies will continue to run, but you cannot submit new topologies.
Running worker processes will not be affected. Also, Supervisors will restart their (local) workers if needed.
However, failed tasks will not be reassigned to other machines, as this is the responsibility of Nimbus.
• What happens when a Supervisor dies (slave node)?
o If Supervisor run under process supervision as recommended (e.g. via supervisord), will restart like
nothing happened.
o Running worker processes will not be affected.
• What happens when a worker process dies?
Its parent Supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus,
Nimbus will reassign the worker to another machine.

16BIT IITR
Key Concepts – Data Model
Data Stream
Unbounded Sequence of Tuples

16BIT IITR
Key Concepts – Spouts and Bolts
• Can do anything from running functions, filter tuples, joins, talk to DB, etc.
• Complex stream transformations often require multiple steps and thus multiple bolts.
Spouts
• Source of data streams
Example: Connect to the Twitter API and emit a stream of tweets.
Spout 1 Bolt 1
Bolts
• Consumes streams and potentially produces new streams
Spout 1 Bolt 1 Bolt 2

16BIT IITR
Key Concepts - Topology
• Network of Spouts and Bolts
• Wires data and functions via a DAG.
• Executes forever and on many machines.
Spout 2
Bolt
3
Bolt
2
Bolt
4
Spout 1
Bolt
1
data

16BIT IITR
Deploying a Storm Cluster
• http://storm.apache.org/about/deployment.html
• http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
• http://knowm.org/how-to-install-a-distributed-apache-storm-cluster/
• https://github.com/nathanmarz/storm-deploy

16BIT IITR
Monitoring your Storm Cluster
• Storm UI

16BIT IITR
Developing Storm Apps
A trivial “Hello, Storm” topology
“emit random number <
100”
“multiply by
2”
(148)(74)
Spout Bolt

16BIT IITR
Developing Storm Apps – Writing Spouts
• Multiple kinds of inbuilt spouts available to connect to various kinds of streams
Example of a basic spout which is generating data by itself

16BIT IITR
Developing Storm Apps – Writing Bolts
• Two main options for JVM users:
o Implement the IRichBolt or IBasicBolt interfaces
o Extend the BaseRichBolt or BaseBasicBolt abstract classes
• BaseRichBolt
o You must – and are able to – manually ack() an incoming tuple.
• BaseBasicBolt
o Auto-acks the incoming tuple at the end of its execute() method.
o These bolts are typically simple functions or filters.

16BIT IITR
Extending BaseRichBolt

16BIT IITR
execute() is the heart of the bolt.
This is where you will focus most of your attention when implementing your bolt or when trying to
understand somebody else’s bolt.

16BIT IITR
prepare() acts as a “second constructor” for the bolt’s class.
Because of Storm’s distributed execution model and serialization, prepare() is often needed to fully
initialize the bolt on the target JVM.

16BIT IITR
declareOutputFields() tells downstream bolts about this bolt’s output. What you declare must
match what you actually emit().

16BIT IITR
Developing Storm Apps – Writing a Topology
• When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use,
and how they interconnect.

16BIT IITR
Developing Storm Apps – Writing a Topology
• When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use,
and how they interconnect.
• You must specify the initial parallelism of the topology

16BIT IITR
Developing Storm Apps – Submitting and Running a Topology
• You submit a topology either to a “local” cluster or to a real cluster.
• To run a topology you must first package your code into a “fat jar”.
o You must includes all your code’s dependencies but:
o Exclude the Storm dependency itself, as the Storm cluster will provide this.
Note: You may need to tweak your build script so that your local tests do include the Storm dependency.
See e.g. assembly.sbt in kafka-storm-starter for an example.
• A topology is run via the storm jar command.
o Will connect to Nimbus, upload your jar, and run the topology.
o Use any machine that can run "storm jar" and talk to Nimbus' Thrift port.
o The configuration of the machine on which the topology is deployed is passed through another
config file
$ storm jar all-my-code.jar com.miguno.MyTopology arg1 arg2

16BIT IITR
Many Stream Processing software, which to use?
The choice would depend on your use cases.

16BIT IITR
Resources
• A few Storm books are already available.
• Storm documentation
http://storm.incubator.apache.org/documentation/Home.html
• Storm-kafka
https://github.com/apache/incubator-storm/tree/master/external/storm-kafka
• Mailing lists
http://storm.incubator.apache.org/community.html
• Code examples
https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
https://github.com/miguno/kafka-storm-starter/

16BIT IITR
Thank You!

16BIT IITR
Extra Slides

16BIT IITR
Use Cases of Apache Storm
• Stream processing:
Storm is used to process a stream of data and update a variety of Databases in real time.
This processing occurs in real time and the processing speed needs to match the input data
speed.
• Continuous computation:
Storm can do continuous computation on data streams and stream the results into clients in
real time.
• Distributed RPC ()
Storm can parallelize an intense query so that you can compute it in real time.
• Real-time analytics:
Storm can analyze and respond to data that comes
from different data sources as they happen in real time.

16BIT IITR
Data Collection Module
What can I do with Wirbelsturm?
• Get a first impression of Storm
• Test-drive your topologies
• Test failure handling
• Stop/kill Nimbus, check what happens to Supervisors.
• Stop/kill ZooKeeper instances, check what happens to topology.
• Use as sandbox environment to test/validate deployments
• “What will actually happen when I deactivate this topology?”
• “Will my Hiera changes actually work?”
• Reproduce production issues, share results with Dev
• Also helpful when reporting back to Storm project and mailing lists.
• Any further cool ideas? 
33

Streaming Data Processing with Apache Storm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Streaming Data Processing with Apache Storm

Similar to Streaming Data Processing with Apache Storm (20)

Recently uploaded

Recently uploaded (20)

Streaming Data Processing with Apache Storm

Editor's Notes