1. 16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Processing with Apache Storm
Data Stream Processing
Slides @ https://goo.gl/BJRf9A
2. 16BIT IITR
Data Collection ModuleData Collection Module
Overview
Data Stream Processing
• Streaming Data Processing
• What is Apache Storm?
• Storm Architecture and Key Concepts
• Monitoring of Storm Cluster
• Development of Storm Apps
• Comparison with other softwares
• Resources
3. 16BIT IITR
Data Collection ModuleData Collection Module
Types of Processing of Big Data
Data Stream Processing
• Batch Processing
Takes large amount Data at a time, analyzes it and produces a large output.
• Real-Time Processing
Collects, analyzes and produces output in Real time.
4. 16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Processing
Data Stream Processing
• Today, most data is continuously produced
user activity logs, web logs, sensors, database transactions, social data…
• The common approach to analyze such data so far
o Record data stream to stable storage (DBMS, HDFS, …)
o Periodically analyze data with batch processing engine
(DBMS, MapReduce, ...)
• Streaming processing engines analyze data while it arrives
5. 16BIT IITR
Data Collection ModuleData Collection Module
Why do Stream Processing?
Data Stream Processing
• Decreases the overall latency to obtain results
o No need to persist data in stable storage
o No periodic batch analysis jobs
• Simplifies the data infrastructure
o Fewer moving parts to be maintained and coordinated
• Makes time dimension of data explicit
o Each event has a timestamp
o Data can be processed based on timestamps
6. 16BIT IITR
Data Collection ModuleData Collection Module
What are the Requirements?
Data Stream Processing
• Large Scale
• Low latency
Results in millisecond
• High throughput
Millions of events per second
• Exactly-once consistency
Correct results in case of failures
• Out-of-order events
Process events based on their associated time
• Intuitive APIs
7. 16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Architecture
Data Stream Processing
8. 16BIT IITR
Data Collection ModuleData Collection Module
Stream Processing Technologies
Data Stream Processing
9. 16BIT IITR
Data Collection ModuleData Collection Module
Apache Storm
Data Stream Processing
• Distributed, fault-tolerant and real-time computation
• Originated at BackType/Twitter, open sourced in late 2011
• Implemented in Clojure, some Java
• Supports APIs in many languages including Java, Python, Scala etc.
10. 16BIT IITR
Data Collection ModuleData Collection Module
Use Cases of Apache Storm
Data Stream Processing
11. 16BIT IITR
Data Collection ModuleData Collection Module
Storm Cluster Architecture
Data Stream Processing
Supervisor
Nimbus
ZooKeeper
ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus – The Master Node
o Distributes codes around cluster
o Assigns tasks to machines/supervisors
o Failure Monitoring
o Stateless
Apache Zookeeper
o Highly Robust
o Provides for Service Discovery and Coordination
o Stores the states
Supervisor
o Listens for work assigned to its machine
o Starts and stops worker processes based on instructions
from Nimbus
o Stateless
12. 16BIT IITR
Data Collection ModuleData Collection Module
Storm Architecture – Fault Tolerance
Data Stream Processing
• What happens when Nimbus dies (master node)?
o If Nimbus is run under process supervision as recommended (e.g. via supervisord), it will restart like
nothing happened.
o While Nimbus is down:
Existing topologies will continue to run, but you cannot submit new topologies.
Running worker processes will not be affected. Also, Supervisors will restart their (local) workers if needed.
However, failed tasks will not be reassigned to other machines, as this is the responsibility of Nimbus.
• What happens when a Supervisor dies (slave node)?
o If Supervisor run under process supervision as recommended (e.g. via supervisord), will restart like
nothing happened.
o Running worker processes will not be affected.
• What happens when a worker process dies?
Its parent Supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus,
Nimbus will reassign the worker to another machine.
13. 16BIT IITR
Data Collection ModuleData Collection Module
Key Concepts – Data Model
Data Stream Processing
Data Stream
Unbounded Sequence of Tuples
14. 16BIT IITR
Data Collection ModuleData Collection Module
Key Concepts – Spouts and Bolts
Data Stream Processing
• Can do anything from running functions, filter tuples, joins, talk to DB, etc.
• Complex stream transformations often require multiple steps and thus multiple bolts.
Spouts
• Source of data streams
Example: Connect to the Twitter API and emit a stream of tweets.
Spout 1 Bolt 1
Bolts
• Consumes streams and potentially produces new streams
Spout 1 Bolt 1 Bolt 2
15. 16BIT IITR
Data Collection ModuleData Collection Module
Key Concepts - Topology
Data Stream Processing
• Network of Spouts and Bolts
• Wires data and functions via a DAG.
• Executes forever and on many machines.
Spout 2
Bolt
3
Bolt
2
Bolt
4
Spout 1
Bolt
1
data
16. 16BIT IITR
Data Collection ModuleData Collection Module
Deploying a Storm Cluster
Data Stream Processing
• http://storm.apache.org/about/deployment.html
• http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
• http://knowm.org/how-to-install-a-distributed-apache-storm-cluster/
• https://github.com/nathanmarz/storm-deploy
17. 16BIT IITR
Data Collection ModuleData Collection Module
Monitoring your Storm Cluster
Data Stream Processing
• Storm UI
18. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps
Data Stream Processing
A trivial “Hello, Storm” topology
“emit random number <
100”
“multiply by
2”
(148)(74)
Spout Bolt
19. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Spouts
Data Stream Processing
• Multiple kinds of inbuilt spouts available to connect to various kinds of streams
Example of a basic spout which is generating data by itself
20. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
• Two main options for JVM users:
o Implement the IRichBolt or IBasicBolt interfaces
o Extend the BaseRichBolt or BaseBasicBolt abstract classes
• BaseRichBolt
o You must – and are able to – manually ack() an incoming tuple.
• BaseBasicBolt
o Auto-acks the incoming tuple at the end of its execute() method.
o These bolts are typically simple functions or filters.
21. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
Extending BaseRichBolt
22. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
execute() is the heart of the bolt.
This is where you will focus most of your attention when implementing your bolt or when trying to
understand somebody else’s bolt.
23. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
prepare() acts as a “second constructor” for the bolt’s class.
Because of Storm’s distributed execution model and serialization, prepare() is often needed to fully
initialize the bolt on the target JVM.
24. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
declareOutputFields() tells downstream bolts about this bolt’s output. What you declare must
match what you actually emit().
25. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing a Topology
Data Stream Processing
• When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use,
and how they interconnect.
26. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing a Topology
Data Stream Processing
• When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use,
and how they interconnect.
• You must specify the initial parallelism of the topology
27. 16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Submitting and Running a Topology
Data Stream Processing
• You submit a topology either to a “local” cluster or to a real cluster.
• To run a topology you must first package your code into a “fat jar”.
o You must includes all your code’s dependencies but:
o Exclude the Storm dependency itself, as the Storm cluster will provide this.
Note: You may need to tweak your build script so that your local tests do include the Storm dependency.
See e.g. assembly.sbt in kafka-storm-starter for an example.
• A topology is run via the storm jar command.
o Will connect to Nimbus, upload your jar, and run the topology.
o Use any machine that can run "storm jar" and talk to Nimbus' Thrift port.
o The configuration of the machine on which the topology is deployed is passed through another
config file
$ storm jar all-my-code.jar com.miguno.MyTopology arg1 arg2
28. 16BIT IITR
Data Collection ModuleData Collection Module
Many Stream Processing software, which to use?
Data Stream Processing
The choice would depend on your use cases.
29. 16BIT IITR
Data Collection ModuleData Collection Module
Resources
Data Stream Processing
• A few Storm books are already available.
• Storm documentation
http://storm.incubator.apache.org/documentation/Home.html
• Storm-kafka
https://github.com/apache/incubator-storm/tree/master/external/storm-kafka
• Mailing lists
http://storm.incubator.apache.org/community.html
• Code examples
https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
https://github.com/miguno/kafka-storm-starter/
32. 16BIT IITR
Data Collection ModuleData Collection Module
Use Cases of Apache Storm
Data Stream Processing
• Stream processing:
Storm is used to process a stream of data and update a variety of Databases in real time.
This processing occurs in real time and the processing speed needs to match the input data
speed.
• Continuous computation:
Storm can do continuous computation on data streams and stream the results into clients in
real time.
• Distributed RPC ()
Storm can parallelize an intense query so that you can compute it in real time.
• Real-time analytics:
Storm can analyze and respond to data that comes
from different data sources as they happen in real time.
33. 16BIT IITR
Data Collection Module
What can I do with Wirbelsturm?
• Get a first impression of Storm
• Test-drive your topologies
• Test failure handling
• Stop/kill Nimbus, check what happens to Supervisors.
• Stop/kill ZooKeeper instances, check what happens to topology.
• Use as sandbox environment to test/validate deployments
• “What will actually happen when I deactivate this topology?”
• “Will my Hiera changes actually work?”
• Reproduce production issues, share results with Dev
• Also helpful when reporting back to Storm project and mailing lists.
• Any further cool ideas?
33
Editor's Notes
You will use this information in downstream bolts to “extract” the datafrom the emitted tuples.
If your bolt only performs side effects (e.g. talk to a DB) but does not emitan actual tuple, override this method with an empty {} method.
You will use this information in downstream bolts to “extract” the datafrom the emitted tuples.
If your bolt only performs side effects (e.g. talk to a DB) but does not emitan actual tuple, override this method with an empty {} method.
You will use this information in downstream bolts to “extract” the datafrom the emitted tuples.
If your bolt only performs side effects (e.g. talk to a DB) but does not emitan actual tuple, override this method with an empty {} method.
You will use this information in downstream bolts to “extract” the datafrom the emitted tuples.
If your bolt only performs side effects (e.g. talk to a DB) but does not emitan actual tuple, override this method with an empty {} method.
You will use this information in downstream bolts to “extract” the datafrom the emitted tuples.
If your bolt only performs side effects (e.g. talk to a DB) but does not emitan actual tuple, override this method with an empty {} method.
You will use this information in downstream bolts to “extract” the datafrom the emitted tuples.
If your bolt only performs side effects (e.g. talk to a DB) but does not emitan actual tuple, override this method with an empty {} method.