Apache Storm

Apache Storm
Presented By: Rajind Ruparathna (179349M)

Outline
• What is Storm?
• Who use Storm?
• Storm Vs Hadoop
• Storm Components
• Storm Topology
• Storm Primitives
• Why Storm is ideal for Real Time Processing?

What is Storm?
• Apache Storm is a free and open source distributed real-time computation
system.
• Storm makes it easy to reliably process unbounded streams of data.
• Storm does for real-time processing what Hadoop did for batch processing.
• Simple, can be used with any programming language.

Storm Vs Hadoop
Strom is to used to do real-time computation whereas Hadoop is used for batch
computation.

Storm Vs Hadoop contd.
Hadoop Storm
Components JobTracker Nimbus
TaskTracker Supervisor
Child Worker
Applications Job Topology
Primitives Mapper/Reducer Spout/Bolt

Batch Data Processing Architecture

Data Processing Architecture: Batch and Real-time

Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Node (Master)
• Uploads computations for execution
• Distributes code across the cluster
• Launches workers across the cluster
• Monitors computation and reallocates workers as needed

Storm Components contd.
Zookeeper Nodes
• Coordinates the Storm cluster
Supervisor Nodes
• Communicates with Nimbus through Zookeeper, starts and stops workers
according to signals from Nimbus

Storm Topology
• The work is delegated to different types of components that are each responsible
for a simple specific processing task.
• The input stream of a Storm cluster is handled by a component called a spout.

Storm Topology contd.
• The spout passes the data to a component called a bolt, which transforms it in
some way.
• A bolt either persists the data in some sort of storage, or passes it to some
other bolt.

Storm Primitives
• Streams
• Spouts
• Bolts
• Topologies

Storm Primitives contd.
Streams - Unbounded sequence of tuples

Spouts - Sources of streams
● Read from a kestrel/kafka queue. {tuples = events}
● Read from a http server log. {tuples = http requests}
● Read from twitter streaming api. {tuples = tweets}

Bolts - Process input stream and produces an output
stream
● Filtering tuples in a stream
● Aggregation of tuples
● Joining multiple streams
● Arbitrary functions on streams
● Communication with external caches/dbs.

Topology - Directed-acyclic-graph(DAG) of spouts
and bolts

Storm Sample - Word Count
https://docs.microsoft.com/en-us/azure/hdinsight/storm/apache-storm-develop-jav
a-topology

Why Storm is ideal for Real Time Processing?
• Fast – Benchmarked as processing one million, 100 byte messages, per second
per node.
• Scalable – With parallel calculations that run across a cluster of machines.
• Fault-tolerant – When workers die, Storm will automatically restart them. If a
node dies, the worker will be restarted on another node.
• Reliable – Storm guarantees that each unit of data (tuple) will be processed at
least once or exactly once. Messages are only replayed when there are failures.
• Easy to operate – Standard configurations are suitable for production on day
one. Once deployed, Storm is easy to operate.

Storm Use Cases @Twitter contd.
• Discovery of emerging topics/stories.
• Online learning of tweet features for search result ranking.
• Real-time analytics for ads.
• Internal log processing.

References
http://storm.apache.org/index.html
https://docs.microsoft.com/en-us/azure/hdinsight/storm/apache-storm-develop-jav
a-topology
https://www.tutorialspoint.com/apache_storm/index.htm
https://github.com/apache/storm

Apache Storm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Storm

Similar to Apache Storm (20)

Recently uploaded

Recently uploaded (20)

Apache Storm