2. Introduction
Hadoop and related technologies have made it
possible
to store and process data at large scales.
Unfortunately, these data processing
technologies are
not realtime systems.
Hadoop does batch processing instead of
realtime
processing.
Apache Storm 2
5. Introduction
E.g. airline
system
Realtime data processing at massive scale is
becoming
more and more of a requirement for businesses.
The lack of a "Hadoop of realtime" has become
the
biggest hole in the data processing ecosystem.
There's no hack that will turn Hadoop into a
realtime
system.
Apache Storm 5
8. Advantages
Free, simple and open source
Can be used with any programming
language
Very fast
Scalabl
e
Fault -
tolerant
Guarantees your data will be
processed
Integrates with any database
technology
Apache Storm 8
10. Storm vs Hadoop
A Storm cluster is superficially similar to a
Hadoop
cluster.
Hadoop runs "MapReduce jobs", while Storm
runs
"topologies".
Apache Storm 10
11. A MapReduce job eventually finishes,
whereas a
topology processes messages forever (or until
you kill
it).
Spouts and Bolts
Spout
s
Bolts
Apache Storm 11
13. Spouts and Bolts
Bolt 1
Bolt 4
Spout 1 Bolt 2
Spout 2 Bolt 3
A stream is an unbounded sequence of
tuples.
A spout is a source of streams.
Apache Storm 13
14. Spouts and Bolts
Bolt 1
Bolt 4
Spout 1 Bolt 2
Spout 2 Bolt 3
For example, a spout may read tuples off of a
queue and
emit them as a stream.
Apache Storm 14
15. Spouts and Bolts
Bolt 1
Bolt 4
Spout 1 Bolt 2
Spout 2 Bolt 3
A bolt consumes any number of input streams,
does
some processing, and possibly emits new streams.
Apache Storm 15
16. Spouts and Bolts
Bolt 1
Bolt 4
Spout 1 Bolt 2
Spout 2 Bolt 3
Each node (spout or bolt) in a Storm topology
executes
in parallel.
Apache Storm 16
17. Architecture
A machine in a storm cluster may run one or more
worker
processes. Worker Process
Each topology has one or more Task Task
worker
processes.
Each worker process
runs
Task
Task
executors (threads) for a specific
topology.
Each executor runs one or more tasks of the
same
component(spout executor or
bolt).
Apache Storm 17
18. Architecture
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
ZooKeeper
Nimbus ZooKeeper
ZooKeeper
Hadoop v1 Storm
JobTracker Nimbus
(only1)
. distributescode around cluster
. assigns tasks to machines/supervisors
. failure monitoring
TaskTracker Supervisor . listens for work assigned to its machine
(many) . starts and stops worker processes as necessary b
o
a
n
s
e
N
d
i
mbus
ZooKeeper . coordination between Nimbus and the Supervisors
Apache Storm 18
19. Architecture
The Nimbus and Supervisor are
stateless .
All state is kept in Zookeeper .
1 ZK instance per machine
When the Nimbus or Supervisor fails, they'll start
back
up like nothing happened.
storm jar all-my-code.jar org.apache.storm.MyTopology
arg1 arg2
Apache Storm 19
23. Shuffle grouping :
Randomized
round-robin
Fields grouping: all Tuples
with
the same field value(s) are
always
routed to the same task
Direct grouping: producer of
the
tuple decides which task of
the
consumer will receive the tuple
Apache Storm 23
24. A Sample Code of
Configuring
TopologyBuilder topologyBuilder = new TopologyBuilder();
Apache Storm 24