2. What is Storm
• Storm is a distributed, reliable, fault-tolerant system for processing
streams of data.
Processing
3. What is Storm
• The work is delegate to different types of component that are each responsible for a specific processing task.
• The input stream of spout cluster is handled by a component called a spout.
• The spout passes the data to a component called a bolt, which transforms it in some way.
• A bolt either persist the data in some sort of storage, or pass it to some other bolt.
Input Data
Source
Spout
Spout
bolt
bolt
bolt
bolt
Passes data
Passes data
Passes data
Passes data
Passes data
Passes data
4. Storm Components
• Storm cluster has 3 nodes
1. Nimbus nodes
Nimbus is a daemon that runs on the master node of Storm cluster.
It is responsible for distributing the code among the worker nodes, assigning input data sets to machines for
processing and monitoring for failures.
Nimbus service is an Apache Thrift service enabling you to submit the code in any programming language. This way,
you can always utilize the language that you are proficient in, without the need of learning a new language to utilize
Apache Storm.
Nimbus service relies on Apache ZooKeeper service to monitor the message processing tasks as all the worker nodes
update their tasks status in Apache ZooKeeper service.
2. Zookeeper nodes
Coordinates storm cluster.
3. Supervisor nodes
All the workers nodes in Storm cluster run a daemon called Supervisor. Supervisor service receives the work assigned
to a machine by Nimbus service. Supervisor manages worker processes to complete the tasks assigned by Nimbus.
Each of these worker processes executes a subset of topology that we will talk about next
6. Storm Components
• For key abstractions help to understand how storm process data.
• Tuples- an ordered list of elements
• Streams – an unbounded sequence of tuples.
• Spouts – sources of streams in a computation
• Bolts – process input streams and produce output streams
• Topologies - Topology, in simple terms, is a graph of computation. Each node in a topology
contains processing logic, and links between nodes indicate how data should be passed around
between nodes. A Topology typically runs distributively on multiple workers processes on multiple
worker nodes
7. Storm technology task
• Java and Clojure
• Storm runs on java virtual machine and is written in combination of java and Clojure.
• Storm is highly polyglot in nature.
• Spouts and bolts can be written virtually in any programming language which can read data from stream.
8. Basic Storm Daemon Commands
• Daemon commands used to start storm
Command Usage Description
Nimbus Storm nimbus This launches nimbus
daemon
Supervisor Storm supervisor This launches supervisor
daemon
UI Storm ui This launches storm ui that
provides web based ui for
monitoring storm clusters.
9. Storm Management Commands
Command Usuage Description
Jar Storm jar <topology jar> <topology class>
<args>
1. Used to submit topology to cluster.
2. Runs main() of topology class.
3. Uploads topology jar to Nimbus for
distribution to cluster.
4. Once submitted storm activates topology
and starts processing.
5. The main() method in the topology class
is responsible for supplying unique name
for the topology.
6. If topology with that name already exist
on cluster, the jar command will fail.
Kill Storm kill <topology name> Storm won't kill the topology immediately.
Instead, it deactivates all the spouts so that
they don't emit any more tuples, and then
Storm waits
Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
seconds before destroying all the workers.
This gives the topology enough time to
complete any tuples it was processing when it
got killed.
10. Storm Management Commands
Command Usuage Description
Deactivate Storm deactivate <topology_name> Deactivates the specified topology's spouts.
Activate Storm activate <topology name> activates the specified topology's spouts.
Rebalance storm rebalance topology-name [-w wait-
time-secs] [-n new-num-workers] [-e
component=parallelism]*
For example, let's say you have a 10 node
cluster running 4 workers per node, and then
let's say you add another 10 nodes to the
cluster. You may wish to have Storm spread
out the workers for the running topology so
that each node runs 2 workers. One way to
do this is to kill the topology and resubmit it,
but Storm provides a "rebalance" command
that provides an easier way to do this.
Rebalance will first deactivate the topology
for the duration of the message timeout
(overridable with the -w flag) and then
redistribute the workers evenly around the
cluster. The topology will then return to its
previous state of activation .
13. Local Mode
• Storm topologies run in local machine in a single JVM
• This mode is for development, testing and debugging because its easiest way to see all topology components
working together.
• We can adjust topologies and see how our topology runs in different configuration.
• http://storm.apache.org/releases/1.1.0/Local-mode.html
14. Remote Mode
• In remote mode, we submit topology to storm cluster which is composed of many processes, usually running
on different machines.
• Remote mode does not show debugging information which is why its considered Production mode.
• http://storm.apache.org/releases/1.1.1/Running-topologies-on-a-production-cluster.html
15. Groupings in Storm
• Before designing it is important to define how data is exchanged
between components
• Stream grouping specifies which stream is consumed by each bolt and
how stream will be consumed.
• Stream grouping is defined when topology is defined.
builder.setSpout("firstSpout", new FirstSpout(), 2);
builder.setBolt("firstBolt", new FirstBolt(), 3).globalGrouping("firstSpout");
16. Shuffle Grouping
• Shuffle grouping is the most randomly used grouping.
• Tuples are randomly distributed across the bolt's tasks in a way such
that each bolt is guaranteed to get an equal number of tuples.
• It is useful for doing atomic operations.
• It cannot be used in case of topology where you need to count words,
as operations cannot be randomly distributed.
17. Fields Grouping
• Based on the fields of one or more tuples, fields grouping allows you to control tuples sent to
bolts.
• It ensures that a given set of values for a combination of fields is always sent to the same bolt.
Bolt A Bolt B
Field x
Field y
Field z
Fields grouping
18. Fields Grouping
• Fields grouping: The stream is partitioned by the fields specified in the
grouping. For example, if the stream is grouped by the "user-id" field,
tuples with the same "user-id" will always go to the same task, but
tuples with different "user-id"'s may go to different tasks.
19. All Grouping
• All grouping is used to send signals to bolts.
• It sends signal copy of each tuple to all instances of receiving bolt.
Bolt A Bolt B
20. Custom Grouping
• We can create our own custom stream grouping by implementing the
backtype.storm.grouping.CustomStreamGrouping interface. This gives
us the power to decide which bolt(s) will receive each tuple.
21. Direct Grouping
• This is a special grouping where the source decides which component will receive the tuple.
• To use direct grouping bolt uses emitDirect method instead of emit.
22. Direct Grouping
• work out the number of target tasks in the prepare method:
And in the topology definition, we specify that the stream
will be grouped directly:
23. Global Grouping
• Global Grouping sends tuples generated by all instances of the source
to a single target instance (specifically, the task with lowest id).
Bolt A Bolt B
24. Local Cluster Vs Storm Submitter
• LocalCluster help to run the topology on our local computer
• Running the Storm infrastructure on our computer lets us run and debug different topologies easily. But
what about when we want to submit our topology to a running Storm cluster?
• One of the interesting features of storm is that it’s easy to send our topology to run in a real cluster. We’ll
need to change the LocalCluster to a StormSubmitter, and implement the submitTopology method, which is
responsible for sending the topology to the cluster
27. Spouts
• A spout is a source of stream that generates input tuples to topology.
28. ISpout
• Ispout is the core interface for implementing spouts.
Method Description
open(Map conf, TopologyContext context,
SpoutOutputCollector collector);
Called when a task for this component is initialized within a worker on the cluster.
void close(); Called when an ISpout is going to be shutdown
void activate(); Called when a spout has been activated out of a deactivated mode.
void deactivate(); Called when a spout has been deactivated. nextTuple will not be called while a spout is
deactivated.
void nextTuple(); When this method is called, Storm is requesting that the Spout emit tuples to the output
collector.
void ack(Object msgId); Storm has determined that the tuple emitted by this spout with the msgId identifier has
been fully processed. Typically, an implementation of this method will take that message off
the queue and prevent it from being replayed.
void fail(Object msgId); The tuple emitted by this spout with the msgId identifier has failed to be fully processed.
Typically, an implementation of this method will put that message back on the queue to be
replayed at a later time.
29. Ack in Storm
• This method is called when message is processed correctly.
• Tuple processing succeeds when the tuple is processed by all target
bolts.
30. Fail in Storm
• The method fail is called when tuple processed failed.
• This method is used either to resend the tuple or throw exception.
31. Bolts
• A bolt consumes tuple input streams, performs business logic, and
potentially can emit new streams.
32. Bolts
• Bolts may subscribe to streams emitted by spouts or other bolts.
• Bolts may perform any sort of processing imaginable. Filtering, Joins,
Calculations, Database reads and writes.
• All bolts must implement IRichBolt interface. BaseRichBolt is the most
basic implementation.
33. Bolts
Method Description
prepare(Map stormConf, TopologyContext context,
OutputCollector collector)
Calls just before bolt starts processing tuples. Called once if a
lifetime of entire bolt lifecycle.
void execute(Tuple input); Process a single tuple of input.
public void cleanup() Called when a bolt is shutdown.
void declareOutputFields(OutputFieldsDeclarer
declarer);
Declare the output schema of bolt. If this method is empty
then you call the tuple by position in next bolt.
34. Parallelism
• Distributed applications take advantage of horizontally-scaled clusters by dividing computation
tasks across nodes in a cluster. Storm offers this and additional finer-grained ways to increase the
parallelism of a Storm topology:
• Increase the number of workers
• Increase the number of executors
• Increase the number of tasks
35. Creating first Storm Topology
Node
Worker JVM
Executer
thread
Executer
thread
Executer Executer
thread
Task (sentence spout) Task (split sentence
bolt)
Task word (count bolt) Task
(Report bolt)
By default, Storm uses a parallelism factor of 1. Assuming a single-node Storm cluster, a parallelism factor of 1 means that one worker, or
JVM, is assigned to execute the topology, and each component in the topology is assigned to a single executor. The following diagram
illustrates this scenario. The topology defines a data flow with three tasks, a spout and two bolts.
36. Controlling Parallelism with task
Node
Worker JVM
Executer
thread
Task
(sentence
spout)
Executer
thread
Task (split
sentence
bolt)
Executer
thread
Task word
(count bolt)
Executer
thread
Task
(Report bolt)
Increasing Parallelism with Tasks
Finally, Storm developers can increase the number of tasks
assigned to a single topology component, such as a spout or
bolt. By default, Storm assigns a single task to each
component, but developers can increase this number with
the setNumTasks() method on
the BoltDeclarer and SpoutDeclarer objects returned by
the setBolt() and setSpout() methods.
Task
(sentence
spout)
37. Controlling Multiple Workers
Worker JVM
Executer thread
Task (sentence
spout)
Executer thread Executer thread Executer thread
Task
(Report bolt)
Task (split sentence
bolt)
Task (split sentence
bolt)
Task word
(count bolt)
Task word
(count bolt)
Worker JVM
Worker JVM
Executer thread
Task (sentence
spout)
Executer thread Executer thread
Task (split sentence
bolt)
Task (split sentence
bolt)
Task word
(count bolt)
Task word
(count bolt)
Worker JVM
Node
Node
conf.setNumWorkers(2);
Increasing Parallelism with Executors
The parallelism API enables Storm developers to
specify the number of executors for each worker with
a parallelism hint, an optional third parameter to
the setBolt()
42. Trident
• Trident is a high level abstraction like utility to perform real time
processing on top of storm.
• Trident has joins, aggregations, grouping, functions and filters.
• A typical trident topology consists of trident spouts and trident
operators
• There are no trident bolts.
• It eases the programming.
43. Trident Spouts
• Unlike storm, trident has the concept of tuple batches.
• Unlike storm spouts, trident spouts emit tuples in batches.
• Each batch is assigned its own unique batch identifier.
• Batch size and contents are configurable.
44. Trident Spout Component
Trident spout Components
Batch Coordinator
is responsible for batch
management such that
emitter can properly replay
batches.
Emitter function
is responsible for emitting
tuples.
45. Trident Spout Code Snippet
public class TridentSpout implements ITridentSpout{
public BatchCoordinator getCoordinator(String txStateId, Map conf, TopologyContext context) {
// TODO Auto-generated method stub
return null;
}
public Emitter getEmitter(String txStateId, Map conf, TopologyContext context) {
// TODO Auto-generated method stub
return null;
}
public Map<String, Object> getComponentConfiguration() {
// TODO Auto-generated method stub
return null;
}
public Fields getOutputFields() {
// TODO Auto-generated method stub
return null;
}
}
46. ITridentSpout
• ITridentSpout is the core interface to implement spout in trident.
getCoordinator(String txStateId, Map
conf, TopologyContext context)
The coordinator for a TransactionalSpout
runs in a single thread and indicates
when batches of tuples should be
emitted
getEmitter(String txStateId, Map conf,
TopologyContext context)
Emitters are responsible for emitting
batches of tuples
47. BatchCoordinator Code Snippet
interface BatchCoordinator<X> {
/**
* Create metadata for this particular transaction id which has never
* been emitted before. The metadata should contain whatever is necessary
* to be able to replay the exact batch for the transaction at a later point.
*
* The metadata is stored in Zookeeper.
*
* Storm uses JSON encoding to store the metadata. Only simple types
* such as numbers, booleans, strings, lists, and maps should be used.
*
*/
X initializeTransaction(long txid, X prevMetadata, X currMetadata);
/**
* This attempt committed successfully, so all state for this commit and before can be safely cleaned up.
*
*/
void success(long txid);
/**
* hint to Storm if the spout is ready for the transaction id
*
*/
boolean isReady(long txid);
/**
* Release any resources from this coordinator.
*/
void close();
}
48. Emitter Code Snippet
interface Emitter<X> {
/**
* Emit a batch for the specified transaction attempt and metadata for the transaction. The metadata
* was created by the Coordinator in the initializeTransaction method. This method must always emit
* the same batch of tuples across all tasks for the same transaction id.
*/
void emitBatch(TransactionAttempt tx, X coordinatorMeta, TridentCollector collector);
/**
* This attempt committed successfully, so all state for this commit and before can be safely cleaned up.
*/
void success(TransactionAttempt tx);
/**
* Release any resources held by this emitter.
*/
void close();
}
50. Trident Functions
• Functions are similar to storm bolts as they also consume tuples and potentially emit new tuples.
• Functions cant remove or mutate existing fields, it can only add fields.
51. Trident Aggregators
• Similar to functions, aggregators allow topologies to combine fields.
• Unlike functions they replace tuple fields and values.
Trident Aggregators
Combiner Aggregator AggregatorReducer Aggregator
52. Combiner Aggregator
• Combiner aggregator is used to combine a set of tuples into single field.
• Storm calls init method with each batch and then repeatedly calls combine method until batch is processed.
• After combining the values from processing the tuples, Storm emits the results of combining those values as
a single new field.
• If partitions is empty then storm emits the value returned by zero method.
53. Reducer Aggregator
• Storm calls init method to retrieve the initial value.
• The reduce method is called with each tuple until the batch is fully processed.
• The first argument to the reduce() method is the current cumulative aggregation, which the method returns
after applying the tuple to the aggregation. When all tuples in the partition have been processed
• The ReducerAggregator interface has the following interface definition
54. Aggregator
• A key difference between Aggregator and other Trident aggregation interfaces is that an instance of TridentCollector is
passed as a parameter to every method. This allows Aggregator implementations to emit tuples at any time during
execution.
• Storm executes Aggregator instances as follows:
• Storm calls the init() method, which returns an object T representing the initial state of the aggregation.
• T is also passed to the aggregate() and complete() methods.
• Storm calls the aggregate() method repeatedly, to process each tuple in the batch.
• Storm calls complete() with the final value of the aggregation.
55. Trident Transactional Spouts
• Trident defines three spout types that differ with respect to batch content, failure response, and support for exactly-once
semantics:
• Non-transactional spouts
• Non-transactional spouts make no guarantees for the contents of each batch. As a result, processing may be at-most-once or at least once. It is
not possible to achieve exactly-once processing when using non-transactional Trident spouts.
• Transactional spouts
• Transactional spouts support exactly-once processing in a Trident topology. They define success at the batch level, and have several important
properties that allow them to accomplish this:
• Batches with a given transaction ID are always identical in terms of tuple content, even when replayed.
• Batch content never overlaps. A tuple can never be in more than one batch.
• Tuples are never skipped.
• With transactional spouts, idempotent state updates are relatively easy: because batch transaction IDs are strongly ordered, the ID can be used
to track data that has already been persisted. For example, if the current transaction ID is 5 and the data store contains a value for ID 5, the
update can be safely skipped.
• Opaque transactional spouts
• Opaque transactional spouts define success at the tuple level. Opaque transactional spouts have the following properties:
• There is no guarantee that a batch for a particular transaction ID is always the same.
• Each tuple is successfully processed in exactly one batch, though it is possible for a tuple to fail in one batch and succeed in another.
• The difference in focus between transactional and opaque transactional spouts—success at the batch level versus the tuple level,
respectively—has key implications in terms of achieving exactly-once semantics when combining different spouts with different state types.
56. Repeat Transactional state
• In repeat transactional state the last committed batch identifier is stored with the data.
• The state is updated if and only if the batch identifier being applied is next in sequence.
• The batch identifier is equal or lower than the persisted identifier then the update is ignored because it has
already being applied.
Batchid State update
1 {SF.320:27811 =4}
2 {SF.320:27811 =10}
3 {SF.320:27811 =8}
57. Repeat Transactional state
• Batch then complete processing in following order.
1-> 2 -> 3-> 3 -> (replayed)
When Batch 3 completes replay, it has no effect on the state because Trident has already incorporated its
update in the state. For the repeat transactional state to function properly batch contents cannot change
between replays.
Batch id State
1 {batch =1} {SF:320:378911=4}
2 {batch =2} {SF:320:378911=14}
3 {batch =3} {SF:320:378911=22}
3 (Replayed) {batch =3} {SF:320:378911=22}
58. Opaque transactional State
• The approach used in repeat transactional state relies on the batch composition remaining constant which
may not be possible if a system encounters a fault.
• If the spout is emitting from a source that may have a partial failure, some of the tuples emitted in the initial
batch might not be available for re-emission.
• The opaque transactional state allows the changing of batch composition by storing both current and
previous state.
• Assume we have the same batches as in the previous example, but this time when batch 3 is replayed, the
aggregate count will be different since it contains a different set of tuples as shown in table below.
Batchid State update
1 {SF.320:27811 =4}
2 {SF.320:27811 =10}
3 {SF.320:27811 =8}
3 (Replayed) {SF.320:27811 =6}
59. Opaque transactional State
• With opaque state the state updates as follows.
Completed batch Batch committed Previous state Current state
1 1 {} {SF.320:27811 =4}
2 2 {SF.320:27811 =4} {SF.320:27811 =14}
3 (Applies) 3 {SF.320:27811 =14} {SF.320:27811 =22}
3 (Replayed) 3 {SF.320:27811 =14} {SF.320:27811 =20}
61. When to use Trident
• As in many use cases, we have required exactly one processing, which we can achieve by writing a
transactional topology in Trident. On the other hand, it will be difficult to achieve exactly one processing in
the case of Vanilla Storm. Hence, Trident will be useful for those use cases where we require exactly once
processing.
• Trident is not fit for all use cases, especially high-performance use cases, because Trident adds complexity on
Storm and manages the state
62. Packaging Storm Topologies
• Maven Shade Plugin
• Use the maven-shade-plugin, rather than the maven-assembly-plugin to package your Apache Storm topologies. The maven-shade-plugin
provides the ability to merge JAR manifest entries, which are used by the Hadoop client to resolve URL schemes.
• Use the following Maven configuration file to package your topology:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>1.4</version>
<configuration>
<createDependencyReducedPom>true</createDependencyReducedPom>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
63. Deploying and Managing Apache Storm
Topologies
• Use the command line interface to deploy a Storm topology after packaging it in a .jar file.
• For example, you can use the following command to deploy WordCountTopology from the storm-starter jar:
• The remainder of this chapter describes the Storm UI, which shows diagnostics for a cluster and topologies,
allowing you to monitor and manage deployed topologies.
storm jar storm-starter-<starter_version>-storm-<storm_version>.jar storm.starter.WordCountTopology
WordCount -c nimbus.host=sandbox.hortonworks.com
64. Moving Data Into and Out of Apache Storm Using
Spouts and Bolts
• The following spouts are available in HDP 2.5:
• Kafka spout based on Kafka 0.7.x/0.8.x, plus a new Kafka consumer spout available as a technical preview (not for production
use)
• HDFS
• EventHubs
• Kinesis (technical preview)
• The following bolts are available in HDP 2.5:
• Kafka
• HDFS
• EventHubs
• HBase
• Hive
• JDBC (supports Phoenix)
• Solr
• Cassandra
• MongoDB
• ElasticSearch
• Redis
• OpenTSDB (technical preview)
Finally, Storm developers can increase the number of tasks assigned to a single topology component, such as a spout or bolt. By default, Storm assigns a single task to each component, but developers can increase this number with the setNumTasks() method on the BoltDeclarer and SpoutDeclarer objects returned by the setBolt() and setSpout() methods.
The parallelism API enables Storm developers to specify the number of executors for each worker with a parallelism hint, an optional third parameter to the setBolt()