SlideShare a Scribd company logo
1 of 66
What is Storm
• Storm is a distributed, reliable, fault-tolerant system for processing
streams of data.
Processing
What is Storm
• The work is delegate to different types of component that are each responsible for a specific processing task.
• The input stream of spout cluster is handled by a component called a spout.
• The spout passes the data to a component called a bolt, which transforms it in some way.
• A bolt either persist the data in some sort of storage, or pass it to some other bolt.
Input Data
Source
Spout
Spout
bolt
bolt
bolt
bolt
Passes data
Passes data
Passes data
Passes data
Passes data
Passes data
Storm Components
• Storm cluster has 3 nodes
1. Nimbus nodes
 Nimbus is a daemon that runs on the master node of Storm cluster.
 It is responsible for distributing the code among the worker nodes, assigning input data sets to machines for
processing and monitoring for failures.
 Nimbus service is an Apache Thrift service enabling you to submit the code in any programming language. This way,
you can always utilize the language that you are proficient in, without the need of learning a new language to utilize
Apache Storm.
 Nimbus service relies on Apache ZooKeeper service to monitor the message processing tasks as all the worker nodes
update their tasks status in Apache ZooKeeper service.
2. Zookeeper nodes
 Coordinates storm cluster.
3. Supervisor nodes
 All the workers nodes in Storm cluster run a daemon called Supervisor. Supervisor service receives the work assigned
to a machine by Nimbus service. Supervisor manages worker processes to complete the tasks assigned by Nimbus.
Each of these worker processes executes a subset of topology that we will talk about next
Storm Components
Storm Components
• For key abstractions help to understand how storm process data.
• Tuples- an ordered list of elements
• Streams – an unbounded sequence of tuples.
• Spouts – sources of streams in a computation
• Bolts – process input streams and produce output streams
• Topologies - Topology, in simple terms, is a graph of computation. Each node in a topology
contains processing logic, and links between nodes indicate how data should be passed around
between nodes. A Topology typically runs distributively on multiple workers processes on multiple
worker nodes
Storm technology task
• Java and Clojure
• Storm runs on java virtual machine and is written in combination of java and Clojure.
• Storm is highly polyglot in nature.
• Spouts and bolts can be written virtually in any programming language which can read data from stream.
Basic Storm Daemon Commands
• Daemon commands used to start storm
Command Usage Description
Nimbus Storm nimbus This launches nimbus
daemon
Supervisor Storm supervisor This launches supervisor
daemon
UI Storm ui This launches storm ui that
provides web based ui for
monitoring storm clusters.
Storm Management Commands
Command Usuage Description
Jar Storm jar <topology jar> <topology class>
<args>
1. Used to submit topology to cluster.
2. Runs main() of topology class.
3. Uploads topology jar to Nimbus for
distribution to cluster.
4. Once submitted storm activates topology
and starts processing.
5. The main() method in the topology class
is responsible for supplying unique name
for the topology.
6. If topology with that name already exist
on cluster, the jar command will fail.
Kill Storm kill <topology name> Storm won't kill the topology immediately.
Instead, it deactivates all the spouts so that
they don't emit any more tuples, and then
Storm waits
Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
seconds before destroying all the workers.
This gives the topology enough time to
complete any tuples it was processing when it
got killed.
Storm Management Commands
Command Usuage Description
Deactivate Storm deactivate <topology_name> Deactivates the specified topology's spouts.
Activate Storm activate <topology name> activates the specified topology's spouts.
Rebalance storm rebalance topology-name [-w wait-
time-secs] [-n new-num-workers] [-e
component=parallelism]*
For example, let's say you have a 10 node
cluster running 4 workers per node, and then
let's say you add another 10 nodes to the
cluster. You may wish to have Storm spread
out the workers for the running topology so
that each node runs 2 workers. One way to
do this is to kill the topology and resubmit it,
but Storm provides a "rebalance" command
that provides an easier way to do this.
Rebalance will first deactivate the topology
for the duration of the message timeout
(overridable with the -w flag) and then
redistribute the workers evenly around the
cluster. The topology will then return to its
previous state of activation .
Storm Management Commands
• http://storm.apache.org/releases/1.0.1/Command-line-client.html
Storm Running Modes
Storm Running Modes
Local Remote
Local Mode
• Storm topologies run in local machine in a single JVM
• This mode is for development, testing and debugging because its easiest way to see all topology components
working together.
• We can adjust topologies and see how our topology runs in different configuration.
• http://storm.apache.org/releases/1.1.0/Local-mode.html
Remote Mode
• In remote mode, we submit topology to storm cluster which is composed of many processes, usually running
on different machines.
• Remote mode does not show debugging information which is why its considered Production mode.
• http://storm.apache.org/releases/1.1.1/Running-topologies-on-a-production-cluster.html
Groupings in Storm
• Before designing it is important to define how data is exchanged
between components
• Stream grouping specifies which stream is consumed by each bolt and
how stream will be consumed.
• Stream grouping is defined when topology is defined.
builder.setSpout("firstSpout", new FirstSpout(), 2);
builder.setBolt("firstBolt", new FirstBolt(), 3).globalGrouping("firstSpout");
Shuffle Grouping
• Shuffle grouping is the most randomly used grouping.
• Tuples are randomly distributed across the bolt's tasks in a way such
that each bolt is guaranteed to get an equal number of tuples.
• It is useful for doing atomic operations.
• It cannot be used in case of topology where you need to count words,
as operations cannot be randomly distributed.
Fields Grouping
• Based on the fields of one or more tuples, fields grouping allows you to control tuples sent to
bolts.
• It ensures that a given set of values for a combination of fields is always sent to the same bolt.
Bolt A Bolt B
Field x
Field y
Field z
Fields grouping
Fields Grouping
• Fields grouping: The stream is partitioned by the fields specified in the
grouping. For example, if the stream is grouped by the "user-id" field,
tuples with the same "user-id" will always go to the same task, but
tuples with different "user-id"'s may go to different tasks.
All Grouping
• All grouping is used to send signals to bolts.
• It sends signal copy of each tuple to all instances of receiving bolt.
Bolt A Bolt B
Custom Grouping
• We can create our own custom stream grouping by implementing the
backtype.storm.grouping.CustomStreamGrouping interface. This gives
us the power to decide which bolt(s) will receive each tuple.
Direct Grouping
• This is a special grouping where the source decides which component will receive the tuple.
• To use direct grouping bolt uses emitDirect method instead of emit.
Direct Grouping
• work out the number of target tasks in the prepare method:
And in the topology definition, we specify that the stream
will be grouped directly:
Global Grouping
• Global Grouping sends tuples generated by all instances of the source
to a single target instance (specifically, the task with lowest id).
Bolt A Bolt B
Local Cluster Vs Storm Submitter
• LocalCluster help to run the topology on our local computer
• Running the Storm infrastructure on our computer lets us run and debug different topologies easily. But
what about when we want to submit our topology to a running Storm cluster?
• One of the interesting features of storm is that it’s easy to send our topology to run in a real cluster. We’ll
need to change the LocalCluster to a StormSubmitter, and implement the submitTopology method, which is
responsible for sending the topology to the cluster
Reference
• https://github.com/xetorthio/getting-started-with-
storm/blob/master/ch03Topologies.asc
Code Walkthrough
Spouts, bolts and topology
Spouts
• A spout is a source of stream that generates input tuples to topology.
ISpout
• Ispout is the core interface for implementing spouts.
Method Description
open(Map conf, TopologyContext context,
SpoutOutputCollector collector);
Called when a task for this component is initialized within a worker on the cluster.
void close(); Called when an ISpout is going to be shutdown
void activate(); Called when a spout has been activated out of a deactivated mode.
void deactivate(); Called when a spout has been deactivated. nextTuple will not be called while a spout is
deactivated.
void nextTuple(); When this method is called, Storm is requesting that the Spout emit tuples to the output
collector.
void ack(Object msgId); Storm has determined that the tuple emitted by this spout with the msgId identifier has
been fully processed. Typically, an implementation of this method will take that message off
the queue and prevent it from being replayed.
void fail(Object msgId); The tuple emitted by this spout with the msgId identifier has failed to be fully processed.
Typically, an implementation of this method will put that message back on the queue to be
replayed at a later time.
Ack in Storm
• This method is called when message is processed correctly.
• Tuple processing succeeds when the tuple is processed by all target
bolts.
Fail in Storm
• The method fail is called when tuple processed failed.
• This method is used either to resend the tuple or throw exception.
Bolts
• A bolt consumes tuple input streams, performs business logic, and
potentially can emit new streams.
Bolts
• Bolts may subscribe to streams emitted by spouts or other bolts.
• Bolts may perform any sort of processing imaginable. Filtering, Joins,
Calculations, Database reads and writes.
• All bolts must implement IRichBolt interface. BaseRichBolt is the most
basic implementation.
Bolts
Method Description
prepare(Map stormConf, TopologyContext context,
OutputCollector collector)
Calls just before bolt starts processing tuples. Called once if a
lifetime of entire bolt lifecycle.
void execute(Tuple input); Process a single tuple of input.
public void cleanup() Called when a bolt is shutdown.
void declareOutputFields(OutputFieldsDeclarer
declarer);
Declare the output schema of bolt. If this method is empty
then you call the tuple by position in next bolt.
Parallelism
• Distributed applications take advantage of horizontally-scaled clusters by dividing computation
tasks across nodes in a cluster. Storm offers this and additional finer-grained ways to increase the
parallelism of a Storm topology:
• Increase the number of workers
• Increase the number of executors
• Increase the number of tasks
Creating first Storm Topology
Node
Worker JVM
Executer
thread
Executer
thread
Executer Executer
thread
Task (sentence spout) Task (split sentence
bolt)
Task word (count bolt) Task
(Report bolt)
By default, Storm uses a parallelism factor of 1. Assuming a single-node Storm cluster, a parallelism factor of 1 means that one worker, or
JVM, is assigned to execute the topology, and each component in the topology is assigned to a single executor. The following diagram
illustrates this scenario. The topology defines a data flow with three tasks, a spout and two bolts.
Controlling Parallelism with task
Node
Worker JVM
Executer
thread
Task
(sentence
spout)
Executer
thread
Task (split
sentence
bolt)
Executer
thread
Task word
(count bolt)
Executer
thread
Task
(Report bolt)
Increasing Parallelism with Tasks
Finally, Storm developers can increase the number of tasks
assigned to a single topology component, such as a spout or
bolt. By default, Storm assigns a single task to each
component, but developers can increase this number with
the setNumTasks() method on
the BoltDeclarer and SpoutDeclarer objects returned by
the setBolt() and setSpout() methods.
Task
(sentence
spout)
Controlling Multiple Workers
Worker JVM
Executer thread
Task (sentence
spout)
Executer thread Executer thread Executer thread
Task
(Report bolt)
Task (split sentence
bolt)
Task (split sentence
bolt)
Task word
(count bolt)
Task word
(count bolt)
Worker JVM
Worker JVM
Executer thread
Task (sentence
spout)
Executer thread Executer thread
Task (split sentence
bolt)
Task (split sentence
bolt)
Task word
(count bolt)
Task word
(count bolt)
Worker JVM
Node
Node
conf.setNumWorkers(2);
Increasing Parallelism with Executors
The parallelism API enables Storm developers to
specify the number of executors for each worker with
a parallelism hint, an optional third parameter to
the setBolt()
Simple Illustration
Java Storm (Word Count Example)
• https://github.com/Viyaan/StormWordCount
Storm installation guide (single-node setup)
• https://vincenzogulisano.com/2015/07/30/5-minutes-storm-
installation-guide-single-node-setup/
• http://storm.apache.org/releases/1.0.3/Setting-up-a-Storm-
cluster.html
• https://www.youtube.com/watch?v=1-HWFArDACA
Trident
Join operations, aggregations, grouping, functions, and filters, as well as
fault-tolerant state management
Trident
• Trident is a high level abstraction like utility to perform real time
processing on top of storm.
• Trident has joins, aggregations, grouping, functions and filters.
• A typical trident topology consists of trident spouts and trident
operators
• There are no trident bolts.
• It eases the programming.
Trident Spouts
• Unlike storm, trident has the concept of tuple batches.
• Unlike storm spouts, trident spouts emit tuples in batches.
• Each batch is assigned its own unique batch identifier.
• Batch size and contents are configurable.
Trident Spout Component
Trident spout Components
Batch Coordinator
is responsible for batch
management such that
emitter can properly replay
batches.
Emitter function
is responsible for emitting
tuples.
Trident Spout Code Snippet
public class TridentSpout implements ITridentSpout{
public BatchCoordinator getCoordinator(String txStateId, Map conf, TopologyContext context) {
// TODO Auto-generated method stub
return null;
}
public Emitter getEmitter(String txStateId, Map conf, TopologyContext context) {
// TODO Auto-generated method stub
return null;
}
public Map<String, Object> getComponentConfiguration() {
// TODO Auto-generated method stub
return null;
}
public Fields getOutputFields() {
// TODO Auto-generated method stub
return null;
}
}
ITridentSpout
• ITridentSpout is the core interface to implement spout in trident.
getCoordinator(String txStateId, Map
conf, TopologyContext context)
The coordinator for a TransactionalSpout
runs in a single thread and indicates
when batches of tuples should be
emitted
getEmitter(String txStateId, Map conf,
TopologyContext context)
Emitters are responsible for emitting
batches of tuples
BatchCoordinator Code Snippet
interface BatchCoordinator<X> {
/**
* Create metadata for this particular transaction id which has never
* been emitted before. The metadata should contain whatever is necessary
* to be able to replay the exact batch for the transaction at a later point.
*
* The metadata is stored in Zookeeper.
*
* Storm uses JSON encoding to store the metadata. Only simple types
* such as numbers, booleans, strings, lists, and maps should be used.
*
*/
X initializeTransaction(long txid, X prevMetadata, X currMetadata);
/**
* This attempt committed successfully, so all state for this commit and before can be safely cleaned up.
*
*/
void success(long txid);
/**
* hint to Storm if the spout is ready for the transaction id
*
*/
boolean isReady(long txid);
/**
* Release any resources from this coordinator.
*/
void close();
}
Emitter Code Snippet
interface Emitter<X> {
/**
* Emit a batch for the specified transaction attempt and metadata for the transaction. The metadata
* was created by the Coordinator in the initializeTransaction method. This method must always emit
* the same batch of tuples across all tasks for the same transaction id.
*/
void emitBatch(TransactionAttempt tx, X coordinatorMeta, TridentCollector collector);
/**
* This attempt committed successfully, so all state for this commit and before can be safely cleaned up.
*/
void success(TransactionAttempt tx);
/**
* Release any resources held by this emitter.
*/
void close();
}
Trident Filters
• Filters take tuple as an input and decide whether to keep that tuple or
not.
Trident Functions
• Functions are similar to storm bolts as they also consume tuples and potentially emit new tuples.
• Functions cant remove or mutate existing fields, it can only add fields.
Trident Aggregators
• Similar to functions, aggregators allow topologies to combine fields.
• Unlike functions they replace tuple fields and values.
Trident Aggregators
Combiner Aggregator AggregatorReducer Aggregator
Combiner Aggregator
• Combiner aggregator is used to combine a set of tuples into single field.
• Storm calls init method with each batch and then repeatedly calls combine method until batch is processed.
• After combining the values from processing the tuples, Storm emits the results of combining those values as
a single new field.
• If partitions is empty then storm emits the value returned by zero method.
Reducer Aggregator
• Storm calls init method to retrieve the initial value.
• The reduce method is called with each tuple until the batch is fully processed.
• The first argument to the reduce() method is the current cumulative aggregation, which the method returns
after applying the tuple to the aggregation. When all tuples in the partition have been processed
• The ReducerAggregator interface has the following interface definition
Aggregator
• A key difference between Aggregator and other Trident aggregation interfaces is that an instance of TridentCollector is
passed as a parameter to every method. This allows Aggregator implementations to emit tuples at any time during
execution.
• Storm executes Aggregator instances as follows:
• Storm calls the init() method, which returns an object T representing the initial state of the aggregation.
• T is also passed to the aggregate() and complete() methods.
• Storm calls the aggregate() method repeatedly, to process each tuple in the batch.
• Storm calls complete() with the final value of the aggregation.
Trident Transactional Spouts
• Trident defines three spout types that differ with respect to batch content, failure response, and support for exactly-once
semantics:
• Non-transactional spouts
• Non-transactional spouts make no guarantees for the contents of each batch. As a result, processing may be at-most-once or at least once. It is
not possible to achieve exactly-once processing when using non-transactional Trident spouts.
• Transactional spouts
• Transactional spouts support exactly-once processing in a Trident topology. They define success at the batch level, and have several important
properties that allow them to accomplish this:
• Batches with a given transaction ID are always identical in terms of tuple content, even when replayed.
• Batch content never overlaps. A tuple can never be in more than one batch.
• Tuples are never skipped.
• With transactional spouts, idempotent state updates are relatively easy: because batch transaction IDs are strongly ordered, the ID can be used
to track data that has already been persisted. For example, if the current transaction ID is 5 and the data store contains a value for ID 5, the
update can be safely skipped.
• Opaque transactional spouts
• Opaque transactional spouts define success at the tuple level. Opaque transactional spouts have the following properties:
• There is no guarantee that a batch for a particular transaction ID is always the same.
• Each tuple is successfully processed in exactly one batch, though it is possible for a tuple to fail in one batch and succeed in another.
• The difference in focus between transactional and opaque transactional spouts—success at the batch level versus the tuple level,
respectively—has key implications in terms of achieving exactly-once semantics when combining different spouts with different state types.
Repeat Transactional state
• In repeat transactional state the last committed batch identifier is stored with the data.
• The state is updated if and only if the batch identifier being applied is next in sequence.
• The batch identifier is equal or lower than the persisted identifier then the update is ignored because it has
already being applied.
Batchid State update
1 {SF.320:27811 =4}
2 {SF.320:27811 =10}
3 {SF.320:27811 =8}
Repeat Transactional state
• Batch then complete processing in following order.
1-> 2 -> 3-> 3 -> (replayed)
When Batch 3 completes replay, it has no effect on the state because Trident has already incorporated its
update in the state. For the repeat transactional state to function properly batch contents cannot change
between replays.
Batch id State
1 {batch =1} {SF:320:378911=4}
2 {batch =2} {SF:320:378911=14}
3 {batch =3} {SF:320:378911=22}
3 (Replayed) {batch =3} {SF:320:378911=22}
Opaque transactional State
• The approach used in repeat transactional state relies on the batch composition remaining constant which
may not be possible if a system encounters a fault.
• If the spout is emitting from a source that may have a partial failure, some of the tuples emitted in the initial
batch might not be available for re-emission.
• The opaque transactional state allows the changing of batch composition by storing both current and
previous state.
• Assume we have the same batches as in the previous example, but this time when batch 3 is replayed, the
aggregate count will be different since it contains a different set of tuples as shown in table below.
Batchid State update
1 {SF.320:27811 =4}
2 {SF.320:27811 =10}
3 {SF.320:27811 =8}
3 (Replayed) {SF.320:27811 =6}
Opaque transactional State
• With opaque state the state updates as follows.
Completed batch Batch committed Previous state Current state
1 1 {} {SF.320:27811 =4}
2 2 {SF.320:27811 =4} {SF.320:27811 =14}
3 (Applies) 3 {SF.320:27811 =14} {SF.320:27811 =22}
3 (Replayed) 3 {SF.320:27811 =14} {SF.320:27811 =20}
Combinations of spout and state types
When to use Trident
• As in many use cases, we have required exactly one processing, which we can achieve by writing a
transactional topology in Trident. On the other hand, it will be difficult to achieve exactly one processing in
the case of Vanilla Storm. Hence, Trident will be useful for those use cases where we require exactly once
processing.
• Trident is not fit for all use cases, especially high-performance use cases, because Trident adds complexity on
Storm and manages the state
Packaging Storm Topologies
• Maven Shade Plugin
• Use the maven-shade-plugin, rather than the maven-assembly-plugin to package your Apache Storm topologies. The maven-shade-plugin
provides the ability to merge JAR manifest entries, which are used by the Hadoop client to resolve URL schemes.
• Use the following Maven configuration file to package your topology:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>1.4</version>
<configuration>
<createDependencyReducedPom>true</createDependencyReducedPom>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
Deploying and Managing Apache Storm
Topologies
• Use the command line interface to deploy a Storm topology after packaging it in a .jar file.
• For example, you can use the following command to deploy WordCountTopology from the storm-starter jar:
• The remainder of this chapter describes the Storm UI, which shows diagnostics for a cluster and topologies,
allowing you to monitor and manage deployed topologies.
storm jar storm-starter-<starter_version>-storm-<storm_version>.jar storm.starter.WordCountTopology
WordCount -c nimbus.host=sandbox.hortonworks.com
Moving Data Into and Out of Apache Storm Using
Spouts and Bolts
• The following spouts are available in HDP 2.5:
• Kafka spout based on Kafka 0.7.x/0.8.x, plus a new Kafka consumer spout available as a technical preview (not for production
use)
• HDFS
• EventHubs
• Kinesis (technical preview)
• The following bolts are available in HDP 2.5:
• Kafka
• HDFS
• EventHubs
• HBase
• Hive
• JDBC (supports Phoenix)
• Solr
• Cassandra
• MongoDB
• ElasticSearch
• Redis
• OpenTSDB (technical preview)
Sample Connector Codes
• https://github.com/Viyaan/StormKafkaStreamingHDFS
• https://github.com/Viyaan/StormKafkaStreamingMongodb
Reference Books
• https://www.tutorialspoint.com/apache_storm/apache_storm_tutori
al.pdf
• https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-
2.6.1/bk_storm-component-guide/bk_storm-component-guide.pdf
• https://books.google.co.in/books?id=B9crAwAAQBAJ&pg

More Related Content

What's hot

Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaAndrew Montalenti
 
streamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormstreamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormDaniel Blanchard
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaBrendan Gregg
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open MpAnshul Sharma
 
Process scheduling
Process schedulingProcess scheduling
Process schedulingHao-Ran Liu
 
Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Seiya Tokui
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2AAKASH S
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapPadraig O'Sullivan
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryoguest40fc7cd
 
Network Simulator Tutorial
Network Simulator TutorialNetwork Simulator Tutorial
Network Simulator Tutorialcscarcas
 
Fundamental concurrent programming
Fundamental concurrent programmingFundamental concurrent programming
Fundamental concurrent programmingDimas Prawira
 

What's hot (20)

Ns2
Ns2Ns2
Ns2
 
~Ns2~
~Ns2~~Ns2~
~Ns2~
 
Session 1 introduction to ns2
Session 1   introduction to ns2Session 1   introduction to ns2
Session 1 introduction to ns2
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
streamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormstreamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with storm
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 
Deep parking
Deep parkingDeep parking
Deep parking
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Ns2 introduction 2
Ns2 introduction 2Ns2 introduction 2
Ns2 introduction 2
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
 
Network Simulator Tutorial
Network Simulator TutorialNetwork Simulator Tutorial
Network Simulator Tutorial
 
Twitter Stream Processing
Twitter Stream ProcessingTwitter Stream Processing
Twitter Stream Processing
 
Ns 2 Network Simulator An Introduction
Ns 2 Network Simulator An IntroductionNs 2 Network Simulator An Introduction
Ns 2 Network Simulator An Introduction
 
Fundamental concurrent programming
Fundamental concurrent programmingFundamental concurrent programming
Fundamental concurrent programming
 

Similar to Storm

Similar to Storm (20)

Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
1 storm-intro
1 storm-intro1 storm-intro
1 storm-intro
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Storm
StormStorm
Storm
 
Slide #2: Setup Apache Storm
Slide #2: Setup Apache StormSlide #2: Setup Apache Storm
Slide #2: Setup Apache Storm
 
Slide #2: How to Setup Apache STROM
Slide #2: How to Setup Apache STROMSlide #2: How to Setup Apache STROM
Slide #2: How to Setup Apache STROM
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Apache Storm Basics
Apache Storm BasicsApache Storm Basics
Apache Storm Basics
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Storm 0.8.2
Storm 0.8.2Storm 0.8.2
Storm 0.8.2
 
Storm
StormStorm
Storm
 
Storm begins
Storm beginsStorm begins
Storm begins
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptx
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 

More from Viyaan Jhiingade (7)

Rate limiting
Rate limitingRate limiting
Rate limiting
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
No sql
No sqlNo sql
No sql
 
Rest Webservice
Rest WebserviceRest Webservice
Rest Webservice
 
Git commands
Git commandsGit commands
Git commands
 
Jenkins CI
Jenkins CIJenkins CI
Jenkins CI
 
Kafka RealTime Streaming
Kafka RealTime StreamingKafka RealTime Streaming
Kafka RealTime Streaming
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Storm

  • 1.
  • 2. What is Storm • Storm is a distributed, reliable, fault-tolerant system for processing streams of data. Processing
  • 3. What is Storm • The work is delegate to different types of component that are each responsible for a specific processing task. • The input stream of spout cluster is handled by a component called a spout. • The spout passes the data to a component called a bolt, which transforms it in some way. • A bolt either persist the data in some sort of storage, or pass it to some other bolt. Input Data Source Spout Spout bolt bolt bolt bolt Passes data Passes data Passes data Passes data Passes data Passes data
  • 4. Storm Components • Storm cluster has 3 nodes 1. Nimbus nodes  Nimbus is a daemon that runs on the master node of Storm cluster.  It is responsible for distributing the code among the worker nodes, assigning input data sets to machines for processing and monitoring for failures.  Nimbus service is an Apache Thrift service enabling you to submit the code in any programming language. This way, you can always utilize the language that you are proficient in, without the need of learning a new language to utilize Apache Storm.  Nimbus service relies on Apache ZooKeeper service to monitor the message processing tasks as all the worker nodes update their tasks status in Apache ZooKeeper service. 2. Zookeeper nodes  Coordinates storm cluster. 3. Supervisor nodes  All the workers nodes in Storm cluster run a daemon called Supervisor. Supervisor service receives the work assigned to a machine by Nimbus service. Supervisor manages worker processes to complete the tasks assigned by Nimbus. Each of these worker processes executes a subset of topology that we will talk about next
  • 6. Storm Components • For key abstractions help to understand how storm process data. • Tuples- an ordered list of elements • Streams – an unbounded sequence of tuples. • Spouts – sources of streams in a computation • Bolts – process input streams and produce output streams • Topologies - Topology, in simple terms, is a graph of computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes. A Topology typically runs distributively on multiple workers processes on multiple worker nodes
  • 7. Storm technology task • Java and Clojure • Storm runs on java virtual machine and is written in combination of java and Clojure. • Storm is highly polyglot in nature. • Spouts and bolts can be written virtually in any programming language which can read data from stream.
  • 8. Basic Storm Daemon Commands • Daemon commands used to start storm Command Usage Description Nimbus Storm nimbus This launches nimbus daemon Supervisor Storm supervisor This launches supervisor daemon UI Storm ui This launches storm ui that provides web based ui for monitoring storm clusters.
  • 9. Storm Management Commands Command Usuage Description Jar Storm jar <topology jar> <topology class> <args> 1. Used to submit topology to cluster. 2. Runs main() of topology class. 3. Uploads topology jar to Nimbus for distribution to cluster. 4. Once submitted storm activates topology and starts processing. 5. The main() method in the topology class is responsible for supplying unique name for the topology. 6. If topology with that name already exist on cluster, the jar command will fail. Kill Storm kill <topology name> Storm won't kill the topology immediately. Instead, it deactivates all the spouts so that they don't emit any more tuples, and then Storm waits Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS seconds before destroying all the workers. This gives the topology enough time to complete any tuples it was processing when it got killed.
  • 10. Storm Management Commands Command Usuage Description Deactivate Storm deactivate <topology_name> Deactivates the specified topology's spouts. Activate Storm activate <topology name> activates the specified topology's spouts. Rebalance storm rebalance topology-name [-w wait- time-secs] [-n new-num-workers] [-e component=parallelism]* For example, let's say you have a 10 node cluster running 4 workers per node, and then let's say you add another 10 nodes to the cluster. You may wish to have Storm spread out the workers for the running topology so that each node runs 2 workers. One way to do this is to kill the topology and resubmit it, but Storm provides a "rebalance" command that provides an easier way to do this. Rebalance will first deactivate the topology for the duration of the message timeout (overridable with the -w flag) and then redistribute the workers evenly around the cluster. The topology will then return to its previous state of activation .
  • 11. Storm Management Commands • http://storm.apache.org/releases/1.0.1/Command-line-client.html
  • 12. Storm Running Modes Storm Running Modes Local Remote
  • 13. Local Mode • Storm topologies run in local machine in a single JVM • This mode is for development, testing and debugging because its easiest way to see all topology components working together. • We can adjust topologies and see how our topology runs in different configuration. • http://storm.apache.org/releases/1.1.0/Local-mode.html
  • 14. Remote Mode • In remote mode, we submit topology to storm cluster which is composed of many processes, usually running on different machines. • Remote mode does not show debugging information which is why its considered Production mode. • http://storm.apache.org/releases/1.1.1/Running-topologies-on-a-production-cluster.html
  • 15. Groupings in Storm • Before designing it is important to define how data is exchanged between components • Stream grouping specifies which stream is consumed by each bolt and how stream will be consumed. • Stream grouping is defined when topology is defined. builder.setSpout("firstSpout", new FirstSpout(), 2); builder.setBolt("firstBolt", new FirstBolt(), 3).globalGrouping("firstSpout");
  • 16. Shuffle Grouping • Shuffle grouping is the most randomly used grouping. • Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples. • It is useful for doing atomic operations. • It cannot be used in case of topology where you need to count words, as operations cannot be randomly distributed.
  • 17. Fields Grouping • Based on the fields of one or more tuples, fields grouping allows you to control tuples sent to bolts. • It ensures that a given set of values for a combination of fields is always sent to the same bolt. Bolt A Bolt B Field x Field y Field z Fields grouping
  • 18. Fields Grouping • Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
  • 19. All Grouping • All grouping is used to send signals to bolts. • It sends signal copy of each tuple to all instances of receiving bolt. Bolt A Bolt B
  • 20. Custom Grouping • We can create our own custom stream grouping by implementing the backtype.storm.grouping.CustomStreamGrouping interface. This gives us the power to decide which bolt(s) will receive each tuple.
  • 21. Direct Grouping • This is a special grouping where the source decides which component will receive the tuple. • To use direct grouping bolt uses emitDirect method instead of emit.
  • 22. Direct Grouping • work out the number of target tasks in the prepare method: And in the topology definition, we specify that the stream will be grouped directly:
  • 23. Global Grouping • Global Grouping sends tuples generated by all instances of the source to a single target instance (specifically, the task with lowest id). Bolt A Bolt B
  • 24. Local Cluster Vs Storm Submitter • LocalCluster help to run the topology on our local computer • Running the Storm infrastructure on our computer lets us run and debug different topologies easily. But what about when we want to submit our topology to a running Storm cluster? • One of the interesting features of storm is that it’s easy to send our topology to run in a real cluster. We’ll need to change the LocalCluster to a StormSubmitter, and implement the submitTopology method, which is responsible for sending the topology to the cluster
  • 27. Spouts • A spout is a source of stream that generates input tuples to topology.
  • 28. ISpout • Ispout is the core interface for implementing spouts. Method Description open(Map conf, TopologyContext context, SpoutOutputCollector collector); Called when a task for this component is initialized within a worker on the cluster. void close(); Called when an ISpout is going to be shutdown void activate(); Called when a spout has been activated out of a deactivated mode. void deactivate(); Called when a spout has been deactivated. nextTuple will not be called while a spout is deactivated. void nextTuple(); When this method is called, Storm is requesting that the Spout emit tuples to the output collector. void ack(Object msgId); Storm has determined that the tuple emitted by this spout with the msgId identifier has been fully processed. Typically, an implementation of this method will take that message off the queue and prevent it from being replayed. void fail(Object msgId); The tuple emitted by this spout with the msgId identifier has failed to be fully processed. Typically, an implementation of this method will put that message back on the queue to be replayed at a later time.
  • 29. Ack in Storm • This method is called when message is processed correctly. • Tuple processing succeeds when the tuple is processed by all target bolts.
  • 30. Fail in Storm • The method fail is called when tuple processed failed. • This method is used either to resend the tuple or throw exception.
  • 31. Bolts • A bolt consumes tuple input streams, performs business logic, and potentially can emit new streams.
  • 32. Bolts • Bolts may subscribe to streams emitted by spouts or other bolts. • Bolts may perform any sort of processing imaginable. Filtering, Joins, Calculations, Database reads and writes. • All bolts must implement IRichBolt interface. BaseRichBolt is the most basic implementation.
  • 33. Bolts Method Description prepare(Map stormConf, TopologyContext context, OutputCollector collector) Calls just before bolt starts processing tuples. Called once if a lifetime of entire bolt lifecycle. void execute(Tuple input); Process a single tuple of input. public void cleanup() Called when a bolt is shutdown. void declareOutputFields(OutputFieldsDeclarer declarer); Declare the output schema of bolt. If this method is empty then you call the tuple by position in next bolt.
  • 34. Parallelism • Distributed applications take advantage of horizontally-scaled clusters by dividing computation tasks across nodes in a cluster. Storm offers this and additional finer-grained ways to increase the parallelism of a Storm topology: • Increase the number of workers • Increase the number of executors • Increase the number of tasks
  • 35. Creating first Storm Topology Node Worker JVM Executer thread Executer thread Executer Executer thread Task (sentence spout) Task (split sentence bolt) Task word (count bolt) Task (Report bolt) By default, Storm uses a parallelism factor of 1. Assuming a single-node Storm cluster, a parallelism factor of 1 means that one worker, or JVM, is assigned to execute the topology, and each component in the topology is assigned to a single executor. The following diagram illustrates this scenario. The topology defines a data flow with three tasks, a spout and two bolts.
  • 36. Controlling Parallelism with task Node Worker JVM Executer thread Task (sentence spout) Executer thread Task (split sentence bolt) Executer thread Task word (count bolt) Executer thread Task (Report bolt) Increasing Parallelism with Tasks Finally, Storm developers can increase the number of tasks assigned to a single topology component, such as a spout or bolt. By default, Storm assigns a single task to each component, but developers can increase this number with the setNumTasks() method on the BoltDeclarer and SpoutDeclarer objects returned by the setBolt() and setSpout() methods. Task (sentence spout)
  • 37. Controlling Multiple Workers Worker JVM Executer thread Task (sentence spout) Executer thread Executer thread Executer thread Task (Report bolt) Task (split sentence bolt) Task (split sentence bolt) Task word (count bolt) Task word (count bolt) Worker JVM Worker JVM Executer thread Task (sentence spout) Executer thread Executer thread Task (split sentence bolt) Task (split sentence bolt) Task word (count bolt) Task word (count bolt) Worker JVM Node Node conf.setNumWorkers(2); Increasing Parallelism with Executors The parallelism API enables Storm developers to specify the number of executors for each worker with a parallelism hint, an optional third parameter to the setBolt()
  • 39. Java Storm (Word Count Example) • https://github.com/Viyaan/StormWordCount
  • 40. Storm installation guide (single-node setup) • https://vincenzogulisano.com/2015/07/30/5-minutes-storm- installation-guide-single-node-setup/ • http://storm.apache.org/releases/1.0.3/Setting-up-a-Storm- cluster.html • https://www.youtube.com/watch?v=1-HWFArDACA
  • 41. Trident Join operations, aggregations, grouping, functions, and filters, as well as fault-tolerant state management
  • 42. Trident • Trident is a high level abstraction like utility to perform real time processing on top of storm. • Trident has joins, aggregations, grouping, functions and filters. • A typical trident topology consists of trident spouts and trident operators • There are no trident bolts. • It eases the programming.
  • 43. Trident Spouts • Unlike storm, trident has the concept of tuple batches. • Unlike storm spouts, trident spouts emit tuples in batches. • Each batch is assigned its own unique batch identifier. • Batch size and contents are configurable.
  • 44. Trident Spout Component Trident spout Components Batch Coordinator is responsible for batch management such that emitter can properly replay batches. Emitter function is responsible for emitting tuples.
  • 45. Trident Spout Code Snippet public class TridentSpout implements ITridentSpout{ public BatchCoordinator getCoordinator(String txStateId, Map conf, TopologyContext context) { // TODO Auto-generated method stub return null; } public Emitter getEmitter(String txStateId, Map conf, TopologyContext context) { // TODO Auto-generated method stub return null; } public Map<String, Object> getComponentConfiguration() { // TODO Auto-generated method stub return null; } public Fields getOutputFields() { // TODO Auto-generated method stub return null; } }
  • 46. ITridentSpout • ITridentSpout is the core interface to implement spout in trident. getCoordinator(String txStateId, Map conf, TopologyContext context) The coordinator for a TransactionalSpout runs in a single thread and indicates when batches of tuples should be emitted getEmitter(String txStateId, Map conf, TopologyContext context) Emitters are responsible for emitting batches of tuples
  • 47. BatchCoordinator Code Snippet interface BatchCoordinator<X> { /** * Create metadata for this particular transaction id which has never * been emitted before. The metadata should contain whatever is necessary * to be able to replay the exact batch for the transaction at a later point. * * The metadata is stored in Zookeeper. * * Storm uses JSON encoding to store the metadata. Only simple types * such as numbers, booleans, strings, lists, and maps should be used. * */ X initializeTransaction(long txid, X prevMetadata, X currMetadata); /** * This attempt committed successfully, so all state for this commit and before can be safely cleaned up. * */ void success(long txid); /** * hint to Storm if the spout is ready for the transaction id * */ boolean isReady(long txid); /** * Release any resources from this coordinator. */ void close(); }
  • 48. Emitter Code Snippet interface Emitter<X> { /** * Emit a batch for the specified transaction attempt and metadata for the transaction. The metadata * was created by the Coordinator in the initializeTransaction method. This method must always emit * the same batch of tuples across all tasks for the same transaction id. */ void emitBatch(TransactionAttempt tx, X coordinatorMeta, TridentCollector collector); /** * This attempt committed successfully, so all state for this commit and before can be safely cleaned up. */ void success(TransactionAttempt tx); /** * Release any resources held by this emitter. */ void close(); }
  • 49. Trident Filters • Filters take tuple as an input and decide whether to keep that tuple or not.
  • 50. Trident Functions • Functions are similar to storm bolts as they also consume tuples and potentially emit new tuples. • Functions cant remove or mutate existing fields, it can only add fields.
  • 51. Trident Aggregators • Similar to functions, aggregators allow topologies to combine fields. • Unlike functions they replace tuple fields and values. Trident Aggregators Combiner Aggregator AggregatorReducer Aggregator
  • 52. Combiner Aggregator • Combiner aggregator is used to combine a set of tuples into single field. • Storm calls init method with each batch and then repeatedly calls combine method until batch is processed. • After combining the values from processing the tuples, Storm emits the results of combining those values as a single new field. • If partitions is empty then storm emits the value returned by zero method.
  • 53. Reducer Aggregator • Storm calls init method to retrieve the initial value. • The reduce method is called with each tuple until the batch is fully processed. • The first argument to the reduce() method is the current cumulative aggregation, which the method returns after applying the tuple to the aggregation. When all tuples in the partition have been processed • The ReducerAggregator interface has the following interface definition
  • 54. Aggregator • A key difference between Aggregator and other Trident aggregation interfaces is that an instance of TridentCollector is passed as a parameter to every method. This allows Aggregator implementations to emit tuples at any time during execution. • Storm executes Aggregator instances as follows: • Storm calls the init() method, which returns an object T representing the initial state of the aggregation. • T is also passed to the aggregate() and complete() methods. • Storm calls the aggregate() method repeatedly, to process each tuple in the batch. • Storm calls complete() with the final value of the aggregation.
  • 55. Trident Transactional Spouts • Trident defines three spout types that differ with respect to batch content, failure response, and support for exactly-once semantics: • Non-transactional spouts • Non-transactional spouts make no guarantees for the contents of each batch. As a result, processing may be at-most-once or at least once. It is not possible to achieve exactly-once processing when using non-transactional Trident spouts. • Transactional spouts • Transactional spouts support exactly-once processing in a Trident topology. They define success at the batch level, and have several important properties that allow them to accomplish this: • Batches with a given transaction ID are always identical in terms of tuple content, even when replayed. • Batch content never overlaps. A tuple can never be in more than one batch. • Tuples are never skipped. • With transactional spouts, idempotent state updates are relatively easy: because batch transaction IDs are strongly ordered, the ID can be used to track data that has already been persisted. For example, if the current transaction ID is 5 and the data store contains a value for ID 5, the update can be safely skipped. • Opaque transactional spouts • Opaque transactional spouts define success at the tuple level. Opaque transactional spouts have the following properties: • There is no guarantee that a batch for a particular transaction ID is always the same. • Each tuple is successfully processed in exactly one batch, though it is possible for a tuple to fail in one batch and succeed in another. • The difference in focus between transactional and opaque transactional spouts—success at the batch level versus the tuple level, respectively—has key implications in terms of achieving exactly-once semantics when combining different spouts with different state types.
  • 56. Repeat Transactional state • In repeat transactional state the last committed batch identifier is stored with the data. • The state is updated if and only if the batch identifier being applied is next in sequence. • The batch identifier is equal or lower than the persisted identifier then the update is ignored because it has already being applied. Batchid State update 1 {SF.320:27811 =4} 2 {SF.320:27811 =10} 3 {SF.320:27811 =8}
  • 57. Repeat Transactional state • Batch then complete processing in following order. 1-> 2 -> 3-> 3 -> (replayed) When Batch 3 completes replay, it has no effect on the state because Trident has already incorporated its update in the state. For the repeat transactional state to function properly batch contents cannot change between replays. Batch id State 1 {batch =1} {SF:320:378911=4} 2 {batch =2} {SF:320:378911=14} 3 {batch =3} {SF:320:378911=22} 3 (Replayed) {batch =3} {SF:320:378911=22}
  • 58. Opaque transactional State • The approach used in repeat transactional state relies on the batch composition remaining constant which may not be possible if a system encounters a fault. • If the spout is emitting from a source that may have a partial failure, some of the tuples emitted in the initial batch might not be available for re-emission. • The opaque transactional state allows the changing of batch composition by storing both current and previous state. • Assume we have the same batches as in the previous example, but this time when batch 3 is replayed, the aggregate count will be different since it contains a different set of tuples as shown in table below. Batchid State update 1 {SF.320:27811 =4} 2 {SF.320:27811 =10} 3 {SF.320:27811 =8} 3 (Replayed) {SF.320:27811 =6}
  • 59. Opaque transactional State • With opaque state the state updates as follows. Completed batch Batch committed Previous state Current state 1 1 {} {SF.320:27811 =4} 2 2 {SF.320:27811 =4} {SF.320:27811 =14} 3 (Applies) 3 {SF.320:27811 =14} {SF.320:27811 =22} 3 (Replayed) 3 {SF.320:27811 =14} {SF.320:27811 =20}
  • 60. Combinations of spout and state types
  • 61. When to use Trident • As in many use cases, we have required exactly one processing, which we can achieve by writing a transactional topology in Trident. On the other hand, it will be difficult to achieve exactly one processing in the case of Vanilla Storm. Hence, Trident will be useful for those use cases where we require exactly once processing. • Trident is not fit for all use cases, especially high-performance use cases, because Trident adds complexity on Storm and manages the state
  • 62. Packaging Storm Topologies • Maven Shade Plugin • Use the maven-shade-plugin, rather than the maven-assembly-plugin to package your Apache Storm topologies. The maven-shade-plugin provides the ability to merge JAR manifest entries, which are used by the Hadoop client to resolve URL schemes. • Use the following Maven configuration file to package your topology: <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>1.4</version> <configuration> <createDependencyReducedPom>true</createDependencyReducedPom> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass></mainClass> </transformer> </transformers> </configuration> </execution> </executions> </plugin>
  • 63. Deploying and Managing Apache Storm Topologies • Use the command line interface to deploy a Storm topology after packaging it in a .jar file. • For example, you can use the following command to deploy WordCountTopology from the storm-starter jar: • The remainder of this chapter describes the Storm UI, which shows diagnostics for a cluster and topologies, allowing you to monitor and manage deployed topologies. storm jar storm-starter-<starter_version>-storm-<storm_version>.jar storm.starter.WordCountTopology WordCount -c nimbus.host=sandbox.hortonworks.com
  • 64. Moving Data Into and Out of Apache Storm Using Spouts and Bolts • The following spouts are available in HDP 2.5: • Kafka spout based on Kafka 0.7.x/0.8.x, plus a new Kafka consumer spout available as a technical preview (not for production use) • HDFS • EventHubs • Kinesis (technical preview) • The following bolts are available in HDP 2.5: • Kafka • HDFS • EventHubs • HBase • Hive • JDBC (supports Phoenix) • Solr • Cassandra • MongoDB • ElasticSearch • Redis • OpenTSDB (technical preview)
  • 65. Sample Connector Codes • https://github.com/Viyaan/StormKafkaStreamingHDFS • https://github.com/Viyaan/StormKafkaStreamingMongodb
  • 66. Reference Books • https://www.tutorialspoint.com/apache_storm/apache_storm_tutori al.pdf • https://docs.hortonworks.com/HDPDocuments/HDP2/HDP- 2.6.1/bk_storm-component-guide/bk_storm-component-guide.pdf • https://books.google.co.in/books?id=B9crAwAAQBAJ&pg

Editor's Notes

  1. Finally, Storm developers can increase the number of tasks assigned to a single topology component, such as a spout or bolt. By default, Storm assigns a single task to each component, but developers can increase this number with the setNumTasks() method on the BoltDeclarer and SpoutDeclarer objects returned by the setBolt() and setSpout() methods.
  2. The parallelism API enables Storm developers to specify the number of executors for each worker with a parallelism hint, an optional third parameter to the setBolt()