3. What is Storm?
Fast, Highly scalable, Fault-tolerance, Real-Time stream
processing system
Programming language agnostic (Java, Python, Ruby)
When should I use Storm?
Stream Processing
Continuous computation
Distributed RPC
Real Time Analytics
4. Storm Architecture
Master – Runs daemon called “Nimbus”
Distributing code, Assigning Tasks and Monitor Failure
Stateless and Fail-fast
Worker – Runs daemon called “Supervisor”
Creates, Stops and Starts worker processes
Stateless and Fail-fast
ZooKeeper
State-full, Manages cluster coordination
kill -9 Supervisor or Nimbus
Operating Modes –
Local Cluster – Development
Remote Cluster – Production
6. Application Components
Spouts and Bolts process streams
Stream is an unbounded sequence of tuples
Tuple – Key Value Pair
Topology – abstraction that defines network of computation;
and contains Spouts and Bolts
Can deploy Topology to Storm Cluster using Storm executable
storm jar <code.jar> com.fis.YourTopology arg1 arg2
storm kill “Name of the Topology”
storm activate “Name of the Topology”
storm deactivate “Name of the Topology”
storm rebalance “Name of the Topology” –w wait_time –n
worker_count –e executor_name=executor_count
8. Spout
Implements IRichSpout Interface
Methods
open(java.util.Map conf, TopologyContext context,
SpoutOutputCollector collector)
Called just before the bolt starts processing tuples
declareOutputFields(OutputFieldsDeclarer declarer)
Declare the output schema
nextTuple()
Emit tuples
ack() or fail()
Called when a bolt is going to shut down
9. Bolts
Implements IRichBolt Interface
Methods
declareOutputFields(OutputFieldsDeclarer declarer)
Declare the output schema for this bolt
prepare(java.util.Map conf, TopologyContext context,
OutputCollector collector)
Called just before the bolt starts processing tuples
execute(Tuple input)
Process a single tuple of input
cleanup()
Called when a bolt is going to shut down
10. Parallelism – Key Terms
Node
Machine that participate in storm cluster
Executes a portion of storm topology
Workers (independent JVM process)
Executors (threads that run within JVM process)
Tasks (instances of spout or bolt)
11.
12.
13. Grouping
Shuffle
Distributes tuples randomly across target’s bolt tasks
Fields
Based on the value of the field tuple routed to same bolt
All
replicates the tuple stream across all bolt tasks
Global
Routes all tuples in a stream to a single task
Direct
Source stream decides which component will receive a given
tuple by calling the emitDirect() method
14. MongoDB
Document oriented database
Has collections
Collection holds document
Document
are stored in JSON style
can have dynamic schema
Start Server:
<mongo install dir>/bin/mongod <db location>
Start Client
<mongo install dir>/bin/mongo
15. In theaters near you…
Microsoft Azure supports Storm
SCP.Net (Stream Computing Platform) in Azure and
HDInsight
URL: http://azure.microsoft.com/en-
us/documentation/articles/hdinsight-hadoop-storm-
scpdotnet-csharp-develop-streaming-data-processing-
application/