SERC – CADL
Indian Institute of Science
Bangalore, India
TWITTER STORM
Real Time, Fault Tolerant Distributed Framework
Created : 25th May, 2013
SONAL RAJ
National Institute of Technology,
Jamshedpur, India
Background
• Created by Nathan Marz @ BackType/Twitter
• Analyze tweets, links, users on Twitter
• Opensourced at Sep 2011
• Eclipse Public License 1.0
• Storm 0.5.2
• 16k java and 7k Clojure LOC
• Current stable release 0.8.2
• 0.9.0 major core improvement
Background
• Active user group
• https://groups.google.com/group/storm-user
• https://github.com/nathanmarz/storm
• Most watched java repo at GitHub (>4k watcher)
• Used by over 30 companies
• Twitter, Groupon, Alibaba, GumGum, ..
What led to storm . .
Problems . . .
•Scale is painful
•Poor fault-tolerance
• Hadoop is stateful
•Coding is tedious
•Batch processing
• Long latency
• no realtime
Storm . . .Problems Solved !!
•Scalable and robust
• No persistent layer
•Guarantees no data loss
•Fault-tolerant
•Programming language agnostic
•Use case
• Stream processing
• Distributed RPC
• Continues computation
STORM FEATURES
Storm
Guaranteed data processing
...,Horizontal scalability
Fault-tolerance
..., No intermediate message brokers!
...,Higher level abstraction than message passing
...,"Just works"
Storm’s edge over hadoop
HADOOP STORM
• Batch processing
• Jobs runs to completion
• JobTracker is SPOF*
• Stateful nodes
• Scalable
• Guarantees no data loss
• Open source
Real-time processing
Topologies run forever
No single point of failure
Stateless nodes
Scalable
Guarantees no data loss
Open source
* Hadoop 0.21 added some checkpointing
SPOF: Single Point Of Failure
Streaming
Computation
Paradigm of stream computation
Queues /Workers
General method
Messages Queue
general method
Message routing can be complex
Messages Queue
storm use cases
COMPONENTS
• Nimbus daemon is comparable to Hadoop JobTracker. It is
the master
• Supervisor daemon spawns workers, it is comparable to
Hadoop TaskTracker
• Worker is spawned by supervisor, one per port defined in
storm.yaml configuration
• Task is run as a thread in workers
• Zookeeper is a distributed system, used to store metadata.
Nimbus and Supervisor daemons are fail-fast and stateless.
All states is kept in Zookeeper.
Notice all communication between Nimbus and
Supervisors are done through Zookeeper
On a cluster with 2k+1 zookeeper nodes, the
system can recover when maximally k nodes fails.
STORM ARCHITECTLlRE
,_ , 'I
Storm architecture
Master Node ( Similar to Hadoop Job-Tracker )
STORM ARCHITECTLlRE
Used for Cluster Co-ordination
STORM ARCHITECTLlRE
Runs Worker Nodes I Processes
CONCEPTS
• Streams
• Topology
• A spout
• A bolt
• An edge represents a grouping
streams
spouts
• Example
• Read from logs, API calls,
event data, queues, …
SPOUTS
•Interface ISpout
l·lethod Summanr"
void ack(java.lang.Object msg_d)
Storm has detennined that thetnpl1
e emitted by this spout th the msgld identifierhas been fuUy processed.
void acti-.:rate 0
Called when a spout has been actPtated out ,of a deactivated mode.
void close()
Called when an ISpout is going to be shutdovn.
void deactivate()
Called vhen a spout has been deacty.,ated.
void fail(java.lang.Object msgidl
The tnple emitted by this spout vith the msgld identifier hasfailed to befulrlprocessed.
void nextTu12le()
Vhen thls method is calle<l Stonn is requesting iliat the Spout emit tnples to theoutput colleotor.
void open(java.· ti .Map con.f, Tog.ologyContext context, SQoutOutQutCollector co ector)
Called when a task for this component is initialized within a worker on the d1rrster.
Bolts
•Bolts
• Processes input streams and produces new streams
• Example
• Stream Joins, DBs, APIs, Filters, Aggregation, …
BOLTS
• Interface Ibolt
TOPOLOGY
•Topology
• is a graph where each node is a spout or bolt, and the edges
indicate which bolts are subscribing to which streams.
TASKS
• Parallelism is implemented using multiples instances of each spout
and bolt for simultaneous similar tasks. Spouts and bolts execute as
many tasks across the cluster.
• Managed by the supervisor daemon
Stream groupings
When a tuple is emitted, which task
does it go to?
Stream grouping
Shuffle grouping: pick a random task
Fields grouping: consistent hashing on a
subset of tuple fields
All grouping: send to all tasks
Global grouping: pick task with lowest id
example : streaming word count
• TopologyBuilder is used to construct topologies in Java.
• Define a Spout in the Topology with parallelism of 5 tasks.
abstraction : DRPC
Consumer decides what data it receives and how it gets
grouped
• Split Sentences into words with parallelism of 8 tasks.
• Create a word count stream
ABSTRACTION : DRPC
)
public static class SplttSentence extends ShellBolt implements IRtchBolt {
public SplttSentence()
super("python", "splltsentence.pyH);
}
public votd declareOutputF1elds(OutputF1eldsDeclarer declare!){
declarer.declaren(ew Fields ''word''));
}
}
'import storm
class SplttSentenceBolts(torm.BastcBolt):
def process(self, tup):
words = tup.values[0].spl1t"( 11
for word tn words:
storm.emit([word])
INSIDE A BOLT ..
public static class WordCount implements IBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
public void prepare(Map conf, TopologyContext conte ) {
}
public void execute(Tuple tuple, BastcOutputCollector
collector){ String vorc..J = tuple.getStr1ng(0);
Integer count = counts.get(word);
if(count==null)count = 0;
count++;
counts.put(word, count);
collector.emitn(ew Values(word, count));
}
public votd cleanup(){
}
public vo1d declareOutputFields(OutputFieldsDeclarer declarEr){
declarer.declaren(ew flelds("word", "count"));
}
}
abstraction : DRPC
• Submitting Topologies to the cluster
abstraction : DRPC
• Running the Topology in Local Mode
Fault-Tolerance
• Zookeeper stores metadata in a very robust way
• Nimbus and Supervisor are stateless and only need metadata from ZK to
work/restart
• When a node dies
• The tasks will time out and be reassigned to other workers by Nimbus.
• When a worker dies
• The supervisor will restart the worker.
• Nimbus will reassign worker to another supervisor, if no heartbeats are
sent.
• If not possible (no free ports), then tasks will be run on other workers in
topology. If more capacity is added to the cluster later, STORM will
automatically initialize a new worker and spread out the tasks.
• When nimbus or supervisor dies
• Workers will continue to run
• Workers cannot be reassigned without Nimbus
• Nimbus and Supervisor should be run using a process monitoring tool, to
restarts them automatically if they fail.
AT LEAST ONCE Processing
• STORM guarantees at-least-once processing of tuples.
• Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits
long
• Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple.
• Ack is called on spout, when tree of tuples for spout tuple is fully processed.
• Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of
tuples is not fully processed within a specified timeout (default is 30 seconds).
• It is possible to specify the message id, when emitting a tuple. This might be
useful for replaying tuples from a queue.
Ack/fail method called when tree of
tuples have been fully processed or
failed / timed-out
AT Least once processing
• Anchoring is used to copy the spout tuple message id(s) to the new
tuples generated. In this way, every tuple knows the message id(s) of all
spout tuples.
• Multi-anchoring is when multiple tuples are anchored. If the tuple tree
fails, then multiple spout tuples will be replayed. Useful for doing
streaming joins and more.
• Ack called from a bolt, indicates the tuple has been processed as
intented
• Fail called from a bolt, replays the spout tuple(s)
• Every tuple must be acked/failed or the task will run out of memory at
some point.
_collector.emit(tuple,new Values(word)); Uses anchoring
_collector.emit(new Values(word)); Does NOT use anchoring
exactly once processing
• Transactional topologies (TT) is an abstraction built on STORM primitives.
• TT guarantees exactly-once-processing of tuples.
• Acking is optimized in TT, no need to do anchoring or acking manually.
• Bolts execute as new instances per attempt of processing a batch
• Example
All grouping
Spout
Task: 1
Bolt
Task: 2
Bolt
Task: 3
1. A spout tuple is emitted to task 2 and 3
2. Worker responsible for task 3 fails
3. Supervisor restarts worker
4. Spout tuple is replayed and emitted to task
2 and 3
5. Task 2 and 3 initiate new bolts because of new
attempt
Now there is no problem
ABSTRACTION : DRPC
f
/
l["request-id"',..result"]
,-----
+''result.. - DRPC
-"args.. Server
::.,
Topology
[..request-id"1· "args' "return-info..]
Ill
Ill
Distributed RPC Architecture
WHY DRPC ?
Before Distributed RPC, time-sensitive queries relied
on a pre-computed index
Storm Does away with the indexing !!
abstraction : DRPC example
• Calculating the “Reach” of URL on the fly (in real time ! )
• Written by Nathan Marz to implement storm !
• Real World Application of Storm , open source, available
at http://github.com/nathanmarz/storm
• Reach is the number of unique people exposed to a URL
(tweet) on twitter at any given time.
abstraction : DRPC >> computing reach
ABSTRACTION : DRPC >> REACH TOPOLOGY
Spout - shuffle
["follower-id"]
+
global
t
abstraction : DRPC >> Reach topology
Create the Topology for the DRPC
Implementation of Reach Computation
ABSTRACTION : DRPC
_collector.emitn(ew Values(id, count));
}
public static class PartialUniquer implements IRichBolt, FinishedCallback {
OutputCollector _collecto";
Map<Object, Set<String>> _sets - new HashMap<Object, Set<String>>();
public void execute(Tuple tuple){
Object id = tuple.getValue(0);
Set<String> curr = _sets.get(id);
if(curr==null){
curr = new HashSet<String>();
_sets.put(id, curr);
}
curr.add(tuple.getString(l));
_collector.ack(tuple);
}
@Override
public void finishedidO(bject 1d){
Set<String> curr = _sets.remove(id);
int count = 0;
if(curr!=null)count = curr.size();
ABSTRACTION : DRPC
_collector.emitn(ew Values(id, count));
}
public static class Part1a1Un1 uer 1m lements IR1chBolt, F1n1shedCa1lback {
Ou _co ector;
ap<Object, Set<String>> _sets = new HashMap<Object, Set<String>>
public void execu e u
Object 1d = tuple.getVa1ue(0);
Set<String> curr = _sets.get(1d);
1f(curr==nu11){
curr = new HashSet<Str1ng>();
_sets.put(id, curr);
}
curr.add(tup1e.getStr1ng(l));
_collector.ack(tuple);
Keep set of followers for
each request id in n1en1ory
}
@Override
public void f1n1shedidO(bject id){
Set<String> curr = _sets.remove(id);
i.nt count = 0;
1f(curr!=nu11)count = curr.size();
ABSTRACTION : DRPC
_collector.emitn(ew Values(id, count));
}
public static class PartialUniquer implements IRichBolt, FinishedCallback {
OutputCollector _collector;
Map<Object, Set<String>> _sets - new HashMap<Object, Set<String>>();
pub · oid
execute(Tuple
Object id = tuple.getValue(0 ,
Set<String> curr = _sets.get(id
if(curr==null){
curr = new HashSet<String>();
_sets.put(id, curr);
}
curr.add(tuple.getString(l));
_collector.ack(tuple);
@Override
public void finishedidO(bject id){
Set<String> curr = _sets.remove(id);
int count = 0;
ABSTRACTION : DRPC
_collector.emitn(ew Values(id, count));
}
if(curr!=null)count = curr.size();
ABSTRACTION : DRPC
public static class PartialUniquer implements IRichBolt, FinishedCallback {
OutputCollector _collector;
Map<Object, Set<String>> _sets = new HashMap<Object, Set<String>>();
public void execute(Tuple tuple){
Object id = tuple.getValue(0);
Set<String> curr = _sets.get(id);
if(curr==null){
curr = new HashSet<String>();
_sets.put(id, curr);
}
curr.add(tuple.getString(l));
_collector.ack(tuple);
}
lie void finishedidO(bject id){
Set<String> curr = _sets.remove(id);
int count = 0;
if(curr!=null)count = curr.size();
_collector.emitn(ew Values(id, count
guaranteeing message processing
Tuple Tree
Guaranteeing message processing
• A spout tuple is not fully processed until all tuples in
the tree have been completed.
• If the tuple tree is not completed within a specified
timeout, the spout tuple is replayed
• Use of an inherent tool called the Reliability API
Guaranteeing message processing
Marks a single node in
the tree as complete
“ Anchoring “ creates a
new edge in the tuple
tree
Storm tracks tuple trees for you in an extremely efficient way
Running a storm application
•Local Mode
• Runs on a single JVM
• Used for development testing and debugging
•Remote Mode
• Submit our processes to Storm Cluster which has many processes
running on different machines.
• Doesn’t show debugging info, hence it is considered Production Mode.
STORM UI
l Pilm•
231HmOI
Hos1
p 11-32 181-'B.ta.llltf<!11
l>orl
6700
l:meted lnondwTecS ,,_ .....ey (ntsJ
OSII 'UJ21'l!J 0
2 23n' n 57s p11).98 200- 01 «:2 '*'nil (i100 54!S.."'60 033-1 2742"..&0 0
a 2'31 17 tp.IG-t
"""
&roo 64l!.S320 &oee'.l320 0. 274.."'«>0 0
5 231117m!l!l p 10.1'V-Il7·116.tc2.1nterno! fl700 03:!6 274274() D
,_
Storm Ul
Component summary
2
Bolt stats
Proc.n cYIMII
031!1
O.alll
0.3:<'0
0320
Input stats (AJItime)
• 'Stt.., Process bl.tone)' IM•I
032CI
Fa'lood
0
Acted Uosl "'""
• 17n• tOll IP 10.:»-73·2311.«,11111! 6100 0 742740 0
DOCUlVIENTATION
nathanman: DastOoard lnbox
nathanmarz I storm 2.,051 I. 109
Pull • 23 Wild 2.4 SlAts e.Graphs
Home Pages WtklHistory GitAocess
Home wPage fGitP&ge
Storm is a distributed realtime computation system.Similar to how Hadoop provides a set of generalprimJtives for doing batch processing,
Storm prov1desa set or generalprimitivesror doang realtJmecomputation.Storm iss1mp1e,canbe usedwath anyprogramm1ng Jaoguage,and
Is a lot of fun to use!
Read these first
• Ra:Jonale
• Sottmg up devolopment environment
• Creatmg a new Stormproject
• Tutor al
Getting help
Feeltree to askquestionson Storm's mailing list·ttp:lkjro p :. ooo oom/qrn 1p torm-user
You can also come to tho Istorm-user room on " cnodo You can usually find a Storm dovolopor thoro to help you out
fated projects
STORM LIBRARIES . .
STORM uses a lot of libraries. The most prominent are
• Clojure a new lisp programming language. Crash-course follows
• Jetty an embedded webserver. Used to host the UI of Nimbus.
• Kryo a fast serializer, used when sending tuples
• Thrift a framework to build services. Nimbus is a thrift daemon
• ZeroMQ a very fast transportation layer
• Zookeeper a distributed system for storing metadata
References
•Twitter Storm
• Mathan Marz
• http://www.storm-project.org
•Storm
• nathanmarz@github
• http://www.github.com/nathanmarz/storm
•Realtime Analytics with Storm and Hadoop
• Hadoop_Summit

Storm Real Time Computation

  • 1.
    SERC – CADL IndianInstitute of Science Bangalore, India TWITTER STORM Real Time, Fault Tolerant Distributed Framework Created : 25th May, 2013 SONAL RAJ National Institute of Technology, Jamshedpur, India
  • 2.
    Background • Created byNathan Marz @ BackType/Twitter • Analyze tweets, links, users on Twitter • Opensourced at Sep 2011 • Eclipse Public License 1.0 • Storm 0.5.2 • 16k java and 7k Clojure LOC • Current stable release 0.8.2 • 0.9.0 major core improvement
  • 3.
    Background • Active usergroup • https://groups.google.com/group/storm-user • https://github.com/nathanmarz/storm • Most watched java repo at GitHub (>4k watcher) • Used by over 30 companies • Twitter, Groupon, Alibaba, GumGum, ..
  • 4.
    What led tostorm . .
  • 5.
    Problems . .. •Scale is painful •Poor fault-tolerance • Hadoop is stateful •Coding is tedious •Batch processing • Long latency • no realtime
  • 6.
    Storm . ..Problems Solved !! •Scalable and robust • No persistent layer •Guarantees no data loss •Fault-tolerant •Programming language agnostic •Use case • Stream processing • Distributed RPC • Continues computation
  • 7.
    STORM FEATURES Storm Guaranteed dataprocessing ...,Horizontal scalability Fault-tolerance ..., No intermediate message brokers! ...,Higher level abstraction than message passing ...,"Just works"
  • 8.
    Storm’s edge overhadoop HADOOP STORM • Batch processing • Jobs runs to completion • JobTracker is SPOF* • Stateful nodes • Scalable • Guarantees no data loss • Open source Real-time processing Topologies run forever No single point of failure Stateless nodes Scalable Guarantees no data loss Open source * Hadoop 0.21 added some checkpointing SPOF: Single Point Of Failure
  • 9.
  • 10.
    Paradigm of streamcomputation Queues /Workers
  • 11.
  • 12.
    general method Message routingcan be complex Messages Queue
  • 13.
  • 14.
    COMPONENTS • Nimbus daemonis comparable to Hadoop JobTracker. It is the master • Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker • Worker is spawned by supervisor, one per port defined in storm.yaml configuration • Task is run as a thread in workers • Zookeeper is a distributed system, used to store metadata. Nimbus and Supervisor daemons are fail-fast and stateless. All states is kept in Zookeeper. Notice all communication between Nimbus and Supervisors are done through Zookeeper On a cluster with 2k+1 zookeeper nodes, the system can recover when maximally k nodes fails.
  • 15.
  • 16.
    Storm architecture Master Node( Similar to Hadoop Job-Tracker )
  • 17.
    STORM ARCHITECTLlRE Used forCluster Co-ordination
  • 18.
  • 19.
    CONCEPTS • Streams • Topology •A spout • A bolt • An edge represents a grouping
  • 20.
  • 21.
    spouts • Example • Readfrom logs, API calls, event data, queues, …
  • 22.
    SPOUTS •Interface ISpout l·lethod Summanr" voidack(java.lang.Object msg_d) Storm has detennined that thetnpl1 e emitted by this spout th the msgld identifierhas been fuUy processed. void acti-.:rate 0 Called when a spout has been actPtated out ,of a deactivated mode. void close() Called when an ISpout is going to be shutdovn. void deactivate() Called vhen a spout has been deacty.,ated. void fail(java.lang.Object msgidl The tnple emitted by this spout vith the msgld identifier hasfailed to befulrlprocessed. void nextTu12le() Vhen thls method is calle<l Stonn is requesting iliat the Spout emit tnples to theoutput colleotor. void open(java.· ti .Map con.f, Tog.ologyContext context, SQoutOutQutCollector co ector) Called when a task for this component is initialized within a worker on the d1rrster.
  • 23.
    Bolts •Bolts • Processes inputstreams and produces new streams • Example • Stream Joins, DBs, APIs, Filters, Aggregation, …
  • 24.
  • 25.
    TOPOLOGY •Topology • is agraph where each node is a spout or bolt, and the edges indicate which bolts are subscribing to which streams.
  • 26.
    TASKS • Parallelism isimplemented using multiples instances of each spout and bolt for simultaneous similar tasks. Spouts and bolts execute as many tasks across the cluster. • Managed by the supervisor daemon
  • 27.
    Stream groupings When atuple is emitted, which task does it go to?
  • 28.
    Stream grouping Shuffle grouping:pick a random task Fields grouping: consistent hashing on a subset of tuple fields All grouping: send to all tasks Global grouping: pick task with lowest id
  • 29.
    example : streamingword count • TopologyBuilder is used to construct topologies in Java. • Define a Spout in the Topology with parallelism of 5 tasks.
  • 30.
    abstraction : DRPC Consumerdecides what data it receives and how it gets grouped • Split Sentences into words with parallelism of 8 tasks. • Create a word count stream
  • 31.
    ABSTRACTION : DRPC ) publicstatic class SplttSentence extends ShellBolt implements IRtchBolt { public SplttSentence() super("python", "splltsentence.pyH); } public votd declareOutputF1elds(OutputF1eldsDeclarer declare!){ declarer.declaren(ew Fields ''word'')); } } 'import storm class SplttSentenceBolts(torm.BastcBolt): def process(self, tup): words = tup.values[0].spl1t"( 11 for word tn words: storm.emit([word])
  • 32.
    INSIDE A BOLT.. public static class WordCount implements IBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void prepare(Map conf, TopologyContext conte ) { } public void execute(Tuple tuple, BastcOutputCollector collector){ String vorc..J = tuple.getStr1ng(0); Integer count = counts.get(word); if(count==null)count = 0; count++; counts.put(word, count); collector.emitn(ew Values(word, count)); } public votd cleanup(){ } public vo1d declareOutputFields(OutputFieldsDeclarer declarEr){ declarer.declaren(ew flelds("word", "count")); } }
  • 33.
    abstraction : DRPC •Submitting Topologies to the cluster
  • 34.
    abstraction : DRPC •Running the Topology in Local Mode
  • 35.
    Fault-Tolerance • Zookeeper storesmetadata in a very robust way • Nimbus and Supervisor are stateless and only need metadata from ZK to work/restart • When a node dies • The tasks will time out and be reassigned to other workers by Nimbus. • When a worker dies • The supervisor will restart the worker. • Nimbus will reassign worker to another supervisor, if no heartbeats are sent. • If not possible (no free ports), then tasks will be run on other workers in topology. If more capacity is added to the cluster later, STORM will automatically initialize a new worker and spread out the tasks. • When nimbus or supervisor dies • Workers will continue to run • Workers cannot be reassigned without Nimbus • Nimbus and Supervisor should be run using a process monitoring tool, to restarts them automatically if they fail.
  • 36.
    AT LEAST ONCEProcessing • STORM guarantees at-least-once processing of tuples. • Message id, gets assigned to a tuple when emitting from spout or bolt. Is 64 bits long • Tree of tuples is the tuples generated (directly and indirectly) from a spout tuple. • Ack is called on spout, when tree of tuples for spout tuple is fully processed. • Fail is called on spout, if one of the tuples in the tree of tuples fails or the tree of tuples is not fully processed within a specified timeout (default is 30 seconds). • It is possible to specify the message id, when emitting a tuple. This might be useful for replaying tuples from a queue. Ack/fail method called when tree of tuples have been fully processed or failed / timed-out
  • 37.
    AT Least onceprocessing • Anchoring is used to copy the spout tuple message id(s) to the new tuples generated. In this way, every tuple knows the message id(s) of all spout tuples. • Multi-anchoring is when multiple tuples are anchored. If the tuple tree fails, then multiple spout tuples will be replayed. Useful for doing streaming joins and more. • Ack called from a bolt, indicates the tuple has been processed as intented • Fail called from a bolt, replays the spout tuple(s) • Every tuple must be acked/failed or the task will run out of memory at some point. _collector.emit(tuple,new Values(word)); Uses anchoring _collector.emit(new Values(word)); Does NOT use anchoring
  • 38.
    exactly once processing •Transactional topologies (TT) is an abstraction built on STORM primitives. • TT guarantees exactly-once-processing of tuples. • Acking is optimized in TT, no need to do anchoring or acking manually. • Bolts execute as new instances per attempt of processing a batch • Example All grouping Spout Task: 1 Bolt Task: 2 Bolt Task: 3 1. A spout tuple is emitted to task 2 and 3 2. Worker responsible for task 3 fails 3. Supervisor restarts worker 4. Spout tuple is replayed and emitted to task 2 and 3 5. Task 2 and 3 initiate new bolts because of new attempt Now there is no problem
  • 39.
    ABSTRACTION : DRPC f / l["request-id"',..result"] ,----- +''result..- DRPC -"args.. Server ::., Topology [..request-id"1· "args' "return-info..] Ill Ill Distributed RPC Architecture
  • 40.
    WHY DRPC ? BeforeDistributed RPC, time-sensitive queries relied on a pre-computed index Storm Does away with the indexing !!
  • 41.
    abstraction : DRPCexample • Calculating the “Reach” of URL on the fly (in real time ! ) • Written by Nathan Marz to implement storm ! • Real World Application of Storm , open source, available at http://github.com/nathanmarz/storm • Reach is the number of unique people exposed to a URL (tweet) on twitter at any given time.
  • 42.
    abstraction : DRPC>> computing reach
  • 43.
    ABSTRACTION : DRPC>> REACH TOPOLOGY Spout - shuffle ["follower-id"] + global t
  • 44.
    abstraction : DRPC>> Reach topology Create the Topology for the DRPC Implementation of Reach Computation
  • 45.
    ABSTRACTION : DRPC _collector.emitn(ewValues(id, count)); } public static class PartialUniquer implements IRichBolt, FinishedCallback { OutputCollector _collecto"; Map<Object, Set<String>> _sets - new HashMap<Object, Set<String>>(); public void execute(Tuple tuple){ Object id = tuple.getValue(0); Set<String> curr = _sets.get(id); if(curr==null){ curr = new HashSet<String>(); _sets.put(id, curr); } curr.add(tuple.getString(l)); _collector.ack(tuple); } @Override public void finishedidO(bject 1d){ Set<String> curr = _sets.remove(id); int count = 0; if(curr!=null)count = curr.size();
  • 46.
    ABSTRACTION : DRPC _collector.emitn(ewValues(id, count)); } public static class Part1a1Un1 uer 1m lements IR1chBolt, F1n1shedCa1lback { Ou _co ector; ap<Object, Set<String>> _sets = new HashMap<Object, Set<String>> public void execu e u Object 1d = tuple.getVa1ue(0); Set<String> curr = _sets.get(1d); 1f(curr==nu11){ curr = new HashSet<Str1ng>(); _sets.put(id, curr); } curr.add(tup1e.getStr1ng(l)); _collector.ack(tuple); Keep set of followers for each request id in n1en1ory } @Override public void f1n1shedidO(bject id){ Set<String> curr = _sets.remove(id); i.nt count = 0; 1f(curr!=nu11)count = curr.size();
  • 47.
    ABSTRACTION : DRPC _collector.emitn(ewValues(id, count)); } public static class PartialUniquer implements IRichBolt, FinishedCallback { OutputCollector _collector; Map<Object, Set<String>> _sets - new HashMap<Object, Set<String>>(); pub · oid execute(Tuple Object id = tuple.getValue(0 , Set<String> curr = _sets.get(id if(curr==null){ curr = new HashSet<String>(); _sets.put(id, curr); } curr.add(tuple.getString(l)); _collector.ack(tuple); @Override public void finishedidO(bject id){ Set<String> curr = _sets.remove(id); int count = 0;
  • 48.
    ABSTRACTION : DRPC _collector.emitn(ewValues(id, count)); } if(curr!=null)count = curr.size();
  • 49.
    ABSTRACTION : DRPC publicstatic class PartialUniquer implements IRichBolt, FinishedCallback { OutputCollector _collector; Map<Object, Set<String>> _sets = new HashMap<Object, Set<String>>(); public void execute(Tuple tuple){ Object id = tuple.getValue(0); Set<String> curr = _sets.get(id); if(curr==null){ curr = new HashSet<String>(); _sets.put(id, curr); } curr.add(tuple.getString(l)); _collector.ack(tuple); } lie void finishedidO(bject id){ Set<String> curr = _sets.remove(id); int count = 0; if(curr!=null)count = curr.size(); _collector.emitn(ew Values(id, count
  • 50.
  • 51.
    Guaranteeing message processing •A spout tuple is not fully processed until all tuples in the tree have been completed. • If the tuple tree is not completed within a specified timeout, the spout tuple is replayed • Use of an inherent tool called the Reliability API
  • 52.
    Guaranteeing message processing Marksa single node in the tree as complete “ Anchoring “ creates a new edge in the tuple tree Storm tracks tuple trees for you in an extremely efficient way
  • 53.
    Running a stormapplication •Local Mode • Runs on a single JVM • Used for development testing and debugging •Remote Mode • Submit our processes to Storm Cluster which has many processes running on different machines. • Doesn’t show debugging info, hence it is considered Production Mode.
  • 54.
    STORM UI l Pilm• 231HmOI Hos1 p11-32 181-'B.ta.llltf<!11 l>orl 6700 l:meted lnondwTecS ,,_ .....ey (ntsJ OSII 'UJ21'l!J 0 2 23n' n 57s p11).98 200- 01 «:2 '*'nil (i100 54!S.."'60 033-1 2742"..&0 0 a 2'31 17 tp.IG-t """ &roo 64l!.S320 &oee'.l320 0. 274.."'«>0 0 5 231117m!l!l p 10.1'V-Il7·116.tc2.1nterno! fl700 03:!6 274274() D ,_ Storm Ul Component summary 2 Bolt stats Proc.n cYIMII 031!1 O.alll 0.3:<'0 0320 Input stats (AJItime) • 'Stt.., Process bl.tone)' IM•I 032CI Fa'lood 0 Acted Uosl "'"" • 17n• tOll IP 10.:»-73·2311.«,11111! 6100 0 742740 0
  • 55.
    DOCUlVIENTATION nathanman: DastOoard lnbox nathanmarzI storm 2.,051 I. 109 Pull • 23 Wild 2.4 SlAts e.Graphs Home Pages WtklHistory GitAocess Home wPage fGitP&ge Storm is a distributed realtime computation system.Similar to how Hadoop provides a set of generalprimJtives for doing batch processing, Storm prov1desa set or generalprimitivesror doang realtJmecomputation.Storm iss1mp1e,canbe usedwath anyprogramm1ng Jaoguage,and Is a lot of fun to use! Read these first • Ra:Jonale • Sottmg up devolopment environment • Creatmg a new Stormproject • Tutor al Getting help Feeltree to askquestionson Storm's mailing list·ttp:lkjro p :. ooo oom/qrn 1p torm-user You can also come to tho Istorm-user room on " cnodo You can usually find a Storm dovolopor thoro to help you out fated projects
  • 56.
    STORM LIBRARIES .. STORM uses a lot of libraries. The most prominent are • Clojure a new lisp programming language. Crash-course follows • Jetty an embedded webserver. Used to host the UI of Nimbus. • Kryo a fast serializer, used when sending tuples • Thrift a framework to build services. Nimbus is a thrift daemon • ZeroMQ a very fast transportation layer • Zookeeper a distributed system for storing metadata
  • 57.
    References •Twitter Storm • MathanMarz • http://www.storm-project.org •Storm • nathanmarz@github • http://www.github.com/nathanmarz/storm •Realtime Analytics with Storm and Hadoop • Hadoop_Summit