Storm
Real-time computation made easy
Michael Vogiatzis
What’s Storm?
• Distributed real-time computation system
• Fault tolerant
• Fast
• Scalable
• Guaranteed message processing
• Open source
• Multilang capabilities
Purpose
Ok but why?
Motivation
• Queues – Workers paradigm
• Scaling is hard
• System is not robust
• Coding is not fun!
– No abstraction
– Low level message passing
– Intermediate message brokers
Use cases
• Stream processing
– Consume stream, update db, etc
• Distributed RPC
– Intense function on top of storm
• Ongoing computation
– Computing music trends on Twitter
Architecture
Elements
• Streams
– Set of tuples
– Unbounded sequence of data
• Spout
– Source of streams
• Bolts
– Application logic
– Functions
– Streaming aggregations, joins, DB ops
Topology
Storm UI
Demo
Unshorten URLs
Evil Shorteners
Demo
Trident
● Higher level of abstraction on top of Storm
● Batch processing
● Keeps state using your persistence store e.g.
DBs, Memcached, etc.
● Exactly – once semantics
● Tuples can be replayed!
● Similar API to Pig / Cascading
Trident operations
OperationInput fields Function fields
Trident operations
● Joins
● Aggregations
● Grouping
● Functions
● Filtering
● Sorting
Trident State
● Solid API for reading / writing to stateful
sources
● State updates are idempotent
● Different kind of fault-tolerance depending
on the different Spout implementations
Learn by example
Compute Male – Female count on a
particular topic on Twitter over time
Trident Gender
1. Stream of incoming tweets
2. Filter out the non-relevant to topic
3. Check gender by checking first name
4. Update either male or female counter
Input (Spout impl.)
● Receives public stream (~1% of tweets) and emits them
into the system
List<Object> tweets;
public void emitBatch(long batchId, 
TridentCollector collector) {
for (Object o : tweets)
collector.emit(new Values(o));
}
Filter
Implement a Filter class called FilterWords
.each(new Fields("status"), new FilterWords(interestingWords))
String[] words = {“instagram”, “flickr”, “pinterest”, “picasa”};
public boolean isKeep(TridentTuple tuple) {
Tweet t = (Tweet) tuple.getValue(0);
//is tweet an interesting one?
for (String word : words)
        if (s.getText().toLowerCase().contains(word))
           return true;   
     
       return false;
    }
}
Function
Implement a function class
.each(new Fields("status"), new 
ExpandName(), new Fields("name"))
Tuple before:
[{”fullname”: “Iris HappyWorker”, 
“text”:”Having the freedom to choose your 
work location feels great. This week is 
London. pic.twitter.com/BHZq86o6“}]
Function
Implement a function class
.each(new Fields("status"), new ExpandName(), new 
Fields("name"))
Tuple before:
[{”fullname”: “Iris HappyWorker”, “text”:”Having the 
freedom to choose your work location feels great. This week 
is London. pic.twitter.com/BHZq86o6“}]
Tuple after: 
[{”fullname”: “Iris HappyWorker”, “text”:”Having the 
freedom to choose your work location feels great. This week 
is London. pic.twitter.com/BHZq86o6“}, 
“Iris”]
State Query
Implement a QueryFunction to query the persistence storage.
.stateQuery(genderDB, new Fields("name"), new 
QueryGender(), new Fields("gender"))
public List<String> batchRetrieve(GenderDB state, 
List<TridentTuple> tuples) {
List<String> batchToQuery = new ArrayList<String>();
for (TridentTuple t : tuples){
    String name = t.getStringByField("name");
    batchToQuery.add(name);
  }
return state.getGenders(batchToQuery);
}
State Query
Tuple before: 
[{”fullname”: “Iris 
HappyWorker”, “text”:”Having 
the freedom to choose your work 
location feels great.   This 
week is London. 
pic.twitter.com/BHZq86o6“}, 
“Iris”]
State Query
Tuple before: 
[{”fullname”: “Iris HappyWorker”, “text”:”Having the 
freedom to choose your work location feels great. 
This week is London. pic.twitter.com/BHZq86o6“}, 
“Iris”]
Tuple after: 
[{”fullname”: “Iris HappyWorker”, “text”:”Having the 
freedom to choose your work location feels great. 
This week is London. pic.twitter.com/BHZq86o6“}, 
“Iris”,
“Female”]
Grouping
● .groupBy(new Fields("gender"))
● Groups the tuples containing the same
gender value together
● Re-partitions the stream
● Tuples are sent over the network
Grouping
● Tuples before: 
1st
 Partition: [{TweetJson1}, 
“Iris”, “Female”]
1st
 Partition: [{TweetJson2}, 
“Michael”, “Male”]
2nd
 Partition: [{TweetJson3}, 
“Lena”, “Female”]
Grouping
● Tuples before: 
1st
 Partition: [{TweetJson1}, “Iris”, “Female”]
1st
 Partition: [{TweetJson2}, “Michael”, “Male”]
2nd
 Partition: [{TweetJson3}, “Lena”, “Female”]
Group By Gender
● Tuple after: 
new 1st
 Partition: [{TweetJson1}, “Iris”, “Female”]
new 1st
 Partition: [{TweetJson3}, “Lena”, “Female”]
new 2nd
 Partition: [{TweetJson2}, “Michael”, “Male”]
Aggregators (general case)
● Run the init() function before processing the batch
● Aggregate through a number of tuples (usually “grouped-by” before) and emit one
or more results based on the aggregate method.
public interface Aggregator<T> extends Operation {
    T init(Object batchId, TridentCollector collector);
    void aggregate(T state, TridentTuple tuple, 
TridentCollector collector);
    void complete(T state, TridentCollector collector);
}
Combiner Aggregator
● Run init(TridentTuple t) on every tuple
● Run combine method to tuple values until no tuples are left, then return single value.
public class Count implements CombinerAggregator<Long> {
    public Long init(TridentTuple tuple) {
        return 1L;
    }
    public Long combine(Long val1, Long val2) {
        return val1 + val2;
    }
    public Long zero() {
        return 0L;
    }
}
Reducer Aggregator
● Run init() to get an initial value
● Iterate over the value to emit a single result
public interface ReducerAggregator<T> 
extends Serializable {
    T init();
    T reduce(T curr, TridentTuple tuple);
}
Back to the example
● For each gender batch run Count()
aggregator
● Not only aggregate, but also store the
value to memory
● Why?
● “Over time count”
Back to the example
● For each gender batch run Count() aggregator
● Not only aggregate, but also store the value to memory
● Why?
● “Over time count”
persistentAggregate(new 
MemoryMapState.Factory(), new Count(), new 
Fields("count"))
Putting it all together
TridentState genderDB = topology.newStaticState(new 
GenderDBFactory());
Stream gender = topology.newStream("spout", spout)
.each(new Fields("status"), new Filter(topicWords))
.each(new Fields("status"), new ExpandName(), new 
Fields("name"))
   .parallelismHint(4)
.stateQuery(genderDB, new Fields("name"), new QueryGender(), 
new Fields("gender"))
.parallelismHint(10)
.groupBy(new Fields("gender"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), 
new Fields("count"))
.newValuesStream();
Demo
Gender count
Some minus
• Hard debugging
➢
pseudo-distributed mode but still..
• Object serialization
➢
When using 3rd
party libraries
➢
Register your own serializers for better
performance e.g. Kryo
I didn’t tackle
• Reliability
–Guaranteed message processing
• Distributed RPC example
• Storm-deploy companion
–One-click storm cluster automated
deploy i.e. EC2
Contributions
Overall
• Express your realtime needs naturally
• Growing community
• System rapidly improving
• Not a Hadoop/MR competitor
• Fun to use
Resources
● Storm Unshortening example
https://github.com/mvogiatzis/storm­
unshortening
● Understanding the Storm Parallelism
http://bit.ly/RCx4Ln
● http://storm­project.net/
● https://github.com/nathanmarz/stor
m
The End
Michael Vogiatzis
Follow me @mvogiatzis
Q & A

Storm real-time processing