Storm real-time processing

Storm
Real-time computation made easy
Michael Vogiatzis

What’s Storm?
• Distributed real-time computation system
• Fault tolerant
• Fast
• Scalable
• Guaranteed message processing
• Open source
• Multilang capabilities

Motivation
• Queues – Workers paradigm
• Scaling is hard
• System is not robust
• Coding is not fun!
– No abstraction
– Low level message passing
– Intermediate message brokers

Use cases
• Stream processing
– Consume stream, update db, etc
• Distributed RPC
– Intense function on top of storm
• Ongoing computation
– Computing music trends on Twitter

Elements
• Streams
– Set of tuples
– Unbounded sequence of data
• Spout
– Source of streams
• Bolts
– Application logic
– Functions
– Streaming aggregations, joins, DB ops

Trident
● Higher level of abstraction on top of Storm
● Batch processing
● Keeps state using your persistence store e.g.
DBs, Memcached, etc.
● Exactly – once semantics
● Tuples can be replayed!
● Similar API to Pig / Cascading

Trident operations
OperationInput fields Function fields

Trident operations
● Joins
● Aggregations
● Grouping
● Functions
● Filtering
● Sorting

Trident State
● Solid API for reading / writing to stateful
sources
● State updates are idempotent
● Different kind of fault-tolerance depending
on the different Spout implementations

Learn by example
Compute Male – Female count on a
particular topic on Twitter over time

Trident Gender
1. Stream of incoming tweets
2. Filter out the non-relevant to topic
3. Check gender by checking first name
4. Update either male or female counter

Input (Spout impl.)
● Receives public stream (~1% of tweets) and emits them
into the system
List<Object> tweets;
public void emitBatch(long batchId,
TridentCollector collector) {
for (Object o : tweets)
collector.emit(new Values(o));
}

Filter
Implement a Filter class called FilterWords
.each(new Fields("status"), new FilterWords(interestingWords))
String[] words = {“instagram”, “flickr”, “pinterest”, “picasa”};
public boolean isKeep(TridentTuple tuple) {
Tweet t = (Tweet) tuple.getValue(0);
//is tweet an interesting one?
for (String word : words)
        if (s.getText().toLowerCase().contains(word))
           return true;

       return false;
    }
}

Function
Implement a function class
.each(new Fields("status"), new
ExpandName(), new Fields("name"))
Tuple before:
[{”fullname”: “Iris HappyWorker”,
“text”:”Having the freedom to choose your
work location feels great. This week is
London. pic.twitter.com/BHZq86o6“}]

Function
Implement a function class
.each(new Fields("status"), new ExpandName(), new
Fields("name"))
Tuple before:
[{”fullname”: “Iris HappyWorker”, “text”:”Having the
freedom to choose your work location feels great. This week
is London. pic.twitter.com/BHZq86o6“}]
Tuple after:
freedom to choose your work location feels great. This week
is London. pic.twitter.com/BHZq86o6“},
“Iris”]

State Query
Implement a QueryFunction to query the persistence storage.
.stateQuery(genderDB, new Fields("name"), new
QueryGender(), new Fields("gender"))
public List<String> batchRetrieve(GenderDB state,
List<TridentTuple> tuples) {
List<String> batchToQuery = new ArrayList<String>();
for (TridentTuple t : tuples){
    String name = t.getStringByField("name");
    batchToQuery.add(name);
  }
return state.getGenders(batchToQuery);
}

State Query
Tuple before:
[{”fullname”: “Iris
HappyWorker”, “text”:”Having
the freedom to choose your work
location feels great. This
week is London.
pic.twitter.com/BHZq86o6“},
“Iris”]

State Query
Tuple before:
freedom to choose your work location feels great.
This week is London. pic.twitter.com/BHZq86o6“},
“Iris”]
Tuple after:
freedom to choose your work location feels great.
This week is London. pic.twitter.com/BHZq86o6“},
“Iris”,
“Female”]

Grouping
● .groupBy(new Fields("gender"))
● Groups the tuples containing the same
gender value together
● Re-partitions the stream
● Tuples are sent over the network

Grouping
● Tuples before:
1st
Partition: [{TweetJson1},
“Iris”, “Female”]
1st
“Michael”, “Male”]
2nd
“Lena”, “Female”]

Grouping
● Tuples before:
1st
Partition: [{TweetJson1}, “Iris”, “Female”]
1st
Partition: [{TweetJson2}, “Michael”, “Male”]
2nd
Partition: [{TweetJson3}, “Lena”, “Female”]
Group By Gender
● Tuple after:
new 1st
Partition: [{TweetJson1}, “Iris”, “Female”]
new 1st
Partition: [{TweetJson3}, “Lena”, “Female”]
new 2nd
Partition: [{TweetJson2}, “Michael”, “Male”]

Aggregators (general case)
● Run the init() function before processing the batch
● Aggregate through a number of tuples (usually “grouped-by” before) and emit one
or more results based on the aggregate method.
public interface Aggregator<T> extends Operation {
    T init(Object batchId, TridentCollector collector);
    void aggregate(T state, TridentTuple tuple,
TridentCollector collector);
    void complete(T state, TridentCollector collector);
}

Combiner Aggregator
● Run init(TridentTuple t) on every tuple
● Run combine method to tuple values until no tuples are left, then return single value.
public class Count implements CombinerAggregator<Long> {
    public Long init(TridentTuple tuple) {
        return 1L;
    }
    public Long combine(Long val1, Long val2) {
        return val1 + val2;
    }
    public Long zero() {
        return 0L;
    }
}

Reducer Aggregator
● Run init() to get an initial value
● Iterate over the value to emit a single result
public interface ReducerAggregator<T>
extends Serializable {
T init();
T reduce(T curr, TridentTuple tuple);
}

Back to the example
● For each gender batch run Count()
aggregator
● Not only aggregate, but also store the
value to memory
● Why?
● “Over time count”

Back to the example
● For each gender batch run Count() aggregator
● Not only aggregate, but also store the value to memory
● Why?
● “Over time count”
persistentAggregate(new
MemoryMapState.Factory(), new Count(), new
Fields("count"))

Putting it all together
TridentState genderDB = topology.newStaticState(new
GenderDBFactory());
Stream gender = topology.newStream("spout", spout)
.each(new Fields("status"), new Filter(topicWords))
.each(new Fields("status"), new ExpandName(), new
Fields("name"))
.parallelismHint(4)
.stateQuery(genderDB, new Fields("name"), new QueryGender(),
new Fields("gender"))
.parallelismHint(10)
.groupBy(new Fields("gender"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(),
new Fields("count"))
.newValuesStream();

Some minus
• Hard debugging
➢
pseudo-distributed mode but still..
• Object serialization
➢
When using 3rd
party libraries
➢
Register your own serializers for better
performance e.g. Kryo

I didn’t tackle
• Reliability
–Guaranteed message processing
• Distributed RPC example
• Storm-deploy companion
–One-click storm cluster automated
deploy i.e. EC2

Overall
• Express your realtime needs naturally
• Growing community
• System rapidly improving
• Not a Hadoop/MR competitor
• Fun to use

Resources
● Storm Unshortening example
https://github.com/mvogiatzis/storm
unshortening
● Understanding the Storm Parallelism
http://bit.ly/RCx4Ln
● http://stormproject.net/
● https://github.com/nathanmarz/stor
m

The End
Michael Vogiatzis
Follow me @mvogiatzis

Storm real-time processing

More Related Content

What's hot

Viewers also liked

Similar to Storm real-time processing

Recently uploaded

Storm real-time processing