2. what is storm?
storm is a platform for doing analysis on streams of
data as they come in, so you can react to data as it
happens.
Sunday, June 16, 13
3. storm v hadoop
storm & hadoop are complementary!
hadoop => big batch processing
storm => fast, reactive, real time processing
Sunday, June 16, 13
4. origins
• originated at backtype, acquired by twitter
in 2011.
• to vastly simplify dealing with queues &
workers.
Sunday, June 16, 13
9. what does storm
provide?
• at least once message processing.
• horizontal scalability.
• no intermediate queues.
• less operational overhead.
• “just works”.
Sunday, June 16, 13
13. typical spouts
• read from a kestrel/kafka queue. {tuples = events}
• read from a http server log. {tuples = http requests}
• read from twitter streaming api. {tuples = tweets}
Sunday, June 16, 13
14. bolts
process input stream - A
produce output stream - B
A A A A A A A A B B B B B B B B
Sunday, June 16, 13
15. bolts
• filtering tuples in a stream.
• aggregation of tuples.
• joining multiple streams.
• arbitrary functions on streams.
• communication with external caches/
dbs.
Sunday, June 16, 13
21. recap
• worker - process that executes a
subset of a topology.
• executor - a thread spawned by a
worker.
• task - performs the actual data
processing.
Sunday, June 16, 13
22. stream grouping
• shuffle grouping - random distribution
of tuples.
• field grouping - groups tuples by a field.
• all grouping - replicates to all tasks.
• global grouping - sends the entire
stream to one task.
Sunday, June 16, 13
23. streaming word-count
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("tweet_spout", new RandomTweetSpout(), 5);
builder.setBolt("parse_bolt", new ParseTweetBolt(), 8)
.shuffleGrouping("tweet_spout")
.setNumTasks(2);
builder.setBolt("count_bolt", new WordCountBolt(), 12)
.fieldsGrouping("parse_bolt", new Fields("word"));
Config config = new Config();
config.setNumWorkers(3);
StormSubmitter.submitTopology(“demo”, config, builder.createTopology());
Sunday, June 16, 13
24. tweet spout
class RandomTweetSpout extends BaseRichSpout {
SpoutOutputCollector collector;
Random rand;
String[] tweets = new String[] {
"@jkrums:There’s a plane in the Hudson. I’m on the ferry to pick up people. Crazy",
"@barackobama: Four more years. pic.twitter.com/bAJE6Vom",
...
};
....
@Override
public void nextTuple() {
Utils.sleep(100);
String tweet = tweets[rand.nextInt(tweets.length)];
collector.emit(new Values(tweet));
}
}
Sunday, June 16, 13
25. parse bolt
class ParseTweetBolt extends BaseBasicBolt {
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String tweet = tuple.getString(0);
for (String word : tweet.split(" ")) {
collector.emit(new Values(word));
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
Sunday, June 16, 13
26. word count bolt
class WordCountBolt extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
count = (count == null) ? 1 : count + 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
Sunday, June 16, 13
28. how do we run storm
@twitter ?
Sunday, June 16, 13
29. storm on mesos
node node node node
mesos
we run multiple instances of storm on
the same cluster via mesos.
storm
(production)
storm
(dev) provides efficient
resource isolation and
sharing across distributed
frameworks such as
storm.
Sunday, June 16, 13
30. topology isolation
isolation scheduler solves the problem of
multi-tenancy – avoiding resource contention
between topologies, by providing full isolation
between topologies.
Sunday, June 16, 13
31. topology isolation
• shared pool - multiple topologies can
run on the same host.
• isolated pool - dedicated set of hosts to
run a single topology.
Sunday, June 16, 13
39. numbers
• benchmarked at a million tuples
processed per second per node.
• running 30 topologies in a 200 node
cluster..
• processing 50 billion messages a day
with an average complete latency under 50
ms.
Sunday, June 16, 13
42. current use-cases
• discovery of emerging topics/stories.
• online learning of tweet features for search
result ranking.
• realtime analytics for ads.
• internal log processing.
Sunday, June 16, 13
43. tweet scoring pipeline
tweets
data streams
impressions
interactions
storm
topology
graph
store
metadata
store
join: tweets, impressions
join: tweets, interactions
last 7 days of:
tweet ->
feature_val,
feature_type,
timestamp
persistent
store:
tweet ->
feature_val,
feature_type,
timestamp
thrift
service
cassandra
twemcache
input: tweet id
output: score
write tweet
features
Sunday, June 16, 13