storm at twitter

storm
stream processing @twitter
Krishna Gade
Twitter
@krishnagade
Sunday, June 16, 13

what is storm?
storm is a platform for doing analysis on streams of
data as they come in, so you can react to data as it
happens.
Sunday, June 16, 13

storm v hadoop
storm & hadoop are complementary!
hadoop => big batch processing
storm => fast, reactive, real time processing
Sunday, June 16, 13

origins
• originated at backtype, acquired by twitter
in 2011.
• to vastly simplify dealing with queues &
workers.
Sunday, June 16, 13

queue-worker model
queues workers
a a a a a
Sunday, June 16, 13

typical workﬂow
queues queues
workers workers
data
store
Sunday, June 16, 13

problems
• scaling is painful - queue partitioning &
worker deploy.
• operational overhead - worker
failures & queue backups.
• no guarantees on data processing.
Sunday, June 16, 13

what does storm
provide?
• at least once message processing.
• horizontal scalability.
• no intermediate queues.
• less operational overhead.
• “just works”.
Sunday, June 16, 13

storm primitives
• streams
• spouts
• bolts
• topologies
Sunday, June 16, 13

streams
unbounded sequence of tuples
T T T T T T T T T T T T T T T
Sunday, June 16, 13

spouts
source of streams
A A A A A A A A A A A A
B B B B B B B B B B B B
Sunday, June 16, 13

typical spouts
• read from a kestrel/kafka queue. {tuples = events}
• read from a http server log. {tuples = http requests}
• read from twitter streaming api. {tuples = tweets}
Sunday, June 16, 13

bolts
process input stream - A
produce output stream - B
A A A A A A A A B B B B B B B B
Sunday, June 16, 13

bolts
• ﬁltering tuples in a stream.
• aggregation of tuples.
• joining multiple streams.
• arbitrary functions on streams.
• communication with external caches/
dbs.
Sunday, June 16, 13

topology
directed-acyclic-graph of spouts and bolts.
s1
s2
b1
b2
b3
b4
b5
Sunday, June 16, 13

storm cluster
nimbus
supervisor
w1 w2 w3 w4
supervisor
w1 w2 w3 w4
ZK
topology map
sync code
topology submission
master node
slave nodes
Sunday, June 16, 13

nimbus
• master node.
• manages the topologies.
• job tracker in hadoop.
$ storm jar myapp.jar com.twitter.MyTopology demo
Sunday, June 16, 13

supervisor
• runs on slave nodes.
• co-ordinates with zookeeper.
• manages workers.
Sunday, June 16, 13

worker
jvm process
executor
task task
task
task
executor executor
Sunday, June 16, 13

recap
• worker - process that executes a
subset of a topology.
• executor - a thread spawned by a
worker.
• task - performs the actual data
processing.
Sunday, June 16, 13

stream grouping
• shuffle grouping - random distribution
of tuples.
• field grouping - groups tuples by a field.
• all grouping - replicates to all tasks.
• global grouping - sends the entire
stream to one task.
Sunday, June 16, 13

streaming word-count
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("tweet_spout", new RandomTweetSpout(), 5);
builder.setBolt("parse_bolt", new ParseTweetBolt(), 8)
.shuffleGrouping("tweet_spout")
.setNumTasks(2);
builder.setBolt("count_bolt", new WordCountBolt(), 12)
.fieldsGrouping("parse_bolt", new Fields("word"));
Config config = new Config();
config.setNumWorkers(3);
StormSubmitter.submitTopology(“demo”, config, builder.createTopology());
Sunday, June 16, 13

tweet spout
class RandomTweetSpout extends BaseRichSpout {
SpoutOutputCollector collector;
Random rand;
String[] tweets = new String[] {
"@jkrums:There’s a plane in the Hudson. I’m on the ferry to pick up people. Crazy",
"@barackobama: Four more years. pic.twitter.com/bAJE6Vom",
...
};
....
@Override
public void nextTuple() {
Utils.sleep(100);
String tweet = tweets[rand.nextInt(tweets.length)];
collector.emit(new Values(tweet));
}
}
Sunday, June 16, 13

parse bolt
class ParseTweetBolt extends BaseBasicBolt {
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String tweet = tuple.getString(0);
for (String word : tweet.split(" ")) {
collector.emit(new Values(word));
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
Sunday, June 16, 13

word count bolt
class WordCountBolt extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
count = (count == null) ? 1 : count + 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
Sunday, June 16, 13

word-count topology
RandomTweetSpout ParseTweetBolt WordCountBolt
shufﬂe grouping ﬁelds grouping
Sunday, June 16, 13

how do we run storm
@twitter ?
Sunday, June 16, 13

storm on mesos
node node node node
mesos
we run multiple instances of storm on
the same cluster via mesos.
storm
(production)
storm
(dev) provides efficient
resource isolation and
sharing across distributed
frameworks such as
storm.
Sunday, June 16, 13

topology isolation
isolation scheduler solves the problem of
multi-tenancy – avoiding resource contention
between topologies, by providing full isolation
between topologies.
Sunday, June 16, 13

topology isolation
• shared pool - multiple topologies can
run on the same host.
• isolated pool - dedicated set of hosts to
run a single topology.
Sunday, June 16, 13

topology isolation
shared pool
storm
cluster
Sunday, June 16, 13

topology isolation
shared pool
storm
cluster
joe’s topology
isolated pools
Sunday, June 16, 13

topology isolation
shared pool
storm
cluster
joe’s topology
isolated pools
jane’s topology
Sunday, June 16, 13

topology isolation
shared pool
storm
cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology
Sunday, June 16, 13

topology isolation
X
shared pool
storm
cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology
host failure
Sunday, June 16, 13

topology isolation
shared pool
storm
cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology
repair hostadd host
Sunday, June 16, 13

topology isolation
shared pool
storm
cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology
add to shared
pool
Sunday, June 16, 13

numbers
• benchmarked at a million tuples
processed per second per node.
• running 30 topologies in a 200 node
cluster..
• processing 50 billion messages a day
with an average complete latency under 50
ms.
Sunday, June 16, 13

storm use-cases
@twitter
Sunday, June 16, 13

stream processing
applications
tweets
favorites, retweets
impressions
twitter stormstreams
spout
bolt
bolt
$$$$
realtime
dashboards
new
features
Sunday, June 16, 13

current use-cases
• discovery of emerging topics/stories.
• online learning of tweet features for search
result ranking.
• realtime analytics for ads.
• internal log processing.
Sunday, June 16, 13

tweet scoring pipeline
tweets
data streams
impressions
interactions
storm
topology
graph
store
metadata
store
join: tweets, impressions
join: tweets, interactions
last 7 days of:
tweet ->
feature_val,
feature_type,
timestamp
persistent
store:
tweet ->
feature_val,
feature_type,
timestamp
thrift
service
cassandra
twemcache
input: tweet id
output: score
write tweet
features
Sunday, June 16, 13

road ahead
• auto scaling.
• persistent bolts.
• better grouping schemes.
• replicated computation.
• higher-level abstractions.
Sunday, June 16, 13

companies using storm
Sunday, June 16, 13

questions?
krishna@twitter.com
project: https://storm-project.net
mailing-list: http://groups.google.com/
group/storm-user
Sunday, June 16, 13

storm at twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to storm at twitter

Similar to storm at twitter (20)

Recently uploaded

Recently uploaded (20)

storm at twitter