Storm is coming

Storm is coming
real time stream processing

Grzegorz Kolpuc
@gkolpuc
https://pl.linkedin.com/pub/grzegorz-kolpuc/55/b7/700
grzegorzkolpuc@gmail.com

Storm
real time stream processing

What is stream processing?
“Streaming processing” is the ideal platform to process data
streams or sensor data (usually a high ratio of event
throughput versus numbers of queries), whereas “complex
event processing” (CEP) utilizes event-by-event processing
and aggregation
http://www.infoq.com/articles/stream-processing-hadoop

Why stream processing?
Latency (batch cannot provide real-time results)

Processing
Stream Processing System
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared. Guido Schmutz.
Collecting queue
Processing
ProcessingCollecting

How to scale?
ProcessingCollecting
queue
Processing
Processing
Collecting
Collecting

How to scale?
q2
ProcessingB
ProcessingB
Collecting
Collecting
qn
q1 ProcessingA
ProcessingA
ProcessingA
q1
q2

Batch Processing
map-reduce
high latency

Stream Processing (ESP)
one at a time
sub-seconds latency

Micro-Batching
mix of stream and batch
processing small chunks of
data
seconds latency

Message Delivery Semantics
1. At most once: messages may be lost but never redelivered
2. At least once: messages will never be lost but may be
redelivered
3. Exactly once: messages are never lost and never redelivered

Complex solution needed?
1. scaling
2. message delivery
3. message grouping
4. message aggregation
5. cost of development and maintenance

➔ distributed real-time processing system
➔ scalable
➔ fault-tolerant
➔ simplify working with queues & workers
➔ Written in Clojure

Storm architecture
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor

Nimbus and Supervisor daemons must be run under supervision
using a tool like daemontools or monit.
No worker processes are affected by the death of Nimbus or the
Supervisors.
So the answer is that Nimbus is "sort of" a SPOF

Tuple
core data unit (single queue message)

Stream
A stream is an unbounded sequence of tuples that is processed
and created in parallel in a distributed fashion

Spout
Spout is a source of streams in a topology
Spouts can either be reliable or unreliable

public interface ISpout extends Serializable {
void open(Map conf, TopologyContext context, SpoutOutputCollector
collector);
void close();
void activate();
void deactivate();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
}

Bolt
Bolts can do anything from filtering, functions, aggregations,
joins, talking to databases, and more.
Bolts can emit more than one stream.

public interface IBolt extends Serializable {
void prepare(Map stormConf, TopologyContext context, OutputCollector collector);
void cleanup();
void execute(Tuple input);
}

Topology
analogous to a MapReduce job

Topology
kafka-spout
ftp-spout
processingA-bolt
merge-bolt
processingC-bolt
processingB-bolt
collecting-bolt

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("ftp-spout", new FTPSpout(config), 1);
builder.setSpout("kafka-spout", new KafkaSpout(config), 4);
builder.setBolt("processingA-bolt", new ProcessingABolt())
.shuffleGrouping("ftp-spout");
builder.setBolt("processingB-bolt", new ProcessingBBolt())
.shuffleGrouping("kafka-spout");
builder.setBolt("processingC-bolt", new ProcessingCBolt())
.shuffleGrouping("kafka-spout");
builder.setBolt("merge-bolt", new MergeBolt()).shuffleGrouping("processingA-bolt");
builder.setBolt("merge-bolt", new MergeBolt()).shuffleGrouping("processingB-bolt");
builder.setBolt("collecting-bolt", new CollectingBolt())
.shuffleGrouping("processingC-bolt");

builder.setBolt("enrichment-bolt", new MnAEnrichmentBolt())
.shuffleGrouping("deals-kafka-spout");
.shuffleGrouping("deals-kafka-spout", "mergers-and-acquisions");

.fieldsGrouping("deals-kafka-spout", new Fields("EventId"));
.fieldsGrouping("deals-kafka-spout", "mergers-and-acquisions",
new Fields("EventId","EventType"));

.allGrouping("deals-kafka-spout");
.allGrouping("deals-kafka-spout", "mergers-and-acquisions");

.globalGrouping("deals-kafka-spout");
.globalGrouping("deals-kafka-spout", "mergers-and-acquisions");

.directGrouping("deals-kafka-spout");
.directGrouping("deals-kafka-spout", "mergers-and-acquisions");

Message delivery/reliability
Storm guarantees that every spout tuple will be fully processed
by the topology. It does this by tracking the tree of tuples
triggered by every spout tuple and determining when that tree
of tuples has been successfully completed. Every topology has
a "message timeout" associated with it. If Storm fails to detect
that a spout tuple has been completed within that timeout,
then it fails the tuple and replays it later.

public interface IOutputCollector extends IErrorReporter {
/**
* Returns the task ids that received the tuples.
*/
List<Integer> emit(String streamId, Collection<Tuple> anchors,
List<Object> tuple);
void emitDirect(int taskId, String streamId, Collection<Tuple>
anchors,
List<Object> tuple);
void ack(Tuple input);
void fail(Tuple input);
}

public class JUGBolt implements IRichBolt {
OutputCollector collector;
@Override
public void execute(Tuple input) {
try {
process(input);
collector.ack(input);
} catch (Exception e) {
collector.fail(input);
}
}
}

Config conf = new Config();
conf.setNumWorkers(20);
StormSubmitter.submitTopology("APR-30-JUGTopology", conf,
new JUGTopology().build());

storm jar path/to/jug-storm.jar org.jug.storm.JUGTopologyClusterSubmitter args...

Config config = new Config();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("APR-30-JUGTopology", config,
new JUGTopology().build());

Scaling
Storm's usage of Zookeeper for cluster coordination
add machines
increase the parallelism

Parallelism
Worker processes
Executors (threads)
Tasks

Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes
topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2);
// set parallelism hint to 2
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
.setNumTasks(4)
.shuffleGrouping("blue-
spout");
topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
.shuffleGrouping("green-
bolt");
StormSubmitter.submitTopology(
"mytopology",
conf,
topologyBuilder.createTopology()
);
http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html

http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html

storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

Trident
Trident is a high-level abstraction for doing realtime
computing on top of Storm. It allows you to seamlessly
intermix high throughput (millions of messages per second),
stateful stream processing with low latency distributed
querying
https://storm.apache.org/documentation/Trident-tutorial.html

topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
.parallelismHint(6);
https://storm.apache.org/documentation/Trident-tutorial.html

DRCP
https://storm.apache.org/documentation/Distributed-RPC.html

why to not use storm?
no commercial support, but...

Other streaming frameworks
Apache Samza
Apache Flink
Spark Streaming
S4
IBM Streams

Latency
• Is performance of streaming application paramount
Development Cost
• Is it desired to have similar code bases for batch and stream processing =>
lambda architecture
Message Delivery Guarantees
• Is there high importance on processing every single record, or is some normal
amount of data loss acceptable
Process Fault Tolerance
• Is high-availability of primary concern
Choice?

Stream processing architectures

Lambda architecture
http://lambda-architecture.net/

Lambda architecture
http://pandawhale.com/post/51352/the-lambda-architecture-has-its-merits-but-alternatives-are-worth-exploring

Kappa Architecture (lightweight lambda)
http://www.kappa-architecture.com/

Storm is coming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Storm is coming

Similar to Storm is coming (20)

Recently uploaded

Recently uploaded (20)

Storm is coming

Editor's Notes