Hadoop Summit Europe 2014: Apache Storm Architecture

© Hortonworks Inc. 2011
P. Taylor Goetz
Apache Storm Committer
tgoetz@hortonworks.com
@ptgoetz
Apache Storm Architecture and Integration
Real-Time Big Data

Shedding Light on Big Data
In Real Time

Storm is Streaming
Key enabler of the Lamda Architecture

Storm is Fast
Clocked at 1M+ messages per second per node

Storm is Scalable
Thousands of workers per cluster

Storm is Fault Tolerant
Failure is expected, and embraced

Storm is Reliable
Guaranteed message delivery

Storm is Reliable
Exactly-once semantics

Tuple
{…} • Core Unit of Data
• Immutable Set of Key/Value
Pairs

Streams
{…} {…} {…} {…} {…} {…} {…}
Unbounded Sequence of Tuples

Spouts
• Source of Streams
• Wraps a streaming data source
and emits Tuples
{…}
{…}
{…}
{…}
{…}
{…}
{…}
{…}
{…}
{…}
{…}
{…}
{…}
{…}

Spout API
public interface ISpout extends Serializable {!
!
void open(Map conf, !
! TopologyContext context, !
! ! ! SpoutOutputCollector collector);!
!
void close();!
!
void activate();!
!
void deactivate();!
!
void nextTuple();!
!
void ack(Object msgId);!
!
void fail(Object msgId);!
}
Lifecycle API

Spout API
!
!
void close();!
!
void activate();!
!
void deactivate();!
!
void nextTuple();!
!
!
}
Core API

Spout API
!
!
void close();!
!
void activate();!
!
void deactivate();!
!
void nextTuple();!
!
!
}
Reliability API

Bolts
• Core functions of a
streaming computation
• Receive tuples and do stuff
• Optionally emit additional
tuples

Bolts
• Write to a data store

Bolts
• Read from a data store

Bolts
• Perform arbitrary
computation
Compute

{…}
{…}
{…}
{…}
{…}
{…}
{…}
Bolts
• (Optionally) Emit additional
streams
{…}
{…}
{…}
{…}
{…}
{…}
{…}

Bolt API
public interface IBolt extends Serializable {!
!
void prepare(Map stormConf, !
TopologyContext context, !
OutputCollector collector);!
!
void cleanup();!
! !
void execute(Tuple input);!
! !
}
Lifecycle API

Bolt API
public interface IBolt extends Serializable {!
!
void prepare(Map stormConf, !
TopologyContext context, !
OutputCollector collector);!
!
void cleanup();!
! !
void execute(Tuple input);!
! !
}
Core API

Bolt Output API
public interface IOutputCollector extends IErrorReporter {!
!
List<Integer> emit(String streamId, !
Collection<Tuple> anchors, !
List<Object> tuple);!
! !
void emitDirect(int taskId, !
String streamId, !
! !
void ack(Tuple input);!
! !
void fail(Tuple input);!
}
Core API

Bolt Output API
public interface IOutputCollector extends IErrorReporter {!
!
List<Integer> emit(String streamId, !
! !
void emitDirect(int taskId, !
String streamId, !
! !
void ack(Tuple input);!
! !
void fail(Tuple input);!
}
Reliability API

Topologies
• DAG of Spouts and Bolts
• Data Flow Representation
• Streaming Computation

Topologies
• Storm executes spouts
and bolts as individual
Tasks that run in parallel
on multiple machines.

Stream Groupings
Stream Groupings determine how Storm routes
Tuples between tasks in a topology

Stream Groupings
Shufﬂe!
!
Randomized round-robin.

Stream Groupings
LocalOrShufﬂe!
!
Randomized round-robin.
(With a preference for intra-worker Tasks)

Stream Groupings
Fields Grouping!
!
Ensures all Tuples with with the same ﬁeld value(s)
are always routed to the same task.

Stream Groupings
Fields Grouping!
!
Ensures all Tuples with with the same ﬁeld value(s)
are always routed to the same task.
!
(this is a simple hash of the ﬁeld values,
modulo the number of tasks)

Physical View
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
Worker* Worker* Worker* Worker*

Topology Deployment
ZooKeeperNimbus
Topology
Submitter
Topology Submitter uploads topology:!
• topology.jar!
• topology.ser!
• conf.ser
$ bin/storm jar

Topology Deployment
Nimbus calculates assignments and sends to Zookeeper
ZooKeeperNimbus
Topology
Submitter

Topology Deployment
Supervisor nodes receive assignment information !
via Zookeeper watches.
ZooKeeperNimbus
Topology
Submitter

Topology Deployment
Supervisor nodes download topology from Nimbus:!
• topology.jar!
• topology.ser!
• conf.ser
ZooKeeperNimbus
Topology
Submitter

Topology Deployment
Supervisors spawn workers (JVM processes) to start the topology
ZooKeeperNimbus
Topology
Submitter
Worker Worker Worker Worker

Fault Tolerance
Workers heartbeat back to Supervisors and Nimbus via ZooKeeper, !
as well as locally.
ZooKeeperNimbus
Topology
Submitter

Fault Tolerance
If a worker dies (fails to heartbeat), the Supervisor will restart it
ZooKeeperNimbus
Topology
Submitter
X

Fault Tolerance
If a worker dies repeatedly, Nimbus will reassign the work to other!
nodes in the cluster.
ZooKeeperNimbus
Topology
Submitter
X

Fault Tolerance
If a supervisor node dies, Nimbus will reassign the work to other nodes.
ZooKeeperNimbus
Topology
Submitter
X
X

Fault Tolerance
If Nimbus dies, topologies will continue to function normally,!
but won’t be able to perform reassignments.
ZooKeeperNimbus
Topology
Submitter
X

Parallelism
Scaling a Distributed Computation

Parallelism
Worker (JVM)
Executor (Thread) Executor (Thread) Executor (Thread)
Task Task Task
1 Worker,
Parallelism = 1

Parallelism
Worker (JVM)
Task Task Task
Executor (Thread)
Task
1 Worker,
Parallelism = 2

Parallelism
Worker (JVM)
Executor (Thread) Executor (Thread)
Task Task
Executor (Thread)
Task
Task
1 Worker,
Parallelism = 2, NumTasks = 2

Parallelism
3 Workers,
Parallelism = 1, NumTasks = 1
Worker (JVM)Worker (JVM)Worker (JVM)
Task Task Task

Internal Messaging
Worker Mechanics

Worker Internal Messaging
Worker Receive Thread
Worker Port
List<List<Tuple>>
Receive Buﬀer
Executor Thread *
Inbound Queue Outbound Queue
Router Send
Thread
Worker Transfer Thread
List<List<Tuple>>
Transfer Buﬀer
To Other Workers
Task
(Spout/Bolt)
Task
(Spout/Bolt)
Task(s)
(Spout/Bolt)

Reliable Processing
At Least Once

Reliable Processing
Bolts may emit Tuples Anchored to one received.
Tuple “B” is a descendant of Tuple “A”
{A} {B}

Reliable Processing
Multiple Anchorings form a Tuple tree
(bolts not shown)
{A} {B}
{C}
{D}
{E}
{F}
{G}
{H}

Reliable Processing
Bolts can Acknowledge that a tuple
has been processed successfully.
{A} {B}
ACK

Reliable Processing
Acks are delivered via a system-level bolt
ACK
{A} {B}
Acker Bolt
ackack

Reliable Processing
Bolts can also Fail a tuple to trigger a spout to
replay the original.
FAIL
{A} {B}
Acker Bolt
failfail

Reliable Processing
Any failure in the Tuple tree will trigger a
replay of the original tuple
{A} {B}
{C}
{D}
{E}
{F}
{G}
{H}
X
X

Reliable Processing
How to track a large-scale tuple tree efﬁciently?

Reliable Processing
A single 64-bit integer.

XOR Magic
Long a, b, c = Random.nextLong();

XOR Magic
Long a, b, c = Random.nextLong();!
!
a ^ a == 0

XOR Magic
!
a ^ a == 0!
!
a ^ a ^ b != 0

XOR Magic
!
a ^ a == 0!
!
a ^ a ^ b != 0!
!
a ^ a ^ b ^ b == 0

XOR Magic
!
a ^ (a ^ b) ^ c ^ (b ^ c) == 0

XOR Magic
!
a ^ (a ^ b) ^ c ^ (b ^ c) == 0
Acks can arrive asynchronously, in any order

Trident
High-level abstraction built on Storm’s core primitives.

Trident
Built-in support for:
• Merges and Joins
• Aggregations
• Groupings
• Functions
• Filters

Trident
Stateful, incremental processing on top
of any persistence store.

Trident
Fluent, Stream-oriented API

Trident
Fluent, Stream-Oriented API
TridentTopology topology = new TridentTopology();!
FixedBatchSpout spout = new FixedBatchSpout(…);!
Stream stream = topology.newStream("words", spout);!
!
stream.each(…, new MyFunction())!
.groupBy()!
.each(…, new MyFilter())!
.persistentAggregate(…);!
User-deﬁned functions

Trident
Micro-Batch Oriented
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}

Trident
Trident Batches are Ordered
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
Batch #1 Batch #2

Trident
Trident Batches can be Partitioned
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}

Trident
Trident Batches can be Partitioned
Tuple Micro-Batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
Partition Operation
Partition A
{…} {…}
{…}{…}
Partition B
{…} {…}
{…}{…}
Partition C
{…} {…}
{…}{…}
Partition D
{…} {…}
{…}{…}

Trident Operation Types
1. Local Operations (Functions/Filters)
2. Repartitioning Operations (Stream Groupings,
etc.)
3. Aggregations
4. Merges/Joins

Trident Topologies
each
each
shufﬂe
Function
Filter
partition
persist

Trident Toplogies
Partitioning operations deﬁne the boundaries
between bolts, and thus network transfer
and parallelism

Trident Topologies
each
each
shufﬂe
Function
Filter
partition
persist
Bolt 1
Bolt 2
shuﬄeGrouping()
Partitioning!
Operation

Trident Batch Coordination
Trident SpoutMaster Batch Coordinator User Logic
next
batch
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
{…} {…} {…} {…}
commit

Controlling Deployment
How do you control where spouts
and bolts get deployed in a cluster?

Plug-able Schedulers

Isolation Scheduler

Wait… Nimbus, Supervisor, Schedulers…
!
Doesn’t that sound kind of like
resource negotiation?

Storm on YARN
HDFS2

(redundant,
reliable
storage)
YARN

(cluster
resource
management)
MapReduce
(batch)
Apache
 
STORM

(streaming)
HADOOP 2.0
Tez

(interactive)
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …

Storm on YARN
HDFS2

(redundant,
reliable
storage)
YARN

(cluster
resource
management)
MapReduce
(batch)
Apache
 
STORM

(streaming)
HADOOP 2.0
Tez

(interactive)
Batch and real-time on the same cluster

Storm on YARN
HDFS2

(redundant,
reliable
storage)
YARN

(cluster
resource
management)
MapReduce
(batch)
Apache
 
STORM

(streaming)
HADOOP 2.0
Tez

(interactive)
Security and Multi-tenancy

Storm on YARN
HDFS2

(redundant,
reliable
storage)
YARN

(cluster
resource
management)
MapReduce
(batch)
Apache
 
STORM

(streaming)
HADOOP 2.0
Tez

(interactive)
Elasticity

Storm on YARN
Nimbus
Resource Management, Scheduling
Supervisor
Node and Process management
Workers
Runs topology tasks
YARN RM
Resource Management
Storm AM
Manage Topology
Containers
Runs topology tasks
YARN NM
Process Management
Storm’s resource management system
maps very naturally to the YARN model.

Storm on YARN
Nimbus
Supervisor
Workers
Runs topology tasks
YARN RM
Resource Management
Storm AM
Manage Topology
Containers
Runs topology tasks
YARN NM
Process Management
High Availability

Storm on YARN
Nimbus
Supervisor
Workers
Runs topology tasks
YARN RM
Resource Management
Storm AM
Manage Topology
Containers
Runs topology tasks
YARN NM
Process Management
Detect and scale around bottlenecks

Storm on YARN
Nimbus
Supervisor
Workers
Runs topology tasks
YARN RM
Resource Management
Storm AM
Manage Topology
Containers
Runs topology tasks
YARN NM
Process Management
Optimize for available resources

Shameless
Plug
https://www.packtpub.com/
storm-distributed-real-time-
computation-blueprints/book

Thank You!
Contributions welcome.
Join the storm community at:
http://storm.incubator.apache.org
P. Taylor Goetz
tgoetz@hortonworks.com
@ptgoetz

Hadoop Summit Europe 2014: Apache Storm Architecture

More Related Content

What's hot

Viewers also liked

Similar to Hadoop Summit Europe 2014: Apache Storm Architecture

Recently uploaded

Hadoop Summit Europe 2014: Apache Storm Architecture