Discretized streams

DISCRETIZED STREAMS
Fault-TolerantStreamingComputationatScale
MateiZaharia,TathagataDas(TD),Haoyuan(HY)Li,
TimothyHunter,ScottShenker,IonStoica
Presented by:
Tomer Orenstein & Lior Nussbaum

BIG DATA
“Big data is data sets that are so voluminous and
complex that traditional data-processing application
software are inadequate to deal with them”

MOTIVATION
Many big-data applications need to process
large data streams in near-real time
Require tens to hundreds of nodes
Require second-scale latencies

TRADITIONAL STREAMING SYSTEMS
PROBLEMS
Stream processing systems do not know how to recover
from failures and stragglers quickly and efficiently

TRADITIONAL STREAMING SYSTEMS
 Continuous operator
model mutable state
node1
node3
input
records
node2
input
records
There is a need to know how to recover if
mutable state is lost when a node fails

FAULT-TOLERANCE IN TRADITIONAL SYSTEMS
 Double cluster size
 Synchronization
 Switch Over
sync
protocol
input
input
hot
failover
nodes
Fastrecovery, but 2xhardware cost
Node
Replication
(e.g. Borealis, Flux)

FAULT-TOLERANCE INTRADITIONAL SYSTEMS
Upstream
Backup
 forwarded records –
self backup
 On failure – state
recreation
 Cold failover node
Only need 1standby, but slow recovery
input
input
coldfailover
node
backup
replay
(e.g. TimeStream, Storm )

SLOW NODES IN TRADITIONAL SYSTEMS
Upstream Backup
input
input
input
input
Neither approach handles stragglers
Node Replication

DISCRETIZED STREAM
PROCESSING
Make state immutable and break
computation into small, deterministic,
stateless batches
stateless
task
state 1
input 1
state 2
stateless
task
state 2
input 2
stateless
task
input 3

 Store intermediate state data in cluster
memory
 Try to make batch sizes as small as possible
to get second-scale latencies
IMPLEMENTATION ASSUMPTIONS

DSTREAM INPUT SOURCES
Out of the box we provide
- Kafka
- HDFS
- MongoDB
- HBase
- Raw TCP sockets
- More…
It is possible to write a receiver for your
own data source

batch operations
Input:replicated
dataset stored in
memory
Output or State:
non-replicated dataset
stored in memory
inputstream output / state stream
…
…
time =0-1:
input
time = 1-2:
input

WINDOWING
Count frequency of words received in last 5 seconds
words = createNetworkStream("http://...”)
ones = words.map(w => (w, 1))
freqs_5s = ones.reduceByKeyAndWindow(_ + _,
Seconds(5), Seconds(1))
words ones
t: 0- 1
map reduce
freqs
t: 1-2

 Datasets track
operation lineage
 Periodic checkpoints
– prevent long
lineages
words ones freqs
t: 0- 1
t: 1-2
map reduce
t:2 - 3
THE LINEAGE

 Lineage is used to
recompute partitions
lost due to failures
 Datasets on different
time steps
recomputed in parallel
 Partitions within a
dataset also
recomputed in
parallel
freqsoneswords
PARALLEL FAULT RECOVERY
map reduce
t:0 - 1
t: 1-2
t: 2 -3

UPSTREAM BACKUP VS DSTREAMS
RECOVERY
SERIAL
BACKUP
Parallelism
within a batch
Parallelism
across time
intervals

HANDLING STRAGGLERS IN
DSTREAMS
 Detect slow tasks (e.g. 2X slower than
other tasks)
 launch more copies of the tasks in
parallel on other machines

Spark Batch
val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFile("hdfs://...")
Spark Streaming using DStreams
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
BATCH AND STREAM – SAME API

RECOVERY
Upstream Backup Parallel Recovery
Needed Work
for full recovery
One node work
during recovery time
Needed Work
for full cluster recovery
Cluster work
during recovery time
Last checkpoint –
1 minute ago

RECOVERY -EFFECT OF CHECKPOINT & NODES

 Discretized Streams model offers a
new approach for streaming
processing -
 Break computation into small batches
 Uses simple techniques to exploit parallelism in streams
 Scalable
 Recovers from failures and stragglers very fast
 Same API for stream and batch
 DStreams model is implemented over
Spark which is an Apache top-level
project

 Memory usage
o Significantly higher than continuous operators with mutable state
o It may possible to reduce the memory usage by storing only Δ between RDDs
 Replication size
o Replication algorithms can cause usage of less hardware than X 2
 Intervals
o There are scenarios which latency of 0.5-2s does not fit its requirements
o There are cases where even in the minimum interval time (0.5s),
the size of the data we should process exceeds our resources –
controlling the interval time is needed.
o There are cases where the processing time of each batch is significantly smaller than the
interval time – therefore we lose valuable processing time.

Discretized streams

Recommended

Recommended

More Related Content

Similar to Discretized streams

Similar to Discretized streams (20)

Recently uploaded

Recently uploaded (20)

Discretized streams