3. BIG DATA
“Big data is data sets that are so voluminous and
complex that traditional data-processing application
software are inadequate to deal with them”
5. MOTIVATION
Many big-data applications need to process
large data streams in near-real time
Require tens to hundreds of nodes
Require second-scale latencies
7. TRADITIONAL STREAMING SYSTEMS
Continuous operator
model mutable state
node1
node3
input
records
node2
input
records
There is a need to know how to recover if
mutable state is lost when a node fails
8. FAULT-TOLERANCE IN TRADITIONAL SYSTEMS
Double cluster size
Synchronization
Switch Over
sync
protocol
input
input
hot
failover
nodes
Fastrecovery, but 2xhardware cost
Node
Replication
(e.g. Borealis, Flux)
9. FAULT-TOLERANCE INTRADITIONAL SYSTEMS
Upstream
Backup
forwarded records –
self backup
On failure – state
recreation
Cold failover node
Only need 1standby, but slow recovery
input
input
coldfailover
node
backup
replay
(e.g. TimeStream, Storm )
10. SLOW NODES IN TRADITIONAL SYSTEMS
Upstream Backup
input
input
input
input
Neither approach handles stragglers
Node Replication
13. DISCRETIZED STREAM
PROCESSING
Make state immutable and break
computation into small, deterministic,
stateless batches
stateless
task
state 1
input 1
state 2
stateless
task
state 2
input 2
stateless
task
input 3
14. Store intermediate state data in cluster
memory
Try to make batch sizes as small as possible
to get second-scale latencies
IMPLEMENTATION ASSUMPTIONS
15. DSTREAM INPUT SOURCES
Out of the box we provide
- Kafka
- HDFS
- MongoDB
- HBase
- Raw TCP sockets
- More…
It is possible to write a receiver for your
own data source
18. WINDOWING
Count frequency of words received in last 5 seconds
words = createNetworkStream("http://...”)
ones = words.map(w => (w, 1))
freqs_5s = ones.reduceByKeyAndWindow(_ + _,
Seconds(5), Seconds(1))
words ones
t: 0- 1
map reduce
freqs
t: 1-2
19. Datasets track
operation lineage
Periodic checkpoints
– prevent long
lineages
words ones freqs
t: 0- 1
t: 1-2
map reduce
t:2 - 3
THE LINEAGE
20. Lineage is used to
recompute partitions
lost due to failures
Datasets on different
time steps
recomputed in parallel
Partitions within a
dataset also
recomputed in
parallel
freqsoneswords
PARALLEL FAULT RECOVERY
map reduce
t:0 - 1
t: 1-2
t: 2 -3
21. UPSTREAM BACKUP VS DSTREAMS
RECOVERY
SERIAL
BACKUP
Parallelism
within a batch
Parallelism
across time
intervals
22. HANDLING STRAGGLERS IN
DSTREAMS
Detect slow tasks (e.g. 2X slower than
other tasks)
launch more copies of the tasks in
parallel on other machines
24. Spark Batch
val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFile("hdfs://...")
Spark Streaming using DStreams
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
BATCH AND STREAM – SAME API
26. RECOVERY
Upstream Backup Parallel Recovery
Needed Work
for full recovery
One node work
during recovery time
Needed Work
for full cluster recovery
Cluster work
during recovery time
Last checkpoint –
1 minute ago
29. Discretized Streams model offers a
new approach for streaming
processing -
Break computation into small batches
Uses simple techniques to exploit parallelism in streams
Scalable
Recovers from failures and stragglers very fast
Same API for stream and batch
DStreams model is implemented over
Spark which is an Apache top-level
project
31. Memory usage
o Significantly higher than continuous operators with mutable state
o It may possible to reduce the memory usage by storing only Δ between RDDs
Replication size
o Replication algorithms can cause usage of less hardware than X 2
Intervals
o There are scenarios which latency of 0.5-2s does not fit its requirements
o There are cases where even in the minimum interval time (0.5s),
the size of the data we should process exceeds our resources –
controlling the interval time is needed.
o There are cases where the processing time of each batch is significantly smaller than the
interval time – therefore we lose valuable processing time.