SlideShare a Scribd company logo
1 of 31
DISCRETIZED STREAMS
Fault-TolerantStreamingComputationatScale
MateiZaharia,TathagataDas(TD),Haoyuan(HY)Li,
TimothyHunter,ScottShenker,IonStoica
Presented by:
Tomer Orenstein & Lior Nussbaum
BIG DATA
“Big data is data sets that are so voluminous and
complex that traditional data-processing application
software are inadequate to deal with them”
BIG DATA PROCESSING METHODS
MOTIVATION
Many big-data applications need to process
large data streams in near-real time
Require tens to hundreds of nodes
Require second-scale latencies
TRADITIONAL STREAMING SYSTEMS
PROBLEMS
Stream processing systems do not know how to recover
from failures and stragglers quickly and efficiently
TRADITIONAL STREAMING SYSTEMS
 Continuous operator
model mutable state
node1
node3
input
records
node2
input
records
There is a need to know how to recover if
mutable state is lost when a node fails
FAULT-TOLERANCE IN TRADITIONAL SYSTEMS
 Double cluster size
 Synchronization
 Switch Over
sync
protocol
input
input
hot
failover
nodes
Fastrecovery, but 2xhardware cost
Node
Replication
(e.g. Borealis, Flux)
FAULT-TOLERANCE INTRADITIONAL SYSTEMS
Upstream
Backup
 forwarded records –
self backup
 On failure – state
recreation
 Cold failover node
Only need 1standby, but slow recovery
input
input
coldfailover
node
backup
replay
(e.g. TimeStream, Storm )
SLOW NODES IN TRADITIONAL SYSTEMS
Upstream Backup
input
input
input
input
Neither approach handles stragglers
Node Replication
THE SOLUTION – DSTREAMS
DISCRETIZED STREAM
PROCESSING
Make state immutable and break
computation into small, deterministic,
stateless batches
stateless
task
state 1
input 1
state 2
stateless
task
state 2
input 2
stateless
task
input 3
 Store intermediate state data in cluster
memory
 Try to make batch sizes as small as possible
to get second-scale latencies
IMPLEMENTATION ASSUMPTIONS
DSTREAM INPUT SOURCES
Out of the box we provide
- Kafka
- HDFS
- MongoDB
- HBase
- Raw TCP sockets
- More…
It is possible to write a receiver for your
own data source
batch operations
Input:replicated
dataset stored in
memory
Output or State:
non-replicated dataset
stored in memory
inputstream output / state stream
…
…
time =0-1:
input
time = 1-2:
input
WINDOWING
Count frequency of words received in last 5 seconds
words = createNetworkStream("http://...”)
ones = words.map(w => (w, 1))
freqs_5s = ones.reduceByKeyAndWindow(_ + _,
Seconds(5), Seconds(1))
words ones
t: 0- 1
map reduce
freqs
t: 1-2
 Datasets track
operation lineage
 Periodic checkpoints
– prevent long
lineages
words ones freqs
t: 0- 1
t: 1-2
map reduce
t:2 - 3
THE LINEAGE
 Lineage is used to
recompute partitions
lost due to failures
 Datasets on different
time steps
recomputed in parallel
 Partitions within a
dataset also
recomputed in
parallel
freqsoneswords
PARALLEL FAULT RECOVERY
map reduce
t:0 - 1
t: 1-2
t: 2 -3
UPSTREAM BACKUP VS DSTREAMS
RECOVERY
SERIAL
BACKUP
Parallelism
within a batch
Parallelism
across time
intervals
HANDLING STRAGGLERS IN
DSTREAMS
 Detect slow tasks (e.g. 2X slower than
other tasks)
 launch more copies of the tasks in
parallel on other machines
TIME FOR SOME CODE
Spark Batch
val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFile("hdfs://...")
Spark Streaming using DStreams
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
BATCH AND STREAM – SAME API
COMPARISONS
RECOVERY
Upstream Backup Parallel Recovery
Needed Work
for full recovery
One node work
during recovery time
Needed Work
for full cluster recovery
Cluster work
during recovery time
Last checkpoint –
1 minute ago
RECOVERY -EFFECT OF CHECKPOINT & NODES
 Discretized Streams model offers a
new approach for streaming
processing -
 Break computation into small batches
 Uses simple techniques to exploit parallelism in streams
 Scalable
 Recovers from failures and stragglers very fast
 Same API for stream and batch
 DStreams model is implemented over
Spark which is an Apache top-level
project
CRITICISM
 Memory usage
o Significantly higher than continuous operators with mutable state
o It may possible to reduce the memory usage by storing only Δ between RDDs
 Replication size
o Replication algorithms can cause usage of less hardware than X 2
 Intervals
o There are scenarios which latency of 0.5-2s does not fit its requirements
o There are cases where even in the minimum interval time (0.5s),
the size of the data we should process exceeds our resources –
controlling the interval time is needed.
o There are cases where the processing time of each batch is significantly smaller than the
interval time – therefore we lose valuable processing time.

More Related Content

Similar to Discretized streams

Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptrveiga100
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptAbhijitManna19
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptsnowflakebatch
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingShidrokhGoudarzi1
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...InfluxData
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 

Similar to Discretized streams (20)

Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Disruptor
DisruptorDisruptor
Disruptor
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 

Recently uploaded

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 

Recently uploaded (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 

Discretized streams

  • 2.
  • 3. BIG DATA “Big data is data sets that are so voluminous and complex that traditional data-processing application software are inadequate to deal with them”
  • 5. MOTIVATION Many big-data applications need to process large data streams in near-real time Require tens to hundreds of nodes Require second-scale latencies
  • 6. TRADITIONAL STREAMING SYSTEMS PROBLEMS Stream processing systems do not know how to recover from failures and stragglers quickly and efficiently
  • 7. TRADITIONAL STREAMING SYSTEMS  Continuous operator model mutable state node1 node3 input records node2 input records There is a need to know how to recover if mutable state is lost when a node fails
  • 8. FAULT-TOLERANCE IN TRADITIONAL SYSTEMS  Double cluster size  Synchronization  Switch Over sync protocol input input hot failover nodes Fastrecovery, but 2xhardware cost Node Replication (e.g. Borealis, Flux)
  • 9. FAULT-TOLERANCE INTRADITIONAL SYSTEMS Upstream Backup  forwarded records – self backup  On failure – state recreation  Cold failover node Only need 1standby, but slow recovery input input coldfailover node backup replay (e.g. TimeStream, Storm )
  • 10. SLOW NODES IN TRADITIONAL SYSTEMS Upstream Backup input input input input Neither approach handles stragglers Node Replication
  • 11. THE SOLUTION – DSTREAMS
  • 12.
  • 13. DISCRETIZED STREAM PROCESSING Make state immutable and break computation into small, deterministic, stateless batches stateless task state 1 input 1 state 2 stateless task state 2 input 2 stateless task input 3
  • 14.  Store intermediate state data in cluster memory  Try to make batch sizes as small as possible to get second-scale latencies IMPLEMENTATION ASSUMPTIONS
  • 15. DSTREAM INPUT SOURCES Out of the box we provide - Kafka - HDFS - MongoDB - HBase - Raw TCP sockets - More… It is possible to write a receiver for your own data source
  • 16.
  • 17. batch operations Input:replicated dataset stored in memory Output or State: non-replicated dataset stored in memory inputstream output / state stream … … time =0-1: input time = 1-2: input
  • 18. WINDOWING Count frequency of words received in last 5 seconds words = createNetworkStream("http://...”) ones = words.map(w => (w, 1)) freqs_5s = ones.reduceByKeyAndWindow(_ + _, Seconds(5), Seconds(1)) words ones t: 0- 1 map reduce freqs t: 1-2
  • 19.  Datasets track operation lineage  Periodic checkpoints – prevent long lineages words ones freqs t: 0- 1 t: 1-2 map reduce t:2 - 3 THE LINEAGE
  • 20.  Lineage is used to recompute partitions lost due to failures  Datasets on different time steps recomputed in parallel  Partitions within a dataset also recomputed in parallel freqsoneswords PARALLEL FAULT RECOVERY map reduce t:0 - 1 t: 1-2 t: 2 -3
  • 21. UPSTREAM BACKUP VS DSTREAMS RECOVERY SERIAL BACKUP Parallelism within a batch Parallelism across time intervals
  • 22. HANDLING STRAGGLERS IN DSTREAMS  Detect slow tasks (e.g. 2X slower than other tasks)  launch more copies of the tasks in parallel on other machines
  • 24. Spark Batch val tweets = sc.hadoopFile("hdfs://...") val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFile("hdfs://...") Spark Streaming using DStreams val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") BATCH AND STREAM – SAME API
  • 26. RECOVERY Upstream Backup Parallel Recovery Needed Work for full recovery One node work during recovery time Needed Work for full cluster recovery Cluster work during recovery time Last checkpoint – 1 minute ago
  • 27. RECOVERY -EFFECT OF CHECKPOINT & NODES
  • 28.
  • 29.  Discretized Streams model offers a new approach for streaming processing -  Break computation into small batches  Uses simple techniques to exploit parallelism in streams  Scalable  Recovers from failures and stragglers very fast  Same API for stream and batch  DStreams model is implemented over Spark which is an Apache top-level project
  • 31.  Memory usage o Significantly higher than continuous operators with mutable state o It may possible to reduce the memory usage by storing only Δ between RDDs  Replication size o Replication algorithms can cause usage of less hardware than X 2  Intervals o There are scenarios which latency of 0.5-2s does not fit its requirements o There are cases where even in the minimum interval time (0.5s), the size of the data we should process exceeds our resources – controlling the interval time is needed. o There are cases where the processing time of each batch is significantly smaller than the interval time – therefore we lose valuable processing time.