SlideShare a Scribd company logo
1 of 48
Download to read offline
So you think you can Stream?
Vida Ha (@femineer)
Prakash Chockalingam (@prakash573)
Who am I ?
● Solutions Architect @ Databricks
● Formerly Square, Reporting & Analytics
● Formerly Google, Mobile Web Search & Speech
Recognition
2
Who am I ?
● Solutions Architect @ Databricks
● Formerly Netflix, Personalization
Infrastructure
3
About Databricks
Founded by creators of Spark in 2013
Cloud enterprise data platform
- Managed Spark clusters
- Interactive data science
- Production pipelines
- Data governance, security, …
Sign up for the Databricks Community Edition
Learn Apache Spark for free
● Mini 6GB cluster for learning Spark
● Interactive notebooks and dashboards
● Online learning resources
● Public environment to share your work
https://databricks.com/try-databricks
Agenda
• Introduction to Spark Streaming
• Lifecycle of a Spark streaming app
• Aggregations and best practices
• Operationalization tips
• Key benefits of Spark streaming
What is Spark Streaming?
Spark Streaming
How does it work?
● Receivers receive data streams and chops them in to batches.
● Spark processes the batches and pushes out the results
Word Count
val context = new StreamingContext(conf, Seconds(1))
val lines = context.socketTextStream(...)
Entry point Batch Interval
DStream: represents a
data stream
Word Count
val context = new StreamingContext(conf, Seconds(1))
val lines = context.socketTextStream(...)
val words = lines.flatMap(_.split(“ “))
Transformations: transform
data to create new DStreams
Word Count
val context = new StreamingContext(conf, Seconds(1))
val lines = context.socketTextStream(...)
val words = lines.flatMap(_.split(“ “))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_)
wordCounts.print()
context.start()
Print the DStream contents on screen
Start the streaming job
Lifecycle of a streaming app
Execution in any Spark Application
Spark
Driver
User code runs in the
driver process
YARN / Mesos / Spark
Standalone cluster
Tasks sent to
executors for
processing data
Spark
Executor
Spark
Executor
Spark
Executor
Driver launches
executors in cluster
Execution in Spark Streaming: Receiving data
Executor
Executor
Driver runs receivers
as long running tasks
Receiver Data stream
Driver
Receiver divides stream into
blocks and keeps in memory
Data Blocks
Blocks also replicated
to another executor
Data Blocks
Execution in Spark Streaming: Processing data
Executor
Executor
Receiver
Data Blocks
Data Blocks
results
results
Data
store
Every batch interval,
driver launches tasks to
process the blocks
Driver
End-to-end view
18
t1 = ssc.socketStream(“…”)
t2 = ssc.socketStream(“…”)
t = t1.union(t2).map(…)
t.saveAsHadoopFiles(…)
t.map(…).foreach(…)
t.filter(…).foreach(…)
T
U
M
T
M FFE
FE FE
B
U
M
B
M F
Input
DStreams
Output
operations
RDD Actions /
Spark Jobs
BlockRDDs
DStreamGraph
DAG of RDDs
every interval
DAG of stages
every interval
Stage 1
Stage 2
Stage 3
Streaming app
Tasks
every interval
B
U
M
B
M F
B
U
M
B
M F
Stage 1
Stage 2
Stage 3
Stage 1
Stage 2
Stage 3
Spark Streaming
JobScheduler + JobGenerator
Spark
DAGScheduler
Spark
TaskScheduler
Executors
YOU write
this
Aggregations
Word count over a time window
val wordCounts = wordStream.reduceByKeyAndWindow((x:
Int, y:Int) => x+y, windowSize, slidingInterval)
Reduces over a time window
Word count over a time window
Scenario: Word count for the last 30 minutes
How to optimize for good performance?
● Increase batch interval, if possible
● Incremental aggregations with inverse reduce function
val wordCounts = wordStream.reduceByKeyAndWindow(
(x: Int, y:Int) => x+y, (x: Int, y: Int) =>
x-y, windowSize, slidingInterval)
● Checkpointing
wordStream.checkpoint(checkpointInterval)
Stateful: Global Aggregations
Scenario: Maintain a global state based on the input events coming
in. Ex: Word count from beginning of time.
updateStateByKey (Spark 1.5 and before)
● Performance is proportional to the size of the state.
mapWithState (Spark 1.6+)
● Performance is proportional to the size of the batch.
Stateful: Global Aggregations
Stateful: Global Aggregations
Key features of mapWithState:
● An initial state - Read from somewhere as a RDD
● # of partitions for the state - If you have a good estimate of the size of the
state, you can specify the # of partitions.
● Partitioner - Default: Hash partitioner. If you have a good understanding of
the key space, then you can provide a custom partitioner
● Timeout - Keys whose values are not updated within the specified timeout
period will be removed from the state.
Stateful: Global Aggregations (Word count)
val stateSpec = StateSpec.function(updateState _)
.initialState(initialRDD)
.numPartitions(100)
.partitioner(MyPartitioner())
.timeout(Minutes(120))
val wordCountState = wordStream.mapWithState(stateSpec)
Stateful: Global Aggregations (Word count)
def updateState(batchTime: Time,
key: String,
value: Option[Int],
state: State[Long])
: Option[(String, Long)]
Current batch time
A Word in the input stream
Current value (= 1)
Counts so far for the word
The word and its new count
Operationalization
Operationalization
Three main themes
● Checkpointing
● Achieving good throughput
● Debugging a streaming job
Checkpoint
Two types of checkpointing:
● Checkpointing Data
● Checkpointing Metadata
Checkpoint Data
● Checkpointing DStreams
• Primarily needed to cut long lineage on past batches
(updateStateByKey/reduceByKeyAndWindow).
• Example: wordStream.checkpoint(checkpointInterval)
Checkpoint Metadata
● Checkpointing Metadata
• All the configuration, DStream operations and incomplete batches are
checkpointed.
• Required for failure recovery if the driver process crashes.
• Example:
–streamingContext.checkpoint(directory)
–StreamingContext.getActiveOrCreate(directory, ...)
Achieving good throughput
context.socketStream(...)
.map(...)
.filter(...)
.saveAsHadoopFile(...)
Problem: There will be 1 receiver which receives all the data and
stores it in its executor and all the processing happens on that
executor. Adding more nodes doesn’t help.
Achieving good throughput
Solution: Increase the # of receivers and union them.
● Each receiver is run in 1 executor. Having 5 receivers will ensure
that the data gets received in parallel in 5 executors.
● Data gets distributed in 5 executors. So all the subsequent Spark
map/filter operations will be distributed
val numStreams = 5
val inputStreams = (1 to numStreams).map(i => context.
socketStream(...))
val fullStream = context.union(inputStreams)
fullStream.map(...).filter(...).saveAsHadoopFile(...)
Achieving good throughput
● In the case of direct receivers (like Kafka), set the appropriate #
of partitions in Kafka.
● Each kafka partition gets mapped to a Spark partition.
● More partitions in Kafka = More parallelism in Spark
Achieving good throughput
● Provide the right # of partitions based on your cluster size for
operations causing shuffles.
words.map(x => (x, 1)).reduceByKey(_+_, 100)
# of partitions
Debugging a Streaming application
Streaming tab in Spark UI
Debugging a Streaming application
Processing Time
● Make sure that the processing time < batch interval
Debugging a Streaming application
Debugging a Streaming application
Batch Details Page:
● Input to the batch
● Jobs that were run as part of the processing for the batch
Debugging a Streaming application
Job Details Page
● DAG Visualization
● Stages of a Spark job
Debugging a Streaming application
Task Details Page
Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing.
If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor
in your cluster.
Key benefits of Spark streaming
Dynamic Load Balancing
Fast failure and Straggler recovery
Combine Batch and Stream Processing
Join data streams with static data sets
val dataset = sparkContext.hadoopFile(“file”)
…
kafkaStream.transform{ batchRdd =>
batchRdd.join(dataset).filter(...)
}
Combine ML and Stream Processing
Learn models offline, apply them online
val model = KMeans.train(dataset, …)
kakfaStream.map { event =>
model.predict(event.feature)
}
Combine SQL and Stream Processing
inputStream.foreachRDD{ rdd =>
val df = SQLContext.createDataframe(rdd)
df.select(...).where(...).groupBy(...)
}
Thank you.

More Related Content

What's hot

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
Databricks
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Alexey Kharlamov
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
Anirvan Chakraborty
 

What's hot (20)

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on Cassandra
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Predictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache FlinkPredictive Maintenance with Deep Learning and Apache Flink
Predictive Maintenance with Deep Learning and Apache Flink
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
 
Introduction to near real time computing
Introduction to near real time computingIntroduction to near real time computing
Introduction to near real time computing
 

Similar to So you think you can stream.pptx

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 

Similar to So you think you can stream.pptx (20)

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 

Recently uploaded

ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
Kamal Acharya
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
Kamal Acharya
 

Recently uploaded (20)

Attraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptxAttraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptx
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
 
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbineLow rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdf
 
Quiz application system project report..pdf
Quiz application system project report..pdfQuiz application system project report..pdf
Quiz application system project report..pdf
 
Lect_Z_Transform_Main_digital_image_processing.pptx
Lect_Z_Transform_Main_digital_image_processing.pptxLect_Z_Transform_Main_digital_image_processing.pptx
Lect_Z_Transform_Main_digital_image_processing.pptx
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Object Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docxObject Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docx
 

So you think you can stream.pptx

  • 1. So you think you can Stream? Vida Ha (@femineer) Prakash Chockalingam (@prakash573)
  • 2. Who am I ? ● Solutions Architect @ Databricks ● Formerly Square, Reporting & Analytics ● Formerly Google, Mobile Web Search & Speech Recognition 2
  • 3. Who am I ? ● Solutions Architect @ Databricks ● Formerly Netflix, Personalization Infrastructure 3
  • 4. About Databricks Founded by creators of Spark in 2013 Cloud enterprise data platform - Managed Spark clusters - Interactive data science - Production pipelines - Data governance, security, …
  • 5. Sign up for the Databricks Community Edition Learn Apache Spark for free ● Mini 6GB cluster for learning Spark ● Interactive notebooks and dashboards ● Online learning resources ● Public environment to share your work https://databricks.com/try-databricks
  • 6. Agenda • Introduction to Spark Streaming • Lifecycle of a Spark streaming app • Aggregations and best practices • Operationalization tips • Key benefits of Spark streaming
  • 7. What is Spark Streaming?
  • 9.
  • 10. How does it work? ● Receivers receive data streams and chops them in to batches. ● Spark processes the batches and pushes out the results
  • 11. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) Entry point Batch Interval DStream: represents a data stream
  • 12. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) val words = lines.flatMap(_.split(“ “)) Transformations: transform data to create new DStreams
  • 13. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) val words = lines.flatMap(_.split(“ “)) val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_) wordCounts.print() context.start() Print the DStream contents on screen Start the streaming job
  • 14. Lifecycle of a streaming app
  • 15. Execution in any Spark Application Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Tasks sent to executors for processing data Spark Executor Spark Executor Spark Executor Driver launches executors in cluster
  • 16. Execution in Spark Streaming: Receiving data Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver Receiver divides stream into blocks and keeps in memory Data Blocks Blocks also replicated to another executor Data Blocks
  • 17. Execution in Spark Streaming: Processing data Executor Executor Receiver Data Blocks Data Blocks results results Data store Every batch interval, driver launches tasks to process the blocks Driver
  • 18. End-to-end view 18 t1 = ssc.socketStream(“…”) t2 = ssc.socketStream(“…”) t = t1.union(t2).map(…) t.saveAsHadoopFiles(…) t.map(…).foreach(…) t.filter(…).foreach(…) T U M T M FFE FE FE B U M B M F Input DStreams Output operations RDD Actions / Spark Jobs BlockRDDs DStreamGraph DAG of RDDs every interval DAG of stages every interval Stage 1 Stage 2 Stage 3 Streaming app Tasks every interval B U M B M F B U M B M F Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Spark Streaming JobScheduler + JobGenerator Spark DAGScheduler Spark TaskScheduler Executors YOU write this
  • 20. Word count over a time window val wordCounts = wordStream.reduceByKeyAndWindow((x: Int, y:Int) => x+y, windowSize, slidingInterval) Reduces over a time window
  • 21. Word count over a time window Scenario: Word count for the last 30 minutes How to optimize for good performance? ● Increase batch interval, if possible ● Incremental aggregations with inverse reduce function val wordCounts = wordStream.reduceByKeyAndWindow( (x: Int, y:Int) => x+y, (x: Int, y: Int) => x-y, windowSize, slidingInterval) ● Checkpointing wordStream.checkpoint(checkpointInterval)
  • 22. Stateful: Global Aggregations Scenario: Maintain a global state based on the input events coming in. Ex: Word count from beginning of time. updateStateByKey (Spark 1.5 and before) ● Performance is proportional to the size of the state. mapWithState (Spark 1.6+) ● Performance is proportional to the size of the batch.
  • 24. Stateful: Global Aggregations Key features of mapWithState: ● An initial state - Read from somewhere as a RDD ● # of partitions for the state - If you have a good estimate of the size of the state, you can specify the # of partitions. ● Partitioner - Default: Hash partitioner. If you have a good understanding of the key space, then you can provide a custom partitioner ● Timeout - Keys whose values are not updated within the specified timeout period will be removed from the state.
  • 25. Stateful: Global Aggregations (Word count) val stateSpec = StateSpec.function(updateState _) .initialState(initialRDD) .numPartitions(100) .partitioner(MyPartitioner()) .timeout(Minutes(120)) val wordCountState = wordStream.mapWithState(stateSpec)
  • 26. Stateful: Global Aggregations (Word count) def updateState(batchTime: Time, key: String, value: Option[Int], state: State[Long]) : Option[(String, Long)] Current batch time A Word in the input stream Current value (= 1) Counts so far for the word The word and its new count
  • 28. Operationalization Three main themes ● Checkpointing ● Achieving good throughput ● Debugging a streaming job
  • 29. Checkpoint Two types of checkpointing: ● Checkpointing Data ● Checkpointing Metadata
  • 30. Checkpoint Data ● Checkpointing DStreams • Primarily needed to cut long lineage on past batches (updateStateByKey/reduceByKeyAndWindow). • Example: wordStream.checkpoint(checkpointInterval)
  • 31. Checkpoint Metadata ● Checkpointing Metadata • All the configuration, DStream operations and incomplete batches are checkpointed. • Required for failure recovery if the driver process crashes. • Example: –streamingContext.checkpoint(directory) –StreamingContext.getActiveOrCreate(directory, ...)
  • 32. Achieving good throughput context.socketStream(...) .map(...) .filter(...) .saveAsHadoopFile(...) Problem: There will be 1 receiver which receives all the data and stores it in its executor and all the processing happens on that executor. Adding more nodes doesn’t help.
  • 33. Achieving good throughput Solution: Increase the # of receivers and union them. ● Each receiver is run in 1 executor. Having 5 receivers will ensure that the data gets received in parallel in 5 executors. ● Data gets distributed in 5 executors. So all the subsequent Spark map/filter operations will be distributed val numStreams = 5 val inputStreams = (1 to numStreams).map(i => context. socketStream(...)) val fullStream = context.union(inputStreams) fullStream.map(...).filter(...).saveAsHadoopFile(...)
  • 34. Achieving good throughput ● In the case of direct receivers (like Kafka), set the appropriate # of partitions in Kafka. ● Each kafka partition gets mapped to a Spark partition. ● More partitions in Kafka = More parallelism in Spark
  • 35. Achieving good throughput ● Provide the right # of partitions based on your cluster size for operations causing shuffles. words.map(x => (x, 1)).reduceByKey(_+_, 100) # of partitions
  • 36. Debugging a Streaming application Streaming tab in Spark UI
  • 37. Debugging a Streaming application Processing Time ● Make sure that the processing time < batch interval
  • 38. Debugging a Streaming application
  • 39. Debugging a Streaming application Batch Details Page: ● Input to the batch ● Jobs that were run as part of the processing for the batch
  • 40. Debugging a Streaming application Job Details Page ● DAG Visualization ● Stages of a Spark job
  • 41. Debugging a Streaming application Task Details Page Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor in your cluster.
  • 42. Key benefits of Spark streaming
  • 44. Fast failure and Straggler recovery
  • 45. Combine Batch and Stream Processing Join data streams with static data sets val dataset = sparkContext.hadoopFile(“file”) … kafkaStream.transform{ batchRdd => batchRdd.join(dataset).filter(...) }
  • 46. Combine ML and Stream Processing Learn models offline, apply them online val model = KMeans.train(dataset, …) kakfaStream.map { event => model.predict(event.feature) }
  • 47. Combine SQL and Stream Processing inputStream.foreachRDD{ rdd => val df = SQLContext.createDataframe(rdd) df.select(...).where(...).groupBy(...) }