SlideShare a Scribd company logo
Spark Streaming: Best Practices
Prakash Chockalingam
@prakash573
Who am I ?
2
● Solutions Architect / Product Manager at Databricks
● Formerly Netflix, Personalization Infrastructure
● Formerly Yahoo!, Personalized Ad Targeting
About Databricks
Founded by creators of Spark in 2013
Cloud enterprise data platform
- Managed Spark clusters
- Interactive data science
- Production pipelines
- Data governance, security, …
Agenda
● Introduction to Spark Streaming
● Lifecycle of a Spark streaming app
● Aggregations and best practices
● Operationalization tips
● Key benefits of Spark streaming
What is Spark Streaming?
Spark Streaming
How does it work?
● Receivers receive data streams and chops them in to batches.
● Spark processes the batches and pushes out the results
Word Count
val context = new StreamingContext(conf, Seconds(1))
val lines = context.socketTextStream(...)
Entry point Batch Interval
DStream: represents a
data stream
Word Count
val context = new StreamingContext(conf, Seconds(1))
val lines = context.socketTextStream(...)
val words = lines.flatMap(_.split(“ “))
Transformations: transform
data to create new DStreams
Word Count
val context = new StreamingContext(conf, Seconds(1))
val lines = context.socketTextStream(...)
val words = lines.flatMap(_.split(“ “))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_)
wordCounts.print()
context.start()
Print the DStream contents on screen
Start the streaming job
Lifecycle of a streaming app
Execution in any Spark Application
Execution in Spark Streaming: Receiving data
Execution in Spark Streaming: Processing data
End-to-end view
16
t1 = ssc.socketStream(“…”)
t2 = ssc.socketStream(“…”)
t = t1.union(t2).map(…)
t.saveAsHadoopFiles(…)
t.map(…).foreach(…)
t.filter(…).foreach(…)
T
U
M
T
M FFE
FE FE
B
U
M
B
M F
Input
DStreams
Output
operations
RDD Actions /
Spark Jobs
BlockRDDs
DStreamGraph DAG of RDDs
every interval
DAG of stages
every interval
Stage 1
Stage 2
Stage 3
Streaming app Tasks
every interval
B
U
M
B
M F
B
U
M
B
M F
Stage 1
Stage 2
Stage 3
Stage 1
Stage 2
Stage 3
Spark Streaming
JobScheduler + JobGenerator
Spark
DAGScheduler
Spark
TaskScheduler
Executors
YOU
write this
Aggregations
Word count over a time window
val wordCounts = wordStream.reduceByKeyAndWindow((x:
Int, y:Int) => x+y, windowSize, slidingInterval)
Parent DStream
window size
sliding interval
Reduces over a time window
Word count over a time window
Scenario: Word count for the last 30 minutes
How to optimize for good performance?
● Increase batch interval, if possible
● Incremental aggregations with inverse reduce function
val wordCounts = wordStream.reduceByKeyAndWindow((x: Int, y:Int) =>
x+y, (x: Int, y: Int) => x-y, windowSize, slidingInterval)
● Checkpointing
wordStream.checkpoint(checkpointInterval)
Stateful: Global Aggregations
Scenario: Maintain a global state based on the input events coming
in. Ex: Word count from beginning of time.
updateStateByKey (Spark 1.5 and before)
● Performance is proportional to the size of the state.
mapWithState (Spark 1.6+)
● Performance is proportional to the size of the batch.
Stateful: Global Aggregations
Stateful: Global Aggregations
Key features of mapWithState:
● An initial state - Read from somewhere as a RDD
● # of partitions for the state - If you have a good estimate of the size of the
state, you can specify the # of partitions.
● Partitioner - Default: Hash partitioner. If you have a good understanding
of the key space, then you can provide a custom partitioner
● Timeout - Keys whose values are not updated within the specified timeout period
will be removed from the state.
Stateful: Global Aggregations (Word count)
val stateSpec = StateSpec.function(updateState _)
.initialState(initialRDD)
.numPartitions(100)
.partitioner(MyPartitioner())
.timeout(Minutes(120))
val wordCountState = wordStream.mapWithState(stateSpec)
Stateful: Global Aggregations (Word count)
def updateState(batchTime: Time,
key: String,
value: Option[Int],
state: State[Long])
: Option[(String, Long)]
Current batch time
A Word in the input stream
Current value (= 1)
Counts so far for the word
The word and its new count
Operationalization
Checkpoint
Two types of checkpointing:
● Checkpointing Data
● Checkpointing Metadata
Checkpoint Data
● Checkpointing DStreams
○ Primarily needed to cut long lineage on past batches
(updateStateByKey/reduceByKeyAndWindow).
○ Example: wordStream.checkpoint(checkpointInterval)
Checkpoint Metadata
● Checkpointing Metadata
○ All the configuration, DStream operations and incomplete batches are
checkpointed.
○ Required for failure recovery if the driver process crashes.
○ Example: streamingContext.checkpoint(directory)
Achieving good throughput
context.socketStream(...)
.map(...)
.filter(...)
.saveAsHadoopFile(...)
Problem: There will be 1 receiver which receives all the data and
stores it in its executor and all the processing happens on that
executor. Adding more nodes doesn’t help.
Achieving good throughput
Solution: Increase the # of receivers and union them.
● Each receiver is run in 1 executor. Having 5 receivers will ensure
that the data gets received in parallel in 5 executors.
● Data gets distributed in 5 executors. So all the subsequent Spark
map/filter operations will be distributed
val numStreams = 5
val inputStreams = (1 to numStreams).map(i => context.
socketStream(...))
val fullStream = context.union(inputStreams)
fullStream.map(...).filter(...).saveAsHadoopFile(...)
Achieving good throughput
● In the case of direct receivers (like Kafka), set the appropriate #
of partitions in Kafka.
● Each kafka paratition gets mapped to a Spark partition.
● More partitions in Kafka = More parallelism in Spark
Achieving good throughput
● Provide the right # of partitions based on your cluster size for
operations causing shuffles.
words.map(x => (x, 1)).reduceByKey(_+_, 100)
# of partitions
Debugging a Streaming application
Streaming tab in Spark UI
Debugging a Streaming application
Processing Time
● Make sure that the processing time < batch interval
Debugging a Streaming application
Debugging a Streaming application
Batch Details Page:
● Input to the batch
● Jobs that were run as part of the processing for the batch
Debugging a Streaming application
Job Details Page
● DAG Visualization
● Stages of a Spark job
Debugging a Streaming application
Task Details Page
Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing.
If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor
in your cluster.
Key benefits of Spark
streaming
Dynamic Load Balancing
Fast failure and Straggler recovery
Combine Batch and Stream Processing
Join data streams with static data sets
val dataset = sparkContext.hadoopFile(“file”)
…
kafkaStream.transform{ batchRdd =>
batchRdd.join(dataset).filter(...)
}
Combine ML and Stream Processing
Learn models offline, apply them online
val model = KMeans.train(dataset, …)
kakfaStream.map { event =>
model.predict(event.feature)
}
Combine SQL and Stream Processing
inputStream.foreachRDD{ rdd =>
val df = SQLContext.createDataframe(rdd)
df.select(...).where(...).groupBy(...)
}
Thank you.

More Related Content

What's hot

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Jean-Baptiste Onofré
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Databricks
 

What's hot (20)

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Spark overview
Spark overviewSpark overview
Spark overview
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 

Viewers also liked

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Edureka!
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
Joel Koshy
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Chris Fregly
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 

Viewers also liked (7)

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 

Similar to Spark streaming: Best Practices

So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
Petr Zapletal
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
Spark Barcelona Meetup: Migrating Batch Jobs into Structured StreamingSpark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
marcgonzalez.eu
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
Toby Matejovsky
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
Yi Pan
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
Prateek Maheshwari
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 

Similar to Spark streaming: Best Practices (20)

So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
Spark Barcelona Meetup: Migrating Batch Jobs into Structured StreamingSpark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 

Recently uploaded

Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
architagupta876
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
GauravCar
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 

Spark streaming: Best Practices

  • 1. Spark Streaming: Best Practices Prakash Chockalingam @prakash573
  • 2. Who am I ? 2 ● Solutions Architect / Product Manager at Databricks ● Formerly Netflix, Personalization Infrastructure ● Formerly Yahoo!, Personalized Ad Targeting
  • 3. About Databricks Founded by creators of Spark in 2013 Cloud enterprise data platform - Managed Spark clusters - Interactive data science - Production pipelines - Data governance, security, …
  • 4. Agenda ● Introduction to Spark Streaming ● Lifecycle of a Spark streaming app ● Aggregations and best practices ● Operationalization tips ● Key benefits of Spark streaming
  • 5. What is Spark Streaming?
  • 7.
  • 8. How does it work? ● Receivers receive data streams and chops them in to batches. ● Spark processes the batches and pushes out the results
  • 9. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) Entry point Batch Interval DStream: represents a data stream
  • 10. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) val words = lines.flatMap(_.split(“ “)) Transformations: transform data to create new DStreams
  • 11. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) val words = lines.flatMap(_.split(“ “)) val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_) wordCounts.print() context.start() Print the DStream contents on screen Start the streaming job
  • 12. Lifecycle of a streaming app
  • 13. Execution in any Spark Application
  • 14. Execution in Spark Streaming: Receiving data
  • 15. Execution in Spark Streaming: Processing data
  • 16. End-to-end view 16 t1 = ssc.socketStream(“…”) t2 = ssc.socketStream(“…”) t = t1.union(t2).map(…) t.saveAsHadoopFiles(…) t.map(…).foreach(…) t.filter(…).foreach(…) T U M T M FFE FE FE B U M B M F Input DStreams Output operations RDD Actions / Spark Jobs BlockRDDs DStreamGraph DAG of RDDs every interval DAG of stages every interval Stage 1 Stage 2 Stage 3 Streaming app Tasks every interval B U M B M F B U M B M F Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Spark Streaming JobScheduler + JobGenerator Spark DAGScheduler Spark TaskScheduler Executors YOU write this
  • 18. Word count over a time window val wordCounts = wordStream.reduceByKeyAndWindow((x: Int, y:Int) => x+y, windowSize, slidingInterval) Parent DStream window size sliding interval Reduces over a time window
  • 19. Word count over a time window Scenario: Word count for the last 30 minutes How to optimize for good performance? ● Increase batch interval, if possible ● Incremental aggregations with inverse reduce function val wordCounts = wordStream.reduceByKeyAndWindow((x: Int, y:Int) => x+y, (x: Int, y: Int) => x-y, windowSize, slidingInterval) ● Checkpointing wordStream.checkpoint(checkpointInterval)
  • 20. Stateful: Global Aggregations Scenario: Maintain a global state based on the input events coming in. Ex: Word count from beginning of time. updateStateByKey (Spark 1.5 and before) ● Performance is proportional to the size of the state. mapWithState (Spark 1.6+) ● Performance is proportional to the size of the batch.
  • 22. Stateful: Global Aggregations Key features of mapWithState: ● An initial state - Read from somewhere as a RDD ● # of partitions for the state - If you have a good estimate of the size of the state, you can specify the # of partitions. ● Partitioner - Default: Hash partitioner. If you have a good understanding of the key space, then you can provide a custom partitioner ● Timeout - Keys whose values are not updated within the specified timeout period will be removed from the state.
  • 23. Stateful: Global Aggregations (Word count) val stateSpec = StateSpec.function(updateState _) .initialState(initialRDD) .numPartitions(100) .partitioner(MyPartitioner()) .timeout(Minutes(120)) val wordCountState = wordStream.mapWithState(stateSpec)
  • 24. Stateful: Global Aggregations (Word count) def updateState(batchTime: Time, key: String, value: Option[Int], state: State[Long]) : Option[(String, Long)] Current batch time A Word in the input stream Current value (= 1) Counts so far for the word The word and its new count
  • 26. Checkpoint Two types of checkpointing: ● Checkpointing Data ● Checkpointing Metadata
  • 27. Checkpoint Data ● Checkpointing DStreams ○ Primarily needed to cut long lineage on past batches (updateStateByKey/reduceByKeyAndWindow). ○ Example: wordStream.checkpoint(checkpointInterval)
  • 28. Checkpoint Metadata ● Checkpointing Metadata ○ All the configuration, DStream operations and incomplete batches are checkpointed. ○ Required for failure recovery if the driver process crashes. ○ Example: streamingContext.checkpoint(directory)
  • 29. Achieving good throughput context.socketStream(...) .map(...) .filter(...) .saveAsHadoopFile(...) Problem: There will be 1 receiver which receives all the data and stores it in its executor and all the processing happens on that executor. Adding more nodes doesn’t help.
  • 30. Achieving good throughput Solution: Increase the # of receivers and union them. ● Each receiver is run in 1 executor. Having 5 receivers will ensure that the data gets received in parallel in 5 executors. ● Data gets distributed in 5 executors. So all the subsequent Spark map/filter operations will be distributed val numStreams = 5 val inputStreams = (1 to numStreams).map(i => context. socketStream(...)) val fullStream = context.union(inputStreams) fullStream.map(...).filter(...).saveAsHadoopFile(...)
  • 31. Achieving good throughput ● In the case of direct receivers (like Kafka), set the appropriate # of partitions in Kafka. ● Each kafka paratition gets mapped to a Spark partition. ● More partitions in Kafka = More parallelism in Spark
  • 32. Achieving good throughput ● Provide the right # of partitions based on your cluster size for operations causing shuffles. words.map(x => (x, 1)).reduceByKey(_+_, 100) # of partitions
  • 33. Debugging a Streaming application Streaming tab in Spark UI
  • 34. Debugging a Streaming application Processing Time ● Make sure that the processing time < batch interval
  • 35. Debugging a Streaming application
  • 36. Debugging a Streaming application Batch Details Page: ● Input to the batch ● Jobs that were run as part of the processing for the batch
  • 37. Debugging a Streaming application Job Details Page ● DAG Visualization ● Stages of a Spark job
  • 38. Debugging a Streaming application Task Details Page Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor in your cluster.
  • 39. Key benefits of Spark streaming
  • 41. Fast failure and Straggler recovery
  • 42. Combine Batch and Stream Processing Join data streams with static data sets val dataset = sparkContext.hadoopFile(“file”) … kafkaStream.transform{ batchRdd => batchRdd.join(dataset).filter(...) }
  • 43. Combine ML and Stream Processing Learn models offline, apply them online val model = KMeans.train(dataset, …) kakfaStream.map { event => model.predict(event.feature) }
  • 44. Combine SQL and Stream Processing inputStream.foreachRDD{ rdd => val df = SQLContext.createDataframe(rdd) df.select(...).where(...).groupBy(...) }