SlideShare a Scribd company logo
1 of 45
Download to read offline
Spark Streaming: Best Practices
Prakash Chockalingam
@prakash573
Who am I ?
2
● Solutions Architect / Product Manager at Databricks
● Formerly Netflix, Personalization Infrastructure
● Formerly Yahoo!, Personalized Ad Targeting
About Databricks
Founded by creators of Spark in 2013
Cloud enterprise data platform
- Managed Spark clusters
- Interactive data science
- Production pipelines
- Data governance, security, …
Agenda
● Introduction to Spark Streaming
● Lifecycle of a Spark streaming app
● Aggregations and best practices
● Operationalization tips
● Key benefits of Spark streaming
What is Spark Streaming?
Spark Streaming
How does it work?
● Receivers receive data streams and chops them in to batches.
● Spark processes the batches and pushes out the results
Word Count
val context = new StreamingContext(conf, Seconds(1))
val lines = context.socketTextStream(...)
Entry point Batch Interval
DStream: represents a
data stream
Word Count
val context = new StreamingContext(conf, Seconds(1))
val lines = context.socketTextStream(...)
val words = lines.flatMap(_.split(“ “))
Transformations: transform
data to create new DStreams
Word Count
val context = new StreamingContext(conf, Seconds(1))
val lines = context.socketTextStream(...)
val words = lines.flatMap(_.split(“ “))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_)
wordCounts.print()
context.start()
Print the DStream contents on screen
Start the streaming job
Lifecycle of a streaming app
Execution in any Spark Application
Execution in Spark Streaming: Receiving data
Execution in Spark Streaming: Processing data
End-to-end view
16
t1 = ssc.socketStream(“…”)
t2 = ssc.socketStream(“…”)
t = t1.union(t2).map(…)
t.saveAsHadoopFiles(…)
t.map(…).foreach(…)
t.filter(…).foreach(…)
T
U
M
T
M FFE
FE FE
B
U
M
B
M F
Input
DStreams
Output
operations
RDD Actions /
Spark Jobs
BlockRDDs
DStreamGraph DAG of RDDs
every interval
DAG of stages
every interval
Stage 1
Stage 2
Stage 3
Streaming app Tasks
every interval
B
U
M
B
M F
B
U
M
B
M F
Stage 1
Stage 2
Stage 3
Stage 1
Stage 2
Stage 3
Spark Streaming
JobScheduler + JobGenerator
Spark
DAGScheduler
Spark
TaskScheduler
Executors
YOU
write this
Aggregations
Word count over a time window
val wordCounts = wordStream.reduceByKeyAndWindow((x:
Int, y:Int) => x+y, windowSize, slidingInterval)
Parent DStream
window size
sliding interval
Reduces over a time window
Word count over a time window
Scenario: Word count for the last 30 minutes
How to optimize for good performance?
● Increase batch interval, if possible
● Incremental aggregations with inverse reduce function
val wordCounts = wordStream.reduceByKeyAndWindow((x: Int, y:Int) =>
x+y, (x: Int, y: Int) => x-y, windowSize, slidingInterval)
● Checkpointing
wordStream.checkpoint(checkpointInterval)
Stateful: Global Aggregations
Scenario: Maintain a global state based on the input events coming
in. Ex: Word count from beginning of time.
updateStateByKey (Spark 1.5 and before)
● Performance is proportional to the size of the state.
mapWithState (Spark 1.6+)
● Performance is proportional to the size of the batch.
Stateful: Global Aggregations
Stateful: Global Aggregations
Key features of mapWithState:
● An initial state - Read from somewhere as a RDD
● # of partitions for the state - If you have a good estimate of the size of the
state, you can specify the # of partitions.
● Partitioner - Default: Hash partitioner. If you have a good understanding
of the key space, then you can provide a custom partitioner
● Timeout - Keys whose values are not updated within the specified timeout period
will be removed from the state.
Stateful: Global Aggregations (Word count)
val stateSpec = StateSpec.function(updateState _)
.initialState(initialRDD)
.numPartitions(100)
.partitioner(MyPartitioner())
.timeout(Minutes(120))
val wordCountState = wordStream.mapWithState(stateSpec)
Stateful: Global Aggregations (Word count)
def updateState(batchTime: Time,
key: String,
value: Option[Int],
state: State[Long])
: Option[(String, Long)]
Current batch time
A Word in the input stream
Current value (= 1)
Counts so far for the word
The word and its new count
Operationalization
Checkpoint
Two types of checkpointing:
● Checkpointing Data
● Checkpointing Metadata
Checkpoint Data
● Checkpointing DStreams
○ Primarily needed to cut long lineage on past batches
(updateStateByKey/reduceByKeyAndWindow).
○ Example: wordStream.checkpoint(checkpointInterval)
Checkpoint Metadata
● Checkpointing Metadata
○ All the configuration, DStream operations and incomplete batches are
checkpointed.
○ Required for failure recovery if the driver process crashes.
○ Example: streamingContext.checkpoint(directory)
Achieving good throughput
context.socketStream(...)
.map(...)
.filter(...)
.saveAsHadoopFile(...)
Problem: There will be 1 receiver which receives all the data and
stores it in its executor and all the processing happens on that
executor. Adding more nodes doesn’t help.
Achieving good throughput
Solution: Increase the # of receivers and union them.
● Each receiver is run in 1 executor. Having 5 receivers will ensure
that the data gets received in parallel in 5 executors.
● Data gets distributed in 5 executors. So all the subsequent Spark
map/filter operations will be distributed
val numStreams = 5
val inputStreams = (1 to numStreams).map(i => context.
socketStream(...))
val fullStream = context.union(inputStreams)
fullStream.map(...).filter(...).saveAsHadoopFile(...)
Achieving good throughput
● In the case of direct receivers (like Kafka), set the appropriate #
of partitions in Kafka.
● Each kafka paratition gets mapped to a Spark partition.
● More partitions in Kafka = More parallelism in Spark
Achieving good throughput
● Provide the right # of partitions based on your cluster size for
operations causing shuffles.
words.map(x => (x, 1)).reduceByKey(_+_, 100)
# of partitions
Debugging a Streaming application
Streaming tab in Spark UI
Debugging a Streaming application
Processing Time
● Make sure that the processing time < batch interval
Debugging a Streaming application
Debugging a Streaming application
Batch Details Page:
● Input to the batch
● Jobs that were run as part of the processing for the batch
Debugging a Streaming application
Job Details Page
● DAG Visualization
● Stages of a Spark job
Debugging a Streaming application
Task Details Page
Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing.
If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor
in your cluster.
Key benefits of Spark
streaming
Dynamic Load Balancing
Fast failure and Straggler recovery
Combine Batch and Stream Processing
Join data streams with static data sets
val dataset = sparkContext.hadoopFile(“file”)
…
kafkaStream.transform{ batchRdd =>
batchRdd.join(dataset).filter(...)
}
Combine ML and Stream Processing
Learn models offline, apply them online
val model = KMeans.train(dataset, …)
kakfaStream.map { event =>
model.predict(event.feature)
}
Combine SQL and Stream Processing
inputStream.foreachRDD{ rdd =>
val df = SQLContext.createDataframe(rdd)
df.select(...).where(...).groupBy(...)
}
Thank you.

More Related Content

What's hot

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introductionleanderlee2
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
CRDB - Multi-Master Geo Distributed Redis with Redis Enterprise
CRDB - Multi-Master Geo Distributed Redis with Redis EnterpriseCRDB - Multi-Master Geo Distributed Redis with Redis Enterprise
CRDB - Multi-Master Geo Distributed Redis with Redis EnterpriseElad Ash
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...HostedbyConfluent
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 

What's hot (20)

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark
SparkSpark
Spark
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Mariadb une base de données NewSQL
Mariadb une base de données NewSQLMariadb une base de données NewSQL
Mariadb une base de données NewSQL
 
CRDB - Multi-Master Geo Distributed Redis with Redis Enterprise
CRDB - Multi-Master Geo Distributed Redis with Redis EnterpriseCRDB - Multi-Master Geo Distributed Redis with Redis Enterprise
CRDB - Multi-Master Geo Distributed Redis with Redis Enterprise
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 

Viewers also liked

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaSpark Summit
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...Chris Fregly
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseMapR Technologies
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 

Viewers also liked (8)

Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 

Similar to Spark streaming: Best Practices

Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
Spark Barcelona Meetup: Migrating Batch Jobs into Structured StreamingSpark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streamingmarcgonzalez.eu
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
 

Similar to Spark streaming: Best Practices (20)

So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
Spark Barcelona Meetup: Migrating Batch Jobs into Structured StreamingSpark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 

Recently uploaded

Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AISheetal Jain
 
Final DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualFinal DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualBalamuruganV28
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisDr.Costas Sachpazis
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...archanaece3
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.benjamincojr
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdfAlexander Litvinenko
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxKarpagam Institute of Teechnology
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdfKamal Acharya
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1T.D. Shashikala
 
Microkernel in Operating System | Operating System
Microkernel in Operating System | Operating SystemMicrokernel in Operating System | Operating System
Microkernel in Operating System | Operating SystemSampad Kar
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxCHAIRMAN M
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxMustafa Ahmed
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfEr.Sonali Nasikkar
 
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...Nitin Sonavane
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfMadan Karki
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdfKamal Acharya
 
Introduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoIntroduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoAbhimanyu Sangale
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsMathias Magdowski
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024EMMANUELLEFRANCEHELI
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...drjose256
 

Recently uploaded (20)

Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
Final DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualFinal DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manual
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Microkernel in Operating System | Operating System
Microkernel in Operating System | Operating SystemMicrokernel in Operating System | Operating System
Microkernel in Operating System | Operating System
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Introduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoIntroduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of Arduino
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 

Spark streaming: Best Practices

  • 1. Spark Streaming: Best Practices Prakash Chockalingam @prakash573
  • 2. Who am I ? 2 ● Solutions Architect / Product Manager at Databricks ● Formerly Netflix, Personalization Infrastructure ● Formerly Yahoo!, Personalized Ad Targeting
  • 3. About Databricks Founded by creators of Spark in 2013 Cloud enterprise data platform - Managed Spark clusters - Interactive data science - Production pipelines - Data governance, security, …
  • 4. Agenda ● Introduction to Spark Streaming ● Lifecycle of a Spark streaming app ● Aggregations and best practices ● Operationalization tips ● Key benefits of Spark streaming
  • 5. What is Spark Streaming?
  • 7.
  • 8. How does it work? ● Receivers receive data streams and chops them in to batches. ● Spark processes the batches and pushes out the results
  • 9. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) Entry point Batch Interval DStream: represents a data stream
  • 10. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) val words = lines.flatMap(_.split(“ “)) Transformations: transform data to create new DStreams
  • 11. Word Count val context = new StreamingContext(conf, Seconds(1)) val lines = context.socketTextStream(...) val words = lines.flatMap(_.split(“ “)) val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_) wordCounts.print() context.start() Print the DStream contents on screen Start the streaming job
  • 12. Lifecycle of a streaming app
  • 13. Execution in any Spark Application
  • 14. Execution in Spark Streaming: Receiving data
  • 15. Execution in Spark Streaming: Processing data
  • 16. End-to-end view 16 t1 = ssc.socketStream(“…”) t2 = ssc.socketStream(“…”) t = t1.union(t2).map(…) t.saveAsHadoopFiles(…) t.map(…).foreach(…) t.filter(…).foreach(…) T U M T M FFE FE FE B U M B M F Input DStreams Output operations RDD Actions / Spark Jobs BlockRDDs DStreamGraph DAG of RDDs every interval DAG of stages every interval Stage 1 Stage 2 Stage 3 Streaming app Tasks every interval B U M B M F B U M B M F Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Spark Streaming JobScheduler + JobGenerator Spark DAGScheduler Spark TaskScheduler Executors YOU write this
  • 18. Word count over a time window val wordCounts = wordStream.reduceByKeyAndWindow((x: Int, y:Int) => x+y, windowSize, slidingInterval) Parent DStream window size sliding interval Reduces over a time window
  • 19. Word count over a time window Scenario: Word count for the last 30 minutes How to optimize for good performance? ● Increase batch interval, if possible ● Incremental aggregations with inverse reduce function val wordCounts = wordStream.reduceByKeyAndWindow((x: Int, y:Int) => x+y, (x: Int, y: Int) => x-y, windowSize, slidingInterval) ● Checkpointing wordStream.checkpoint(checkpointInterval)
  • 20. Stateful: Global Aggregations Scenario: Maintain a global state based on the input events coming in. Ex: Word count from beginning of time. updateStateByKey (Spark 1.5 and before) ● Performance is proportional to the size of the state. mapWithState (Spark 1.6+) ● Performance is proportional to the size of the batch.
  • 22. Stateful: Global Aggregations Key features of mapWithState: ● An initial state - Read from somewhere as a RDD ● # of partitions for the state - If you have a good estimate of the size of the state, you can specify the # of partitions. ● Partitioner - Default: Hash partitioner. If you have a good understanding of the key space, then you can provide a custom partitioner ● Timeout - Keys whose values are not updated within the specified timeout period will be removed from the state.
  • 23. Stateful: Global Aggregations (Word count) val stateSpec = StateSpec.function(updateState _) .initialState(initialRDD) .numPartitions(100) .partitioner(MyPartitioner()) .timeout(Minutes(120)) val wordCountState = wordStream.mapWithState(stateSpec)
  • 24. Stateful: Global Aggregations (Word count) def updateState(batchTime: Time, key: String, value: Option[Int], state: State[Long]) : Option[(String, Long)] Current batch time A Word in the input stream Current value (= 1) Counts so far for the word The word and its new count
  • 26. Checkpoint Two types of checkpointing: ● Checkpointing Data ● Checkpointing Metadata
  • 27. Checkpoint Data ● Checkpointing DStreams ○ Primarily needed to cut long lineage on past batches (updateStateByKey/reduceByKeyAndWindow). ○ Example: wordStream.checkpoint(checkpointInterval)
  • 28. Checkpoint Metadata ● Checkpointing Metadata ○ All the configuration, DStream operations and incomplete batches are checkpointed. ○ Required for failure recovery if the driver process crashes. ○ Example: streamingContext.checkpoint(directory)
  • 29. Achieving good throughput context.socketStream(...) .map(...) .filter(...) .saveAsHadoopFile(...) Problem: There will be 1 receiver which receives all the data and stores it in its executor and all the processing happens on that executor. Adding more nodes doesn’t help.
  • 30. Achieving good throughput Solution: Increase the # of receivers and union them. ● Each receiver is run in 1 executor. Having 5 receivers will ensure that the data gets received in parallel in 5 executors. ● Data gets distributed in 5 executors. So all the subsequent Spark map/filter operations will be distributed val numStreams = 5 val inputStreams = (1 to numStreams).map(i => context. socketStream(...)) val fullStream = context.union(inputStreams) fullStream.map(...).filter(...).saveAsHadoopFile(...)
  • 31. Achieving good throughput ● In the case of direct receivers (like Kafka), set the appropriate # of partitions in Kafka. ● Each kafka paratition gets mapped to a Spark partition. ● More partitions in Kafka = More parallelism in Spark
  • 32. Achieving good throughput ● Provide the right # of partitions based on your cluster size for operations causing shuffles. words.map(x => (x, 1)).reduceByKey(_+_, 100) # of partitions
  • 33. Debugging a Streaming application Streaming tab in Spark UI
  • 34. Debugging a Streaming application Processing Time ● Make sure that the processing time < batch interval
  • 35. Debugging a Streaming application
  • 36. Debugging a Streaming application Batch Details Page: ● Input to the batch ● Jobs that were run as part of the processing for the batch
  • 37. Debugging a Streaming application Job Details Page ● DAG Visualization ● Stages of a Spark job
  • 38. Debugging a Streaming application Task Details Page Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor in your cluster.
  • 39. Key benefits of Spark streaming
  • 41. Fast failure and Straggler recovery
  • 42. Combine Batch and Stream Processing Join data streams with static data sets val dataset = sparkContext.hadoopFile(“file”) … kafkaStream.transform{ batchRdd => batchRdd.join(dataset).filter(...) }
  • 43. Combine ML and Stream Processing Learn models offline, apply them online val model = KMeans.train(dataset, …) kakfaStream.map { event => model.predict(event.feature) }
  • 44. Combine SQL and Stream Processing inputStream.foreachRDD{ rdd => val df = SQLContext.createDataframe(rdd) df.select(...).where(...).groupBy(...) }