SlideShare a Scribd company logo
1 of 95
Satendra Kumar
Sr. Software Consultant
Knoldus Software LLP
Stream Processing
Topics Covered
➢ What is Stream
➢ What is Stream processing
➢ The challenges of stream processing
➢ Overview Spark Streaming
➢ Receivers
➢ Custom receivers
➢ Transformations on Dstreams
➢ Failures
➢ Fault-tolerance Semantics
➢ Kafka Integration
➢ Performance Tuning
What is Stream
A stream is a sequence of data elements made available over time
and which can be accessed in sequential order.
Eg. YouTube video buffering.
What is Stream processing
Stream processing is the real-time processing of data
continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous
infinite stream of data integrated from both live and historical
sources.
➢ Partitioning & Scalability
➢ Semantics & Fault tolerance
➢ Unifying the streams
➢ Time
➢ Re-Processing
The challenges of stream processing
Spark Streaming
➢ Provides a way to process the live data streams.
➢ Scalable, high-throughput, fault-tolerant.
➢ Built top of core Spark API.
➢ API is very similar to Spark core API.
➢ Supports many sources like Kafka, Flume, Kinesis or TCP
sockets.
➢ Currently based on RDDs.
Spark Streaming
Spark Streaming
Spark Streaming
Spark Streaming
Discretized Streams
➢ It provides a high-level abstraction called discretized stream or
DStream, which represents a continuous stream of data;
➢ DStreams can be created either from input data streams from
sources such as Kafka, Flume, and Kinesis, or by applying high-
level operations on other Dstreams.
➢ DStream is represented as a sequence of RDDs.
High level overview
High level overview
High level overview
High level overview
High level overview
High level overview
High level overview
High level overview
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams
Output Operations on DStreams
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams
Output Operations on DStreams
Start the Streaming
Important Points
➢ Once a context has been started, no new streaming computations can
be set up or added to it.
➢ Once a context has been stopped, it cannot be restarted.
➢ Only one StreamingContext can be active in a JVM at the same time.
➢ stop() on StreamingContext also stops the SparkContext. To stop only
the StreamingContext, set the optional parameter of stop() called
stopSparkContext to false.
➢ A SparkContext can be re-used to create multiple StreamingContexts, as
long as the previous StreamingContext is stopped (without stopping the
SparkContext) before the next StreamingContext is created.
Spark Streaming Concept
➢ Spark streaming is based on micro-batch architecture.
➢ Spark streaming continuously receives live input data streams and divides
the data into batches.
➢ New batches are created at regular time intervals called batch interval.
➢ Each batch have N numbers blocks.
Where N = batch-interval / block-interval
For eg. If batch interval = 1 second and block interval= 200ms(by default)
then each batch have 5 blocks.
Transforming DStream
Transforming DStream
Transforming DStream
Transforming DStream
Transforming DStream
➢ DStream is represented by a continuous series of RDDs
➢ Each RDD in a DStream contains data from a certain interval
➢ Any operation applied on a DStream translates to operations on the
underlying RDDs
➢ Processing time of a batch should less than or equal to batch
interval.
Transformations on DStreams
def map[U: ClassTag](mapFunc: T => U): DStream[U]
def flatMap[U: ClassTag](flatMapFunc: T => TraversableOnce[U]): DStream[U]
def filter(filterFunc: T => Boolean): DStream[T]
def reduce(reduceFunc: (T, T) => T): DStream[T]
def count(): DStream[Long]
def repartition(numPartitions: Int): DStream[T]
def countByValue(numPartitions: Int = ssc.sc.defaultParallelism): DStream[(T, Long)]
def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
Transformations on PairDStream
def groupByKey(): DStream[(K, Iterable[V])]
def reduceByKey(reduceFunc: (V, V) => V, numPartitions: Int): DStream[(K, V)]
def join[W: ClassTag](other: DStream[(K, W)]): DStream[(K, (V, W))]
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],partitioner: Partitioner): DStream[(K, S)]
def cogroup[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Iterable[V], Iterable[W]))]
def mapValues[U: ClassTag](mapValuesFunc: V => U): DStream[(K, U)]
def leftOuterJoin[W: ClassTag](
other: DStream[(K, W)],numPartitions: Int): DStream[(K, (V, Option[W]))]
def rightOuterJoin[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Option[V], W))]
updateStateByKey
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
streamingContext.checkpoint(".")
val lines = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val updatedState: DStream[(String, Int)] =
pairs.updateStateByKey[Int] {
(newValues: Seq[Int], state: Option[Int]) => Some(newValues.sum +state.getOrElse(0))
}
updatedState.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Window Operations
Spark Streaming also provides windowed computations, which allow
you to apply transformations over a sliding window of data.
Window operation needs to specify two parameters:
● window length - The duration of the window.
● sliding interval - The interval at which the window operation is performed.
Window Operations
def window(windowDuration: Duration): DStream[T]
def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
def reduceByWindow(reduceFunc: (T, T) => T,
windowDuration: Duration, slideDuration: Duration): DStream[T]
def countByWindow(windowDuration: Duration, slideDuration: Duration): DStream[Long]
def countByValueAndWindow(windowDuration: Duration,
slideDuration: Duration,numPartitions: Int): DStream[(T, Long)]
//pairDStream Operations
def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])]
def groupByKeyAndWindow(windowDuration: Duration,
slideDuration: Duration): DStream[(K, Iterable[V])]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,windowDuration: Duration): DStream[(K, V)]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,
windowDuration: Duration,slideDuration: Duration): DStream[(K, V)]
Window Operations
pairs.window(Seconds(15), Seconds(10))
filteredWords.reduceByWindow((a, b) => a +", "+ b, Seconds(15), Seconds(10))
pairs.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(15), Seconds(10))
Output Operations on DStreams
def print(num: Int): Unit
def saveAsObjectFiles(prefix: String, suffix: String = ""): Unit
def saveAsTextFiles(prefix: String, suffix: String = ""): Unit
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit
def saveAsHadoopFiles[F <: OutputFormat[K, V]](prefix: String,suffix: String): Unit
Receivers
Spark Streaming have two kinds of receivers:
1) Reliable Receiver - A reliable receiver correctly sends acknowledgment
to a reliable source when the data has been received and stored in Spark with
replication.
2) Unreliable Receiver - An unreliable receiver does not send acknowledgment
to a source.
Custom Receiver
A custom receiver must extend this abstract Receiver class by implementing
two abstract methods:
def onStart(): Unit //Things to do to start receiving data
def onStop(): Unit // Things to do to stop receiving data
Custom Receiver
class CustomReceiver(path: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("File Reader") {
override def run() {
receive()
}
}.start()
}
def onStop() {}
private def receive() =
try {
println("Reading file " + path)
val reader = new BufferedReader(
new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8))
var userInput = reader.readLine()
while (!isStopped && Option(userInput).isDefined) {
store(userInput)
userInput = reader.readLine()
}
reader.close()
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case ex: Exception =>
restart("Error reading file " + path, ex)
}
}
Custom Receiver
object CustomReceiver extends App {
val sparkConf = new SparkConf().setAppName("CustomReceiver")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.receiverStream(new CustomReceiver(args(0)))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Failure is everywhere
Fault-tolerance Semantics
Streaming system provides zero data loss guarantees despite any kind of
failure in the system.
➢ At least once- Each record will be processed one or more times.
➢ Exactly once- Each record will be processed exactly once - no data will be lost and no
data will be processed multiple times
Kinds of Failure
There are two kind of failure:
➢ Executor failure
1) Data received and replicated
2) Data received but not replicated
➢ Driver failure
Executor failure
Executor failure
Executor failure
Executor failure
Executor failure
Executor failure
Data would be lost ?
Executor with WAL
Executor failure
Enable write ahead logs
object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
streamingContext.checkpoint(checkpointDirectory)
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print(20)
streamingContext.start()
streamingContext.awaitTermination()
}
Enable write logs
Enable write ahead logs
object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
streamingContext.checkpoint(checkpointDirectory)
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print(20)
streamingContext.start()
streamingContext.awaitTermination()
}
Enable write logs
Enable checkpointing
Enable write ahead logs
1) For WAL first need to enable checkpointing
- streamingContext.checkpoint(checkpointDirectory)
2) Enable WAL in spark configuration
-sparkConf.set("spark.streaming.receiver.writeAheadLog.enable","true")
3) Receiver should be reliable
- Acknowledge source only after data saved to WAL
- Unacknowledged data will be replayed from source by restated receiver
4) Disable in-memory replication (Already replicated By HDFS)
- Use StorageLevel.MEMORY_AND_DISK_SER for input DStream
Driver failure
Driver failure
Driver failure
Driver failure
Driver failure
How to recover from this Failure ?
Driver with checkpointing
Dstream Checkpointing : Periodically save the DAG of
DStream to fault-tolerant storage.
Driver failure
Recover from Driver failure
Recover from Driver failure
Recover from Driver failure
1) Configure Automatic driver restart
-All cluster managers support this
2) Set a checkpoint directory
- Directory should be in fault-tolerant & reliable file system (e.g., HDFS, S3, etc.)
- streamingContext.checkpoint(checkpointDirectory)
3) Driver should be restart using checkpointing
Configure Automatic driver restart
Spark Standalone
- use spark-submit with “cluster” mode and “- - supervise”
YARN
-use spark-submit with “cluster” mode
Mesos
-Marathon can restart applications or use “- - supervise” flag
Configure Checkpointing
object RecoverableWordCount {
//should a fault-tolerant,reliable file system(e.g.HDFS,S3, etc.)
val checkpointDirectory = "checkpointDir"
def createContext() = {
val sparkConf = new SparkConf().setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(1))
streamingContext.checkpoint(checkpointDirectory)
val lines = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print(20)
streamingContext
}
}
Driver should be restart using checkpointing
object StreamingApp extends App {
import RecoverableWordCount._
val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
//do other operations
streamingContext.start()
streamingContext.awaitTermination()
}
Driver should be restart using checkpointing
object StreamingApp extends App {
import RecoverableWordCount._
val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
//do other operations
streamingContext.start()
streamingContext.awaitTermination()
}
Checkpointing
There are two types of data that are checkpointed.
1) Metadata checkpointing
-Configuration
-DStream operations
-Incomplete batches
2) Data checkpointing
- Saving of the generated RDDs to reliable storage. This is necessary in some stateful
transformations that combine data across multiple batches.
Checkpointing Latency
➔ Checkpointing of RDDs incurs the cost of saving to reliable storage. The interval of
checkpointing needs to be set carefully.
dstream.checkpoint( Seconds( (batch interval)*10 ) )
➔ A checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Spark Streaming & Kafka Integration
Why Kafka ?
➢ Velocity & volume of streaming data
➢ Reprocessing of streaming
➢ Reliable receiver complexity
➢ Checkpoint complexity
➢ Upgrading Application Code
Kafka Integration
There are two approaches to integrate Kafka with Spark Streaming:
➢ Receiver-based Approach
➢ Direct Approach
Receiver-based Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Receiver-based Approach
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object ReceiverBasedStreaming extends App {
val group = "streaming-test-group"
val zkQuorum = "localhost:2181"
val topics = Map("streaming_queue" -> 1)
val sparkConf = new SparkConf().setAppName("ReceiverBasedStreamingApp")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topics)
.map { case (key, message) => message }
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Direct Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Direct Approach
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka._
object KafkaDirectStreaming extends App {
val brokers = "localhost:9092"
val sparkConf = new SparkConf().setAppName("KafkaDirectStreaming")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpointDir") //offset recovery
val topics = Set("streaming_queue")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages: InputDStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
val lines = messages.map { case (key, message) => message }
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Direct Approach
Direct Approach has the following advantages over the receiver-based approach:
➢ Simplified Parallelism
➢ Efficiency
➢ Exactly-once semantics
Performance Tuning
For best performance of a Spark Streaming application we need to
consider two things:
➢ Reducing the Batch Processing Times
➢ Setting the Right Batch Interval
Reducing the Batch Processing Times
➢ Level of Parallelism in Data Receiving
➢ Level of Parallelism in Data Processing
➢ Data Serialization
-Input data
-Persisted RDDs generated by Streaming Operations
➢ Task Launching Overheads
-Running Spark in Standalone mode or coarse-grained Mesos mode leads
to better task launch times.
Setting the Right Batch Interval
➢ Batch processing time should be less than the batch interval.
➢ Memory Tuning
-Persistence Level of Dstreams
-Clearing old data
-CMS Garbage Collector
Code samples
https://github.com/knoldus/spark-streaming-meetup
https://github.com/knoldus/real-time-stream-processing-engine
https://github.com/knoldus/kafka-tweet-producer
Questions & DStream[Answer]
References
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
http://spark.apache.org/docs/latest/tuning.html
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htm
Thanks
Presenters:
@_satendrakumar
Organizer:
@knolspeak
http://www.knoldus.com

More Related Content

What's hot

ksqlDB로 시작하는 스트림 프로세싱
ksqlDB로 시작하는 스트림 프로세싱ksqlDB로 시작하는 스트림 프로세싱
ksqlDB로 시작하는 스트림 프로세싱confluent
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander Kukushkin
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander KukushkinPGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander Kukushkin
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander KukushkinEqunix Business Solutions
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 
Athena & Step Function 으로 통계 파이프라인 구축하기 - 변규현 (당근마켓) :: AWS Community Day Onl...
Athena & Step Function 으로 통계 파이프라인 구축하기 - 변규현 (당근마켓) :: AWS Community Day Onl...Athena & Step Function 으로 통계 파이프라인 구축하기 - 변규현 (당근마켓) :: AWS Community Day Onl...
Athena & Step Function 으로 통계 파이프라인 구축하기 - 변규현 (당근마켓) :: AWS Community Day Onl...AWSKRUG - AWS한국사용자모임
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Deep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature AnnouncementsDeep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature AnnouncementsAmazon Web Services
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...HostedbyConfluent
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
AWS Customer Presentation - WeoGeo
AWS Customer Presentation - WeoGeo AWS Customer Presentation - WeoGeo
AWS Customer Presentation - WeoGeo Amazon Web Services
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKai Wähner
 
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksIntroducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksAmazon Web Services
 

What's hot (20)

ksqlDB로 시작하는 스트림 프로세싱
ksqlDB로 시작하는 스트림 프로세싱ksqlDB로 시작하는 스트림 프로세싱
ksqlDB로 시작하는 스트림 프로세싱
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander Kukushkin
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander KukushkinPGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander Kukushkin
PGConf.ASIA 2019 Bali - Patroni in 2019 - Alexander Kukushkin
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Athena & Step Function 으로 통계 파이프라인 구축하기 - 변규현 (당근마켓) :: AWS Community Day Onl...
Athena & Step Function 으로 통계 파이프라인 구축하기 - 변규현 (당근마켓) :: AWS Community Day Onl...Athena & Step Function 으로 통계 파이프라인 구축하기 - 변규현 (당근마켓) :: AWS Community Day Onl...
Athena & Step Function 으로 통계 파이프라인 구축하기 - 변규현 (당근마켓) :: AWS Community Day Onl...
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Deep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature AnnouncementsDeep Dive on Amazon Aurora - Covering New Feature Announcements
Deep Dive on Amazon Aurora - Covering New Feature Announcements
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
AWS Customer Presentation - WeoGeo
AWS Customer Presentation - WeoGeo AWS Customer Presentation - WeoGeo
AWS Customer Presentation - WeoGeo
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
 
Elastic-Engineering
Elastic-EngineeringElastic-Engineering
Elastic-Engineering
 
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksIntroducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
 

Viewers also liked

Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Knoldus Inc.
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Knoldus Inc.
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Effective way to code in Scala
Effective way to code in ScalaEffective way to code in Scala
Effective way to code in ScalaKnoldus Inc.
 
Introduction to Shield and kibana
Introduction to Shield and kibanaIntroduction to Shield and kibana
Introduction to Shield and kibanaKnoldus Inc.
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Akka Finite State Machine
Akka Finite State MachineAkka Finite State Machine
Akka Finite State MachineKnoldus Inc.
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Introduction to AWS IAM
Introduction to AWS IAMIntroduction to AWS IAM
Introduction to AWS IAMKnoldus Inc.
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Robert Metzger
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsJeff Hull
 

Viewers also liked (20)

Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Effective way to code in Scala
Effective way to code in ScalaEffective way to code in Scala
Effective way to code in Scala
 
Introduction to Shield and kibana
Introduction to Shield and kibanaIntroduction to Shield and kibana
Introduction to Shield and kibana
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Akka Finite State Machine
Akka Finite State MachineAkka Finite State Machine
Akka Finite State Machine
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Introduction to AWS IAM
Introduction to AWS IAMIntroduction to AWS IAM
Introduction to AWS IAM
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 Mins
 

Similar to Meet Up - Spark Stream Processing + Kafka

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Steve Min
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2Bahul Neel Upadhyaya
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applicationsŁukasz Gawron
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark ApplicationsFuture Processing
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsMatt Stubbs
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
 

Similar to Meet Up - Spark Stream Processing + Kafka (20)

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 

More from Knoldus Inc.

Stakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) PresentationStakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) PresentationKnoldus Inc.
 
Introduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) PresentationIntroduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) PresentationKnoldus Inc.
 
Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)Knoldus Inc.
 
Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)Knoldus Inc.
 
Clean Code in Test Automation Differentiating Between the Good and the Bad
Clean Code in Test Automation  Differentiating Between the Good and the BadClean Code in Test Automation  Differentiating Between the Good and the Bad
Clean Code in Test Automation Differentiating Between the Good and the BadKnoldus Inc.
 
Integrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test AutomationIntegrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test AutomationKnoldus Inc.
 
State Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptxState Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptxKnoldus Inc.
 
Authentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxAuthentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxKnoldus Inc.
 
OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)Knoldus Inc.
 
Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxKnoldus Inc.
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionKnoldus Inc.
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxKnoldus Inc.
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptxKnoldus Inc.
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfKnoldus Inc.
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxKnoldus Inc.
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesKnoldus Inc.
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxKnoldus Inc.
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxKnoldus Inc.
 

More from Knoldus Inc. (20)

Stakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) PresentationStakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) Presentation
 
Introduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) PresentationIntroduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) Presentation
 
Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)
 
Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)
 
Clean Code in Test Automation Differentiating Between the Good and the Bad
Clean Code in Test Automation  Differentiating Between the Good and the BadClean Code in Test Automation  Differentiating Between the Good and the Bad
Clean Code in Test Automation Differentiating Between the Good and the Bad
 
Integrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test AutomationIntegrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test Automation
 
State Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptxState Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptx
 
Authentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxAuthentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptx
 
OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)
 
Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 

Recently uploaded

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfmbmh111980
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfFurqanuddin10
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024vaibhav130304
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfTestgrid.io
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfDeskTrack
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfsteffenkarlsson2
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionWave PLM
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Soroosh Khodami
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityamy56318795
 

Recently uploaded (20)

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdf
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 

Meet Up - Spark Stream Processing + Kafka

  • 1. Satendra Kumar Sr. Software Consultant Knoldus Software LLP Stream Processing
  • 2. Topics Covered ➢ What is Stream ➢ What is Stream processing ➢ The challenges of stream processing ➢ Overview Spark Streaming ➢ Receivers ➢ Custom receivers ➢ Transformations on Dstreams ➢ Failures ➢ Fault-tolerance Semantics ➢ Kafka Integration ➢ Performance Tuning
  • 3. What is Stream A stream is a sequence of data elements made available over time and which can be accessed in sequential order. Eg. YouTube video buffering.
  • 4. What is Stream processing Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion. It treats data not as static tables or files, but as a continuous infinite stream of data integrated from both live and historical sources.
  • 5. ➢ Partitioning & Scalability ➢ Semantics & Fault tolerance ➢ Unifying the streams ➢ Time ➢ Re-Processing The challenges of stream processing
  • 6. Spark Streaming ➢ Provides a way to process the live data streams. ➢ Scalable, high-throughput, fault-tolerant. ➢ Built top of core Spark API. ➢ API is very similar to Spark core API. ➢ Supports many sources like Kafka, Flume, Kinesis or TCP sockets. ➢ Currently based on RDDs.
  • 11. Discretized Streams ➢ It provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data; ➢ DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high- level operations on other Dstreams. ➢ DStream is represented as a sequence of RDDs.
  • 20. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() }
  • 21. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context
  • 22. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval
  • 23. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver
  • 24. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver Transformations on DStreams
  • 25. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver Transformations on DStreams Output Operations on DStreams
  • 26. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver Transformations on DStreams Output Operations on DStreams Start the Streaming
  • 27. Important Points ➢ Once a context has been started, no new streaming computations can be set up or added to it. ➢ Once a context has been stopped, it cannot be restarted. ➢ Only one StreamingContext can be active in a JVM at the same time. ➢ stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false. ➢ A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.
  • 28. Spark Streaming Concept ➢ Spark streaming is based on micro-batch architecture. ➢ Spark streaming continuously receives live input data streams and divides the data into batches. ➢ New batches are created at regular time intervals called batch interval. ➢ Each batch have N numbers blocks. Where N = batch-interval / block-interval For eg. If batch interval = 1 second and block interval= 200ms(by default) then each batch have 5 blocks.
  • 33. Transforming DStream ➢ DStream is represented by a continuous series of RDDs ➢ Each RDD in a DStream contains data from a certain interval ➢ Any operation applied on a DStream translates to operations on the underlying RDDs ➢ Processing time of a batch should less than or equal to batch interval.
  • 34. Transformations on DStreams def map[U: ClassTag](mapFunc: T => U): DStream[U] def flatMap[U: ClassTag](flatMapFunc: T => TraversableOnce[U]): DStream[U] def filter(filterFunc: T => Boolean): DStream[T] def reduce(reduceFunc: (T, T) => T): DStream[T] def count(): DStream[Long] def repartition(numPartitions: Int): DStream[T] def countByValue(numPartitions: Int = ssc.sc.defaultParallelism): DStream[(T, Long)] def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
  • 35. Transformations on PairDStream def groupByKey(): DStream[(K, Iterable[V])] def reduceByKey(reduceFunc: (V, V) => V, numPartitions: Int): DStream[(K, V)] def join[W: ClassTag](other: DStream[(K, W)]): DStream[(K, (V, W))] def updateStateByKey[S: ClassTag]( updateFunc: (Seq[V], Option[S]) => Option[S],partitioner: Partitioner): DStream[(K, S)] def cogroup[W: ClassTag]( other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Iterable[V], Iterable[W]))] def mapValues[U: ClassTag](mapValuesFunc: V => U): DStream[(K, U)] def leftOuterJoin[W: ClassTag]( other: DStream[(K, W)],numPartitions: Int): DStream[(K, (V, Option[W]))] def rightOuterJoin[W: ClassTag]( other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Option[V], W))]
  • 36. updateStateByKey object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) streamingContext.checkpoint(".") val lines = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val updatedState: DStream[(String, Int)] = pairs.updateStateByKey[Int] { (newValues: Seq[Int], state: Option[Int]) => Some(newValues.sum +state.getOrElse(0)) } updatedState.print() streamingContext.start() streamingContext.awaitTermination() }
  • 37. Window Operations Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data. Window operation needs to specify two parameters: ● window length - The duration of the window. ● sliding interval - The interval at which the window operation is performed.
  • 38. Window Operations def window(windowDuration: Duration): DStream[T] def window(windowDuration: Duration, slideDuration: Duration): DStream[T] def reduceByWindow(reduceFunc: (T, T) => T, windowDuration: Duration, slideDuration: Duration): DStream[T] def countByWindow(windowDuration: Duration, slideDuration: Duration): DStream[Long] def countByValueAndWindow(windowDuration: Duration, slideDuration: Duration,numPartitions: Int): DStream[(T, Long)] //pairDStream Operations def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])] def groupByKeyAndWindow(windowDuration: Duration, slideDuration: Duration): DStream[(K, Iterable[V])] def reduceByKeyAndWindow(reduceFunc: (V, V) => V,windowDuration: Duration): DStream[(K, V)] def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration,slideDuration: Duration): DStream[(K, V)]
  • 39. Window Operations pairs.window(Seconds(15), Seconds(10)) filteredWords.reduceByWindow((a, b) => a +", "+ b, Seconds(15), Seconds(10)) pairs.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(15), Seconds(10))
  • 40. Output Operations on DStreams def print(num: Int): Unit def saveAsObjectFiles(prefix: String, suffix: String = ""): Unit def saveAsTextFiles(prefix: String, suffix: String = ""): Unit def foreachRDD(foreachFunc: RDD[T] => Unit): Unit def saveAsHadoopFiles[F <: OutputFormat[K, V]](prefix: String,suffix: String): Unit
  • 41. Receivers Spark Streaming have two kinds of receivers: 1) Reliable Receiver - A reliable receiver correctly sends acknowledgment to a reliable source when the data has been received and stored in Spark with replication. 2) Unreliable Receiver - An unreliable receiver does not send acknowledgment to a source.
  • 42. Custom Receiver A custom receiver must extend this abstract Receiver class by implementing two abstract methods: def onStart(): Unit //Things to do to start receiving data def onStop(): Unit // Things to do to stop receiving data
  • 43. Custom Receiver class CustomReceiver(path: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) { def onStart() { new Thread("File Reader") { override def run() { receive() } }.start() } def onStop() {} private def receive() = try { println("Reading file " + path) val reader = new BufferedReader( new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8)) var userInput = reader.readLine() while (!isStopped && Option(userInput).isDefined) { store(userInput) userInput = reader.readLine() } reader.close() println("Stopped receiving") restart("Trying to connect again") } catch { case ex: Exception => restart("Error reading file " + path, ex) } }
  • 44. Custom Receiver object CustomReceiver extends App { val sparkConf = new SparkConf().setAppName("CustomReceiver") val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines = ssc.receiverStream(new CustomReceiver(args(0))) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() }
  • 46. Fault-tolerance Semantics Streaming system provides zero data loss guarantees despite any kind of failure in the system. ➢ At least once- Each record will be processed one or more times. ➢ Exactly once- Each record will be processed exactly once - no data will be lost and no data will be processed multiple times
  • 47. Kinds of Failure There are two kind of failure: ➢ Executor failure 1) Data received and replicated 2) Data received but not replicated ➢ Driver failure
  • 56. Enable write ahead logs object Streaming2App extends App { val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.) val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) streamingContext.checkpoint(checkpointDirectory) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print(20) streamingContext.start() streamingContext.awaitTermination() } Enable write logs
  • 57. Enable write ahead logs object Streaming2App extends App { val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.) val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) streamingContext.checkpoint(checkpointDirectory) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print(20) streamingContext.start() streamingContext.awaitTermination() } Enable write logs Enable checkpointing
  • 58. Enable write ahead logs 1) For WAL first need to enable checkpointing - streamingContext.checkpoint(checkpointDirectory) 2) Enable WAL in spark configuration -sparkConf.set("spark.streaming.receiver.writeAheadLog.enable","true") 3) Receiver should be reliable - Acknowledge source only after data saved to WAL - Unacknowledged data will be replayed from source by restated receiver 4) Disable in-memory replication (Already replicated By HDFS) - Use StorageLevel.MEMORY_AND_DISK_SER for input DStream
  • 63. Driver failure How to recover from this Failure ?
  • 64. Driver with checkpointing Dstream Checkpointing : Periodically save the DAG of DStream to fault-tolerant storage.
  • 68. Recover from Driver failure 1) Configure Automatic driver restart -All cluster managers support this 2) Set a checkpoint directory - Directory should be in fault-tolerant & reliable file system (e.g., HDFS, S3, etc.) - streamingContext.checkpoint(checkpointDirectory) 3) Driver should be restart using checkpointing
  • 69. Configure Automatic driver restart Spark Standalone - use spark-submit with “cluster” mode and “- - supervise” YARN -use spark-submit with “cluster” mode Mesos -Marathon can restart applications or use “- - supervise” flag
  • 70. Configure Checkpointing object RecoverableWordCount { //should a fault-tolerant,reliable file system(e.g.HDFS,S3, etc.) val checkpointDirectory = "checkpointDir" def createContext() = { val sparkConf = new SparkConf().setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(1)) streamingContext.checkpoint(checkpointDirectory) val lines = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print(20) streamingContext } }
  • 71. Driver should be restart using checkpointing object StreamingApp extends App { import RecoverableWordCount._ val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _) //do other operations streamingContext.start() streamingContext.awaitTermination() }
  • 72. Driver should be restart using checkpointing object StreamingApp extends App { import RecoverableWordCount._ val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _) //do other operations streamingContext.start() streamingContext.awaitTermination() }
  • 73. Checkpointing There are two types of data that are checkpointed. 1) Metadata checkpointing -Configuration -DStream operations -Incomplete batches 2) Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches.
  • 74. Checkpointing Latency ➔ Checkpointing of RDDs incurs the cost of saving to reliable storage. The interval of checkpointing needs to be set carefully. dstream.checkpoint( Seconds( (batch interval)*10 ) ) ➔ A checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  • 81. Spark Streaming & Kafka Integration
  • 82. Why Kafka ? ➢ Velocity & volume of streaming data ➢ Reprocessing of streaming ➢ Reliable receiver complexity ➢ Checkpoint complexity ➢ Upgrading Application Code
  • 83. Kafka Integration There are two approaches to integrate Kafka with Spark Streaming: ➢ Receiver-based Approach ➢ Direct Approach
  • 85. Receiver-based Approach import org.apache.spark.SparkConf import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} object ReceiverBasedStreaming extends App { val group = "streaming-test-group" val zkQuorum = "localhost:2181" val topics = Map("streaming_queue" -> 1) val sparkConf = new SparkConf().setAppName("ReceiverBasedStreamingApp") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topics) .map { case (key, message) => message } val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() }
  • 87. Direct Approach import kafka.serializer.StringDecoder import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.streaming.kafka._ object KafkaDirectStreaming extends App { val brokers = "localhost:9092" val sparkConf = new SparkConf().setAppName("KafkaDirectStreaming") val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint("checkpointDir") //offset recovery val topics = Set("streaming_queue") val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) val messages: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val lines = messages.map { case (key, message) => message } val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() }
  • 88. Direct Approach Direct Approach has the following advantages over the receiver-based approach: ➢ Simplified Parallelism ➢ Efficiency ➢ Exactly-once semantics
  • 89. Performance Tuning For best performance of a Spark Streaming application we need to consider two things: ➢ Reducing the Batch Processing Times ➢ Setting the Right Batch Interval
  • 90. Reducing the Batch Processing Times ➢ Level of Parallelism in Data Receiving ➢ Level of Parallelism in Data Processing ➢ Data Serialization -Input data -Persisted RDDs generated by Streaming Operations ➢ Task Launching Overheads -Running Spark in Standalone mode or coarse-grained Mesos mode leads to better task launch times.
  • 91. Setting the Right Batch Interval ➢ Batch processing time should be less than the batch interval. ➢ Memory Tuning -Persistence Level of Dstreams -Clearing old data -CMS Garbage Collector

Editor's Notes

  1. Data can be ingested from many sources like Kafka, Flume, Kinesis. Data can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Processed data can be pushed out to filesystems, databases, and live dashboards
  2. Unreliable - This can be used for sources that do not support acknowledgment, or even for reliable sources when one does not want or need to go into the complexity of acknowledgment.
  3. Spark Streaming can receive streaming data from any arbitrary data source beyond the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.). This requires the developer to implement a receiver that is customized for receiving data from the concerned data source. This guide walks through the process of implementing a custom receiver and using it in a Spark Streaming application. Note that custom receivers can be implemented in Scala or Java.
  4. Spark Streaming can receive streaming data from any arbitrary data source beyond the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.). This requires the developer to implement a receiver that is customized for receiving data from the concerned data source. This guide walks through the process of implementing a custom receiver and using it in a Spark Streaming application. Note that custom receivers can be implemented in Scala or Java.
  5. Spark Streaming can receive streaming data from any arbitrary data source beyond the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.). This requires the developer to implement a receiver that is customized for receiving data from the concerned data source. This guide walks through the process of implementing a custom receiver and using it in a Spark Streaming application. Note that custom receivers can be implemented in Scala or Java.
  6. At most once: Each record will be either processed once or not processed at all. At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates. Exactly once: Each record will be processed exactly once - no data will be lost and no data will be processed multiple times. This is obviously the strongest guarantee of the three.
  7. Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  8. Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  9. Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  10. Simplified Parallelism: No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune. Efficiency: Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka. Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints. This eliminates inconsistencies between Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures. In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets (see Semantics of output operations in the main programming guide for further information).
  11. Level of Parallelism in Data Receiving- 1) Create multple receivers and those result a multple dstreams.These multiple DStreams can be unioned together to create a single DStream. Then the transformations that were being applied on a single input DStream can be applied on the unified stream. For example kafka one topic on receiver. 2) Another parameter that should be considered is the receiver’s blocking interval. For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of tasks per receiver per batch will be approximately (batch interval / block interval).inputStream.repartition(&amp;lt;number of partitions&amp;gt;)). Level of Parallelism in Data Processing- Cluster resources can be under-utilized if the number of parallel tasks used in any stage of the computation is not high enough. For example, for distributed reduce operations like reduceByKey and reduceByKeyAndWindow, the default number of parallel tasks is controlled by the spark.default.parallelism configuration property. You can pass the level of parallelism as an argument (see PairDStreamFunctions documentation), or set the spark.default.parallelism configuration property to change the default. Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads – the receiver must deserialize the received data and re-serialize it using Spark’s serialization format. Persisted RDDs generated by Streaming Operations: RDDs generated by streaming computations may be persisted in memory. For example, window operations persist data in memory as they would be processed multiple times. However, unlike the Spark Core default of StorageLevel.MEMORY_ONLY, persisted RDDs generated by streaming computations are persisted with StorageLevel.MEMORY_ONLY_SER (i.e. serialized) by default to minimize GC overheads. In both cases, using Kryo serialization can reduce both CPU and memory overheads. See the Spark Tuning Guide for more details.
  12. A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. Persistence Level of DStreams: As mentioned earlier in the Data Serialization section, the input data and RDDs are by default persisted as serialized bytes. This reduces both the memory usage and GC overheads, compared to deserialized persistence. Enabling Kryo serialization further reduces serialized sizes and memory usage. Further reduction in memory usage can be achieved with compression (see the Spark configuration spark.rdd.compress), at the cost of CPU time. Clearing old data: By default, all input data and persisted RDDs generated by DStream transformations are automatically cleared. Spark Streaming decides when to clear the data based on the transformations that are used. For example, if you are using a window operation of 10 minutes, then Spark Streaming will keep around the last 10 minutes of data, and actively throw away older data. Data can be retained for a longer duration (e.g. interactively querying older data) by setting streamingContext.remember. CMS Garbage Collector: Use of the concurrent mark-and-sweep GC is strongly recommended for keeping GC-related pauses consistently low. Even though concurrent GC is known to reduce the overall processing throughput of the system, its use is still recommended to achieve more consistent batch processing times. Make sure you set the CMS GC on both the driver (using --driver-java-options in spark-submit) and the executors (using Spark configuration spark.executor.extraJavaOptions).