SlideShare a Scribd company logo
1 of 95
Satendra Kumar
Sr. Software Consultant
Knoldus Software LLP
Stream Processing
Topics Covered
➢ What is Stream
➢ What is Stream processing
➢ The challenges of stream processing
➢ Overview Spark Streaming
➢ Receivers
➢ Custom receivers
➢ Transformations on Dstreams
➢ Failures
➢ Fault-tolerance Semantics
➢ Kafka Integration
➢ Performance Tuning
What is Stream
A stream is a sequence of data elements made available over time
and which can be accessed in sequential order.
Eg. YouTube video buffering.
What is Stream processing
Stream processing is the real-time processing of data
continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous
infinite stream of data integrated from both live and historical
sources.
➢ Partitioning & Scalability
➢ Semantics & Fault tolerance
➢ Unifying the streams
➢ Time
➢ Re-Processing
The challenges of stream processing
Spark Streaming
➢ Provides a way to process the live data streams.
➢ Scalable, high-throughput, fault-tolerant.
➢ Built top of core Spark API.
➢ API is very similar to Spark core API.
➢ Supports many sources like Kafka, Flume, Kinesis or TCP
sockets.
➢ Currently based on RDDs.
Spark Streaming
Spark Streaming
Spark Streaming
Spark Streaming
Discretized Streams
➢ It provides a high-level abstraction called discretized stream or
DStream, which represents a continuous stream of data;
➢ DStreams can be created either from input data streams from
sources such as Kafka, Flume, and Kinesis, or by applying high-
level operations on other Dstreams.
➢ DStream is represented as a sequence of RDDs.
High level overview
High level overview
High level overview
High level overview
High level overview
High level overview
High level overview
High level overview
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams
Output Operations on DStreams
Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams
Output Operations on DStreams
Start the Streaming
Important Points
➢ Once a context has been started, no new streaming computations can
be set up or added to it.
➢ Once a context has been stopped, it cannot be restarted.
➢ Only one StreamingContext can be active in a JVM at the same time.
➢ stop() on StreamingContext also stops the SparkContext. To stop only
the StreamingContext, set the optional parameter of stop() called
stopSparkContext to false.
➢ A SparkContext can be re-used to create multiple StreamingContexts, as
long as the previous StreamingContext is stopped (without stopping the
SparkContext) before the next StreamingContext is created.
Spark Streaming Concept
➢ Spark streaming is based on micro-batch architecture.
➢ Spark streaming continuously receives live input data streams and divides
the data into batches.
➢ New batches are created at regular time intervals called batch interval.
➢ Each batch have N numbers blocks.
Where N = batch-interval / block-interval
For eg. If batch interval = 1 second and block interval= 200ms(by default)
then each batch have 5 blocks.
Transforming DStream
Transforming DStream
Transforming DStream
Transforming DStream
Transforming DStream
➢ DStream is represented by a continuous series of RDDs
➢ Each RDD in a DStream contains data from a certain interval
➢ Any operation applied on a DStream translates to operations on the
underlying RDDs
➢ Processing time of a batch should less than or equal to batch
interval.
Transformations on DStreams
def map[U: ClassTag](mapFunc: T => U): DStream[U]
def flatMap[U: ClassTag](flatMapFunc: T => TraversableOnce[U]): DStream[U]
def filter(filterFunc: T => Boolean): DStream[T]
def reduce(reduceFunc: (T, T) => T): DStream[T]
def count(): DStream[Long]
def repartition(numPartitions: Int): DStream[T]
def countByValue(numPartitions: Int = ssc.sc.defaultParallelism): DStream[(T, Long)]
def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
Transformations on PairDStream
def groupByKey(): DStream[(K, Iterable[V])]
def reduceByKey(reduceFunc: (V, V) => V, numPartitions: Int): DStream[(K, V)]
def join[W: ClassTag](other: DStream[(K, W)]): DStream[(K, (V, W))]
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],partitioner: Partitioner): DStream[(K, S)]
def cogroup[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Iterable[V], Iterable[W]))]
def mapValues[U: ClassTag](mapValuesFunc: V => U): DStream[(K, U)]
def leftOuterJoin[W: ClassTag](
other: DStream[(K, W)],numPartitions: Int): DStream[(K, (V, Option[W]))]
def rightOuterJoin[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Option[V], W))]
updateStateByKey
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
streamingContext.checkpoint(".")
val lines = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val updatedState: DStream[(String, Int)] =
pairs.updateStateByKey[Int] {
(newValues: Seq[Int], state: Option[Int]) => Some(newValues.sum +state.getOrElse(0))
}
updatedState.print()
streamingContext.start()
streamingContext.awaitTermination()
}
Window Operations
Spark Streaming also provides windowed computations, which allow
you to apply transformations over a sliding window of data.
Window operation needs to specify two parameters:
● window length - The duration of the window.
● sliding interval - The interval at which the window operation is performed.
Window Operations
def window(windowDuration: Duration): DStream[T]
def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
def reduceByWindow(reduceFunc: (T, T) => T,
windowDuration: Duration, slideDuration: Duration): DStream[T]
def countByWindow(windowDuration: Duration, slideDuration: Duration): DStream[Long]
def countByValueAndWindow(windowDuration: Duration,
slideDuration: Duration,numPartitions: Int): DStream[(T, Long)]
//pairDStream Operations
def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])]
def groupByKeyAndWindow(windowDuration: Duration,
slideDuration: Duration): DStream[(K, Iterable[V])]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,windowDuration: Duration): DStream[(K, V)]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,
windowDuration: Duration,slideDuration: Duration): DStream[(K, V)]
Window Operations
pairs.window(Seconds(15), Seconds(10))
filteredWords.reduceByWindow((a, b) => a +", "+ b, Seconds(15), Seconds(10))
pairs.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(15), Seconds(10))
Output Operations on DStreams
def print(num: Int): Unit
def saveAsObjectFiles(prefix: String, suffix: String = ""): Unit
def saveAsTextFiles(prefix: String, suffix: String = ""): Unit
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit
def saveAsHadoopFiles[F <: OutputFormat[K, V]](prefix: String,suffix: String): Unit
Receivers
Spark Streaming have two kinds of receivers:
1) Reliable Receiver - A reliable receiver correctly sends acknowledgment
to a reliable source when the data has been received and stored in Spark with
replication.
2) Unreliable Receiver - An unreliable receiver does not send acknowledgment
to a source.
Custom Receiver
A custom receiver must extend this abstract Receiver class by implementing
two abstract methods:
def onStart(): Unit //Things to do to start receiving data
def onStop(): Unit // Things to do to stop receiving data
Custom Receiver
class CustomReceiver(path: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("File Reader") {
override def run() {
receive()
}
}.start()
}
def onStop() {}
private def receive() =
try {
println("Reading file " + path)
val reader = new BufferedReader(
new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8))
var userInput = reader.readLine()
while (!isStopped && Option(userInput).isDefined) {
store(userInput)
userInput = reader.readLine()
}
reader.close()
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case ex: Exception =>
restart("Error reading file " + path, ex)
}
}
Custom Receiver
object CustomReceiver extends App {
val sparkConf = new SparkConf().setAppName("CustomReceiver")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.receiverStream(new CustomReceiver(args(0)))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Failure is everywhere
Fault-tolerance Semantics
Streaming system provides zero data loss guarantees despite any kind of
failure in the system.
➢ At least once- Each record will be processed one or more times.
➢ Exactly once- Each record will be processed exactly once - no data will be lost and no
data will be processed multiple times
Kinds of Failure
There are two kind of failure:
➢ Executor failure
1) Data received and replicated
2) Data received but not replicated
➢ Driver failure
Executor failure
Executor failure
Executor failure
Executor failure
Executor failure
Executor failure
Data would be lost ?
Executor with WAL
Executor failure
Enable write ahead logs
object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
streamingContext.checkpoint(checkpointDirectory)
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print(20)
streamingContext.start()
streamingContext.awaitTermination()
}
Enable write logs
Enable write ahead logs
object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
streamingContext.checkpoint(checkpointDirectory)
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print(20)
streamingContext.start()
streamingContext.awaitTermination()
}
Enable write logs
Enable checkpointing
Enable write ahead logs
1) For WAL first need to enable checkpointing
- streamingContext.checkpoint(checkpointDirectory)
2) Enable WAL in spark configuration
-sparkConf.set("spark.streaming.receiver.writeAheadLog.enable","true")
3) Receiver should be reliable
- Acknowledge source only after data saved to WAL
- Unacknowledged data will be replayed from source by restated receiver
4) Disable in-memory replication (Already replicated By HDFS)
- Use StorageLevel.MEMORY_AND_DISK_SER for input DStream
Driver failure
Driver failure
Driver failure
Driver failure
Driver failure
How to recover from this Failure ?
Driver with checkpointing
Dstream Checkpointing : Periodically save the DAG of
DStream to fault-tolerant storage.
Driver failure
Recover from Driver failure
Recover from Driver failure
Recover from Driver failure
1) Configure Automatic driver restart
-All cluster managers support this
2) Set a checkpoint directory
- Directory should be in fault-tolerant & reliable file system (e.g., HDFS, S3, etc.)
- streamingContext.checkpoint(checkpointDirectory)
3) Driver should be restart using checkpointing
Configure Automatic driver restart
Spark Standalone
- use spark-submit with “cluster” mode and “- - supervise”
YARN
-use spark-submit with “cluster” mode
Mesos
-Marathon can restart applications or use “- - supervise” flag
Configure Checkpointing
object RecoverableWordCount {
//should a fault-tolerant,reliable file system(e.g.HDFS,S3, etc.)
val checkpointDirectory = "checkpointDir"
def createContext() = {
val sparkConf = new SparkConf().setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(1))
streamingContext.checkpoint(checkpointDirectory)
val lines = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print(20)
streamingContext
}
}
Driver should be restart using checkpointing
object StreamingApp extends App {
import RecoverableWordCount._
val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
//do other operations
streamingContext.start()
streamingContext.awaitTermination()
}
Driver should be restart using checkpointing
object StreamingApp extends App {
import RecoverableWordCount._
val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
//do other operations
streamingContext.start()
streamingContext.awaitTermination()
}
Checkpointing
There are two types of data that are checkpointed.
1) Metadata checkpointing
-Configuration
-DStream operations
-Incomplete batches
2) Data checkpointing
- Saving of the generated RDDs to reliable storage. This is necessary in some stateful
transformations that combine data across multiple batches.
Checkpointing Latency
➔ Checkpointing of RDDs incurs the cost of saving to reliable storage. The interval of
checkpointing needs to be set carefully.
dstream.checkpoint( Seconds( (batch interval)*10 ) )
➔ A checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Fault-tolerance Semantics
Spark Streaming & Kafka Integration
Why Kafka ?
➢ Velocity & volume of streaming data
➢ Reprocessing of streaming
➢ Reliable receiver complexity
➢ Checkpoint complexity
➢ Upgrading Application Code
Kafka Integration
There are two approaches to integrate Kafka with Spark Streaming:
➢ Receiver-based Approach
➢ Direct Approach
Receiver-based Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Receiver-based Approach
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object ReceiverBasedStreaming extends App {
val group = "streaming-test-group"
val zkQuorum = "localhost:2181"
val topics = Map("streaming_queue" -> 1)
val sparkConf = new SparkConf().setAppName("ReceiverBasedStreamingApp")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topics)
.map { case (key, message) => message }
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Direct Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Direct Approach
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka._
object KafkaDirectStreaming extends App {
val brokers = "localhost:9092"
val sparkConf = new SparkConf().setAppName("KafkaDirectStreaming")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpointDir") //offset recovery
val topics = Set("streaming_queue")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages: InputDStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
val lines = messages.map { case (key, message) => message }
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Direct Approach
Direct Approach has the following advantages over the receiver-based approach:
➢ Simplified Parallelism
➢ Efficiency
➢ Exactly-once semantics
Performance Tuning
For best performance of a Spark Streaming application we need to
consider two things:
➢ Reducing the Batch Processing Times
➢ Setting the Right Batch Interval
Reducing the Batch Processing Times
➢ Level of Parallelism in Data Receiving
➢ Level of Parallelism in Data Processing
➢ Data Serialization
-Input data
-Persisted RDDs generated by Streaming Operations
➢ Task Launching Overheads
-Running Spark in Standalone mode or coarse-grained Mesos mode leads
to better task launch times.
Setting the Right Batch Interval
➢ Batch processing time should be less than the batch interval.
➢ Memory Tuning
-Persistence Level of Dstreams
-Clearing old data
-CMS Garbage Collector
Code samples
https://github.com/knoldus/spark-streaming-meetup
https://github.com/knoldus/real-time-stream-processing-engine
https://github.com/knoldus/kafka-tweet-producer
Questions & DStream[Answer]
References
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
http://spark.apache.org/docs/latest/tuning.html
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htm
Thanks
Presenters:
@_satendrakumar
Organizer:
@knolspeak
http://www.knoldus.com

More Related Content

What's hot

Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Advanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAdvanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAljoscha Krettek
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...DataStax
 

What's hot (20)

Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Advanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applicationsAdvanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applications
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
 

Viewers also liked

Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Knoldus Inc.
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Knoldus Inc.
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Effective way to code in Scala
Effective way to code in ScalaEffective way to code in Scala
Effective way to code in ScalaKnoldus Inc.
 
Introduction to Shield and kibana
Introduction to Shield and kibanaIntroduction to Shield and kibana
Introduction to Shield and kibanaKnoldus Inc.
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Akka Finite State Machine
Akka Finite State MachineAkka Finite State Machine
Akka Finite State MachineKnoldus Inc.
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Introduction to AWS IAM
Introduction to AWS IAMIntroduction to AWS IAM
Introduction to AWS IAMKnoldus Inc.
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Robert Metzger
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsJeff Hull
 

Viewers also liked (20)

Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 2
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Effective way to code in Scala
Effective way to code in ScalaEffective way to code in Scala
Effective way to code in Scala
 
Introduction to Shield and kibana
Introduction to Shield and kibanaIntroduction to Shield and kibana
Introduction to Shield and kibana
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Akka Finite State Machine
Akka Finite State MachineAkka Finite State Machine
Akka Finite State Machine
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Introduction to AWS IAM
Introduction to AWS IAMIntroduction to AWS IAM
Introduction to AWS IAM
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
 
Build and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 MinsBuild and Deploy a Python Web App to Amazon in 30 Mins
Build and Deploy a Python Web App to Amazon in 30 Mins
 

Similar to Spark Streaming Fundamentals and Concepts

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Steve Min
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2Bahul Neel Upadhyaya
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applicationsŁukasz Gawron
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark ApplicationsFuture Processing
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsMatt Stubbs
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
 

Similar to Spark Streaming Fundamentals and Concepts (20)

Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 

More from Knoldus Inc.

Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxKnoldus Inc.
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxKnoldus Inc.
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxKnoldus Inc.
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxKnoldus Inc.
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationKnoldus Inc.
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationKnoldus Inc.
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIsKnoldus Inc.
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II PresentationKnoldus Inc.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAKnoldus Inc.
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Knoldus Inc.
 
Azure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptxAzure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptxKnoldus Inc.
 
The Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and KotlinThe Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and KotlinKnoldus Inc.
 
Data Engineering with Databricks Presentation
Data Engineering with Databricks PresentationData Engineering with Databricks Presentation
Data Engineering with Databricks PresentationKnoldus Inc.
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Knoldus Inc.
 
NoOps - (Automate Ops) Presentation.pptx
NoOps - (Automate Ops) Presentation.pptxNoOps - (Automate Ops) Presentation.pptx
NoOps - (Automate Ops) Presentation.pptxKnoldus Inc.
 
Mastering Distributed Performance Testing
Mastering Distributed Performance TestingMastering Distributed Performance Testing
Mastering Distributed Performance TestingKnoldus Inc.
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxKnoldus Inc.
 
Introduction to Ansible Tower Presentation
Introduction to Ansible Tower PresentationIntroduction to Ansible Tower Presentation
Introduction to Ansible Tower PresentationKnoldus Inc.
 
CQRS with dot net services presentation.
CQRS with dot net services presentation.CQRS with dot net services presentation.
CQRS with dot net services presentation.Knoldus Inc.
 

More from Knoldus Inc. (20)

Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)
 
Azure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptxAzure Databricks (For Data Analytics).pptx
Azure Databricks (For Data Analytics).pptx
 
The Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and KotlinThe Power of Dependency Injection with Dagger 2 and Kotlin
The Power of Dependency Injection with Dagger 2 and Kotlin
 
Data Engineering with Databricks Presentation
Data Engineering with Databricks PresentationData Engineering with Databricks Presentation
Data Engineering with Databricks Presentation
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)
 
NoOps - (Automate Ops) Presentation.pptx
NoOps - (Automate Ops) Presentation.pptxNoOps - (Automate Ops) Presentation.pptx
NoOps - (Automate Ops) Presentation.pptx
 
Mastering Distributed Performance Testing
Mastering Distributed Performance TestingMastering Distributed Performance Testing
Mastering Distributed Performance Testing
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptx
 
Introduction to Ansible Tower Presentation
Introduction to Ansible Tower PresentationIntroduction to Ansible Tower Presentation
Introduction to Ansible Tower Presentation
 
CQRS with dot net services presentation.
CQRS with dot net services presentation.CQRS with dot net services presentation.
CQRS with dot net services presentation.
 

Recently uploaded

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 

Recently uploaded (20)

2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 

Spark Streaming Fundamentals and Concepts

  • 1. Satendra Kumar Sr. Software Consultant Knoldus Software LLP Stream Processing
  • 2. Topics Covered ➢ What is Stream ➢ What is Stream processing ➢ The challenges of stream processing ➢ Overview Spark Streaming ➢ Receivers ➢ Custom receivers ➢ Transformations on Dstreams ➢ Failures ➢ Fault-tolerance Semantics ➢ Kafka Integration ➢ Performance Tuning
  • 3. What is Stream A stream is a sequence of data elements made available over time and which can be accessed in sequential order. Eg. YouTube video buffering.
  • 4. What is Stream processing Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion. It treats data not as static tables or files, but as a continuous infinite stream of data integrated from both live and historical sources.
  • 5. ➢ Partitioning & Scalability ➢ Semantics & Fault tolerance ➢ Unifying the streams ➢ Time ➢ Re-Processing The challenges of stream processing
  • 6. Spark Streaming ➢ Provides a way to process the live data streams. ➢ Scalable, high-throughput, fault-tolerant. ➢ Built top of core Spark API. ➢ API is very similar to Spark core API. ➢ Supports many sources like Kafka, Flume, Kinesis or TCP sockets. ➢ Currently based on RDDs.
  • 11. Discretized Streams ➢ It provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data; ➢ DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high- level operations on other Dstreams. ➢ DStream is represented as a sequence of RDDs.
  • 20. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() }
  • 21. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context
  • 22. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval
  • 23. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver
  • 24. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver Transformations on DStreams
  • 25. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver Transformations on DStreams Output Operations on DStreams
  • 26. Driver Program object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print() streamingContext.start() streamingContext.awaitTermination() } Streaming Context Batch Interval Receiver Transformations on DStreams Output Operations on DStreams Start the Streaming
  • 27. Important Points ➢ Once a context has been started, no new streaming computations can be set up or added to it. ➢ Once a context has been stopped, it cannot be restarted. ➢ Only one StreamingContext can be active in a JVM at the same time. ➢ stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false. ➢ A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.
  • 28. Spark Streaming Concept ➢ Spark streaming is based on micro-batch architecture. ➢ Spark streaming continuously receives live input data streams and divides the data into batches. ➢ New batches are created at regular time intervals called batch interval. ➢ Each batch have N numbers blocks. Where N = batch-interval / block-interval For eg. If batch interval = 1 second and block interval= 200ms(by default) then each batch have 5 blocks.
  • 33. Transforming DStream ➢ DStream is represented by a continuous series of RDDs ➢ Each RDD in a DStream contains data from a certain interval ➢ Any operation applied on a DStream translates to operations on the underlying RDDs ➢ Processing time of a batch should less than or equal to batch interval.
  • 34. Transformations on DStreams def map[U: ClassTag](mapFunc: T => U): DStream[U] def flatMap[U: ClassTag](flatMapFunc: T => TraversableOnce[U]): DStream[U] def filter(filterFunc: T => Boolean): DStream[T] def reduce(reduceFunc: (T, T) => T): DStream[T] def count(): DStream[Long] def repartition(numPartitions: Int): DStream[T] def countByValue(numPartitions: Int = ssc.sc.defaultParallelism): DStream[(T, Long)] def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]
  • 35. Transformations on PairDStream def groupByKey(): DStream[(K, Iterable[V])] def reduceByKey(reduceFunc: (V, V) => V, numPartitions: Int): DStream[(K, V)] def join[W: ClassTag](other: DStream[(K, W)]): DStream[(K, (V, W))] def updateStateByKey[S: ClassTag]( updateFunc: (Seq[V], Option[S]) => Option[S],partitioner: Partitioner): DStream[(K, S)] def cogroup[W: ClassTag]( other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Iterable[V], Iterable[W]))] def mapValues[U: ClassTag](mapValuesFunc: V => U): DStream[(K, U)] def leftOuterJoin[W: ClassTag]( other: DStream[(K, W)],numPartitions: Int): DStream[(K, (V, Option[W]))] def rightOuterJoin[W: ClassTag]( other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Option[V], W))]
  • 36. updateStateByKey object StreamingApp extends App { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) streamingContext.checkpoint(".") val lines = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val updatedState: DStream[(String, Int)] = pairs.updateStateByKey[Int] { (newValues: Seq[Int], state: Option[Int]) => Some(newValues.sum +state.getOrElse(0)) } updatedState.print() streamingContext.start() streamingContext.awaitTermination() }
  • 37. Window Operations Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data. Window operation needs to specify two parameters: ● window length - The duration of the window. ● sliding interval - The interval at which the window operation is performed.
  • 38. Window Operations def window(windowDuration: Duration): DStream[T] def window(windowDuration: Duration, slideDuration: Duration): DStream[T] def reduceByWindow(reduceFunc: (T, T) => T, windowDuration: Duration, slideDuration: Duration): DStream[T] def countByWindow(windowDuration: Duration, slideDuration: Duration): DStream[Long] def countByValueAndWindow(windowDuration: Duration, slideDuration: Duration,numPartitions: Int): DStream[(T, Long)] //pairDStream Operations def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])] def groupByKeyAndWindow(windowDuration: Duration, slideDuration: Duration): DStream[(K, Iterable[V])] def reduceByKeyAndWindow(reduceFunc: (V, V) => V,windowDuration: Duration): DStream[(K, V)] def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration,slideDuration: Duration): DStream[(K, V)]
  • 39. Window Operations pairs.window(Seconds(15), Seconds(10)) filteredWords.reduceByWindow((a, b) => a +", "+ b, Seconds(15), Seconds(10)) pairs.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(15), Seconds(10))
  • 40. Output Operations on DStreams def print(num: Int): Unit def saveAsObjectFiles(prefix: String, suffix: String = ""): Unit def saveAsTextFiles(prefix: String, suffix: String = ""): Unit def foreachRDD(foreachFunc: RDD[T] => Unit): Unit def saveAsHadoopFiles[F <: OutputFormat[K, V]](prefix: String,suffix: String): Unit
  • 41. Receivers Spark Streaming have two kinds of receivers: 1) Reliable Receiver - A reliable receiver correctly sends acknowledgment to a reliable source when the data has been received and stored in Spark with replication. 2) Unreliable Receiver - An unreliable receiver does not send acknowledgment to a source.
  • 42. Custom Receiver A custom receiver must extend this abstract Receiver class by implementing two abstract methods: def onStart(): Unit //Things to do to start receiving data def onStop(): Unit // Things to do to stop receiving data
  • 43. Custom Receiver class CustomReceiver(path: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) { def onStart() { new Thread("File Reader") { override def run() { receive() } }.start() } def onStop() {} private def receive() = try { println("Reading file " + path) val reader = new BufferedReader( new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8)) var userInput = reader.readLine() while (!isStopped && Option(userInput).isDefined) { store(userInput) userInput = reader.readLine() } reader.close() println("Stopped receiving") restart("Trying to connect again") } catch { case ex: Exception => restart("Error reading file " + path, ex) } }
  • 44. Custom Receiver object CustomReceiver extends App { val sparkConf = new SparkConf().setAppName("CustomReceiver") val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines = ssc.receiverStream(new CustomReceiver(args(0))) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() }
  • 46. Fault-tolerance Semantics Streaming system provides zero data loss guarantees despite any kind of failure in the system. ➢ At least once- Each record will be processed one or more times. ➢ Exactly once- Each record will be processed exactly once - no data will be lost and no data will be processed multiple times
  • 47. Kinds of Failure There are two kind of failure: ➢ Executor failure 1) Data received and replicated 2) Data received but not replicated ➢ Driver failure
  • 56. Enable write ahead logs object Streaming2App extends App { val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.) val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) streamingContext.checkpoint(checkpointDirectory) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print(20) streamingContext.start() streamingContext.awaitTermination() } Enable write logs
  • 57. Enable write ahead logs object Streaming2App extends App { val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.) val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") val streamingContext = new StreamingContext(sparkConf, Seconds(5)) streamingContext.checkpoint(checkpointDirectory) val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print(20) streamingContext.start() streamingContext.awaitTermination() } Enable write logs Enable checkpointing
  • 58. Enable write ahead logs 1) For WAL first need to enable checkpointing - streamingContext.checkpoint(checkpointDirectory) 2) Enable WAL in spark configuration -sparkConf.set("spark.streaming.receiver.writeAheadLog.enable","true") 3) Receiver should be reliable - Acknowledge source only after data saved to WAL - Unacknowledged data will be replayed from source by restated receiver 4) Disable in-memory replication (Already replicated By HDFS) - Use StorageLevel.MEMORY_AND_DISK_SER for input DStream
  • 63. Driver failure How to recover from this Failure ?
  • 64. Driver with checkpointing Dstream Checkpointing : Periodically save the DAG of DStream to fault-tolerant storage.
  • 68. Recover from Driver failure 1) Configure Automatic driver restart -All cluster managers support this 2) Set a checkpoint directory - Directory should be in fault-tolerant & reliable file system (e.g., HDFS, S3, etc.) - streamingContext.checkpoint(checkpointDirectory) 3) Driver should be restart using checkpointing
  • 69. Configure Automatic driver restart Spark Standalone - use spark-submit with “cluster” mode and “- - supervise” YARN -use spark-submit with “cluster” mode Mesos -Marathon can restart applications or use “- - supervise” flag
  • 70. Configure Checkpointing object RecoverableWordCount { //should a fault-tolerant,reliable file system(e.g.HDFS,S3, etc.) val checkpointDirectory = "checkpointDir" def createContext() = { val sparkConf = new SparkConf().setAppName("StreamingApp") val streamingContext = new StreamingContext(sparkConf, Seconds(1)) streamingContext.checkpoint(checkpointDirectory) val lines = streamingContext.socketTextStream("localhost", 9000) val words: DStream[String] = lines.flatMap(_.split(" ")) val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty) val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1)) val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _) wordCounts.print(20) streamingContext } }
  • 71. Driver should be restart using checkpointing object StreamingApp extends App { import RecoverableWordCount._ val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _) //do other operations streamingContext.start() streamingContext.awaitTermination() }
  • 72. Driver should be restart using checkpointing object StreamingApp extends App { import RecoverableWordCount._ val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _) //do other operations streamingContext.start() streamingContext.awaitTermination() }
  • 73. Checkpointing There are two types of data that are checkpointed. 1) Metadata checkpointing -Configuration -DStream operations -Incomplete batches 2) Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches.
  • 74. Checkpointing Latency ➔ Checkpointing of RDDs incurs the cost of saving to reliable storage. The interval of checkpointing needs to be set carefully. dstream.checkpoint( Seconds( (batch interval)*10 ) ) ➔ A checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  • 81. Spark Streaming & Kafka Integration
  • 82. Why Kafka ? ➢ Velocity & volume of streaming data ➢ Reprocessing of streaming ➢ Reliable receiver complexity ➢ Checkpoint complexity ➢ Upgrading Application Code
  • 83. Kafka Integration There are two approaches to integrate Kafka with Spark Streaming: ➢ Receiver-based Approach ➢ Direct Approach
  • 85. Receiver-based Approach import org.apache.spark.SparkConf import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} object ReceiverBasedStreaming extends App { val group = "streaming-test-group" val zkQuorum = "localhost:2181" val topics = Map("streaming_queue" -> 1) val sparkConf = new SparkConf().setAppName("ReceiverBasedStreamingApp") sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topics) .map { case (key, message) => message } val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() }
  • 87. Direct Approach import kafka.serializer.StringDecoder import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.streaming.kafka._ object KafkaDirectStreaming extends App { val brokers = "localhost:9092" val sparkConf = new SparkConf().setAppName("KafkaDirectStreaming") val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint("checkpointDir") //offset recovery val topics = Set("streaming_queue") val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) val messages: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val lines = messages.map { case (key, message) => message } val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() }
  • 88. Direct Approach Direct Approach has the following advantages over the receiver-based approach: ➢ Simplified Parallelism ➢ Efficiency ➢ Exactly-once semantics
  • 89. Performance Tuning For best performance of a Spark Streaming application we need to consider two things: ➢ Reducing the Batch Processing Times ➢ Setting the Right Batch Interval
  • 90. Reducing the Batch Processing Times ➢ Level of Parallelism in Data Receiving ➢ Level of Parallelism in Data Processing ➢ Data Serialization -Input data -Persisted RDDs generated by Streaming Operations ➢ Task Launching Overheads -Running Spark in Standalone mode or coarse-grained Mesos mode leads to better task launch times.
  • 91. Setting the Right Batch Interval ➢ Batch processing time should be less than the batch interval. ➢ Memory Tuning -Persistence Level of Dstreams -Clearing old data -CMS Garbage Collector

Editor's Notes

  1. Data can be ingested from many sources like Kafka, Flume, Kinesis. Data can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Processed data can be pushed out to filesystems, databases, and live dashboards
  2. Unreliable - This can be used for sources that do not support acknowledgment, or even for reliable sources when one does not want or need to go into the complexity of acknowledgment.
  3. Spark Streaming can receive streaming data from any arbitrary data source beyond the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.). This requires the developer to implement a receiver that is customized for receiving data from the concerned data source. This guide walks through the process of implementing a custom receiver and using it in a Spark Streaming application. Note that custom receivers can be implemented in Scala or Java.
  4. Spark Streaming can receive streaming data from any arbitrary data source beyond the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.). This requires the developer to implement a receiver that is customized for receiving data from the concerned data source. This guide walks through the process of implementing a custom receiver and using it in a Spark Streaming application. Note that custom receivers can be implemented in Scala or Java.
  5. Spark Streaming can receive streaming data from any arbitrary data source beyond the ones for which it has built-in support (that is, beyond Flume, Kafka, Kinesis, files, sockets, etc.). This requires the developer to implement a receiver that is customized for receiving data from the concerned data source. This guide walks through the process of implementing a custom receiver and using it in a Spark Streaming application. Note that custom receivers can be implemented in Scala or Java.
  6. At most once: Each record will be either processed once or not processed at all. At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates. Exactly once: Each record will be processed exactly once - no data will be lost and no data will be processed multiple times. This is obviously the strongest guarantee of the three.
  7. Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  8. Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  9. Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
  10. Simplified Parallelism: No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune. Efficiency: Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka. Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. This occurs because of inconsistencies between data reliably received by Spark Streaming and offsets tracked by Zookeeper. Hence, in this second approach, we use simple Kafka API that does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints. This eliminates inconsistencies between Spark Streaming and Zookeeper/Kafka, and so each record is received by Spark Streaming effectively exactly once despite failures. In order to achieve exactly-once semantics for output of your results, your output operation that saves the data to an external data store must be either idempotent, or an atomic transaction that saves results and offsets (see Semantics of output operations in the main programming guide for further information).
  11. Level of Parallelism in Data Receiving- 1) Create multple receivers and those result a multple dstreams.These multiple DStreams can be unioned together to create a single DStream. Then the transformations that were being applied on a single input DStream can be applied on the unified stream. For example kafka one topic on receiver. 2) Another parameter that should be considered is the receiver’s blocking interval. For most receivers, the received data is coalesced together into blocks of data before storing inside Spark’s memory. The number of tasks per receiver per batch will be approximately (batch interval / block interval).inputStream.repartition(&amp;lt;number of partitions&amp;gt;)). Level of Parallelism in Data Processing- Cluster resources can be under-utilized if the number of parallel tasks used in any stage of the computation is not high enough. For example, for distributed reduce operations like reduceByKey and reduceByKeyAndWindow, the default number of parallel tasks is controlled by the spark.default.parallelism configuration property. You can pass the level of parallelism as an argument (see PairDStreamFunctions documentation), or set the spark.default.parallelism configuration property to change the default. Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads – the receiver must deserialize the received data and re-serialize it using Spark’s serialization format. Persisted RDDs generated by Streaming Operations: RDDs generated by streaming computations may be persisted in memory. For example, window operations persist data in memory as they would be processed multiple times. However, unlike the Spark Core default of StorageLevel.MEMORY_ONLY, persisted RDDs generated by streaming computations are persisted with StorageLevel.MEMORY_ONLY_SER (i.e. serialized) by default to minimize GC overheads. In both cases, using Kryo serialization can reduce both CPU and memory overheads. See the Spark Tuning Guide for more details.
  12. A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. Persistence Level of DStreams: As mentioned earlier in the Data Serialization section, the input data and RDDs are by default persisted as serialized bytes. This reduces both the memory usage and GC overheads, compared to deserialized persistence. Enabling Kryo serialization further reduces serialized sizes and memory usage. Further reduction in memory usage can be achieved with compression (see the Spark configuration spark.rdd.compress), at the cost of CPU time. Clearing old data: By default, all input data and persisted RDDs generated by DStream transformations are automatically cleared. Spark Streaming decides when to clear the data based on the transformations that are used. For example, if you are using a window operation of 10 minutes, then Spark Streaming will keep around the last 10 minutes of data, and actively throw away older data. Data can be retained for a longer duration (e.g. interactively querying older data) by setting streamingContext.remember. CMS Garbage Collector: Use of the concurrent mark-and-sweep GC is strongly recommended for keeping GC-related pauses consistently low. Even though concurrent GC is known to reduce the overall processing throughput of the system, its use is still recommended to achieve more consistent batch processing times. Make sure you set the CMS GC on both the driver (using --driver-java-options in spark-submit) and the executors (using Spark configuration spark.executor.extraJavaOptions).