Spark Streaming Fundamentals and Concepts

Satendra Kumar
Sr. Software Consultant
Knoldus Software LLP
Stream Processing

Topics Covered
➢ What is Stream
➢ What is Stream processing
➢ The challenges of stream processing
➢ Overview Spark Streaming
➢ Receivers
➢ Custom receivers
➢ Transformations on Dstreams
➢ Failures
➢ Fault-tolerance Semantics
➢ Kafka Integration
➢ Performance Tuning

What is Stream
A stream is a sequence of data elements made available over time
and which can be accessed in sequential order.
Eg. YouTube video buffering.

What is Stream processing
Stream processing is the real-time processing of data
continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous
infinite stream of data integrated from both live and historical
sources.

➢ Partitioning & Scalability
➢ Semantics & Fault tolerance
➢ Unifying the streams
➢ Time
➢ Re-Processing
The challenges of stream processing

Spark Streaming
➢ Provides a way to process the live data streams.
➢ Scalable, high-throughput, fault-tolerant.
➢ Built top of core Spark API.
➢ API is very similar to Spark core API.
➢ Supports many sources like Kafka, Flume, Kinesis or TCP
sockets.
➢ Currently based on RDDs.

Discretized Streams
➢ It provides a high-level abstraction called discretized stream or
DStream, which represents a continuous stream of data;
➢ DStreams can be created either from input data streams from
sources such as Kafka, Flume, and Kinesis, or by applying high-
level operations on other Dstreams.
➢ DStream is represented as a sequence of RDDs.

Driver Program
object StreamingApp extends App {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingApp")
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val lines: ReceiverInputDStream[String] = streamingContext.socketTextStream("localhost", 9000)
val words: DStream[String] = lines.flatMap(_.split(" "))
val filteredWords: DStream[String] = words.filter(!_.trim.isEmpty)
val pairs: DStream[(String, Int)] = filteredWords.map(word => (word, 1))
val wordCounts: DStream[(String, Int)] = pairs.reduceByKey(_ + _)
wordCounts.print()
streamingContext.start()
streamingContext.awaitTermination()
}

Driver Program
val filteredWords = words.filter(!_.trim.isEmpty)
wordCounts.print()
}
Streaming Context

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval
Receiver

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval
Receiver
Transformations on DStreams

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval
Receiver
Output Operations on DStreams

Driver Program
wordCounts.print()
}
Streaming Context
Batch Interval
Receiver
Start the Streaming

Important Points
➢ Once a context has been started, no new streaming computations can
be set up or added to it.
➢ Once a context has been stopped, it cannot be restarted.
➢ Only one StreamingContext can be active in a JVM at the same time.
➢ stop() on StreamingContext also stops the SparkContext. To stop only
the StreamingContext, set the optional parameter of stop() called
stopSparkContext to false.
➢ A SparkContext can be re-used to create multiple StreamingContexts, as
long as the previous StreamingContext is stopped (without stopping the
SparkContext) before the next StreamingContext is created.

Spark Streaming Concept
➢ Spark streaming is based on micro-batch architecture.
➢ Spark streaming continuously receives live input data streams and divides
the data into batches.
➢ New batches are created at regular time intervals called batch interval.
➢ Each batch have N numbers blocks.
Where N = batch-interval / block-interval
For eg. If batch interval = 1 second and block interval= 200ms(by default)
then each batch have 5 blocks.

Transforming DStream
➢ DStream is represented by a continuous series of RDDs
➢ Each RDD in a DStream contains data from a certain interval
➢ Any operation applied on a DStream translates to operations on the
underlying RDDs
➢ Processing time of a batch should less than or equal to batch
interval.

def map[U: ClassTag](mapFunc: T => U): DStream[U]
def flatMap[U: ClassTag](flatMapFunc: T => TraversableOnce[U]): DStream[U]
def filter(filterFunc: T => Boolean): DStream[T]
def reduce(reduceFunc: (T, T) => T): DStream[T]
def count(): DStream[Long]
def repartition(numPartitions: Int): DStream[T]
def countByValue(numPartitions: Int = ssc.sc.defaultParallelism): DStream[(T, Long)]
def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]

Transformations on PairDStream
def groupByKey(): DStream[(K, Iterable[V])]
def reduceByKey(reduceFunc: (V, V) => V, numPartitions: Int): DStream[(K, V)]
def join[W: ClassTag](other: DStream[(K, W)]): DStream[(K, (V, W))]
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],partitioner: Partitioner): DStream[(K, S)]
def cogroup[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Iterable[V], Iterable[W]))]
def mapValues[U: ClassTag](mapValuesFunc: V => U): DStream[(K, U)]
def leftOuterJoin[W: ClassTag](
other: DStream[(K, W)],numPartitions: Int): DStream[(K, (V, Option[W]))]
def rightOuterJoin[W: ClassTag](
other: DStream[(K, W)], numPartitions: Int): DStream[(K, (Option[V], W))]

updateStateByKey
streamingContext.checkpoint(".")
val lines = streamingContext.socketTextStream("localhost", 9000)
val updatedState: DStream[(String, Int)] =
pairs.updateStateByKey[Int] {
(newValues: Seq[Int], state: Option[Int]) => Some(newValues.sum +state.getOrElse(0))
}
updatedState.print()
}

Window Operations
Spark Streaming also provides windowed computations, which allow
you to apply transformations over a sliding window of data.
Window operation needs to specify two parameters:
● window length - The duration of the window.
● sliding interval - The interval at which the window operation is performed.

Window Operations
def window(windowDuration: Duration): DStream[T]
def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
def reduceByWindow(reduceFunc: (T, T) => T,
windowDuration: Duration, slideDuration: Duration): DStream[T]
def countByWindow(windowDuration: Duration, slideDuration: Duration): DStream[Long]
def countByValueAndWindow(windowDuration: Duration,
slideDuration: Duration,numPartitions: Int): DStream[(T, Long)]
//pairDStream Operations
def groupByKeyAndWindow(windowDuration: Duration): DStream[(K, Iterable[V])]
def groupByKeyAndWindow(windowDuration: Duration,
slideDuration: Duration): DStream[(K, Iterable[V])]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,windowDuration: Duration): DStream[(K, V)]
def reduceByKeyAndWindow(reduceFunc: (V, V) => V,
windowDuration: Duration,slideDuration: Duration): DStream[(K, V)]

Window Operations
pairs.window(Seconds(15), Seconds(10))
filteredWords.reduceByWindow((a, b) => a +", "+ b, Seconds(15), Seconds(10))
pairs.reduceByKeyAndWindow((a: Int, b: Int) => a + b, Seconds(15), Seconds(10))

def print(num: Int): Unit
def saveAsObjectFiles(prefix: String, suffix: String = ""): Unit
def saveAsTextFiles(prefix: String, suffix: String = ""): Unit
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit
def saveAsHadoopFiles[F <: OutputFormat[K, V]](prefix: String,suffix: String): Unit

Receivers
Spark Streaming have two kinds of receivers:
1) Reliable Receiver - A reliable receiver correctly sends acknowledgment
to a reliable source when the data has been received and stored in Spark with
replication.
2) Unreliable Receiver - An unreliable receiver does not send acknowledgment
to a source.

Custom Receiver
A custom receiver must extend this abstract Receiver class by implementing
two abstract methods:
def onStart(): Unit //Things to do to start receiving data
def onStop(): Unit // Things to do to stop receiving data

Custom Receiver
class CustomReceiver(path: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("File Reader") {
override def run() {
receive()
}
}.start()
}
def onStop() {}
private def receive() =
try {
println("Reading file " + path)
val reader = new BufferedReader(
new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8))
var userInput = reader.readLine()
while (!isStopped && Option(userInput).isDefined) {
store(userInput)
userInput = reader.readLine()
}
reader.close()
println("Stopped receiving")
restart("Trying to connect again")
} catch {
case ex: Exception =>
restart("Error reading file " + path, ex)
}
}

Custom Receiver
object CustomReceiver extends App {
val sparkConf = new SparkConf().setAppName("CustomReceiver")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.receiverStream(new CustomReceiver(args(0)))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}

Fault-tolerance Semantics
Streaming system provides zero data loss guarantees despite any kind of
failure in the system.
➢ At least once- Each record will be processed one or more times.
➢ Exactly once- Each record will be processed exactly once - no data will be lost and no
data will be processed multiple times

Kinds of Failure
There are two kind of failure:
➢ Executor failure
1) Data received and replicated
2) Data received but not replicated
➢ Driver failure

Executor failure
Data would be lost ?

Enable write ahead logs
object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
streamingContext.checkpoint(checkpointDirectory)
wordCounts.print(20)
}
Enable write logs

object Streaming2App extends App {
val checkpointDirectory ="checkpointDir"//It should be fault-tolerant & reliable file system(e.g. HDFS, S3, etc.)
}
Enable write logs
Enable checkpointing

1) For WAL first need to enable checkpointing
- streamingContext.checkpoint(checkpointDirectory)
2) Enable WAL in spark configuration
-sparkConf.set("spark.streaming.receiver.writeAheadLog.enable","true")
3) Receiver should be reliable
- Acknowledge source only after data saved to WAL
- Unacknowledged data will be replayed from source by restated receiver
4) Disable in-memory replication (Already replicated By HDFS)
- Use StorageLevel.MEMORY_AND_DISK_SER for input DStream

Driver failure
How to recover from this Failure ?

Driver with checkpointing
Dstream Checkpointing : Periodically save the DAG of
DStream to fault-tolerant storage.

Recover from Driver failure
1) Configure Automatic driver restart
-All cluster managers support this
2) Set a checkpoint directory
- Directory should be in fault-tolerant & reliable file system (e.g., HDFS, S3, etc.)
- streamingContext.checkpoint(checkpointDirectory)
3) Driver should be restart using checkpointing

Configure Automatic driver restart
Spark Standalone
- use spark-submit with “cluster” mode and “- - supervise”
YARN
-use spark-submit with “cluster” mode
Mesos
-Marathon can restart applications or use “- - supervise” flag

Configure Checkpointing
object RecoverableWordCount {
//should a fault-tolerant,reliable file system(e.g.HDFS,S3, etc.)
val checkpointDirectory = "checkpointDir"
def createContext() = {
val sparkConf = new SparkConf().setAppName("StreamingApp")
val lines = streamingContext.socketTextStream("localhost", 9000)
streamingContext
}
}

Driver should be restart using checkpointing
import RecoverableWordCount._
val streamingContext = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
//do other operations
}

Checkpointing
There are two types of data that are checkpointed.
1) Metadata checkpointing
-Configuration
-DStream operations
-Incomplete batches
2) Data checkpointing
- Saving of the generated RDDs to reliable storage. This is necessary in some stateful
transformations that combine data across multiple batches.

Checkpointing Latency
➔ Checkpointing of RDDs incurs the cost of saving to reliable storage. The interval of
checkpointing needs to be set carefully.
dstream.checkpoint( Seconds( (batch interval)*10 ) )
➔ A checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.

Spark Streaming & Kafka Integration

Why Kafka ?
➢ Velocity & volume of streaming data
➢ Reprocessing of streaming
➢ Reliable receiver complexity
➢ Checkpoint complexity
➢ Upgrading Application Code

Kafka Integration
There are two approaches to integrate Kafka with Spark Streaming:
➢ Receiver-based Approach
➢ Direct Approach

Receiver-based Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Receiver-based Approach
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object ReceiverBasedStreaming extends App {
val group = "streaming-test-group"
val zkQuorum = "localhost:2181"
val topics = Map("streaming_queue" -> 1)
val sparkConf = new SparkConf().setAppName("ReceiverBasedStreamingApp")
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topics)
.map { case (key, message) => message }
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
}

Direct Approach
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Direct Approach
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka._
object KafkaDirectStreaming extends App {
val brokers = "localhost:9092"
val sparkConf = new SparkConf().setAppName("KafkaDirectStreaming")
ssc.checkpoint("checkpointDir") //offset recovery
val topics = Set("streaming_queue")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages: InputDStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
val lines = messages.map { case (key, message) => message }
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
}

Direct Approach
Direct Approach has the following advantages over the receiver-based approach:
➢ Simplified Parallelism
➢ Efficiency
➢ Exactly-once semantics

Performance Tuning
For best performance of a Spark Streaming application we need to
consider two things:
➢ Reducing the Batch Processing Times
➢ Setting the Right Batch Interval

Reducing the Batch Processing Times
➢ Level of Parallelism in Data Receiving
➢ Level of Parallelism in Data Processing
➢ Data Serialization
-Input data
-Persisted RDDs generated by Streaming Operations
➢ Task Launching Overheads
-Running Spark in Standalone mode or coarse-grained Mesos mode leads
to better task launch times.

Setting the Right Batch Interval
➢ Batch processing time should be less than the batch interval.
➢ Memory Tuning
-Persistence Level of Dstreams
-Clearing old data
-CMS Garbage Collector

Code samples
https://github.com/knoldus/spark-streaming-meetup
https://github.com/knoldus/real-time-stream-processing-engine
https://github.com/knoldus/kafka-tweet-producer

References
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
http://spark.apache.org/docs/latest/tuning.html
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htm

Thanks
Presenters:
@_satendrakumar
Organizer:
@knolspeak
http://www.knoldus.com

Spark Streaming Fundamentals and Concepts

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark Streaming Fundamentals and Concepts

Similar to Spark Streaming Fundamentals and Concepts (20)

More from Knoldus Inc.

More from Knoldus Inc. (20)

Recently uploaded

Recently uploaded (20)

Spark Streaming Fundamentals and Concepts

Editor's Notes