Learning spark ch10 - Spark Streaming

C H A P T E R 1 0 : S P A R K S T R E A M I N G
Learning Spark
by Holden Karau et. al.

Overview: Spark Streaming
 A Simple Example
 Architecture and
Abstraction
 Transformations
 Stateless
 Stateful
 Output Operations
 Input Sources
 Core Sources
 Additional Sources
 Multiple Sources and Cluster
Sizing
 24/7 Operation
 Checkpointing
 Driver Fault Tolerance
 Worker Fault Tolerance
 Receiver Fault Tolerance
 Processing Guarantees
 Streaming UI
 Performance
Considerations
 Batch and Window Sizes
 Level of Parallelism
 Garbage Collection and
Memory Usage
 Conclusion

10.1 A Simple Example
 Before we dive into the details of Spark Streaming,
let’s consider a simple example. We will receive a
stream of newline-delimited lines of text from a
server running at port 7777, filter only the lines that
contain the word error, and print them.
 Spark Streaming programs are best run as
standalone applications built using Maven or sbt.
Spark Streaming, while part of Spark, ships as a
separate Maven artifact and has some additional
imports you will want to add to your project.

10.2 Architecture and Abstraction

Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala

10.2 Architecture and Abstraction (cont.)

10.3 Transformations
 Stateless
 the processing of each batch does not depend on the data of its
previous batches
 include the common RDD transformations like map(), filter(),
and reduceByKey()
 Stateful
 use data or intermediate results from previous batches to
compute the results of the current batch
 include transformations based on:
 sliding windows
 tracking state across time

10.3.1 Stateless Transformations

10.3.2 Stateless Transformations
 Windowed Transformation
 compute results across a longer time period than the
StreamingContext’s batch interval, by combining results from
multiple batches
A windowed stream with a window duration of
3 batches and a slide duration of 2 batches;
every two time steps, we compute a result over
the previous 3 time steps

10.3.2 Stateless Transformations (cont.)
 UpdateStateByKey transformation
 updateStateByKey() maintains state across the batches in a
DStream by providing access to a state variable for DStreams
of key/value pairs
 update(events, oldState)  returns a newState
 events is a list of events that arrived in the current batch (may be
empty)
 oldState is an optional state object, stored within an Option; it
might be missing if there was no previous state for the key
 newState is also an Option; we can return an empty Option to
specify that we want to delete the state

10.4 Output Operations
 Specify what needs to be done with the final transformed
data in a stream
 print()
 save()
 Saving DStream to text files in Scala
ipAddressRequestCount.saveAsTextFiles("outputDir", "txt")
 Saving SequenceFiles from a DStream in Scala
val writableIpAddressRequestCount = ipAddressRequestCount.map {
(ip, count) => (new Text(ip), new LongWritable(count)) }
writableIpAddressRequestCount.saveAsHadoopFiles[
SequenceFileOutputFormat[Text,
LongWritable]]("outputDir", "txt")

10.5 Input Sources
 Spark Streaming has built-in support for a number
of different data sources.
 “core” sources are built into the Spark Streaming Maven
artifact
 others are available through additional artifacts
 Eg: spark-streaming-kafka.

10.5.1 Core Sources
 Stream of files
 allows a stream to be created from files written in a directory of a
Hadoop-compatible filesystem
 needs to have a consistent date format for the directory names and
the files have to be created atomically
 Eg: Streaming text files written to a directory in Scala
val logData = ssc.textFileStream(logDirectory)
 Akka actor stream
 allows using Akka actors as a source for streaming
 To construct an actor stream:
 create an Akka actor
 implement the org.apache.spark.streaming.receiver.ActorHelper
interface

10.5.2 Additional Sources
 Apache Kafka
 Apache Plume
 Push-based receiver
 Pull-based receiver
 Custom input sources

10.5.3 Multiple Sources and Cluster Sizing
 We can combine multiple DStreams using operations
like union()  combine data from multiple input
DStreams
 The receivers are executed in the Spark cluster to use
multiple ones
 Each receiver runs as a long-running task within Spark’s
executors, and hence occupies CPU cores allocated to the
application
 Note: Do not run Spark Streaming programs locally
with master config‐ ured as "local" or "local[1]”

10.6 “24/7” Operations
 Spark provides strong fault tolerance guarantees.
 As long as the input data is stored reliably, Spark Streaming
will always compute the correct result from it, offering “exactly
once” semantics, even if workers or the driver fail.
 To run Spark Streaming applications 24/7
1. setting up checkpointing to a reliable storage system, such as
HDFS or Amazon S3
2. worry about the fault tolerance of the driver program and of
unreliable input sources

10.6.1 Checkpointing
 Main mechanism needs to be set up for fault
tolerance
 Allows periodically saving data about the application
to a reliable storage system, such as HDFS or
Amazon S3  for use in recovering
 Two purposes:
 Limiting the state that must be recomputed on failure
 Providing fault tolerance for the driver

10.6.2 Driver Fault Tolerance
 Requires creating our StreamingContext, which
takes in the checkpoint directory
 use the StreamingContext.getOrCreate() function
 Write initialization code using getOrCreate(), need to
actually restart your driver program when it crashes

10.6.3 Worker Fault Tolerance
 Spark Streaming uses the same techniques as Spark
for its fault tolerance.
 All the data received from external sources is
replicated among the Spark workers
 All RDDs created through transformations of this
replicated input data are tolerant to failure of a
worker node, as the RDD lineage allows the system
to recompute the lost data all the way from the
surviving replica of the input data.

10.6.4 Receiver Fault Tolerance
 Spark Streaming restarts the failed receivers on
other nodes in the cluster
 Receivers provide the guarantees:
 All data read from a reliable filesystem (e.g., with
StreamingContext.hadoop Files) is reliable, because the underlying
filesystem is replicated.
 For unreliable sources such as Kafka, push-based Flume, or
Twitter, Spark repli‐ cates the input data to other nodes, but it
can briefly lose data if a receiver task is down.

10.6.5 Processing Guarantees
 Spark Streaming provide exactly- once semantics for
all transformations
 Even if a worker fails and some data gets reprocessed, the final
transformed result (that is, the transformed RDDs) will be the
same as if the data were processed exactly once.
 When the transformed result is to be pushed to
external systems using out‐ put operations, the task
pushing the result may get executed multiple times
due to failures, and some data can get pushed
multiple times.

10.7 Streaming UI
 UI page that lets us look at what applications are
doing. (typically http:// <driver>:4040)

10.8 Performance Considerations
 Batch in window sizes
 Level of parallelism
 Garbage Collection and Memory Usage

10.8.1 Batch and Window Sizes
 Minimum batch size Spark Streaming can use: 500
milliseconds
 The best approach:
 start with a larger batch size (around 10 seconds)
 work your way down to a smaller batch size.
 If the processing times reported in the Streaming UI
remain consistent, then you can continue to decrease
the batch size
 Note: if they are increasing you may have reached the limit for
your application.

10.8.2 Level of Parallelism
 Increasing the parallelism - a common way to reduce
the processing time of batches
 3 ways:
 Increasing the number of receivers
 Explicitly repartitioning received data
 Increasing parallelism in aggregation

10.8.3 Garbage Collection and Memory Usage
 Java’s garbage collection - an aspect that can cause
problems
 To minimize large pauses due to GC  enabling
Java’s Concurrent Mark- Sweep garbage collector.
 The Concurrent Mark-Sweep garbage collector does consume
more resources overall, but introduces fewer pauses.
 To reduce GC pressure
 Cache RDDs in serialized form
 Use Kryo serialization
 Use an LRU cache

10.9 Conclusion
 In this chapter, we have seen how to work with
streaming data using DStreams.
 Since DStreams are composed of RDDs, the
techniques and knowledge you have gained from the
earlier chapters remains applicable for streaming
and real-time applications.
 In the next chapter, we will look at machine learning
with Spark.

Learning spark ch10 - Spark Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Learning spark ch10 - Spark Streaming

Similar to Learning spark ch10 - Spark Streaming (20)

More from phanleson

More from phanleson (18)

Recently uploaded

Recently uploaded (20)

Learning spark ch10 - Spark Streaming

Editor's Notes