2. Spark Streaming - Introduction
• Spark Streaming is an extension of the core Spark
API.
• Scalable, high-throughput, fault-tolerant stream
processing of live data streams.
• Data can be ingested from many sources.
• Data can be processed using complex algorithms
expressed with high-level functions.
• Processed data can be pushed out to filesystems,
databases, and live dashboards.
• Spark’s machine learning and graph processing
algorithms can be applied on data streams.
Image Source: Official Spark Documentation
3. Overview
■ Spark Streaming receives live input data streams and divides the data into batches
■ Batches are processed by the Spark engine to generate the final stream of results in
batches.
Image Source: Official Spark Documentation
4. DStreams
■ Spark Streaming provides a high-level abstraction called discretized
stream or Dstream.
■ DStream represents a continuous stream of data.
■ Can be created either from input data streams from sources such as Kafka, Flume, and
Kinesis, or by applying high-level operations on other DStreams.
■ Internally, a DStream is represented as a sequence RDD
■ Each RDD in a DStream contains data from a certain interval.
Image Source: Official Spark Documentation
5. DStreams
■ Any operation applied on a DStream translates to operations on the underlying RDDs.
Image Source: Official Spark Documentation
6. Built-in streaming sources
■ Basic sources: Sources directly available in the StreamingContextAPI. Examples: file
systems, and socket connections.
■ Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra
utility classes.These require linking against extra dependencies.
■ Multiple input Dstreams can be created.This will create multiple receivers which will
simultaneously receive multiple data streams.
■ The number of cores allocated to the Spark Streaming application must be more than
the number of receivers.