Spark
Structured
Streaming
Ajay Pratap Singh Pundhir
Agenda - Spark Components
- Spark Data models
- Spark Streaming
- Continuous Applications
- Structured Streaming
- Example & Demo
Spark Components
Spark Data Models - RDD
- Lineage ( Dependency)
- Partitioning (Distributed data)
- Compute function: Partition => Iterator[T]
Spark Data Models
- [T]Structure
- Arrange data according to plan
- More optimization
- Will limit what we can do, but still can do majority of operations
(RDD’s are still available)
Data Models Spark
Runtime Compile time Compile time
Runtime Runtime Compile time
RDD Dataframe Dataset
Syntex
Error
Analysis
Error
DATASET API’s
Type safe
Domain objects
lambda functions
private[spark] val df = spark.read.json("people.json")
// Convert data to domain objects.
case class Person(name: String, age: Int)
val ds = df.as[Person] ds.filter(_.age > 30)
// Compute histogram of age by name.
val hist = ds.groupBy(_.name).mapGroups { case (name, people:
Iter[Person]) =>
val buckets = new Array[Int](10) people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}
DATAFRAM
E =
DATASET
[RAW]
- Unified API since 2.0
- Dataframe is Dataset of Generic Row objects
- Types can be enforced on generic rows by
using df.as[MyClass]
Spark Streaming
Most widely used component
Widely used for event processing
Fault tolerant
High throughput
Unified stack with batch processing
Limitations of DStream
- Processing data using event time
- DStream natively supports batch time only. What about late arrival of
data??
- Interoperability between Stream and Batch
- API’s are similar but manual translation needed (DStream & RDD)
Limitations of DStream Cont..
- Exactly once processing semantics
- Implementation needs to take care
- Check-pointing efficiency
- Check pointing saves complete rdd, no incremental save
Structured Streaming Model
source
Structured Streaming Model
● Complete output
● Delta output
● Append Output
source
Batch ETL on Dataframe
Streaming ETL on Dataframe
Continuous Windowed
aggregation
Joining streams with static
data
Example: Word Count
source
Demo
1. Domain – Person
Objective: #of person by Gender
2. Domain- Transaction
Objective: #of transaction per category on event time window
Thanks
apundhir
Internals
Query
Management
- A handle on running streaming computation
Multiple queries can be running same time and
each query has a unique name to keep track
on query state
- Several methods to operate on query
- Stop query
- Wait for termination
- Get status
- Get running queries
- ….
Batch Execution Plan
Dataframe/Dataset Logical Plan Planner
Execution
Plan
Abstract
representation of
query
Spark SQL
catalyst query
optimizer
Execute Most
optimized query
plan
Continuous Incremental Execution Plan
Dataframe/Dataset Logical Plan Planner
Abstract
representation of
query
Spark SQL
catalyst query
optimizer
Incremental
Execution Plan 1
Incremental
Execution Plan 3
Incremental
Execution Plan 2
Fault Tolerance Model
- All data and metadata needs to be recoverable
- Planner
- Offsets are written to fault tolerant file system
- In case of failure, read offsets from file system and do the execution again
- Sources:
- Capable of regenerating same set of data given the offsets
Fault Tolerance Model
- State:
- Intermediate state is maintained as versioned key value pair in spark workers backed by
HDFS
- Sink:
- By Design idempotent, handles re execution to avoid double commits

Spark Kafka summit 2017

  • 1.
  • 2.
    Agenda - SparkComponents - Spark Data models - Spark Streaming - Continuous Applications - Structured Streaming - Example & Demo
  • 3.
  • 4.
    Spark Data Models- RDD - Lineage ( Dependency) - Partitioning (Distributed data) - Compute function: Partition => Iterator[T]
  • 5.
    Spark Data Models -[T]Structure - Arrange data according to plan - More optimization - Will limit what we can do, but still can do majority of operations (RDD’s are still available)
  • 6.
    Data Models Spark RuntimeCompile time Compile time Runtime Runtime Compile time RDD Dataframe Dataset Syntex Error Analysis Error
  • 7.
    DATASET API’s Type safe Domainobjects lambda functions private[spark] val df = spark.read.json("people.json") // Convert data to domain objects. case class Person(name: String, age: Int) val ds = df.as[Person] ds.filter(_.age > 30) // Compute histogram of age by name. val hist = ds.groupBy(_.name).mapGroups { case (name, people: Iter[Person]) => val buckets = new Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) }
  • 8.
    DATAFRAM E = DATASET [RAW] - UnifiedAPI since 2.0 - Dataframe is Dataset of Generic Row objects - Types can be enforced on generic rows by using df.as[MyClass]
  • 9.
    Spark Streaming Most widelyused component Widely used for event processing Fault tolerant High throughput Unified stack with batch processing
  • 10.
    Limitations of DStream -Processing data using event time - DStream natively supports batch time only. What about late arrival of data?? - Interoperability between Stream and Batch - API’s are similar but manual translation needed (DStream & RDD)
  • 11.
    Limitations of DStreamCont.. - Exactly once processing semantics - Implementation needs to take care - Check-pointing efficiency - Check pointing saves complete rdd, no incremental save
  • 12.
  • 13.
    Structured Streaming Model ●Complete output ● Delta output ● Append Output source
  • 14.
    Batch ETL onDataframe
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    Demo 1. Domain –Person Objective: #of person by Gender 2. Domain- Transaction Objective: #of transaction per category on event time window
  • 20.
  • 21.
  • 22.
    Query Management - A handleon running streaming computation Multiple queries can be running same time and each query has a unique name to keep track on query state - Several methods to operate on query - Stop query - Wait for termination - Get status - Get running queries - ….
  • 23.
    Batch Execution Plan Dataframe/DatasetLogical Plan Planner Execution Plan Abstract representation of query Spark SQL catalyst query optimizer Execute Most optimized query plan
  • 24.
    Continuous Incremental ExecutionPlan Dataframe/Dataset Logical Plan Planner Abstract representation of query Spark SQL catalyst query optimizer Incremental Execution Plan 1 Incremental Execution Plan 3 Incremental Execution Plan 2
  • 25.
    Fault Tolerance Model -All data and metadata needs to be recoverable - Planner - Offsets are written to fault tolerant file system - In case of failure, read offsets from file system and do the execution again - Sources: - Capable of regenerating same set of data given the offsets
  • 26.
    Fault Tolerance Model -State: - Intermediate state is maintained as versioned key value pair in spark workers backed by HDFS - Sink: - By Design idempotent, handles re execution to avoid double commits

Editor's Notes

  • #5 Dependency - RDD needs other data sources or RDD’s to compute result Partitioning - Given RDD dependencies how to parallelize work across cluster, also needs to take care of data locality, given data stored in HDFS, block location spark scheduler can optimize work distribution. Compute function: Given partition to work on, produce an iterator of the data This is opaque (Black box), spark has no idea about the type of data and kind of operation Data itself is opaque, just an object, we can just serialize.deserialize, but have no idea about the internals Optimizations are limited
  • #7 RDD: SQL as string has no idea about the syntax and if the column exist or not?? Dataframe: Knows functions, so can catch syntax errors at compile time, but no idea about data underneath. Dataset: Knows data fully, you are working on typed objects. Analysis errors reported before a distributed job starts. Which one to choose depends upon the type of problem you are solving.
  • #9 When you don’t know all of the fields ahead of time and you don’t want to compile your data class ahead of time, Dataframe with generic RAW type is answer. Imp: Dataset API’s are available in Java and Scala as they we need to work on JVM based objects with generics Python itself provides dynamic type casting..
  • #11 Streaming Applications does not run in isolation They need to interact with batch data (Data stores...), Do some interactive analysis like machine learning… Example: ETL: Store all data into long term storage systems => data loss No data duplicacy => once write guarantee Status monitoring: Late arrival Window operations Learn Model offline + Continuous learning This type of applications are known as Continuous Applications.
  • #12 Streaming Applications does not run in isolation They need to interact with batch data (Data stores...), Do some interactive analysis like machine learning… Example: ETL: Store all data into long term storage systems => data loss No data duplicacy => once write guarantee Status monitoring: Late arrival Window operations Learn Model offline + Continuous learning This type of applications are known as Continuous Applications.
  • #13 Think of a continuously increasing dataset Single API for Static Dataframe vs Continuous Dataframe
  • #14 Think of a continuously increasing dataset Single API for Static Dataframe vs Continuous Dataframe Input: Data from source as append only Trigger: How frequently to check for new data Query : Operation on input -- filter, aggregation, Result: Final operated table updated every trigger interval You may not need to use result as it is. In Aggregate query you need to read whole table each time as aggregates will change over time. While in case of filter you may just need to append result in sink Output: What part of output to write to the sink after each trigger
  • #16 Continuous Aggregation… groupBY Agg
  • #17 Window function can also be applied to file in batch mode with same event time aggregation. Sliding window
  • #18 Example: Static data: User information Stream: account activity And you wants to see in real time events happening on accounts
  • #19 Latearrival: Handling Late Data and Watermarking: which let’s the engine automatically track the current event time in the data and and attempt to clean up old state accordingly. You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected be in terms of event time
  • #25 Planner knows how to create Incremental Execution plan Reads new data from source, Execute incremental plan and writes to sink Imp: How to pass state between two executions (i.e. pass previous aggregates to the new execution) Spark does that by maintaining its own state management Maintain running aggregates in memory backed by file system for fault tolerance
  • #26 Idempotent: Multiple identical requests make same change WAL: Write ahead log All modifications are written to file system before they are applied on data Both redo undo is captured
  • #27 Idempotent: Multiple identical requests make same change WAL: Write ahead log All modifications are written to file system before they are applied on data Both redo undo is captured