Spark Kafka summit 2017

Spark
Structured
Streaming
Ajay Pratap Singh Pundhir

Agenda - Spark Components
- Spark Data models
- Spark Streaming
- Continuous Applications
- Structured Streaming
- Example & Demo

Spark Data Models - RDD
- Lineage ( Dependency)
- Partitioning (Distributed data)
- Compute function: Partition => Iterator[T]

Spark Data Models
- [T]Structure
- Arrange data according to plan
- More optimization
- Will limit what we can do, but still can do majority of operations
(RDD’s are still available)

Data Models Spark
Runtime Compile time Compile time
Runtime Runtime Compile time
RDD Dataframe Dataset
Syntex
Error
Analysis
Error

DATASET API’s
Type safe
Domain objects
lambda functions
private[spark] val df = spark.read.json("people.json")
// Convert data to domain objects.
case class Person(name: String, age: Int)
val ds = df.as[Person] ds.filter(_.age > 30)
// Compute histogram of age by name.
val hist = ds.groupBy(_.name).mapGroups { case (name, people:
Iter[Person]) =>
val buckets = new Array[Int](10) people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}

DATAFRAM
E =
DATASET
[RAW]
- Unified API since 2.0
- Dataframe is Dataset of Generic Row objects
- Types can be enforced on generic rows by
using df.as[MyClass]

Spark Streaming
Most widely used component
Widely used for event processing
Fault tolerant
High throughput
Unified stack with batch processing

Limitations of DStream
- Processing data using event time
- DStream natively supports batch time only. What about late arrival of
data??
- Interoperability between Stream and Batch
- API’s are similar but manual translation needed (DStream & RDD)

Limitations of DStream Cont..
- Exactly once processing semantics
- Implementation needs to take care
- Check-pointing efficiency
- Check pointing saves complete rdd, no incremental save

Structured Streaming Model
source

Structured Streaming Model
● Complete output
● Delta output
● Append Output
source

Continuous Windowed
aggregation

Joining streams with static
data

Demo
1. Domain – Person
Objective: #of person by Gender
2. Domain- Transaction
Objective: #of transaction per category on event time window

Query
Management
- A handle on running streaming computation
Multiple queries can be running same time and
each query has a unique name to keep track
on query state
- Several methods to operate on query
- Stop query
- Wait for termination
- Get status
- Get running queries
- ….

Batch Execution Plan
Dataframe/Dataset Logical Plan Planner
Execution
Plan
Abstract
representation of
query
Spark SQL
catalyst query
optimizer
Execute Most
optimized query
plan

Continuous Incremental Execution Plan
Dataframe/Dataset Logical Plan Planner
Abstract
representation of
query
Spark SQL
catalyst query
optimizer
Incremental
Execution Plan 1
Incremental
Execution Plan 3
Incremental
Execution Plan 2

Fault Tolerance Model
- All data and metadata needs to be recoverable
- Planner
- Offsets are written to fault tolerant file system
- In case of failure, read offsets from file system and do the execution again
- Sources:
- Capable of regenerating same set of data given the offsets

Fault Tolerance Model
- State:
- Intermediate state is maintained as versioned key value pair in spark workers backed by
HDFS
- Sink:
- By Design idempotent, handles re execution to avoid double commits

Spark Kafka summit 2017

More Related Content

What's hot

Similar to Spark Kafka summit 2017

Recently uploaded

Spark Kafka summit 2017

Editor's Notes