Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming

Easy, Scalable, Fault-tolerant
stream processing with
Structured Streaming
Tathagata “TD” Das
@tathadas
Spark Meetup @ Intel
Santa Clara, 23rd March 2017

About Me
Spark PMC Member
Built Spark Streaming in UC Berkeley
Currently focused on building Structured Streaming
Software engineer at Databricks
2

building robust
stream processing
apps is hard
3

Complexities in stream processing
4
Complex Data
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-of-order
Complex
Systems
Diverse storage
systems and formats
(SQL, NoSQL, parquet, ...
)
System failures
Complex
Workloads
Event time processing
Combining streaming
with interactive
queries, machine
learning

Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
5

you
should not have to
reason about streaming
6

you
should only write
simple batch-like queries
Spark
should automatically streamify
them 7

Treat Streams as Unbounded Tables
8
data stream unbounded input table
new data in the
data stream
=
new rows appended
to a unbounded table

New Model Trigger: every 1 sec
Time
Input data up
to t = 3
Quer
y
Input: data from source as an
append-only table
Trigger: how frequently to check
input for new data
Query: operations on input
usual map/filter/reduce
new window, session ops
t=1 t=2 t=3
data up
to t = 1
data up
to t = 2

New Model
result
up to t
= 1
Result
Quer
y
Time
data up
to t = 1
Input data up
to t = 2
result
up to t
= 2
data up
to t = 3
result
up to t
= 3
Result: final operated table
updated after every trigger
Output: what part of result to write
to storage after every
trigger
Complete output: write full result table every
time Output
[complete mode]
write all rows in result table to storage
t=1 t=2 t=3

New Model
t=1 t=2 t=3
Result
Quer
y
Time
Input data up
to t = 3
result
up to t
= 3
Output
[append mode] write new rows since last trigger to
storage
Result: final operated table
updated after every trigger
Output: what part of result to write
to storage after every
trigger
Complete output: write full result table every
time
Append output: write only new rows that got
added to result table since previous batch
data up
to t = 1
data up
to t = 2
result
up to t
= 1
result
up to t
= 2

New Model
t=1 t=2 t=3
Result
Quer
y
Time
Input data up
to t = 3
result
up to t
= 3
Output
[append mode] write new rows since last trigger to
storage
Conceptual model that
guides how to think of a
streaming query as a
simple table query
Engine does not need to
keep the full input table in
memory once it has
streamified it
data up
to t = 1
data up
to t = 2
result
up to t
= 1
result
up to t
= 2

DataFrames,
Datasets, SQL
input = spark.readStream
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Logical Plan
Streaming
Source
Project
device, signal
Filter
signal > 15
Streaming
Sink
Spark automatically streamifies!
Spark SQL converts batch-like query to a series of
incremental execution plans operating on new batches
of data
Series of Incremental
Execution Plans
process
newfiles
t =
1
t =
2
t =
3
process
newfiles
process
newfiles

Fault-tolerance with Checkpointing
Checkpointing - metadata
(e.g. offsets) of current batch
stored in a write ahead log in
HDFS/S3
Query can be restarted from the
log
Streaming sources can replay the
exact data range in case of failure
Streaming sinks can dedup
end-to-end
exactly-once
guarantees
process
newfiles
t =
1
t =
2
t =
3
process
newfiles
process
newfiles
write
ahead
log

static data =
bounded table
streaming data =
unbounded table
Unified API - Dataset/DataFrame
Single API !

16
Dataset/DataFrame  Tables
Unified, structured APIs in Spark to transform data
in Scala , Java , Python , R
SQL
spark.sql("
SELECT type, sum(signal)
FROM devices
GROUP BY type
")
val df: DataFrame =
spark
.table("device-data")
.groupBy("type")
.sum("signal"))
DataFrames Dataset
val ds: Dataset[(String, Double)] =
spark
.table("device-data")
.as[DeviceData]
.groupByKey(_.type)
.mapValues(_.signal)
.reduceGroups(_ + _)

17
Dataset/DataFrame  Tables
Unified, structured APIs in Spark to transform data
in Scala , Java , Python , R
SQL DataFrames Dataset
Compile time type
safety

Batch Queries with DataFrames
input = spark.read
.format("json")
.load("source-path")
result = input
result.write
.format("parquet")
.save("dest-path")
Read from Json file
Select some devices
Write to parquet file

Streaming Queries with DataFrames
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
result.writeStream
.format("parquet")
.start("dest-path")
Read from Kafka
Replace read with readStream
Change format to kafka
Select some devices
Code does not change
Write to Parquet file stream
Replace save() with start()

Complex Streaming ETL
Structured Streaming enables raw data to be
available as structured data in seconds, for more
interactive and complex analytics
20
table
seconds
1010101
0

Complex Streaming ETL
21
Example
- Json data being received in
Kafka
- Parse nested json and flatten
it
- Store in structured Parquet
table
- Get end-to-end failure
guarantees
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.load()
val parsedData = rawData
.selectExpr("cast (value as string) as json"))
.select(from_json("json", schema).as("data"))
.select("data.*")
val query = parsedData.writeStream
.option("checkpointLocation", "/checkpoint")
.partitionBy("date")
.format("parquet")
.start("/parquetTable/")

Reading from Kafka [Spark 2.1]
22
Support Kafka 0.10.0.1
Specify options to configure
How?
kafka.boostrap.servers => broker1
What?
subscribe => topic1,topic2,topic3 // fixed list of topics
subscribePattern => topic* // dynamic list of topics
assign => {"topicA":[0,1] } // specific partitions
Where?
startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} }
.format("kafka")
.load()

Reading from Kafka
23
.format("kafka")
.load()
rawData dataframe
has the following
columns
key value topic partitio
n
offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721

Transforming Data
Cast binary value to string
Name it column json
24
.selectExpr("cast (value as string) as json")
.select(from_json("json").as("data"))
.select("data.*")

Transforming Data
Name it column json
Parse json string and expand
into nested columns, name it
data
25
.select("data.*")
json
{ "timestamp": 1486087873, "device":
"devA", …}
{ "timestamp": 1486082418, "device":
"devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")
as "data"

Transforming Data
Name it column json
data
Flatten the nested columns
26
.select("data.*")
data (nested)
14860878
73
devA …
14860867
21
devX …
14860878
73
devA …
14860867
21
devX …
select("data.*")
(not nested)

Transforming Data
Name it column json
data
Flatten the nested columns
27
.select("data.*")
powerful built-in APIs to
perform complex data
transformations
from_json, to_json, explode, ...
100s of functions
(see our blog post)

Writing to Parquet table
Save parsed data as
Parquet table in the given
path
Partition files by date so
that future queries on time
slices of data is fast
e.g. query on last 48 hours of
data
28
.option("checkpointLocation", ...)
.format("parquet")
.start("/parquetTable")

Checkpointing
Enable checkpointing by
setting the checkpoint
location to save offset
logs
start actually starts a
continuous running
StreamingQuery in the
Spark cluster
29
.format("parquet")

Streaming Query
query is a handle to the continuously
running StreamingQuery
Used to monitor and manage the
execution
30
.format("parquet")
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
StreamingQuery

Data Consistency on Ad-hoc Queries
Data available for complex, ad-hoc analytics within
seconds
Parquet table is updated atomically, ensures prefix
integrity
Even if distributed, ad-hoc queries will see either all updates
from streaming query or none, read more in our blog
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
31
seconds!
complex, ad-hoc
queries on
latest
data

Event-time Aggregations
Many use cases require aggregate statistics by event
time
E.g. what's the #errors in each system in the 1 hour windows?
Many challenges
Extracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event-time stuff
32

Event time Aggregations
Windowing is just another type of grouping in Struct.
Streaming
number of records every hour
33
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each
device every 10 mins

Event-time Aggregations
Any number of
simultaneous aggs
Custom aggs using
reduceGroups, UDAFs
34

Stateful Processing for Aggregations
Aggregates has to be saved as
distributed state between
triggers
Each trigger reads previous state and
writes updated state
State stored in memory, backed by
write ahead log in HDFS/S3
Fault-tolerant, exactly-once
guarantee!
35
process
newdata
t = 1
sink
src
t = 2
process
newdata
sink
src
t = 3
process
newdata
sink
src
state state
write
ahea
d log
state updates
are written to
log for checkpointing
state

Watermarking and Late Data
Watermark [Spark 2.1] -
threshold on how late an event
is expected to be in event time
Trails behind max seen event
time
Trailing gap is configurable
36
event time
max event
time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing
gap
of 10 mins

Data newer than watermark may
be late, but allowed to aggregate
Data older than watermark is
"too late" and dropped
Windows older than watermark
automatically deleted to limit the
amount of intermediate state
37
max event
time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped

38
max event
time
event time
watermark
allowed
lateness
of 10
mins
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped

Watermarking to Limit State [Spark 2.1]
39
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in
counts
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
system tracks max
observed event
time
12:08
wm =
12:04
10min
12:14
more details in online
programming guide

Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful ops to a
user-defined state
supports timeouts
fault-tolerant, exactly-
once
supports Scala and Java 40
dataset
.groupByKey(groupingFunc)
.mapGroupsWithState(mappingFunc)
def mappingFunc(
key: K,
values: Iterator[V],
state: KeyedState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}

Many more updates!
StreamingQueryListener [Spark 2.1]
Receive of regular progress heartbeats for health and perf monitoring
Automatic in Databricks!!
Streaming Deduplication [Spark 2.2]
Automatically eliminate duplicate data from Kafka/Kinesis/etc.
More Kafka Integration [Spark 2.2]
Run batch queries on Kafka, and write to Kafka from batch/streaming
queries
Kinesis Source
Read from Amazon Kinesis 41

Future Directions
Stability, stability, stability
Needed to remove the Experimental tag
More supported operations
Stream-stream joins, …
Stable source and sink APIs
Connect to your streams and stores
More sources and sinks
JDBC, …
42

More Info
Structured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks blog posts for more focused discussions
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-
2.html
and more to come, stay tuned!!
43

Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming

Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming

Similar to Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with Kafka and Spark's Structured Streaming