Productionize
Spark Structured Streaming
Ivan Kosianenko
Who am I
Software Architect at AppsFlyer
Data Engineering
Machine Learning
Ivan Kosianenko
AppsFlyer in a Nutshell
DEMO DATA
Time Series Lifetime Value
1.1.2018
1.1.2017
5.1.2018 5.1.2018
...
2.1.2018
3.1.2018
4.1.2018
5.1.2018
1.1.2018
2.1.2018
3.1.2018
4.1.2018
5.1.2018
- Mutable data Immutable data
VS
December 2017: MemSQL + Druid
Clojure
streamers
Live 1 Day Data
MemSQL
Cluster
Dashboard
Middleware
API
Druid
Daily 4 Year
LTV Table
12 billion events daily
KAFKA
Now: ClickHouse + Druid
Live 1 Day Data
ClickHouse
Cluster
Dashboard
Middleware
API
Druid
Daily 5 Year
LTV Table
38 billion events daily
KAFKA
Structured Streaming
How to DIY Clojure streamer?
● take clj-kafka and JDBC driver
● add multi-threaded execution
● write custom business logic
How to DIY Clojure streamer?
● distributed orchestration
● resource allocation
● retry logic
● monitoring
● delivery guarantee
● ???
The Law of Leaky Abstractions
Kafka Connect
Apache Storm
Heron Streaming
Onyx
Spark Structured Streaming
source
sink
business logic
Custom Foreach Writer
Add row
Write batch to anywhere
Create batch
Be careful!
For each row
Spark Structured Streaming
Write-ahead logging
ClickHouse
+
Exactly-once end-to-end processing
Spark Structured Streaming
Streaming query
Streaming Query
● Explain
● Stop
● Await termination
● Get termination error
● Track progress
Last progress
Checkpoint dir
Track lag between Spark and Kafka
1
2
3
1. Get last offsets from Spark
2. Get last offsets from Kafka
3. Send difference to statd
Dashboard and alerts
End-to-end test monitoring
1
2
3
1. Sends test message to
Kafka
2. Wait while message will
appear in Clickhouse
3. Send latency to statsd
Dashboard and alerts
Production config
turn off dynamic allocation
or
set min/max executors
Deploy with custom Spark operator
Airflow
Custom operator tries to figure out:
● Is current Spark job running?
● Is it latest version with latest config?
If no -> submit Spark job to YARN with
yarn.tag = hash(jar_version+run config)
Once in a minute
Join us
Thank you
ivan.kosianenko@appsflyer.com
t.me/appsflyerkyiv

Productionize spark structured streaming