Spark Streaming
Spark streaming recipes and “exactly once” semantics revised
Appsflyer: Basic Flow
Advertiser
Publisher
Click
Install
Appsflyer as Marketing Platform
Attribution
Statistics: clicks, installs, in-app events, launches, uninstalls, etc.
Life time value
Retargeting
Fraud detection
Prediction
A/B testing
etc...
Appsflyer Technology
~7B events / day
Hundreds of machines in Amazon
Tens of micro-services
Apache Kafka
service
service
service
service
service
service
DB
Amazon S3
MongoDB
Redshift
Druid
What is stream processing?
Stream Processing
Minimize latency between data ingestion and insights
Usages
● Real-time dashboard
● Fraud prevention
● Ad bidding
● etc.
Stream Processing Frameworks
Key Differences
● Latency
● Windowing support
● Delivery semantics
● State management
● API easiness
● Programming languages support
● Community support
● etc..
Apache Spark
Spark Driver
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Cluster Manager
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Apache Spark
External World
RDD RDD
RDD RDD
RDD External World
read
read
transform
transform
transform
transform
action
Streaming in Spark
Advantages
● Reuse existing infra
● Rich API
● Straightforward windowing
● It’s easier to implement “exactly once”
Disadvantages
● Latency
Input Stream Micro batches
Spark Engine
Processed data
Windowing in Spark Streaming
● Window length and sliding interval must be multiples of batch interval
● Possible usages
○ Finding Top N elements during last M period of time
○ Pre-aggregation of data prior to inserting to DB
○ etc.
DStream
Window length
Sliding interval
Do we need “exactly once” semantics?
Data Processing Paradigms
Batch layer
Real-time layer
Real-time layer
http://www.kappa-architecture.com/https://en.wikipedia.org/wiki/Lambda_architecture
How do we achieve “exactly once”?
Achieving “Exactly once”
Producer
Doesn’t duplicate messages
Stream processor
Tracking state (checkpointing)
Resilient components
Consumer
Reads only new messages
“Easy” way:
Message deduplication based on some ID
Idempotent output destination
Stream Checkpointing
https://en.wikipedia.org/wiki/Snapshot_algorithm
Barriers are injected into the data stream.
Once intermediate step sees barriers from all of its input streams it outputs barrier to all of its outgoing streams.
Once all sink operators see barrier for a snapshot, they acknowledge the snapshot, and it’s considered committed.
Multiple barriers can be seen in stream flow.
Operators store their state to an external storage.
storage
Micro-batch Checkpointing
receive
process
state
receive
process
state
receive
process
state
while (true) {
// 1. receive next batch of data
// 2. compute next stream and state
}
Unit of fault tolerance
Resilience in Spark Streaming
All Spark components must be resilient!
Driver application process
Master process
Worker process
Executor process
Receiver thread
Worker node
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
Driver
Driver Resilience
Client mode
Driver application is running inside the “spark-submit” process.
If this process dies the entire application is killed.
Cluster mode
Driver application runs on one of worker nodes.
“--supervise” option makes driver restart
on a different worker node.
Running through Marathon
Marathon can re-start failed
applications automatically.
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
Master Resilience
Single master
The entire application is killed.
Multi-master mode
A standby master is elected active.
Worker nodes automatically register with new master.
Leader election via ZooKeeper. Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
Worker Resilience
Worker process
When failed, all child processes (driver or executor) are killed.
New worker process is launched automatically.
Executor process
Restarted on failure by the parent worker process.
Receiver thread
Running inside the Executor process -
same as Executor.
Worker node
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
Resilience doesn’t ensure “exactly once”
Checkpointing
Checkpointing helps recover from driver failure.
Stores computation graph to some fault tolerant place (like HDFS or S3).
What is saved as metadata
Metadata of queued but not processed batches
Stream operations (code)
Configuration
Disadvantages
Frequent checkpointing reduces throughput.
Write Ahead Log
Synchronously saves received data to fault tolerant storage.
Helps recover received, but not yet committed blocks.
Disadvantages
Additional storage is required.
Reduced throughput.
Executor
input stream
Receiver
Problems with Checkpointing and WAL
Data can be lost even when using checkpointing (batches hold in memory will be
lost on driver failure).
Checkpointing and WAL prevent data loss, but do not provide “exactly once”
semantics.
If receiver fails before updating offsets in ZooKeeper - we are in trouble.
In this case data will be re-read from Kafka and from WAL.
Still not exactly once!
The Solution
Don’t use receivers - read directly from input stream instead.
Driver instructs executors what range to read from a stream (stream must be rewindable).
Read range is attached to the batch itself.
Example (Kafka direct stream):
Application Driver
Streaming
Context
1. Periodically query latest
offsets for topics & partitions
2. Calculates offset ranges for
the next batch
Executor
3. Schedule the next micro-
batch job
4. Consume data for the
calculated offsets
Example #1
The Problem
Events counting.
Group by different set of dimensions.
Have pre-aggregation layer that reduces load on DB on spikes.
DB
app_id event_name country count
com.app.bla FIRST_LAUNCH US 152
com.app.bla purchase IL 10
com.app.jo custom_inapp_20 US 45
Transactional Events Aggregator
Based on SQL database
Store Kafka partition offsets into the DB
Increment event counters in transaction based on current and stored offsets.
SQL DB
Driver Executor Executor
1. Read last Kafka partitions and their
offsets from the DB
2. Create direct Kafka stream
based on read partitions and
offsets
3. Consume events from Kafka
4. Aggregate events
5. Upsert event counter along with
current offsets in transaction
Creating Kafka Stream
Aggregation & Writing to DB
Example #2
Snapshotting Events Aggregator
Driver Executor Executor
1. Read last Kafka partitions
and their offsets from S3
2. Create direct Kafka stream
based on read partitions and
offsets
3. Consume events from Kafka
4. Aggregate events
5. Store processed data and Kafka offsets
under /data/ts=<timestamp> and
/offsets/ts=<timestamp> respectively
S3
Aggregator Application
Snapshotting Events Aggregator
Executor
Executor Executor
1. Find last committed timestamp
2. Read data for the last
timestamp from
/data/ts=<timestamp>
4. Aggregate events by
different dimensions, and split
to cubes
6. Delete offsets and data for the timestamp
/offsets/ts=<timestamp>
/data/ts=<timestamp>
S3
Loader Application
Cassandra
5. Increment counters in
different cubes
Driver
Aggregator
Aggregator
Loader
Deployment
We use Mesos
Master HA for free.
Marathon keeps Spark streaming application alive.
Read carefully
http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
Inspect, re-configure, retry
Turn off Spark dynamicity
Preserve data locality
Find balance between cores/batch interval/block interval
Processing time must be less than batch interval
Tips
Thank you!
(and we’re hiring)
Right Now
Real time analytics dashboard
Right Now
Processes ~50M events a day
Reduces the stream in two sliding windows:
1. Last 5 seconds (“now”)
2. Last 10 minutes (“recent”)
At most once semantics
Right Now
Right Now
Why Spark?
Experienced with Spark
Convenient Clojure wrappers (Sparkling, Flambo)
Documentation and community
Right Now
In Production
3 m3.xlarge machines for the workers (4 cores each)
spark.default.parallelism=10
Lesson learned: foreachRDD and foreachPartition
Thank you!

Spark Streaming Recipes and "Exactly Once" Semantics Revised

  • 1.
    Spark Streaming Spark streamingrecipes and “exactly once” semantics revised
  • 2.
  • 3.
    Appsflyer as MarketingPlatform Attribution Statistics: clicks, installs, in-app events, launches, uninstalls, etc. Life time value Retargeting Fraud detection Prediction A/B testing etc...
  • 4.
    Appsflyer Technology ~7B events/ day Hundreds of machines in Amazon Tens of micro-services Apache Kafka service service service service service service DB Amazon S3 MongoDB Redshift Druid
  • 5.
    What is streamprocessing?
  • 6.
    Stream Processing Minimize latencybetween data ingestion and insights Usages ● Real-time dashboard ● Fraud prevention ● Ad bidding ● etc.
  • 7.
    Stream Processing Frameworks KeyDifferences ● Latency ● Windowing support ● Delivery semantics ● State management ● API easiness ● Programming languages support ● Community support ● etc..
  • 8.
    Apache Spark Spark Driver valtextFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Cluster Manager Worker Node Executor Task Task Worker Node Executor Task Task
  • 9.
    Apache Spark External World RDDRDD RDD RDD RDD External World read read transform transform transform transform action
  • 10.
    Streaming in Spark Advantages ●Reuse existing infra ● Rich API ● Straightforward windowing ● It’s easier to implement “exactly once” Disadvantages ● Latency Input Stream Micro batches Spark Engine Processed data
  • 11.
    Windowing in SparkStreaming ● Window length and sliding interval must be multiples of batch interval ● Possible usages ○ Finding Top N elements during last M period of time ○ Pre-aggregation of data prior to inserting to DB ○ etc. DStream Window length Sliding interval
  • 12.
    Do we need“exactly once” semantics?
  • 13.
    Data Processing Paradigms Batchlayer Real-time layer Real-time layer http://www.kappa-architecture.com/https://en.wikipedia.org/wiki/Lambda_architecture
  • 14.
    How do weachieve “exactly once”?
  • 15.
    Achieving “Exactly once” Producer Doesn’tduplicate messages Stream processor Tracking state (checkpointing) Resilient components Consumer Reads only new messages “Easy” way: Message deduplication based on some ID Idempotent output destination
  • 16.
    Stream Checkpointing https://en.wikipedia.org/wiki/Snapshot_algorithm Barriers areinjected into the data stream. Once intermediate step sees barriers from all of its input streams it outputs barrier to all of its outgoing streams. Once all sink operators see barrier for a snapshot, they acknowledge the snapshot, and it’s considered committed. Multiple barriers can be seen in stream flow. Operators store their state to an external storage. storage
  • 17.
    Micro-batch Checkpointing receive process state receive process state receive process state while (true){ // 1. receive next batch of data // 2. compute next stream and state } Unit of fault tolerance
  • 18.
    Resilience in SparkStreaming All Spark components must be resilient! Driver application process Master process Worker process Executor process Receiver thread Worker node Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  • 19.
    Driver Driver Resilience Client mode Driverapplication is running inside the “spark-submit” process. If this process dies the entire application is killed. Cluster mode Driver application runs on one of worker nodes. “--supervise” option makes driver restart on a different worker node. Running through Marathon Marathon can re-start failed applications automatically. Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  • 20.
    Master Resilience Single master Theentire application is killed. Multi-master mode A standby master is elected active. Worker nodes automatically register with new master. Leader election via ZooKeeper. Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  • 21.
    Worker Resilience Worker process Whenfailed, all child processes (driver or executor) are killed. New worker process is launched automatically. Executor process Restarted on failure by the parent worker process. Receiver thread Running inside the Executor process - same as Executor. Worker node Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  • 22.
    Resilience doesn’t ensure“exactly once”
  • 23.
    Checkpointing Checkpointing helps recoverfrom driver failure. Stores computation graph to some fault tolerant place (like HDFS or S3). What is saved as metadata Metadata of queued but not processed batches Stream operations (code) Configuration Disadvantages Frequent checkpointing reduces throughput.
  • 24.
    Write Ahead Log Synchronouslysaves received data to fault tolerant storage. Helps recover received, but not yet committed blocks. Disadvantages Additional storage is required. Reduced throughput. Executor input stream Receiver
  • 25.
    Problems with Checkpointingand WAL Data can be lost even when using checkpointing (batches hold in memory will be lost on driver failure). Checkpointing and WAL prevent data loss, but do not provide “exactly once” semantics. If receiver fails before updating offsets in ZooKeeper - we are in trouble. In this case data will be re-read from Kafka and from WAL. Still not exactly once!
  • 26.
    The Solution Don’t usereceivers - read directly from input stream instead. Driver instructs executors what range to read from a stream (stream must be rewindable). Read range is attached to the batch itself. Example (Kafka direct stream): Application Driver Streaming Context 1. Periodically query latest offsets for topics & partitions 2. Calculates offset ranges for the next batch Executor 3. Schedule the next micro- batch job 4. Consume data for the calculated offsets
  • 27.
  • 28.
    The Problem Events counting. Groupby different set of dimensions. Have pre-aggregation layer that reduces load on DB on spikes. DB app_id event_name country count com.app.bla FIRST_LAUNCH US 152 com.app.bla purchase IL 10 com.app.jo custom_inapp_20 US 45
  • 29.
    Transactional Events Aggregator Basedon SQL database Store Kafka partition offsets into the DB Increment event counters in transaction based on current and stored offsets. SQL DB Driver Executor Executor 1. Read last Kafka partitions and their offsets from the DB 2. Create direct Kafka stream based on read partitions and offsets 3. Consume events from Kafka 4. Aggregate events 5. Upsert event counter along with current offsets in transaction
  • 30.
  • 31.
  • 32.
  • 33.
    Snapshotting Events Aggregator DriverExecutor Executor 1. Read last Kafka partitions and their offsets from S3 2. Create direct Kafka stream based on read partitions and offsets 3. Consume events from Kafka 4. Aggregate events 5. Store processed data and Kafka offsets under /data/ts=<timestamp> and /offsets/ts=<timestamp> respectively S3 Aggregator Application
  • 34.
    Snapshotting Events Aggregator Executor ExecutorExecutor 1. Find last committed timestamp 2. Read data for the last timestamp from /data/ts=<timestamp> 4. Aggregate events by different dimensions, and split to cubes 6. Delete offsets and data for the timestamp /offsets/ts=<timestamp> /data/ts=<timestamp> S3 Loader Application Cassandra 5. Increment counters in different cubes Driver
  • 35.
  • 36.
  • 37.
  • 38.
    Deployment We use Mesos MasterHA for free. Marathon keeps Spark streaming application alive. Read carefully http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning Inspect, re-configure, retry Turn off Spark dynamicity Preserve data locality Find balance between cores/batch interval/block interval Processing time must be less than batch interval Tips
  • 39.
  • 40.
    Right Now Real timeanalytics dashboard
  • 41.
    Right Now Processes ~50Mevents a day Reduces the stream in two sliding windows: 1. Last 5 seconds (“now”) 2. Last 10 minutes (“recent”) At most once semantics
  • 42.
  • 43.
    Right Now Why Spark? Experiencedwith Spark Convenient Clojure wrappers (Sparkling, Flambo) Documentation and community
  • 44.
    Right Now In Production 3m3.xlarge machines for the workers (4 cores each) spark.default.parallelism=10 Lesson learned: foreachRDD and foreachPartition
  • 45.