How Adobe uses Structured Streaming at Scale

How Adobe uses Spark
Structured Streaming
at Scale
Yeshwanth Vijayakumar
Sr. Engineering Manager/Architect @ Adobe

Agenda
§ Know thy Lag
§ Reading Data In
§ MicroBatching Best
Practices
§ Spark Speculation and its
Effects
§ Calculating Streaming
Statistics

Unified Profile Data Ingestion
Unified Profile
Experience Data Model
Adobe Campaign
AEM
Adobe Analytics
Adobe
AdCloud
Change Feed Streaming
Stats Generation

Structured Streaming - Know thy Lag

What/How to measure?
Having a streaming based ingestion mechanism makes it that much harder to track.

Is that enough?
Reference: https://medium.com/@ronbarabash/how-to-measure-consumer-
lag-in-spark-structured-streaming-6c3645e45a37

Use Burrow to keep track of the Lag

Structured Streaming
Optimizing The Ingestion

Generic Flow
Partition 1
Partition 2
Kafka Topic
Executor 1
Executor 2
Executor 3
Business
Logic
Business
Logic
Business
Logic

Read In
What can we optimize way upstream?
▪ maxOffsetsPerTrigger
Determine what QPS you want to hit
Observe your QPS
▪ minPartitions
▪ Enables a Fan-Out processing
pattern
▪ Maps 1. Kafka Partition to
multiple sub partitions
▪ Executor Resources
Keep this constant
Rinse and. Repeat till you have Throughput per Core
▪ Make sure processingTime <= TriggerInterval
If its <<<Trigger Interval, you have headroom to grow in QPS

Flow with MinPartitions > partitions on Kafka
Partition 1
Partition 2
Kafka Topic
Executor 1
Executor 2
Executor 3
Partition 1.1 Partition 1.2

MicroBatch Hard! Logic Best Practices
Pros
Easy to code
Cons
§ Slow!
§ No local aggregation , specify explicit
combiner
§ Too many individual tasks
Hard to get Connection Management
right
Pros
Explicit Connection Mangement
▪ Allows for good batching and re-use
Local Aggregations using HashMaps at
partition level
Cons
Needs more upfront memory
▪ OOM till tuning is done
§ Uglier to visualize
§ Might need some extra cpu per task
mapPartition() + forEachBatch()
map() + foreach()

An Example
From SAIS2020 talk: Every Day Probabilistic Data Structures For Humans

Speculate Away!
What can we optimize way upstream?
SparkConf Value Description
spark.speculation true
If set to "true", performs speculative
execution of tasks. This means if one or
more tasks are running slowly in a stage,
they will be re-launched.
spark.speculation.multiplier 5
How many times slower a task is than the
median to be considered for speculation.
spark.speculation.quantile 0.9
Fraction of tasks which must be
complete before speculation is enabled
for a particular stage.

Calculating Streaming Statistics

Pitfalls during Aggregates
Let’s take an example of consuming a retail company’s event firehose

Different Scenarios
To cater simple scenarios
Key Value
<8pm- 9pm> purchase 500
<8pm- 9pm> addToCart 5000
…….
<9pm- 10pm> addToCart 70
Results Table In StateStore
Key Value
<8pm- 9pm> product1 20
…….
…..
Results Table In StateStore
Lower Cardinality
Very High Cardinality! ☠
😎

StateStore Issues
• By default, stored in Memory i.e managed by JVM
• Large number of Keys => GC Pause
• GC Pauses => Higher latencies and increased lag
• Switch to off-heap State Store
• One Example
• Can manage way more keys in the Statestore safely
• Implement your own persistent off heap state store by extending
StateStoreProvider one example is Redis
• Also try to keep shorter windows

Skew is Real!
▪ Default Partitioning of the dataframe might not be ideal
▪ Some partitions can have too much data
▪ Processing those can cause OOM/connection failures
▪ Repartition is your friend
▪ Might not still be enough, add some salt!
9997/10000 tasks don’t matter. The 3/10000 that fails is that matters

How to get the magic targetPartitionCount?
▪ When reading/wriIng to parquet on HDFS, many
recomendaIons to mimic the HDFS block size (default:
128MB)
▪ Sample a small porIon of your large DF
▪ Df.head might suffice too with a large enough sample
▪ EsImate size of each row and extrapolate
Sample here and sample there!

Digging into Redis Pipelining + Spark
From https://redis.io/topics/pipelining
Without Pipelining With Pipelining

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

How Adobe uses Structured Streaming at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Adobe uses Structured Streaming at Scale

Similar to How Adobe uses Structured Streaming at Scale (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

How Adobe uses Structured Streaming at Scale