How Adobe Does 2 Million Records Per Second Using Apache Spark!

How Adobe Does 2 Million
Records Per Second using
Apache Spark!
Yeshwanth Vijayakumar
Project Lead/ Architect – Adobe Experience Platform

Goals
Share tips/experiences from our usecases
Hopefully saves atleast an hour of your time! J

What do you mean by Processing? Agenda!
Ingestion
▪ Structured Streaming - Know thy Lag
▪ Mircobatch the right way
Evaluation
▪ The Art of How I learned to cache my physical Plans
▪ Know Thy Join
▪ Skew Phew! And Sample Sample Sample
Redis – The Ultimate Swiss Army Knife!

Unified Profile Data Ingestion
Unified Profile
Experience Data Model
Adobe Campaign
AEM
Adobe Analytics
Adobe AdCloud

Structured Streaming - Know thy Lag

What/How to measure?
Having a streaming based ingestion mechanism makes it that much harder to track.

Is that enough?
Reference: https://medium.com/@ronbarabash/how-to-measure-consumer-
lag-in-spark-structured-streaming-6c3645e45a37

Use Burrow to keep track of the Lag

Structured Streaming
Optimizing The Ingestion

Generic Flow
Partition 1
Partition 2
Kafka Topic
Executor 1
Executor 2
Executor 3
Business
Logic
Business
Logic
Business
Logic

Read In
What can we optimize way upstream?
maxOffsetsPerTrigger
Determine what QPS you want to hit
Observe your QPS
minPartitions
▪ Enables a Fan-Out processing pattern
▪ Maps 1. Kafka Partition to multiple sub
Executor Resources
Keep this constant
Rinse and. Repeat till you have Throughput per Core
Make sure processingTime <= TriggerInterval
If its <<<Trigger Interval, you have headroom to grow in QPS

Flow with MinPartitions > partitions on Kafka
Partition 1
Partition 2
Kafka Topic
Executor 1
Executor 2
Executor 3
Partition 1.1 Partition 1.2

MicroBatch Hard! Logic Best Practices
Pros
Easy to code
Cons
Slow!
No local aggregation , specify explicit
combiner
Too many individual tasks
Hard to get Connection Management
right
Pros
Explicit Connection Mangement
▪ Allows for good batching and re-use
Local Aggregations using HashMaps
at partition level
Cons
Needs more upfront memory
▪ OOM till tuning is done
Uglier to visualize
Might need some extra cpu per task
mapPartition() + forEachBatch()map() + foreach()

An Example
From another SAIS2020 talk: Every Day Probabilistic Data Structures For Humans

Speculate Away!
What can we optimize way upstream?
SparkConf Value Description
spark.speculation true
If set to "true", performs speculative
execution of tasks. This means if one or
more tasks are running slowly in a stage,
they will be re-launched.
spark.speculation.multiplier 5
How many times slower a task is than the
median to be considered for speculation.
spark.speculation.quantile 0.9
Fraction of tasks which must be complete
before speculation is enabled for a
particular stage.

What are we processing?
Run as many queries as possible in parallel on top a denormalized dataframe
Query 1
Query 2
Query 3
Query 1000
ProfileIds field1 field1000 eventsArray
a@a.com a x [e1,2,3]
b@g.com b x [e1]
d@d.com d y [e1,2,3]
z@z.com z y [e1,2,3,5,7]
Interactive Processing!

The Art of How I learned to
Cache My Physical Plans

For Repeated Queries Over Same DF
Prepared Statements in RDBMS
▪ Avoids repeated query Planning by taking in a template
▪ Compile (parse->optimize/translate to plan) ahead of time
Similarly we obtain the internal execution plan for a DF query
Taking inspiration from RDBMS Land
df.cache() This ^

Main Overhead
Dataframe has 1000’s of nested columns
Printing the queryplan caused an overflow while printing to logs in
debug mode
Time for query planning = 2-3 seconds or more
Significant impact while submitting interactive queries when total
untime < 10s
Ref: https://stackoverflow.com/questions/49583401/how-to-avoid-query-preparation-parsing-planning-and-optimizations-every-time
https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/util/StateCache.scala#L4

Join Optimization For Interactive Queries
(Opinionated)
Avoid Them by de-normalizing if possible!
Broadcast The Join Table if its small enough!
▪ Can simulate a HashJoin
If too big to broadcast, See if the join info can be replicated into Redis
like KV Stores
▪ You still get the characteristics of Hash Join
Once you get into real large data, Shuffles will hate you and vice versa!
Sort-Merge is your friend Until it isn’t

Skew is Real!
Default Partitioning of the dataframe might not be ideal
▪ Some partitions can have too much data
▪ Processing those can cause OOM/connection failures
Repartition is your friend
Might not still be enough, add some salt!
9997/10000 tasks don’t matter. The 3/10000 that fails is that matters

How to get the magic targetPartitionCount?
When reading/writing to parquet on HDFS, many recomendations to
mimic the HDFS block size (default: 128MB)
Sample a small portion of your large DF
▪ Df.head might suffice too with a large enough sample
Estimate size of each row and extrapolate
Sample here and sample there!

All put together!
Dataframe Size: 13 Million entries
Schema: 800 nested fields
Before
After

Redis – The Ultimate Swiss Army Knife!

Using Redis With Spark Uncommonly
Maintain Bloom Filters/HLL on Redis
Interactive Counting while processing results using mapPartitions()
Accumulator Replacement
Event Queue to Convert any normal batch Spark to Interactive Spark
Best Practices
Use Pipelining + Batching!
Tear down connections diligently
Turn Off Speculative Execution
Depends whom you ask

Digging into Redis Pipelining + Spark
From https://redis.io/topics/pipelining
Without Pipelining With Pipelining

More Questions?
https://www.linkedin.com/in/yeshwanth-vijayakumar-75599431
yvijayak@adobe.com
Feel free to reach out to me at

How Adobe Does 2 Million Records Per Second Using Apache Spark!

How Adobe Does 2 Million Records Per Second Using Apache Spark!

More Related Content

What's hot

Similar to How Adobe Does 2 Million Records Per Second Using Apache Spark!

More from Databricks

Recently uploaded

How Adobe Does 2 Million Records Per Second Using Apache Spark!