Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 36

How Adobe Does 2 Million Records Per Second Using Apache Spark!

2

Share

Download to read offline

Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.

How Adobe Does 2 Million Records Per Second Using Apache Spark!

  1. 1. How Adobe Does 2 Million Records Per Second using Apache Spark! Yeshwanth Vijayakumar Project Lead/ Architect – Adobe Experience Platform
  2. 2. Goals Share tips/experiences from our usecases Hopefully saves atleast an hour of your time! J
  3. 3. What do you mean by Processing? Agenda! Ingestion ▪ Structured Streaming - Know thy Lag ▪ Mircobatch the right way Evaluation ▪ The Art of How I learned to cache my physical Plans ▪ Know Thy Join ▪ Skew Phew! And Sample Sample Sample Redis – The Ultimate Swiss Army Knife!
  4. 4. Ingestion Scenario
  5. 5. Unified Profile Data Ingestion Unified Profile Experience Data Model Adobe Campaign AEM Adobe Analytics Adobe AdCloud
  6. 6. Structured Streaming - Know thy Lag
  7. 7. What/How to measure? Having a streaming based ingestion mechanism makes it that much harder to track.
  8. 8. Is that enough? Reference: https://medium.com/@ronbarabash/how-to-measure-consumer- lag-in-spark-structured-streaming-6c3645e45a37
  9. 9. Use Burrow to keep track of the Lag
  10. 10. Structured Streaming Optimizing The Ingestion
  11. 11. Generic Flow Partition 1 Partition 2 Kafka Topic Executor 1 Executor 2 Executor 3 Business Logic Business Logic Business Logic
  12. 12. Read In What can we optimize way upstream? maxOffsetsPerTrigger Determine what QPS you want to hit Observe your QPS minPartitions ▪ Enables a Fan-Out processing pattern ▪ Maps 1. Kafka Partition to multiple sub Executor Resources Keep this constant Rinse and. Repeat till you have Throughput per Core Make sure processingTime <= TriggerInterval If its <<<Trigger Interval, you have headroom to grow in QPS
  13. 13. Flow with MinPartitions > partitions on Kafka Partition 1 Partition 2 Kafka Topic Executor 1 Executor 2 Executor 3 Partition 1.1 Partition 1.2 Partition 1.3 Partition 2.1 Partition 2.2 Partition 2.3
  14. 14. MicroBatch Hard! Logic Best Practices Pros Easy to code Cons Slow! No local aggregation , specify explicit combiner Too many individual tasks Hard to get Connection Management right Pros Explicit Connection Mangement ▪ Allows for good batching and re-use Local Aggregations using HashMaps at partition level Cons Needs more upfront memory ▪ OOM till tuning is done Uglier to visualize Might need some extra cpu per task mapPartition() + forEachBatch()map() + foreach()
  15. 15. An Example From another SAIS2020 talk: Every Day Probabilistic Data Structures For Humans
  16. 16. Speculate Away! What can we optimize way upstream? SparkConf Value Description spark.speculation true If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched. spark.speculation.multiplier 5 How many times slower a task is than the median to be considered for speculation. spark.speculation.quantile 0.9 Fraction of tasks which must be complete before speculation is enabled for a particular stage.
  17. 17. Evaluation Scenario
  18. 18. What are we processing? Run as many queries as possible in parallel on top a denormalized dataframe Query 1 Query 2 Query 3 Query 1000 ProfileIds field1 field1000 eventsArray a@a.com a x [e1,2,3] b@g.com b x [e1] d@d.com d y [e1,2,3] z@z.com z y [e1,2,3,5,7] Interactive Processing!
  19. 19. The Art of How I learned to Cache My Physical Plans
  20. 20. For Repeated Queries Over Same DF Prepared Statements in RDBMS ▪ Avoids repeated query Planning by taking in a template ▪ Compile (parse->optimize/translate to plan) ahead of time Similarly we obtain the internal execution plan for a DF query Taking inspiration from RDBMS Land df.cache() This ^
  21. 21. Main Overhead Dataframe has 1000’s of nested columns Printing the queryplan caused an overflow while printing to logs in debug mode Time for query planning = 2-3 seconds or more Significant impact while submitting interactive queries when total untime < 10s Ref: https://stackoverflow.com/questions/49583401/how-to-avoid-query-preparation-parsing-planning-and-optimizations-every-time https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/util/StateCache.scala#L4
  22. 22. Cached ahead of time
  23. 23. Know thy Join
  24. 24. Join Optimization For Interactive Queries (Opinionated) Avoid Them by de-normalizing if possible! Broadcast The Join Table if its small enough! ▪ Can simulate a HashJoin If too big to broadcast, See if the join info can be replicated into Redis like KV Stores ▪ You still get the characteristics of Hash Join Once you get into real large data, Shuffles will hate you and vice versa! Sort-Merge is your friend Until it isn’t
  25. 25. Skew! Phew!
  26. 26. Skew is Real! Default Partitioning of the dataframe might not be ideal ▪ Some partitions can have too much data ▪ Processing those can cause OOM/connection failures Repartition is your friend Might not still be enough, add some salt! 9997/10000 tasks don’t matter. The 3/10000 that fails is that matters
  27. 27. How to get the magic targetPartitionCount? When reading/writing to parquet on HDFS, many recomendations to mimic the HDFS block size (default: 128MB) Sample a small portion of your large DF ▪ Df.head might suffice too with a large enough sample Estimate size of each row and extrapolate Sample here and sample there!
  28. 28. All put together! Dataframe Size: 13 Million entries Schema: 800 nested fields Before After
  29. 29. Redis – The Ultimate Swiss Army Knife!
  30. 30. Using Redis With Spark Uncommonly Maintain Bloom Filters/HLL on Redis Interactive Counting while processing results using mapPartitions() Accumulator Replacement Event Queue to Convert any normal batch Spark to Interactive Spark Best Practices Use Pipelining + Batching! Tear down connections diligently Turn Off Speculative Execution Depends whom you ask
  31. 31. Digging into Redis Pipelining + Spark From https://redis.io/topics/pipelining Without Pipelining With Pipelining
  32. 32. More Questions? https://www.linkedin.com/in/yeshwanth-vijayakumar-75599431 yvijayak@adobe.com Feel free to reach out to me at

×