Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 38

Memory Optimization and Reliable Metrics in ML Pipelines at Netflix

0

Share

Download to read offline

Netflix personalizes the experience for each member and this is achieved by several machine learning models. Our team builds infrastructure that powers these machine learning pipelines; primarily using Spark for feature generation and training.

More Related Content

You Might Also Like

Memory Optimization and Reliable Metrics in ML Pipelines at Netflix

  1. 1. MEMORY OPTIMIZATION AND RELIABLE METRICS IN ML PIPELINES AT NETFLIX Vivek Kaushal Senior Software Engineer
  2. 2. The lockdowns
  3. 3. End the lockdowns
  4. 4. Keep the lockdowns
  5. 5. Work with the lockdowns
  6. 6. Vivek Kaushal ▪ Senior Software Engineer at Netflix ▪ Have worked at ▪ Apple ▪ Sumologic ▪ Amazon ▪ Love to ▪ Read ▪ Meditate ▪ vkaushal21
  7. 7. Netflix ▪ Video on demand streaming service ▪ 180 Million Members ▪ Added more than 1000 original titles in 2019 ▪ Use ML models to generate recommendations
  8. 8. Types of data that will be relevant in this talk ▪ Examples ▪ Video metadata ▪ Similar videos ▪ Size: up to 10GB ▪ Can be loaded in one JVM ▪ 180 million members ▪ Examples ▪ Viewing history ▪ My List ▪ Thumb Ratings ▪ Cannot be loaded on one JVM Member dataVideo data
  9. 9. Viewing History Video Metadata Online Feature Generation Netflix Recommendations Architecture Similar Videos My List Historical Data Store Offline Feature Generation Compute Recommendations Updated ML model Model Training
  10. 10. Timeline of a compute recommendations node Video Metadata Mversion 2 Viewing History H1 Online Feature Generation Member 1 Member 5 Member 6 Mversion 1 Member 2 Member 3 Member 4 H2 H3 H4 H5 H6 Time Mversion 3 Member 7 H7 H8 Member 8
  11. 11. Dataframe for offline feature generation Video Metadata Mversion 2 Viewing History H1 Online Feature Generation Member 1 Member 5 Member 6 Mversion 1 Member 2 Member 3 Member 4 H2 H3 H4 H5 H6 Time Mversion 3 Member 7 H7 H8 Member 8 Member Data Video Metadata Version H 1 Mversion 1 H 2 Mversion 1 H 3 Mversion 1 H 4 Mversion 1 H 5 Mversion 2 H 6 Mversion 2 H 7 Mversion 3 H 8 Mversion 3
  12. 12. Viewing History Video Metadata Online Feature Generation Netflix Recommendations Architecture Similar Videos My List Historical Data Store Offline Feature Generation Updated ML model Compute Recommendations Model Training Member Data Video Metadata Version H 1 Mversion 1 H 2 Mversion 1 H 3 Mversion 1 H 4 Mversion 1 H 5 Mversion 2 H 6 Mversion 2 H 7 Mversion 3 H 8 Mversion 3
  13. 13. Problem visualization Driver Executor Executor Executor Executor Executor Core Core Core Time Task Task Task Task M1 M2 Optimize executor memory consumption Member Data Video Metadata Version H 4 Mversion 1 H 5 Mversion 2 Partition
  14. 14. Spark memory options Core Core Core M1 M2 M3 Partition 1 Partition 2 Partition 3 M1 M1 M2 Task 1 Task 2 Task 3 Task 1 Task 2 Task 3 M1 M2 M3 Partition 1 Partition 2 Partition 3 Broadcast all (M1, M2, M3) Task 1 Task 2 Task 3
  15. 15. Spark ▪ Limited memory sharing options ▪ No option to select an executor for a particular task ▪ Repartition ▪ Materializing intermediate Dataframe ▪ Sort within partitions Data wrangling toolsLimitations
  16. 16. Approach 1: Use Dataframe without data wrangling Member Data Video Metadata Version H 7 Mversion 3 H 8 Mversion 3 H 1 Mversion 1 Member Data Video Metadata Version H 6 Mversion 2 H 3 Mversion 1 H 2 Mversion 1 Member Data Video Metadata Version H 4 Mversion 1 H 5 Mversion 2 Partition N1 Partition N2 Partition N3 Partition N1 M3 then M1 Task 1 M2 then M1 Partition N2 Task 2 M1 then M2 Partition N3 Task 3
  17. 17. M1 then M2 Partition N3Partition N1 M3 then M1 Task 1 M2 then M1 Partition N2 Task 2 Task 3 Approach 1: Use Dataframe without data wrangling Parallel execution of all data Memory Inefficient No repartitioning Switches versions on same partition
  18. 18. Partition data by Version Member Data Video Metadata Version H 4 Mversion 1 H 2 Mversion 1 Member Data Video Metadata Version H 6 Mversion 2 H 5 Mversion 2 Member Data Video Metadata Version H 7 Mversion 3 H 8 Mversion 3 Data For Version 1 Data for Version 2 Data for Version 3 Member Data Video Metadata Version H 1 Mversion 1 H 3 Mversion 1 Partition R1a Partition R1b Partition R2a Partition R3a
  19. 19. Approach 2: Repartition data ▪ Parallelly execute repartitioned data Parallel execution of multiple versions Memory Inefficient Repartition time One version for each partition Partition R1a M1 Task 1 Partition R1b M1 Task 2 Partition R3a M3 Task 3
  20. 20. Approach 3: Materialize intermediate Dataframes ▪ Split execution of each version Executes one version at a time Memory efficient Repartition time One version for each partition Broadcast M1 Partition R1b Partition R1a Task 1 Task 2 Stage 1 Broadcast M2 Partition R2a Task 1 Stage 2
  21. 21. Approach 4: Custom memory management ▪ Singleton ▪ Task context ▪ Task listener ▪ Semaphore to restrict max use Parallel execution of multiple versions Memory efficient Repartition time One version for each partition Partition R1a Partition R1b Partition R3a M1 M3 Task 1 Task 2 Task 3
  22. 22. Working approach
  23. 23. Sort within partition Task 1 Task 1Task 3Task 2 Task 3 Member Data Video Metadata Version H 7 Mversion 3 H 8 Mversion 3 H 1 Mversion 1 Member Data Video Metadata Version H 6 Mversion 2 H 3 Mversion 1 H 2 Mversion 1 Member Data Video Metadata Version H 4 Mversion 1 H 5 Mversion 2 Partition N1 Partition N2 Partition N3 Member Data Video Metadata Version H 1 Mversion 1 H 7 Mversion 3 H 8 Mversion 3 Member Data Video Metadata Version H 3 Mversion 1 H 2 Mversion 1 H 6 Mversion 2 Member Data Video Metadata Version H 4 Mversion 1 H 5 Mversion 2 Partition S1 Partition S2 Partition S3
  24. 24. Manage memory between sorted partitions Partition S1 Partition S2 Partition S3 M1 Member Data Video Metadata Version H 1 Mversion 1 H 8 Mversion 3 H 7 Mversion 3 Partition S1 Partition S2 Partition S3 M2M3 Time Task 1 Task 1Task 3Task 2 Task 2 Task 3 Partition S1 Member Data Video Metadata Version H 3 Mversion 1 H 2 Mversion 1 H 6 Mversion 2 Member Data Video Metadata Version H 4 Mversion 1 H 5 Mversion 2 Partition S2 Partition S3
  25. 25. Manage memory between sorted partitions Can execute multiple versions in parallel Memory efficient No repartition time Switches versions on the same partition ▪ Amortized cost of switching versions ▪ Efficiency increase with more cores ▪ Semaphore not implemented Task 1 Task 1 Task 3Task 2 Task 2 Task 3 Partition S1 Partition S2 Partition S3 M1 Partition S1 Partition S2 Partition S3 M2M3 Task 1 Task 1 Task 3Task 2 Task 2 Task 3
  26. 26. Reliable Metrics
  27. 27. Shout out… Dhaval Patel ▪ Senior Software Engineer at Netflix ▪ Have worked at ▪ Twitter ▪ Netapp ▪ Love to ▪ Hiking ▪ Camping ▪ Socializing
  28. 28. Netflix Recommendations Architecture Historical Data Store Offline Feature Generation Viewing History Video Metadata Online Feature Generation Similar Videos My List Compute Recommendations Updated ML model Model Training
  29. 29. Problem visualization Historical Data Store Offline Feature Generation Labels Query Historical Data Generate Feature Dataframe materialized Accept small amounts of data drops
  30. 30. Then the job fails Debug why the job failed
  31. 31. Spark limitations ▪ Hard to know what was executed before the failure ▪ Lack of a consolidated view for accumulators ▪ Accumulators can double count ▪ No easy way to record failure samples
  32. 32. Solution Driver Executor Executor Executor Accumulators Kafka Hive Kafka Partial data failures Hive Atlas Metrics Consolidated View
  33. 33. Implementation ▪ Spark stage listeners and task listeners ▪ Deduplicate logging by storing hash in listener context ▪ Gotcha: Spark listeners catch throwable
  34. 34. Advantages ▪ Consolidation of logs and accumulators ▪ Easy to detect which parts of the code were executed ▪ Reduced debugging time significantly
  35. 35. Thank you
  36. 36. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  37. 37. Questions Driver Executor Accumulators Kafka Kafka Partial data failures Task 1 Task 2 Task 3 Partition S1 Partition S2 Partition S3 M2M3 Task 1 Task 2 Task 3 Hive Hive

×