Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 32

Performance Optimization of Recommendation Training Pipeline at Netflix DB Tsai and Hua Jiangx)

3

Share

Download to read offline

Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.
At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.

Performance Optimization of Recommendation Training Pipeline at Netflix DB Tsai and Hua Jiangx)

  1. 1. Optimization of Recommendation Pipelines using Apache Spark Hua Jiang and DB Tsai Spark Summit SF - June 6, 2017
  2. 2. Netflix Scale ▪ Started streaming videos 10 years ago ▪ > 100M members ▪ > 190 countries ▪ > 1000 device types ▪ A third of peak US downstream traffic
  3. 3. Turn on Netflix, and the absolute best content for you would automatically start playing Recommendation System: Ideal State
  4. 4. Title Ranking Everything is a RecommendationRowSelection&Ordering Recommendations are driven by machine learning algorithms Over 80% of what members watch comes from our recommendations Image
  5. 5. • Try an idea offline using historical data to see if it would have made better recommendations • If it would, deploy a live A/B test to see if it performs well in production Running Experiments
  6. 6. Design Experiment Collect Label Dataset Offline Feature Generation Model Training Compute Validation Metrics Model Testing Design a New Experiment to Test Out Different Ideas Offline Experiment Online System Online AB Testing Running Experiments
  7. 7. Feature Generation: Feature Computation S3 Snapshot Model Training Labeled Features Label Data Feature Model Feature Encoders Required Features Data Features Fact Data Required Data 1 3 42 5 6
  8. 8. Version 1: RDD-Based Feature Generation • RDD: Resilient Distributed Dataset • Our first version was written when only RDD operations were available • Opacity ▪ Data are opaque ▪ Computation is opaque
  9. 9. Version 1: RDD-Based Feature Generation S3 Snapshot Model Training Labeled Features RDD of POJO’s Feature Model Required Feature Maps of Data POJO Feature Encoders Features Required Data Label Data
  10. 10. Version 1: RDD-Based Feature Generation RDD operations are at low level. You are responsible for performance optimization. RDD operations are on whole objects, even if only one field is required.
  11. 11. Version 2: Using DataFrame • DataFrame: Structured Data Organized into Named Columns • Transparency ▪ Data are structured ▪ Computations are planned based on common patterns
  12. 12. Version 2: Using DataFrame Spark SQL optimizer, Catalyst, optimizes DataFrame operation
  13. 13. Version 2: Using DataFrame S3 Snapshot Model Training Labeled Features RDD of POJO’s Feature Model Required Feature Feature Encoders Maps of Data POJO Features Required Data Label Data
  14. 14. Version 2: Using DataFrame S3 Snapshot Model Training Structured Labeled Features Structured Data in DataFrame Feature Model Feature Encoders Required Feature Maps of Data POJO Features Required Data Label Data
  15. 15. Version 2: Using DataFrame ~3x run time gain in feature generation ▪ 50 ~ 80 executors ▪ ~3 cores per executor ▪ ~24GB per executor
  16. 16. Version 2: Using DataFrame Let’s take a look at the physical plan of the DataFrame taken from snapshot...
  17. 17. Version 2: Using DataFrame (with RDD[Row]) We use RDD[Row] from data frame and create a new data frame by manipulating the Row object. S3 Snapshot Structured Data in DataFrame Deduping Logic with Row Manipulation
  18. 18. Version 2: Using DataFrame (with RDD[Row]) Even the new DataFrame, created from RDD[Row], has columns with the same names, they are different to Spark
  19. 19. Version 2: Using DataFrame (with RDD[Row]) Manipulations on row objects are completely opaque, blocking optimizer from moving operations around.
  20. 20. Version 3: Column Operations To The Rescue Most of the operations are essentially column(s) to column(s)
  21. 21. Version 3: Column Operations To The Rescue Possible Replacement for row manipulations: ▪ Spark SQL Functions ▪ User-Defined Functions ▪ Catalyst Expression Most of the operations are essentially column(s) to column(s)
  22. 22. Version 3: Column Operations To The Rescue Spark SQL Functions (org.apache.spark.sql.functions) ▪ Built-in ▪ Highly efficient ▪ Internal data structure ▪ Code generation ▪ Supports rule-based optimization ▪ A variety of categories ▪ Aggregation ▪ Collection ▪ Math ▪ String
  23. 23. Version 3: Column Operations To The Rescue User-Defined Functions (UDFs) ▪ Scala functions with certain types ▪ Highly flexible ▪ Data encoding/decoding required
  24. 24. Version 3: Column Operations To The Rescue User-Defined Catalyst Expressions ▪ Flexible ▪ User defines the operations ▪ Efficient ▪ Internal data structure ▪ Code generation possible
  25. 25. Version 3: Column Operations To The Rescue S3 Snapshot Model Training Structured Labeled Features Feature Model Structured Data in DataFrame Feature Encoders Required Feature Maps of Data POJO Features Required Data Label Data Row Manipulation
  26. 26. Version 3: Column Operations To The Rescue S3 Snapshot Model Training Structured Labeled Features Feature Model Structured Data in DataFrame Feature Encoders Required Feature Maps of Data POJO Features Required Data Label Data Catalyst Expressions
  27. 27. Version 3: Column Operations To The Rescue We replaced row manipulation with Catalyst expression S3 Snapshot Structured Data in DataFrame case class RemoveDuplications(child: Expression) extends UnaryExpression { ... } Catalyst Expression
  28. 28. Version 3: Column Operations To The Rescue
  29. 29. Version 3: Column Operations To The Rescue Physical Plan with Column Operations
  30. 30. Version 3: Column Operations To The Rescue ~2x run time gain compared to version 2 ▪ 50 ~ 80 executors ▪ ~3 cores per executor ▪ ~24GB per executor
  31. 31. Conclusions ▪ Time Travel in Offline Training ▪ Fact logging + offline feature generation ▪ Optimization ▪ Remove “black boxes” ▪ Prefer high-level DataFrame APIs ▪ Prefer column operations over row manipulations
  32. 32. (We are hiring...) Questions?

×