Successfully reported this slideshow.
Your SlideShare is downloading. ×

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 90 Ad

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

Download to read offline

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk (20)

Advertisement

More from Spark Summit (20)

Recently uploaded (20)

Advertisement

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

  1. 1. #EUai9 Marcin Kulka and Michał Kaczmarczyk 9LivesData Oct/26/2017 No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark
  2. 2. Who we are? • Marcin Kulka – Senior Software Engineer • Michał Kaczmarczyk (Ph.D.) – Software Architect, Team Leader and Project Manager 2
  3. 3. Who we are? • Advanced software R&D company (Warsaw, Poland) • 75+ scientists and software engineers • Specializing in scalable storage, distributed and big data systems • Cooperating with partners all around the world 3
  4. 4. 4
  5. 5. • Masato Asahara (Ph.D.) - Researcher, NEC Data Science Research Laboratory • Ryohei Fujimaki (Ph.D.) - Research Fellow, NEC Data Science Research Laboratory 5
  6. 6. Agenda • Typical use case for predictive modeling problem • Our technology - Automatic Predictive Modeling • Design challenges • Evaluation results • Our observations 6
  7. 7. Motivation 7
  8. 8. Predictive analysis in industry and business 8 Driver risk assessment Inventory Optimization Churn Retention Predictive Maintenance Product price optimization Sales optimization Energy/water operation mgmt
  9. 9. ... but Predictive Modeling • Takes a long time • Requires high skills 9
  10. 10. Typical predictive modeling use case 1010 Training Data Validation Data Test Data Highly accurate prediction results
  11. 11. Typical predictive modeling use case 1111 Predictive models Training Data Validation Data Test Data Highly accurate prediction results
  12. 12. Predictive model design 12 Algorithm selection Accuracy v s Transparency Black box White box
  13. 13. Predictive model design 13 Hyperparameters tuning Best balance Algorithm selection Accuracy v s Transparency Black box White box
  14. 14. Predictive model design 14 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Sales = f (Price, Location) Sales = f (Price, Weather) or
  15. 15. Predictive model design 15 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Sales = f (Price, Location) Sales = f (Price, Weather) or
  16. 16. Predictive model design 16 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Many iterations, weeks... Sales = f (Price, Location) Sales = f (Price, Weather) or
  17. 17. Predictive model design 17 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Many iterations, weeks... Sales = f (Price, Location) Sales = f (Price, Weather) or Sophisticated knowledge...
  18. 18. Automatic predictive modeling 18 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Sales = f (Price, Location) Sales = f (Price, Weather) or
  19. 19. Automatic predictive modeling 19 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Highly accurate results in a short time! Sales = f (Price, Location) Sales = f (Price, Weather) or
  20. 20. Our technology 20
  21. 21. Exploring massive modeling possibilities 21 Data preprocessing strategies
  22. 22. Exploring massive modeling possibilities 22 Algorithms Yes No Yes Data preprocessing strategies
  23. 23. Exploring massive modeling possibilities 23 Algorithms Yes No Yes Data preprocessing strategies Feature selection!
  24. 24. Exploring massive modeling possibilities 24 Algorithms Yes No Yes Hyperparameters tuning Data preprocessing strategies Feature selection!
  25. 25. Exploring massive modeling possibilities 25 Algorithms Yes No Yes Data preprocessing strategies Yes No Yes Feature selection! 1000s of models! Hyperparameters tuning
  26. 26. Exploring massive modeling possibilities 26 Algorithms Yes No Yes Data preprocessing strategies Yes No Yes Feature selection! 1000s of models! Hyperparameters tuning
  27. 27. Automating and accelerating with Spark 27 Complete in hours! Yes No Yes Algorithms Yes No Yes Data preprocessing strategies Feature selection! Hyperparameters tuning
  28. 28. 28 Training data Validation criteria Validation data Modeling flow = training + validation
  29. 29. Modeling flow = training + validation 29 Training data Validation data Training models Validating models Models Test data Best model Validation criteria
  30. 30. Modeling and prediction flow 30 Training data Validation data Training models Validating models Models Test data Prediction Best model Validation criteria Best prediction
  31. 31. Design challenges and solutions 31
  32. 32. 3232 Challenges to achieve high execution performance • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing 3232 θ1 θ2 θ3
  33. 33. 3333 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing Challenges to achieve high execution performance
  34. 34. Using native ML engines in Spark Why? 34
  35. 35. Comparison of Spark and native ML engines 35 (+ Spark ML) Native ML engines
  36. 36. Comparison of Spark and native ML engines 36 (+ Spark ML) Native ML engines Scalability Yes No (or very limited)
  37. 37. Comparison of Spark and native ML engines 37 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Accuracy
  38. 38. Comparison of Spark and native ML engines 38 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high Distributed nature, synchronization overhead Accuracy If data fits a single server
  39. 39. Comparison of Spark and native ML engines 39 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high Distributed nature, synchronization overhead Accuracy If data fits a single server
  40. 40. Comparison of Spark and native ML engines • We would like to combine Spark and ML engines 40 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high
  41. 41. Combining Spark and ML engines for training 41 Training data (parquet) HDFS Models
  42. 42. 42 Data preprocessing (MapReduce) Training data (parquet) HDFS Models Combining Spark and ML engines for training
  43. 43. 43 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Combining Spark and ML engines for training
  44. 44. 44 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes ’Single ML engine’ on a single executor Combining Spark and ML engines for training
  45. 45. 45 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Input requirements: size & format ’Single ML engine’ on a single executor Combining Spark and ML engines for training
  46. 46. 46 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Combining Spark and ML engines for training
  47. 47. 47 Machine Learning (map operation) Converting to RDD[Matrix] Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Matrix Matrix Matrix Combining Spark and ML engines for training
  48. 48. 48 Machine Learning (map operation) Converting to RDD[Matrix] Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Matrix Matrix Matrix Combining Spark and ML engines for training RDD of huge, efficiently stored objects optimized for ML computations!!!
  49. 49. Converting to RDD[Matrix] 49 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS 1000s of models Yes No Yes Yes No Yes Matrix Matrix Matrix RDD of huge, efficiently stored objects optimized for ML computations!!! Combining Spark and ML engines for training
  50. 50. Combining Spark and ML engines for validation 50 Validation data (parquet) HDFS
  51. 51. 51 Data preprocessing (MapReduce) Validation data (parquet) HDFS Combining Spark and ML engines for validation
  52. 52. 52 Converting to RDD[Matrix] Data preprocessing (MapReduce) Validation data (parquet) HDFS Matrix Matrix Matrix Combining Spark and ML engines for validation
  53. 53. Converting to RDD[Matrix] Matrix Matrix Matrix 53 Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS Computing validation results for many models Combining Spark and ML engines for validation
  54. 54. Converting to RDD[Matrix] Matrix Matrix Matrix 54 Validation (MapReduce) Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS Computing validation scores Combining Spark and ML engines for validation
  55. 55. Converting to RDD[Matrix] Matrix Matrix Matrix 55 Validation (MapReduce) Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS HDFS Best model Combining Spark and ML engines for validation
  56. 56. 56 Predict (map operation) Convert to RDD[Matrix] Data preprocessing (MapReduce) Test data (parquet) HDFS HDFS Prediction results (parquet) Matrix Matrix Matrix Computations only for selected models Combining Spark and ML engines for prediction
  57. 57. Design challenges 5757 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing
  58. 58. Many models to schedule 58 Matrix X3 Matrix X2 Matrix X1
  59. 59. Many models to schedule 59 Algorithms Hyperparameters Data preprocessing strategies Parameters: θ1, θ2, θ3 ... Matrix X3 Matrix X2 Matrix X1
  60. 60. Many models to schedule 60 Algorithms Hyperparameters Data preprocessing strategies Machine Learning Yes No Yes Parameters: θ1, θ2, θ3 ... Matrix X3 Matrix X2 Matrix X1
  61. 61. Naive scheduling 61 Load & Convert Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3 • Waste of memory • Frequent data loading from other servers • Frequent data to matrix conversion 61 Load & Convert Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3
  62. 62. 62 Parameter-aware scheduling 62 • Efficient memory usage • Infrequent data loading from other servers • Infrequent data to matrix conversion 62 Parameter θ1 Parameter θ2 Parameter θ3 Matrix X1
  63. 63. Design challenges 6363 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing
  64. 64. Machine learning – most work intensive & time consuming part 64 Machine Learning (map operation) Convert to matrix Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS Yes No Yes We must ensure good balance of paralleled work 1000s of models Matrix Matrix Matrix
  65. 65. Naive balancing of models to compute 65 5 min 5 min Complicated model
  66. 66. Naive balancing of models to compute 66 5 min 5 min 1 min 1 min Wait 8 min…Yes No Yes Yes No Yes Decision tree model Complicated model
  67. 67. Predictive balancing • Balancing complex and simple models (based on previous estimation) • Complex models first 5 min 1 min 5 min 1 min Yes No Yes Yes No Yes ♪~ ♪~ 67
  68. 68. Evaluation 68
  69. 69. Evaluation – targeting Top-10% • Prediction problem – Comparing Top-10% precision of targeting potential positive samples • Comparing with manual predictive modeling – Done with scikit-learn v0.18.1 – Selected algorithms (Logistic Regression, SVM, Random Forests) – Selected preprocessing strategies – All parameters of algorithms set with default values • except Random Forest (n_estimators = 200) 69
  70. 70. Evaluation – data sets • KDDCUP 2014 competition data – 557K records for training and validate data – 62K records for test data – Features: 500 • KDDCUP 2015 competition data – 108K records for training and validate data – 12K records for test data – Features: 500 • IJCAI 2015 competition data – 87K records for training, validate and test data – Features: 500 70
  71. 71. Evaluation – cluster specificaton • Size: 3U • Server modules: 34 • CPU: 272 cores (Intel Xeon D 2.1GHz) – 128 cores used in the evaluation • RAM: 2TB • Storage: 34TB SSD • Internal network: 10GbE • Spark v1.6.0, Hadoop v2.7.3 71 Scalable Modular Server (DX2000)
  72. 72. Evaluation results and conclusions 72 Data Our technology Logistic regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% precision results
  73. 73. Evaluation results and conclusions • Competitive results with good accuracy 73 Data Our technology Logistic regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% precision results
  74. 74. Evaluation results and conclusions • Short execution time • Full automation of the whole process • Handling data of any size 74 Data Our technology KDDCUP 2014 172 minutes KDDCUP 2015 45 minutes IJCAI 2015 36 minutes Execution time
  75. 75. Our observations 75
  76. 76. Our observations • Using RDD of huge but compact objects optimized for ML computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 76
  77. 77. Our observations • Using RDD of huge but compact objects optimized for ML computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 77
  78. 78. Converting to RDD[Matrix] 78 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS 1000s of models Yes No Yes Yes No Yes Matrix Matrix Matrix RDD[DenseMatrix]
  79. 79. • Spark used for parallelization • All the necessary data for a single execution kept without memory overhead • Performance critical operations executed: – On objects with Linear Algebra operations optimized – By fast native ML algorithms 79 RDD[DenseMatrix]
  80. 80. Our observations • Using RDD of huge but compact objects optimized for fast computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 80
  81. 81. Limiting execution overhead in tests • Submitting Spark application takes time 81 TestSpark submit Spark submit Test Spark submit Test
  82. 82. Limiting execution overhead in tests • We submit only once 82 TestSpark submit Test Test ♪~
  83. 83. Our observations • Using RDD of huge but compact objects optimized for fast computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 83
  84. 84. Stable execution on YARN • Default configuration sometimes failing with not enough memory • Spark Web UI: • Serving much memory to Spark but application still failing • Known problem in Spark 84
  85. 85. Stable execution on YARN • JVM system memory spikes over YARN limitation suddenly (*) 85 (*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit”, Spark Summit 2016. YARN limitation (6GB) Time Memory(GB) Spike of JVM system memory usage
  86. 86. Stable execution on YARN • Tip: spark.yarn.executor.memoryOverhead to be carefully configured • Recommended overhead: 6-10% • 15% overhead required in our case • Must be thoroughly investigated 86 (http://spark.apache.org/docs/2.1.1/running-on-yarn.html)
  87. 87. Summary 87
  88. 88. Summary • Predictive modeling problem – Requires sophisticated knowledge – Takes a long time • Our technology: Automatic Predictive Modeling – Combines Spark with native ML engines – Fully automates the whole process – Provides highly accurate results – Takes at most hours – Handles data of any size 88
  89. 89. Future work • Extending to other models (e.g. deep learning) • Speeding up by GPU • Reducing YARN memory overhead 89
  90. 90. Thank you! 90

×