Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
DBG / June 5, 2018 / © 2018 IBM Corporation
Model Parallelism in
Spark ML 

Cross-validation
Nick Pentreath
Principal Engi...
DBG / June 5, 2018 / © 2018 IBM Corporation
About Nick
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center...
DBG / June 5, 2018 / © 2018 IBM Corporation
About Bryan
Software Engineer, IBM CODAIT
Apache Spark committer
Apache Arrow ...
DBG / June 5, 2018 / © 2018 IBM Corporation
Center for Open Source Data and AI Technologies
CODAIT
codait.org
CODAIT aims ...
DBG / June 5, 2018 / © 2018 IBM Corporation
Agenda
Model Tuning in Spark
Scaling Model Tuning
Performance Results
Best Pra...
DBG / June 5, 2018 / © 2018 IBM Corporation
Model Tuning in Spark
DBG / June 5, 2018 / © 2018 IBM Corporation
Model selection: workflow within a workflow
Model Tuning in Spark
Ingest
Data
...
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
Tokenizer CountVectorizer Logi...
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
Tokenizer
CountVectorizer
# fe...
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
# features:
10
# features:
100...
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
DBG / June 5, 2018 / © 2018 IBM Corporation
Cross-validation is expensive!
Model Tuning in Spark
• 5 x 5 x 5 hyperparamete...
DBG / June 5, 2018 / © 2018 IBM Corporation
Scaling Model Tuning
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# fea...
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# fea...
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# fea...
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# fea...
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
• Added in SPARK-19357 and SPAR...
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
# features:
10
# features:
100
...
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
DBG / June 5, 2018 / © 2018 IBM Corporation
Implementation considerations
Scaling Model Tuning
• Parallelism parameter set...
DBG / June 5, 2018 / © 2018 IBM Corporation
Performance tests
Scaling Model Tuning
• Compared parallel CV to serial CV wit...
DBG / June 5, 2018 / © 2018 IBM Corporation
Results
Scaling Model Tuning
• ±2.4x speedup
• Stays roughly constant as #
sam...
DBG / June 5, 2018 / © 2018 IBM Corporation
Best practices
Scaling Model Tuning
• Simple integer parameter is the only thi...
DBG / June 5, 2018 / © 2018 IBM Corporation
Optimizing Tuning for
Pipeline Models
DBG / June 5, 2018 / © 2018 IBM Corporation
Challenges
Optimizing Tuning for Pipeline Models
• Multi-stage, complex pipeli...
DBG / June 5, 2018 / © 2018 IBM Corporation
Duplicating work
Optimizing Tuning for Pipeline Models
• Each Pipeline treated...
DBG / June 5, 2018 / © 2018 IBM Corporation
Optimize with a DAG
Optimizing Tuning for Pipeline Models
• A node is an estim...
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallelize in breadth-first order
Optimizing Tuning for Pipeline Models
• Exa...
DBG / June 5, 2018 / © 2018 IBM Corporation
Fit estimators
Optimizing Tuning for Pipeline Models
• Cache the result and pr...
DBG / June 5, 2018 / © 2018 IBM Corporation
Fit estimators
Optimizing Tuning for Pipeline Models
• Unpersist when child ta...
DBG / June 5, 2018 / © 2018 IBM Corporation
Fit estimators
Optimizing Tuning for Pipeline Models
• All 4 LR models fitted
...
DBG / June 5, 2018 / © 2018 IBM Corporation
Evaluate models
Optimizing Tuning for Pipeline Models
• Evaluate models using ...
DBG / June 5, 2018 / © 2018 IBM Corporation
Evaluate models
Optimizing Tuning for Pipeline Models
• Evaluate models using ...
DBG / June 5, 2018 / © 2018 IBM Corporation
Evaluate models
Optimizing Tuning for Pipeline Models
• All models evaluated f...
DBG / June 5, 2018 / © 2018 IBM Corporation
Select best model
Optimizing Tuning for Pipeline Models
• Average the metrics ...
DBG / June 5, 2018 / © 2018 IBM Corporation
Performance tests
Optimizing Tuning for Pipeline Models
• Compared to Standard...
DBG / June 5, 2018 / © 2018 IBM Corporation
Results
Optimizing Tuning for Pipeline Models
• Up to 3.25x speedup
• Increase...
DBG / June 5, 2018 / © 2018 IBM Corporation
Thank you!
codait.org
twitter.com/MLnick
github.com/MLnick
github.com/BryanCut...
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Tue, June 5 | 11 AM
...
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Wed, June 6 | 12:50 ...
DBG / June 5, 2018 / © 2018 IBM Corporation
Upcoming SlideShare
Loading in …5
×

of

Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 1 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 2 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 3 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 4 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 5 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 6 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 7 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 8 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 9 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 10 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 11 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 12 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 13 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 14 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 15 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 16 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 17 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 18 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 19 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 20 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 21 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 22 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 23 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 24 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 25 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 26 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 27 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 28 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 29 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 30 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 31 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 32 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 33 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 34 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 35 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 36 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 37 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 38 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 39 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 40 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 41 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 42 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 43 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 44 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 45 Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler Slide 46
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler

Download to read offline

Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times.

Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature.

Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines.

Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler

  1. 1. DBG / June 5, 2018 / © 2018 IBM Corporation Model Parallelism in Spark ML 
 Cross-validation Nick Pentreath Principal Engineer Bryan Cutler Software Engineer
  2. 2. DBG / June 5, 2018 / © 2018 IBM Corporation About Nick @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups
  3. 3. DBG / June 5, 2018 / © 2018 IBM Corporation About Bryan Software Engineer, IBM CODAIT Apache Spark committer Apache Arrow committer Python, Machine Learning OSS @BryanCutler on Github
  4. 4. DBG / June 5, 2018 / © 2018 IBM Corporation Center for Open Source Data and AI Technologies CODAIT codait.org CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission Improving Enterprise AI Lifecycle in Open Source
  5. 5. DBG / June 5, 2018 / © 2018 IBM Corporation Agenda Model Tuning in Spark Scaling Model Tuning Performance Results Best Practices Future Directions in Optimizing Pipelines
  6. 6. DBG / June 5, 2018 / © 2018 IBM Corporation Model Tuning in Spark
  7. 7. DBG / June 5, 2018 / © 2018 IBM Corporation Model selection: workflow within a workflow Model Tuning in Spark Ingest Data Processing Feature Engineering Model Selection Final Model Candidate models Train Evaluate Adjust
  8. 8. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark Tokenizer CountVectorizer LogisticRegression Spark ML Pipeline # features: 10 # features: 100 regParam: 0.001 regParam: 0.1 Parameters
  9. 9. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  10. 10. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark # features: 10 # features: 100 regParam: 0.001 regParam: 0.1 Tokenizer CountVectorizer LogisticRegression
  11. 11. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark
  12. 12. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark
  13. 13. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark
  14. 14. DBG / June 5, 2018 / © 2018 IBM Corporation Cross-validation is expensive! Model Tuning in Spark • 5 x 5 x 5 hyperparameters = 125 pipelines • ... across 4 machine learning models = 500 • If training & evaluation does not fully utilize available cluster resources then that waste is compounded for each model Based on XKCD comic: https://xkcd.com/303/ & https://github.com/mislavcimpersak/xkcd-excuse-generator
  15. 15. DBG / June 5, 2018 / © 2018 IBM Corporation Scaling Model Tuning
  16. 16. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  17. 17. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  18. 18. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  19. 19. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  20. 20. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning • Added in SPARK-19357 and SPARK-21911 (PySpark) • Parallelism parameter governs the maximum # models to be trained at once
  21. 21. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning # features: 10 # features: 100 regParam: 0.001 regParam: 0.1 Tokenizer CountVectorizer LogisticRegression
  22. 22. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning
  23. 23. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning
  24. 24. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning
  25. 25. DBG / June 5, 2018 / © 2018 IBM Corporation Implementation considerations Scaling Model Tuning • Parallelism parameter sets the size of threadpool under the hood • Dedicated ExecutionContext created to avoid deadlocks with using the default threadpool • Used Futures instead of parallel collections – more flexible • Model-specific parallel fitting implementations not supported • SPARK-22126
  26. 26. DBG / June 5, 2018 / © 2018 IBM Corporation Performance tests Scaling Model Tuning • Compared parallel CV to serial CV with varying number of samples • Simple LogisticRegression with regParam and fitIntercept; parameter grid size 12 • Measure elapsed time for cross-validation • Data size: 100,000 -> 5,000,000 • Number features: 10 • Number partitions: 10 • Number CV folds: 5 • Parallelism: 3 • Standalone cluster with 30 cores
  27. 27. DBG / June 5, 2018 / © 2018 IBM Corporation Results Scaling Model Tuning • ±2.4x speedup • Stays roughly constant as # samples increases
  28. 28. DBG / June 5, 2018 / © 2018 IBM Corporation Best practices Scaling Model Tuning • Simple integer parameter is the only thing you can set (for now) • Too low => under-utilize resources • Too high => could lead to memory issues or overloading cluster • Rough rule: # cores / # partitions • But depends on data and model sizes • Mid-sized cluster probably <= 10
  29. 29. DBG / June 5, 2018 / © 2018 IBM Corporation Optimizing Tuning for Pipeline Models
  30. 30. DBG / June 5, 2018 / © 2018 IBM Corporation Challenges Optimizing Tuning for Pipeline Models • Multi-stage, complex pipelines • Parameter grid with hyperparameters from different stages • Easy to have huge number of candidate parameter combinations • Model parallelism helps, but can we do better?
  31. 31. DBG / June 5, 2018 / © 2018 IBM Corporation Duplicating work Optimizing Tuning for Pipeline Models • Each Pipeline treated independently • Depending on parameter grid and pipeline stages • Fit the same model multiple times • Perform same transformations multiple times
  32. 32. DBG / June 5, 2018 / © 2018 IBM Corporation Optimize with a DAG Optimizing Tuning for Pipeline Models • A node is an estimator/transformer with a set of hyperparameters • A path in the graph is a single pipeline model Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01
  33. 33. DBG / June 5, 2018 / © 2018 IBM Corporation Parallelize in breadth-first order Optimizing Tuning for Pipeline Models • Example with parallelism parameter set to 2 • Tokenizer is only a transform, proceed to fit CountVectorizer nodes Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01
  34. 34. DBG / June 5, 2018 / © 2018 IBM Corporation Fit estimators Optimizing Tuning for Pipeline Models • Cache the result and proceed to fit the first 2 LogisticRegression models Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01 Cache result
  35. 35. DBG / June 5, 2018 / © 2018 IBM Corporation Fit estimators Optimizing Tuning for Pipeline Models • Unpersist when child tasks done • Fit final 2 LR models Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01 Unpersist cached dataframe Cache result
  36. 36. DBG / June 5, 2018 / © 2018 IBM Corporation Fit estimators Optimizing Tuning for Pipeline Models • All 4 LR models fitted Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01 Unpersist cached dataframe
  37. 37. DBG / June 5, 2018 / © 2018 IBM Corporation Evaluate models Optimizing Tuning for Pipeline Models • Evaluate models using similar method • CountVectorizerModel is now a transformer • Cache transform result Tokenizer CVModel nfeat=10 CVModel nfeat=100 LRModel reg=0.1 LRModel reg=0.01 LRModel reg=0.1 LRModel reg=0.01 Cache result
  38. 38. DBG / June 5, 2018 / © 2018 IBM Corporation Evaluate models Optimizing Tuning for Pipeline Models • Evaluate models using similar method • CountVectorizerModel is now a transformer • Cache transform result Tokenizer CVModel nfeat=10 CVModel nfeat=100 LRModel reg=0.1 LRModel reg=0.01 LRModel reg=0.1 LRModel reg=0.01 Unpersist cached dataframe Cache result Metrics: 0.62 0.62
  39. 39. DBG / June 5, 2018 / © 2018 IBM Corporation Evaluate models Optimizing Tuning for Pipeline Models • All models evaluated for this fold Tokenizer CVModel nfeat=10 CVModel nfeat=100 LRModel reg=0.1 LRModel reg=0.01 LRModel reg=0.1 LRModel reg=0.01 Unpersist cached dataframe Metrics: 0.62 0.62 0.72 0.66
  40. 40. DBG / June 5, 2018 / © 2018 IBM Corporation Select best model Optimizing Tuning for Pipeline Models • Average the metrics from all folds and select the best PipelineModel Tokenizer CVModel nfeat=10 CVModel nfeat=100 LRModel reg=0.1 LRModel reg=0.01 LRModel reg=0.1 LRModel reg=0.01 Avg Metrics: 0.64 0.64 0.71 0.65
  41. 41. DBG / June 5, 2018 / © 2018 IBM Corporation Performance tests Optimizing Tuning for Pipeline Models • Compared to Standard Spark CV with parallelism enabled • Pipeline:
 MinMaxScaler → PCA → LinearRegression
 • Measure elapsed time for cross-validation varying size of parameter grid from 36 to 80 models to evaluate • Data size: 1,000,000 • Number features: 50 • Number partitions: 16 • Number CV folds: 4 • Parallelism: 3 • Standalone cluster with 30 cores
  42. 42. DBG / June 5, 2018 / © 2018 IBM Corporation Results Optimizing Tuning for Pipeline Models • Up to 3.25x speedup • Increases with more models … • … and more complex pipelines • Check out: • https://github.com/BryanCutler/PipelineTuning • Experimental! • Watch SPARK-19071 Elapsed time for DAG CV vs Simple Parallel CV 0 275 550 825 1100 # models 36 48 60 80 Parallel DAG Parallel
  43. 43. DBG / June 5, 2018 / © 2018 IBM Corporation Thank you! codait.org twitter.com/MLnick github.com/MLnick github.com/BryanCutler developer.ibm.com/code FfDL Sign up for IBM Cloud and try Watson Studio! https://datascience.ibm.com/ MAX
  44. 44. DBG / June 5, 2018 / © 2018 IBM Corporation Date, Time, Location & Duration Session title and Speaker Tue, June 5 | 11 AM 2010-2012, 30 mins Productionizing Spark ML Pipelines with the Portable Format for Analytics Nick Pentreath (IBM) Tue, June 5 | 2 PM 2018, 30 mins Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing! Holden Karau (Google) Bryan Cutler (IBM) Tue, June 5 | 2 PM Nook by 2001, 30 mins Making Data and AI Accessible for All Armand Ruiz Gabernet (IBM) Tue, June 5 | 2:40 PM 2002-2004, 30 mins Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System Rajesh Bordawekar (IBM T.J. Watson Research Center) Tue, June 5 | 3:20 PM 3016-3022, 30 mins Dynamic Priorities for Apache Spark Application’s Resource Allocations Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.) Tue, June 5 | 3:20 PM 2001-2005, 30 mins Model Parallelism in Spark ML Cross-Validation Nick Pentreath (IBM) Bryan Cutler (IBM) Tue, June 5 | 3:20 PM 2007, 30 mins Serverless Machine Learning on Modern Hardware Using Apache Spark Patrick Stuedi (IBM) Tue, June 5 | 5:40 PM 2002-2004, 30 mins Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine; Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida) Tue, June 5 | 5:40 PM 2007, 30 mins Transparent GPU Exploitation on Apache Spark Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM) Tue, June 5 | 5:40 PM 2009-2011, 30 mins Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks Yonggang Hu (IBM) Chao Xue (IBM) IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5)
  45. 45. DBG / June 5, 2018 / © 2018 IBM Corporation Date, Time, Location & Duration Session title and Speaker Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP) Wed, June 6 | 2 PM 2002-2004, 30 mins Deep Learning for Recommender Systems Nick Pentreath (IBM) ) Wed, June 6 | 3:20 PM 2018, 30 mins Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies) IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6) Meet us at IBM booth in the Expo area.
  46. 46. DBG / June 5, 2018 / © 2018 IBM Corporation
  • ShubhmayPotdar1

    Dec. 3, 2018

Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times. Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature. Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines.

Views

Total views

1,533

On Slideshare

0

From embeds

0

Number of embeds

5

Actions

Downloads

65

Shares

0

Comments

0

Likes

1

×