Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Augmenting Machine Learning with Databricks Labs AutoML Toolkit

Download to read offline

<p>Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.</p><p>In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.</p>

  • Be the first to like this

Augmenting Machine Learning with Databricks Labs AutoML Toolkit

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Mary Grace Moesta, Databricks Denny Lee, Databricks Augmenting Machine Learning with Databricks Labs AutoML Toolkit #UnifiedDataAnalytics #SparkAISummit
  3. 3. Agenda • Discuss traditional ML pipeline problem and all of its stages • How AutoML Toolkit solves these problems • Hyperparameter Optimization • Choosing Models • Scaling AutoML Toolkit Best Practices 3#UnifiedDataAnalytics #SparkAISummit
  4. 4. About Speaker Mary Grace Moesta Customer Success Engineer, Databricks • Current AutoML developer • Former data scientist at 84.51° focused on using ML for brand accelerator and several customer experience projects • Likes long walks on the beach, Spark and applied math 4#UnifiedDataAnalytics #SparkAISummit
  5. 5. About Speaker Denny Lee Developer Advocate, Databricks • Worked with Apache Spark™ since 0.5 • Former Senior Director Data Science Engineering at Concur • On Project Isotope incubation team that built what is now known as Azure HDInsight • Former SQLCAT DW BI Lead at Microsoft 5#UnifiedDataAnalytics #SparkAISummit
  6. 6. AutoML’s Tiered API Approach 6 No-Code Full Automation Low-Code Augmentation Code Flexibility and Performance Citizen Data Scientist Engineer ML Expert / Researcher Persona Goal Driving AnalogyAutoML API High Level - Automation Runner Mid Level - Individual Component APIs Low Level - Hyperparameter tuning
  7. 7. Let’s start at the end 7
  8. 8. Let’s start at the end • AutoML’s FeatureImportances automates the discovery of features • AutoML’s AutomationRunner automates the building, training, execution, and tuning of a Machine Learning pipeline to create an optimal ML model. • Improved AUC from 0.6732 to 0.723 • Business value: $23.22M to $68.88M saved • Less code, faster! 8
  9. 9. 9 ML Pipeline Stages
  10. 10. 10 ML Pipeline Stages
  11. 11. Traditional ML Pipelines Identify Important Features 11
  12. 12. 12 Exploratory Analysis to Identify Features
  13. 13. AutoML Toolkit Identify Important Features 13
  14. 14. 14 ML Pipeline with AutoML Toolkit
  15. 15. AutoML | FeatureImportances // Calculate Feature Importance (fi) val fiConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", genericMapOverrides) // Since we're using XGBoost, set parallelism <= 2x number of nodes fiConfig.tunerConfig.tunerParallelism = nodeCount * 2 val fiMainConfig = ConfigurationGenerator.generateFeatureImportanceConfig(fiConfig) // Generate Feature Importance val importances = new FeatureImportances(sourceData, fiMainConfig, "count", 20.0) .generateFeatureImportances() 15
  16. 16. AutoML | FeatureImportances 16
  17. 17. 17 ML Pipeline Stages
  18. 18. Traditional Model Building and Tuning Building and Tuning Models 18
  19. 19. Hand-made Model 19 • Traditionally, when we build a ML pipeline, we will need to a number of tasks including: • Defining our category (text-based) and numeric columns • Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model) • For numeric columns, ensure they are double or float data types • For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data • Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode) • For example, put together imputer, stringIndexer, One-Hot encoding of category data • Create a vector (e.g. vectorAssembler) to put together these features • Apply a standard scaler to the values to minimize the impact of outliers • Execute the model against our dataset • Review the metrics (e.g., AUC) • Tune the model using a Cross Validator • The better you understand the model, the more likely you will provide better hyperparameters for cross validation • i.e. need to choose a solid set of parameters (e.g. paramGrid) • Review the metrics again (e.g. AUC) • Review confusion matrix (in the case of binary classification) • Review business value
  20. 20. Hand-made Model 20 • Traditionally, when we build a ML pipeline, we will need to a number of tasks including: • Defining our category (text-based) and numeric columns • Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model) • For numeric columns, ensure they are double or float data types • For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data • Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode) • For example, put together imputer, stringIndexer, One-Hot encoding of category data • Create a vector (e.g. vectorAssembler) to put together these features • Apply a standard scaler to the values to minimize the impact of outliers • Execute the model against our dataset • Review the metrics (e.g., AUC) • Tune the model using a Cross Validator • The better you understand the model, the more likely you will provide better hyperparameters for cross validation • i.e. need to choose a solid set of parameters (e.g. paramGrid) • Review the metrics again (e.g. AUC) • Review confusion matrix (in the case of binary classification) • Review business value Can we make this easier?
  21. 21. AutoML Model Building and Tuning Building and Tuning Models 21
  22. 22. 22 ML Pipeline with AutoML Toolkit
  23. 23. AutoML | AutomationRunner val conf = ConfigurationGenerator.generateConfigFromMap("XGBoost",…) // Adjust model tuner configuration conf.tunerConfig.tunerParallelism = nodeCount // Generate configuration val XGBConfig = ConfigurationGenerator.generateMainConfig(conf) // Select on the important features val runner = new AutomationRunner(sourceData).setMainConfig(XGBConfig) .runWithConfusionReport() 23
  24. 24. 24 Model, Metrics, Configs Saved AUC from 0.6732 to 0.723
  25. 25. 25 Model, Metrics, Configs Saved AUC from 0.6732 to 0.723
  26. 26. How did AutoML Toolkit do this? Able to find better hyperparameters because it: • Tested and tuned all modifiable hyperparameters • Performed this in a distributed fashion using a collection of optimization algorithms • Incorporated is the understanding of how to use the parameters extracted from the algorithm source code 26
  27. 27. Common Overrides 27 Override Description dataPrepCache Cache the primary DataFrame to allow for faster batch processing data tunerParallelism Configure how many workflows to run in parallel; monitor if this is >30 parallel tasks as this may saturate the driver setTrainSplitMethod Setting the appropriate sampling method for model training tunerTrainPortion Configure the percentages for train / test split tunerAutoStoppingScor e Sets the auto stopping score for hyperparameter tuning in batch mode
  28. 28. 28 Clearing up the Confusion
  29. 29. Business Value 29 Prediction Label (Is Bad Loan) Short Description Long Description 1 1 Loss Avoided Correctly found bad loans 1 0 Profit Forfeited Incorrectly labeled bad loans 0 1 Loss Still Incurred Incorrectly labeled good loans 0 0 Profit Retained Correctly found good loans Business value = - (loss avoided – profit forfeited) = -([1, 1] - [1, 0])
  30. 30. Business Value 30 Potentially from $23.22M to $68.88M saved
  31. 31. It’s all in the Family…Runner 31
  32. 32. Model Experimentation • In the original Loan Risk Analysis blog, we tried GLM, GBT, and XGBoost • Traditional Model Building and Tuning x3! (one for each model type) 32 x3
  33. 33. Model Experimentation • In the original Loan Risk Analysis blog, we tried GLM, GBT, and XGBoost • Traditional Model Building and Tuning x3! (one for each model type) 33 x3 Can we make this easier?
  34. 34. AutoML | FamilyRunner import com.databricks.labs.automl.executor.FamilyRunner // RF, GBT, and XGBoost model type configurations val randomForestConf = ConfigurationGenerator.generateConfigFromMap("RandomForest", "classifier", rfMap) val gbtConf = ConfigurationGenerator.generateConfigFromMap("GBT", "classifier", gbtMap) val xgbConf = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", genericMapOverrides) val runner = FamilyRunner(sourceData, Array(randomForestConf, gbtConf, xgbConf)).execute() 34
  35. 35. AutoML | FamilyRunner 35
  36. 36. Let’s end at the end • FeatureImportances automates feature discovery • AutomationRunner automates the building, training, execution, and tuning of a Machine Learning pipeline • FamilyRunner automates experimenting with model families • Improved AUC from 0.6732 to 0.723; potentially $23.22M to $68.88 saved • Less code, faster! 36
  37. 37. AutoML Roadmap 0.6.0 features • Serialized model and featurization stored as SparkML Pipeline • BinaryEncoder for high cardinality nominal features • Euclidean distance optimizer for post modeling search • Advanced MBO search for genetic epoch candidates to aid in faster / more effective convergence • Automated MLflow logging and configuration (log to same workspace directory) • Bug fixes 37
  38. 38. AutoML Roadmap 0.6.x release • Python API • MLeap artifact export • LightGBM support 38
  39. 39. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

<p>Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.</p><p>In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.</p>

Views

Total views

515

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

16

Shares

0

Comments

0

Likes

0

×