SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Mary Grace Moesta, Databricks
Denny Lee, Databricks
Augmenting Machine Learning
with Databricks Labs AutoML
Toolkit
#UnifiedDataAnalytics #SparkAISummit
Agenda
• Discuss traditional ML pipeline problem and all of
its stages
• How AutoML Toolkit solves these problems
• Hyperparameter Optimization
• Choosing Models
• Scaling AutoML Toolkit Best Practices
3#UnifiedDataAnalytics #SparkAISummit
About Speaker
Mary Grace Moesta
Customer Success Engineer, Databricks
• Current AutoML developer
• Former data scientist at 84.51° focused on using ML
for brand accelerator and several customer
experience projects
• Likes long walks on the beach, Spark and applied
math
4#UnifiedDataAnalytics #SparkAISummit
About Speaker
Denny Lee
Developer Advocate, Databricks
• Worked with Apache Spark™ since 0.5
• Former Senior Director Data Science Engineering at
Concur
• On Project Isotope incubation team that built what
is now known as Azure HDInsight
• Former SQLCAT DW BI Lead at Microsoft
5#UnifiedDataAnalytics #SparkAISummit
AutoML’s Tiered API Approach
6
No-Code
Full Automation
Low-Code
Augmentation
Code
Flexibility and Performance
Citizen
Data Scientist
Engineer
ML Expert /
Researcher
Persona Goal Driving AnalogyAutoML API
High Level -
Automation Runner
Mid Level - Individual
Component APIs
Low Level -
Hyperparameter tuning
Let’s start at the end
7
Let’s start at the end
• AutoML’s FeatureImportances automates the discovery
of features
• AutoML’s AutomationRunner automates the building,
training, execution, and tuning of a Machine Learning pipeline
to create an optimal ML model.
• Improved AUC from 0.6732 to 0.723
• Business value: $23.22M to $68.88M saved
• Less code, faster!
8
9
ML Pipeline Stages
10
ML Pipeline Stages
Traditional ML Pipelines
Identify Important Features
11
12
Exploratory Analysis to
Identify Features
AutoML Toolkit
Identify Important Features
13
14
ML Pipeline with AutoML Toolkit
AutoML | FeatureImportances
// Calculate Feature Importance (fi)
val fiConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost",
"classifier", genericMapOverrides)
// Since we're using XGBoost, set parallelism <= 2x number of nodes
fiConfig.tunerConfig.tunerParallelism = nodeCount * 2
val fiMainConfig =
ConfigurationGenerator.generateFeatureImportanceConfig(fiConfig)
// Generate Feature Importance
val importances = new FeatureImportances(sourceData, fiMainConfig, "count", 20.0)
.generateFeatureImportances()
15
AutoML | FeatureImportances
16
17
ML Pipeline Stages
Traditional Model Building and Tuning
Building and Tuning Models
18
Hand-made Model
19
• Traditionally, when we build a ML pipeline, we will need to a number of tasks including:
• Defining our category (text-based) and numeric columns
• Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model)
• For numeric columns, ensure they are double or float data types
• For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data
• Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode)
• For example, put together imputer, stringIndexer, One-Hot encoding of category data
• Create a vector (e.g. vectorAssembler) to put together these features
• Apply a standard scaler to the values to minimize the impact of outliers
• Execute the model against our dataset
• Review the metrics (e.g., AUC)
• Tune the model using a Cross Validator
• The better you understand the model, the more likely you will provide better hyperparameters for cross validation
• i.e. need to choose a solid set of parameters (e.g. paramGrid)
• Review the metrics again (e.g. AUC)
• Review confusion matrix (in the case of binary classification)
• Review business value
Hand-made Model
20
• Traditionally, when we build a ML pipeline, we will need to a number of tasks including:
• Defining our category (text-based) and numeric columns
• Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model)
• For numeric columns, ensure they are double or float data types
• For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data
• Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode)
• For example, put together imputer, stringIndexer, One-Hot encoding of category data
• Create a vector (e.g. vectorAssembler) to put together these features
• Apply a standard scaler to the values to minimize the impact of outliers
• Execute the model against our dataset
• Review the metrics (e.g., AUC)
• Tune the model using a Cross Validator
• The better you understand the model, the more likely you will provide better hyperparameters for cross validation
• i.e. need to choose a solid set of parameters (e.g. paramGrid)
• Review the metrics again (e.g. AUC)
• Review confusion matrix (in the case of binary classification)
• Review business value
Can
we
make
this
easier?
AutoML Model Building and Tuning
Building and Tuning Models
21
22
ML Pipeline with AutoML Toolkit
AutoML | AutomationRunner
val conf = ConfigurationGenerator.generateConfigFromMap("XGBoost",…)
// Adjust model tuner configuration
conf.tunerConfig.tunerParallelism = nodeCount
// Generate configuration
val XGBConfig = ConfigurationGenerator.generateMainConfig(conf)
// Select on the important features
val runner = new AutomationRunner(sourceData).setMainConfig(XGBConfig)
.runWithConfusionReport()
23
24
Model, Metrics, Configs Saved
AUC from 0.6732 to 0.723
25
Model, Metrics, Configs Saved
AUC from 0.6732 to 0.723
How did AutoML Toolkit do this?
Able to find better hyperparameters because it:
• Tested and tuned all modifiable hyperparameters
• Performed this in a distributed fashion using a collection of
optimization algorithms
• Incorporated is the understanding of how to use the
parameters extracted from the algorithm source code
26
Common Overrides
27
Override Description
dataPrepCache Cache the primary DataFrame to allow for faster batch processing data
tunerParallelism Configure how many workflows to run in parallel; monitor if this is >30 parallel
tasks as this may saturate the driver
setTrainSplitMethod Setting the appropriate sampling method for model training
tunerTrainPortion Configure the percentages for train / test split
tunerAutoStoppingScor
e
Sets the auto stopping score for hyperparameter tuning in batch mode
28
Clearing up the Confusion
Business Value
29
Prediction
Label (Is
Bad Loan)
Short Description Long Description
1 1 Loss Avoided Correctly found bad loans
1 0 Profit Forfeited Incorrectly labeled bad loans
0 1 Loss Still Incurred Incorrectly labeled good loans
0 0 Profit Retained Correctly found good loans
Business value = - (loss avoided – profit forfeited) = -([1, 1] - [1, 0])
Business Value
30
Potentially from $23.22M to $68.88M saved
It’s all in the Family…Runner
31
Model Experimentation
• In the original Loan Risk Analysis
blog, we tried GLM, GBT, and
XGBoost
• Traditional Model Building and
Tuning x3! (one for each model
type)
32
x3
Model Experimentation
• In the original Loan Risk Analysis
blog, we tried GLM, GBT, and
XGBoost
• Traditional Model Building and
Tuning x3! (one for each model
type)
33
x3
Can
we
make
this
easier?
AutoML | FamilyRunner
import com.databricks.labs.automl.executor.FamilyRunner
// RF, GBT, and XGBoost model type configurations
val randomForestConf =
ConfigurationGenerator.generateConfigFromMap("RandomForest", "classifier", rfMap)
val gbtConf = ConfigurationGenerator.generateConfigFromMap("GBT", "classifier",
gbtMap)
val xgbConf = ConfigurationGenerator.generateConfigFromMap("XGBoost",
"classifier", genericMapOverrides)
val runner = FamilyRunner(sourceData, Array(randomForestConf, gbtConf,
xgbConf)).execute()
34
AutoML | FamilyRunner
35
Let’s end at the end
• FeatureImportances automates feature discovery
• AutomationRunner automates the building, training,
execution, and tuning of a Machine Learning pipeline
• FamilyRunner automates experimenting with model
families
• Improved AUC from 0.6732 to 0.723; potentially $23.22M to
$68.88 saved
• Less code, faster!
36
AutoML Roadmap
0.6.0 features
• Serialized model and featurization stored as SparkML Pipeline
• BinaryEncoder for high cardinality nominal features
• Euclidean distance optimizer for post modeling search
• Advanced MBO search for genetic epoch candidates to aid in faster / more
effective convergence
• Automated MLflow logging and configuration (log to same workspace
directory)
• Bug fixes
37
AutoML Roadmap
0.6.x release
• Python API
• MLeap artifact export
• LightGBM support
38
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Azure training
Azure trainingAzure training
Azure training
Koenig Solutions Ltd.
 
Accelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered ApplicationsAccelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered Applications
HostedbyConfluent
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Márton Kodok
 
[Machine Learning 15minutes! #61] Azure OpenAI Service
[Machine Learning 15minutes! #61] Azure OpenAI Service[Machine Learning 15minutes! #61] Azure OpenAI Service
[Machine Learning 15minutes! #61] Azure OpenAI Service
Naoki (Neo) SATO
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020
Mikio L. Braun
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
Microsoft Azure Technical Overview
Microsoft Azure Technical OverviewMicrosoft Azure Technical Overview
Microsoft Azure Technical Overview
gjuljo
 
AZ-204T00A-PowerPoint_00.pptx
AZ-204T00A-PowerPoint_00.pptxAZ-204T00A-PowerPoint_00.pptx
AZ-204T00A-PowerPoint_00.pptx
JavierMadrigal29
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Amazon Web Services
 
Oracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinarOracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinar
Minnie Seungmin Cho
 
AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축
AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축
AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축
Sungmin Kim
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
Rui Quintino
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
Saurabh Kaushik
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
India Quotient
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
Matthias Feys
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
Dan Sullivan, Ph.D.
 
AWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform Comparison
AWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform ComparisonAWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform Comparison
AWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform Comparison
Mariya James
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Ilyas F ☁☁☁
 

What's hot (20)

Azure training
Azure trainingAzure training
Azure training
 
Accelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered ApplicationsAccelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered Applications
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
[Machine Learning 15minutes! #61] Azure OpenAI Service
[Machine Learning 15minutes! #61] Azure OpenAI Service[Machine Learning 15minutes! #61] Azure OpenAI Service
[Machine Learning 15minutes! #61] Azure OpenAI Service
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
Microsoft Azure Technical Overview
Microsoft Azure Technical OverviewMicrosoft Azure Technical Overview
Microsoft Azure Technical Overview
 
AZ-204T00A-PowerPoint_00.pptx
AZ-204T00A-PowerPoint_00.pptxAZ-204T00A-PowerPoint_00.pptx
AZ-204T00A-PowerPoint_00.pptx
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
Oracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinarOracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinar
 
AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축
AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축
AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
AWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform Comparison
AWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform ComparisonAWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform Comparison
AWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform Comparison
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
 

Similar to Augmenting Machine Learning with Databricks Labs AutoML Toolkit

Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Practical data science
Practical data sciencePractical data science
Practical data science
Ding Li
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Philip Goddard
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Azure machine learning
Azure machine learningAzure machine learning
Azure machine learning
Simone Caldaro
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Philip Goddard
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
Vijayananda Mohire
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
Dev Raj Gautam
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Julien SIMON
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
June Andrews
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
Greg Makowski
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
Jaroslaw Szymczak
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
Databricks
 
Important Concepts for Machine Learning
Important Concepts for Machine LearningImportant Concepts for Machine Learning
Important Concepts for Machine Learning
SolivarLabs
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?
Matei Zaharia
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
Francesca Lazzeri, PhD
 

Similar to Augmenting Machine Learning with Databricks Labs AutoML Toolkit (20)

Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Azure machine learning
Azure machine learningAzure machine learning
Azure machine learning
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
 
Important Concepts for Machine Learning
Important Concepts for Machine LearningImportant Concepts for Machine Learning
Important Concepts for Machine Learning
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 

Augmenting Machine Learning with Databricks Labs AutoML Toolkit

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Mary Grace Moesta, Databricks Denny Lee, Databricks Augmenting Machine Learning with Databricks Labs AutoML Toolkit #UnifiedDataAnalytics #SparkAISummit
  • 3. Agenda • Discuss traditional ML pipeline problem and all of its stages • How AutoML Toolkit solves these problems • Hyperparameter Optimization • Choosing Models • Scaling AutoML Toolkit Best Practices 3#UnifiedDataAnalytics #SparkAISummit
  • 4. About Speaker Mary Grace Moesta Customer Success Engineer, Databricks • Current AutoML developer • Former data scientist at 84.51° focused on using ML for brand accelerator and several customer experience projects • Likes long walks on the beach, Spark and applied math 4#UnifiedDataAnalytics #SparkAISummit
  • 5. About Speaker Denny Lee Developer Advocate, Databricks • Worked with Apache Spark™ since 0.5 • Former Senior Director Data Science Engineering at Concur • On Project Isotope incubation team that built what is now known as Azure HDInsight • Former SQLCAT DW BI Lead at Microsoft 5#UnifiedDataAnalytics #SparkAISummit
  • 6. AutoML’s Tiered API Approach 6 No-Code Full Automation Low-Code Augmentation Code Flexibility and Performance Citizen Data Scientist Engineer ML Expert / Researcher Persona Goal Driving AnalogyAutoML API High Level - Automation Runner Mid Level - Individual Component APIs Low Level - Hyperparameter tuning
  • 7. Let’s start at the end 7
  • 8. Let’s start at the end • AutoML’s FeatureImportances automates the discovery of features • AutoML’s AutomationRunner automates the building, training, execution, and tuning of a Machine Learning pipeline to create an optimal ML model. • Improved AUC from 0.6732 to 0.723 • Business value: $23.22M to $68.88M saved • Less code, faster! 8
  • 11. Traditional ML Pipelines Identify Important Features 11
  • 14. 14 ML Pipeline with AutoML Toolkit
  • 15. AutoML | FeatureImportances // Calculate Feature Importance (fi) val fiConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", genericMapOverrides) // Since we're using XGBoost, set parallelism <= 2x number of nodes fiConfig.tunerConfig.tunerParallelism = nodeCount * 2 val fiMainConfig = ConfigurationGenerator.generateFeatureImportanceConfig(fiConfig) // Generate Feature Importance val importances = new FeatureImportances(sourceData, fiMainConfig, "count", 20.0) .generateFeatureImportances() 15
  • 18. Traditional Model Building and Tuning Building and Tuning Models 18
  • 19. Hand-made Model 19 • Traditionally, when we build a ML pipeline, we will need to a number of tasks including: • Defining our category (text-based) and numeric columns • Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model) • For numeric columns, ensure they are double or float data types • For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data • Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode) • For example, put together imputer, stringIndexer, One-Hot encoding of category data • Create a vector (e.g. vectorAssembler) to put together these features • Apply a standard scaler to the values to minimize the impact of outliers • Execute the model against our dataset • Review the metrics (e.g., AUC) • Tune the model using a Cross Validator • The better you understand the model, the more likely you will provide better hyperparameters for cross validation • i.e. need to choose a solid set of parameters (e.g. paramGrid) • Review the metrics again (e.g. AUC) • Review confusion matrix (in the case of binary classification) • Review business value
  • 20. Hand-made Model 20 • Traditionally, when we build a ML pipeline, we will need to a number of tasks including: • Defining our category (text-based) and numeric columns • Based on previous analysis, you can determine which features (i.e. which columns to include for your ML model) • For numeric columns, ensure they are double or float data types • For category columns, convert them using a stringIndexer and one-hot encoding to create a numeric representation of the category data • Build and train our ML pipeline to create our ML model (in this case, an XGBoost mode) • For example, put together imputer, stringIndexer, One-Hot encoding of category data • Create a vector (e.g. vectorAssembler) to put together these features • Apply a standard scaler to the values to minimize the impact of outliers • Execute the model against our dataset • Review the metrics (e.g., AUC) • Tune the model using a Cross Validator • The better you understand the model, the more likely you will provide better hyperparameters for cross validation • i.e. need to choose a solid set of parameters (e.g. paramGrid) • Review the metrics again (e.g. AUC) • Review confusion matrix (in the case of binary classification) • Review business value Can we make this easier?
  • 21. AutoML Model Building and Tuning Building and Tuning Models 21
  • 22. 22 ML Pipeline with AutoML Toolkit
  • 23. AutoML | AutomationRunner val conf = ConfigurationGenerator.generateConfigFromMap("XGBoost",…) // Adjust model tuner configuration conf.tunerConfig.tunerParallelism = nodeCount // Generate configuration val XGBConfig = ConfigurationGenerator.generateMainConfig(conf) // Select on the important features val runner = new AutomationRunner(sourceData).setMainConfig(XGBConfig) .runWithConfusionReport() 23
  • 24. 24 Model, Metrics, Configs Saved AUC from 0.6732 to 0.723
  • 25. 25 Model, Metrics, Configs Saved AUC from 0.6732 to 0.723
  • 26. How did AutoML Toolkit do this? Able to find better hyperparameters because it: • Tested and tuned all modifiable hyperparameters • Performed this in a distributed fashion using a collection of optimization algorithms • Incorporated is the understanding of how to use the parameters extracted from the algorithm source code 26
  • 27. Common Overrides 27 Override Description dataPrepCache Cache the primary DataFrame to allow for faster batch processing data tunerParallelism Configure how many workflows to run in parallel; monitor if this is >30 parallel tasks as this may saturate the driver setTrainSplitMethod Setting the appropriate sampling method for model training tunerTrainPortion Configure the percentages for train / test split tunerAutoStoppingScor e Sets the auto stopping score for hyperparameter tuning in batch mode
  • 28. 28 Clearing up the Confusion
  • 29. Business Value 29 Prediction Label (Is Bad Loan) Short Description Long Description 1 1 Loss Avoided Correctly found bad loans 1 0 Profit Forfeited Incorrectly labeled bad loans 0 1 Loss Still Incurred Incorrectly labeled good loans 0 0 Profit Retained Correctly found good loans Business value = - (loss avoided – profit forfeited) = -([1, 1] - [1, 0])
  • 30. Business Value 30 Potentially from $23.22M to $68.88M saved
  • 31. It’s all in the Family…Runner 31
  • 32. Model Experimentation • In the original Loan Risk Analysis blog, we tried GLM, GBT, and XGBoost • Traditional Model Building and Tuning x3! (one for each model type) 32 x3
  • 33. Model Experimentation • In the original Loan Risk Analysis blog, we tried GLM, GBT, and XGBoost • Traditional Model Building and Tuning x3! (one for each model type) 33 x3 Can we make this easier?
  • 34. AutoML | FamilyRunner import com.databricks.labs.automl.executor.FamilyRunner // RF, GBT, and XGBoost model type configurations val randomForestConf = ConfigurationGenerator.generateConfigFromMap("RandomForest", "classifier", rfMap) val gbtConf = ConfigurationGenerator.generateConfigFromMap("GBT", "classifier", gbtMap) val xgbConf = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", genericMapOverrides) val runner = FamilyRunner(sourceData, Array(randomForestConf, gbtConf, xgbConf)).execute() 34
  • 36. Let’s end at the end • FeatureImportances automates feature discovery • AutomationRunner automates the building, training, execution, and tuning of a Machine Learning pipeline • FamilyRunner automates experimenting with model families • Improved AUC from 0.6732 to 0.723; potentially $23.22M to $68.88 saved • Less code, faster! 36
  • 37. AutoML Roadmap 0.6.0 features • Serialized model and featurization stored as SparkML Pipeline • BinaryEncoder for high cardinality nominal features • Euclidean distance optimizer for post modeling search • Advanced MBO search for genetic epoch candidates to aid in faster / more effective convergence • Automated MLflow logging and configuration (log to same workspace directory) • Bug fixes 37
  • 38. AutoML Roadmap 0.6.x release • Python API • MLeap artifact export • LightGBM support 38
  • 39. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT