SlideShare a Scribd company logo
1 of 52
Download to read offline
Automated
Hyperparameter Tuning
June 20th, 2019
Logistics
• We can’t hear you…
• Recording will be available…
• Slides will be available…
• Code samples and notebooks will be available…
• Queue up Questions…
• Bookmark databricks.com/blog
About our speakers
Yifan Cao, Sr. Product Manager, Machine Learning at Databricks
• Product Area: ML/DL algorithms and Databricks Runtime for Machine
Learning
• Built and grew two ML products to multi-million dollars in annual
revenue
• B.S. Engineering from UC Berkeley; MBA from MIT
Joseph Bradley, Software Engineer, Machine Learning at Databricks
• Apache Spark PMC member
• Postdoc at UC Berkeley
• Ph.D. in Machine Learning from Carnegie Mellon
Accelerate innovation by unifying data science,
engineering and business
• Original creators of
• 2000+ global companies use our platform across big
data & machine learning lifecycle
VISION
WHO WE
ARE
Unified Analytics PlatformSOLUTION
DATA
ENGINEERS
x
Data & ML Tech and People are in Silos
DATA
SCIENTISTS
Hiring Data Scientists is a Key Blocker
“My team needs to build 100+
models this year, but it has
only got to 20%.”
What is Automated ML (AutoML)?
● Excel-like tool that enables anyone
to do machine learning
● Productivity tools for
data scientists
Raw Data
Model
Exploration
Feature
Engineering
ETL
Model
Scoring
Hyperparam
eter Tuning
Alerting &
Monitoring
Cross
Validation
Where does AutoML fit on Databricks?
DATA
ENGINEERS
DATA
SCIENTISTS
AutoML
Great Training
AutoML on Databricks (1/3)
AutoML librariesUSER CONTROL
Watch it now >
https://dbricks.co/zynga
Custom Solution: Zynga
Automating Predictive Modeling at Zynga with Pandas UDFs
Great Training
AutoML on Databricks (2/3)
AutoML libraries
PartnershipsAUTOMATION
USER CONTROL
Databricks
ETL & ML
Databricks
ML Test & Model
Enable data scientists and citizen data scientists to accelerate and scale
the development and delivery of predictive models.
Run and deploy ML
models at Scale
14
Databricks and DataRobot Integration
Watch it now >
https://dbricks.co/datarobot
Great Training
AutoML on Databricks (3/3)
AutoML libraries
Partnerships
Hyperopt
AUTOMATION
USER CONTROL
AUTOMATION +
CONTROL
Integrations MLlib
Today's Content
Great Training
A simple analogy
Manual Transmission
Semi AutonomousAUTOMATION
USER CONTROL
AUTOMATION +
CONTROL
Automatic Transmission
Today's Content
Use Case #1: Hyperparameter Tuning
Model
Exploration
Feature
Engineering
Model
Scoring
Hyperparam
eter Tuning
Alerting &
Monitoring
Cross
Validation
Scenarios:
● Automated hyperparameter search to select models after cross validation
● Automated hyperparameter search to optimize models in production
Our Offerings:
● Distributed Hyperopt + Automated MLflow Tracking
Raw Data ETL
Use Case #2: Model Search
Model
Exploration
Feature
Engineering
Model
Scoring
Hyperparam
eter Tuning
Alerting &
Monitoring
Cross
Validation
Scenarios:
● Automated model search by exploring different combinations of featuresets, algos,
hyperparameters
● Automated model search by extending a baseline model to 1000+ custom models
Our Offerings:
● MLlib + Automated MLflow Tracking
● Distributed Hyperopt + Automated MLflow Tracking, with conditional hyperparameter tuning
Raw Data ETL
Scenarios:
● Automated end-to-end Machine Learning model generation pipelines incorporating
customer-specified logics
Our Offerings:
● Leverage existing Databricks internal tools & frameworks on top of Databricks Runtime
ML
Use Case #3: End-to-end ML Pipeline
Model
Exploration
Feature
Engineering
Model
Scoring
Hyperparam
eter Tuning
Alerting &
Monitoring
Cross
Validation
Raw Data ETL
Hyperparameters
Hyperparameters
Express high-level concepts, such as statistical assumptions
E.g.: regularization
Are fixed before training or are hard to learn from data
E.g.: neural net architecture
Affect objective, test time performance, computational cost
E.g.: # iterations or epochs
Tuning hyperparameters
E.g.: Fitting a
polynomial
Common goals:
• More flexible modeling process
• Reduced generalization error
• Faster training
• Plug & play ML
Challenges in tuning
Curse of dimensionality
Non-convex optimization
Computational cost
Unintuitive hyperparameters
Data prep: train-validation-test splits
Data
Data prep: train-validation-test splits
Training Data Test Data
ML Model
Data prep: train-validation-test splits
Training
Data
Validation
Data
Test Data
Final
ML Model
ML Model 1
ML Model 2
ML Model 3
A practical definition of tuning
ML Model
Featurization
Model family
selection
Hyperparameter
tuning
Parameters: configs which your ML library learns from data
Hyperparameters: configs which your ML library does not learn from data
Tuning Methods
Overview of tuning methods
•Manual search
•Grid search
•Random search
•Population-based algorithms
•Bayesian algorithms
Manual search
Select hyperparameter settings to try based on human intuition.
2 hyperparameters:
•[0, ..., 5]
•{A, B, ..., F}
A B C D E F
0
1
2
3
4
5
Expert knowledge tells us to try:
(2,C), (2,D), (2,E), (3,C), (3,D), (3,E)
Grid Search
Try points on a grid defined by ranges and step sizes
X-axis: {A,...,F}
Y-axis: 0-5, step = 1
A B C D E F
0
1
2
3
4
5
A B C D E F
0
1
2
3
4
5
Random Search
Sample from distributions over ranges
X-axis: Uniform({A,...,F})
Y-axis: Uniform([0,5])
Start with random search, then iterate:
•Use the previous “generation” to
inform the next generation
•E.g., sample from best performers &
then perturb them
Population Based Algorithms
A B C D E F
0
1
2
3
4
5
Start with random search, then iterate:
•Use the previous “generation” to
inform the next generation
•E.g., sample from best performers &
then perturb them
Population Based Algorithms
A B C D E F
0
1
2
3
4
5
Start with random search, then iterate:
•Use the previous “generation” to
inform the next generation
•E.g., sample from best performers &
then perturb them
Population Based Algorithms
A B C D E F
0
1
2
3
4
5
Model the loss function:
Hyperparameters ⇒ loss
Iteratively search space, trading off
between exploration and exploitation
A B C D E F
0
1
2
3
4
5
Bayesian Optimization
Get samples: Test new points in
hyperparameter space
Bayesian Optimization
A B C D E F
0
1
2
3
4
5
A B C D E F
0
1
2
3
4
5
Get samples: Test new points in
hyperparameter space
Update model of space:
Hyperparameters ⇒ loss
Bayesian Optimization
Comparing tuning methods
Iterative /
adaptive?
# evaluations
for P params
Model of
param space
Grid search No O(c^P) none
Random search No O(k) none
Population-based Yes O(k) implicit
Bayesian Yes O(k) explicit
Open-source tools for tuning
Grid
search
Random
search
Population
-based
Bayesian PyPi
downloads
last month
Github
stars
License
scikit-learn Yes Yes --- --- BSD
MLlib Yes --- --- Apache 2.0
scikit-opti
mize
Yes 49,189 1,278 BSD
Hyperopt Yes Yes 98,282 3,286 BSD
DEAP Yes 26,700 2,789 LGPL v3
TPOT Yes 9,057 5,609 LGPL v3
GPyOpt Yes 4,959 451 BSD
As of mid-April 2019
Tracking Tuning Workflows
MLflow Overview
42
Tracking
Record and query
experiments: code,
data, config, results
Projects
Packaging format
for reproducible runs
on any platform
Models
General model format
that supports diverse
deployment tools
mlflow.org github.com/mlflow twitter.com/MLflowdatabricks.com/mlflow
Organizing with
Training Data Validation Data Test Data
Final ML ModelML Model 1
ML Model 2
ML Model 3
Experiment
Main run
Child runs
Tip: Tune full pipeline, not 1 model.
Instrumenting tuning with
MLflow concepts for tracking runs
Params: hyperparameters
Metrics: training & validation, loss & objective, multiple objectives
Tags: provenance, simple metadata
Artifacts: serialized model, large metadata
Analyzing how tuning performs
Questions to answer
• Am I tuning the right hyperparameters?
• Am I exploring the right parts of the search space?
• Do I need to do another round of tuning?
Examining results
• Simple case: visualize param vs metric
• Challenges: multiple params and metrics, iterative experimentation
Auto-tracking MLlib with
Training Data Validation Data Test Data
Final ML ModelML Model 1
ML Model 2
ML Model 3
Experiment
Main run
Child runs
In Databricks
• CrossValidator &
TrainValidationSplit
• 1 run per setting of
hyperparameters
• Avg metrics for CV folds(demo)
Scaling Tuning Workflows
Hyperopt
Hyperparameter tuning in Python ML workflows
● Usable with any Python ML library
● Tuning algorithms:
○ Random search
○ Bayesian (Tree of Parzen Estimators)
● Open source (3-clause BSD license)
https://github.com/hyperopt/hyperopt
Distribute tuning across Spark clusters
● Each Spark task trains & evaluates 1 model (hyperparameter setting)
○ Applicable to single-machine ML workloads
● Via new SparkTrials plugin
● Contributing to open source Hyperopt:
github.com/hyperopt/hyperopt/pull/509
With automated MLflow tracking in Databricks
Available now in Databricks Runtime 5.4 ML
Hyperopt on Apache Spark
(demo)
Related Content
Blog:
• Hyperparameter Tuning with MLflow,
Apache Spark MLlib and Hyperopt
Webinar:
• How to Automate Machine Learning and
Scale Delivery
Tutorials
● Hyperparameter Tuning Documentation
● MLflow integrations with H20.ai GPyOpt,
HyperOpt
Notebooks
● MLlib + Automated MLflow Tracking
● Distributed Hyperopt + Automated MLflow
Tracking
● Basic Introduction to DataRobot via API
Videos
● Automating Predictive Modeling at Zynga
with PySpark and Pandas UDFs
● Best Practices for Hyperparameter Tuning
with MLflow
● Advanced Hyperparameter Optimization
for Deep Learning with MLflow
Getting started
MLflow
Managed MLflow
Generally Available in
Databricks
MLlib + automated
MLflow tracking
Public preview in
Databricks Runtime 5.4
& 5.4ML
Distributed Hyperopt
+ automated MLflow
tracking
Public preview in
Databricks Runtime 5.4ML
https://docs.databricks.com/spark/latest/mllib/index.html#hyperparameter-tuning
https://docs.azuredatabricks.net/spark/latest/mllib/index.html#hyperparameter-tuning
https://mlflow.org/
Thank you
Q&A
52

More Related Content

What's hot

Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine LearningJames Serra
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondBowen Li
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
Michelangelo - Machine Learning Platform - 2018
Michelangelo - Machine Learning Platform - 2018Michelangelo - Machine Learning Platform - 2018
Michelangelo - Machine Learning Platform - 2018Karthik Murugesan
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learningDr Wei Liu
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 

What's hot (20)

Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Michelangelo - Machine Learning Platform - 2018
Michelangelo - Machine Learning Platform - 2018Michelangelo - Machine Learning Platform - 2018
Michelangelo - Machine Learning Platform - 2018
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
Data science
Data scienceData science
Data science
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learning
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 

Similar to Automated Hyperparameter Tuning, Scaling and Tracking

Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
 
Reproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowReproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowDatabricks
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine LearningJoaquin Vanschoren
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemPierre Gutierrez
 

Similar to Automated Hyperparameter Tuning, Scaling and Tracking (20)

Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Reproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowReproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflow
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Machine learning
Machine learningMachine learning
Machine learning
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Ai in finance
Ai in financeAi in finance
Ai in finance
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Automated Hyperparameter Tuning, Scaling and Tracking

  • 2. Logistics • We can’t hear you… • Recording will be available… • Slides will be available… • Code samples and notebooks will be available… • Queue up Questions… • Bookmark databricks.com/blog
  • 3. About our speakers Yifan Cao, Sr. Product Manager, Machine Learning at Databricks • Product Area: ML/DL algorithms and Databricks Runtime for Machine Learning • Built and grew two ML products to multi-million dollars in annual revenue • B.S. Engineering from UC Berkeley; MBA from MIT Joseph Bradley, Software Engineer, Machine Learning at Databricks • Apache Spark PMC member • Postdoc at UC Berkeley • Ph.D. in Machine Learning from Carnegie Mellon
  • 4. Accelerate innovation by unifying data science, engineering and business • Original creators of • 2000+ global companies use our platform across big data & machine learning lifecycle VISION WHO WE ARE Unified Analytics PlatformSOLUTION
  • 5. DATA ENGINEERS x Data & ML Tech and People are in Silos DATA SCIENTISTS
  • 6.
  • 7. Hiring Data Scientists is a Key Blocker
  • 8. “My team needs to build 100+ models this year, but it has only got to 20%.”
  • 9. What is Automated ML (AutoML)? ● Excel-like tool that enables anyone to do machine learning ● Productivity tools for data scientists
  • 10. Raw Data Model Exploration Feature Engineering ETL Model Scoring Hyperparam eter Tuning Alerting & Monitoring Cross Validation Where does AutoML fit on Databricks? DATA ENGINEERS DATA SCIENTISTS AutoML
  • 11. Great Training AutoML on Databricks (1/3) AutoML librariesUSER CONTROL
  • 12. Watch it now > https://dbricks.co/zynga Custom Solution: Zynga Automating Predictive Modeling at Zynga with Pandas UDFs
  • 13. Great Training AutoML on Databricks (2/3) AutoML libraries PartnershipsAUTOMATION USER CONTROL
  • 14. Databricks ETL & ML Databricks ML Test & Model Enable data scientists and citizen data scientists to accelerate and scale the development and delivery of predictive models. Run and deploy ML models at Scale 14 Databricks and DataRobot Integration Watch it now > https://dbricks.co/datarobot
  • 15. Great Training AutoML on Databricks (3/3) AutoML libraries Partnerships Hyperopt AUTOMATION USER CONTROL AUTOMATION + CONTROL Integrations MLlib Today's Content
  • 16. Great Training A simple analogy Manual Transmission Semi AutonomousAUTOMATION USER CONTROL AUTOMATION + CONTROL Automatic Transmission Today's Content
  • 17. Use Case #1: Hyperparameter Tuning Model Exploration Feature Engineering Model Scoring Hyperparam eter Tuning Alerting & Monitoring Cross Validation Scenarios: ● Automated hyperparameter search to select models after cross validation ● Automated hyperparameter search to optimize models in production Our Offerings: ● Distributed Hyperopt + Automated MLflow Tracking Raw Data ETL
  • 18. Use Case #2: Model Search Model Exploration Feature Engineering Model Scoring Hyperparam eter Tuning Alerting & Monitoring Cross Validation Scenarios: ● Automated model search by exploring different combinations of featuresets, algos, hyperparameters ● Automated model search by extending a baseline model to 1000+ custom models Our Offerings: ● MLlib + Automated MLflow Tracking ● Distributed Hyperopt + Automated MLflow Tracking, with conditional hyperparameter tuning Raw Data ETL
  • 19. Scenarios: ● Automated end-to-end Machine Learning model generation pipelines incorporating customer-specified logics Our Offerings: ● Leverage existing Databricks internal tools & frameworks on top of Databricks Runtime ML Use Case #3: End-to-end ML Pipeline Model Exploration Feature Engineering Model Scoring Hyperparam eter Tuning Alerting & Monitoring Cross Validation Raw Data ETL
  • 21. Hyperparameters Express high-level concepts, such as statistical assumptions E.g.: regularization Are fixed before training or are hard to learn from data E.g.: neural net architecture Affect objective, test time performance, computational cost E.g.: # iterations or epochs
  • 22. Tuning hyperparameters E.g.: Fitting a polynomial Common goals: • More flexible modeling process • Reduced generalization error • Faster training • Plug & play ML
  • 23. Challenges in tuning Curse of dimensionality Non-convex optimization Computational cost Unintuitive hyperparameters
  • 25. Data prep: train-validation-test splits Training Data Test Data ML Model
  • 26. Data prep: train-validation-test splits Training Data Validation Data Test Data Final ML Model ML Model 1 ML Model 2 ML Model 3
  • 27. A practical definition of tuning ML Model Featurization Model family selection Hyperparameter tuning Parameters: configs which your ML library learns from data Hyperparameters: configs which your ML library does not learn from data
  • 29. Overview of tuning methods •Manual search •Grid search •Random search •Population-based algorithms •Bayesian algorithms
  • 30. Manual search Select hyperparameter settings to try based on human intuition. 2 hyperparameters: •[0, ..., 5] •{A, B, ..., F} A B C D E F 0 1 2 3 4 5 Expert knowledge tells us to try: (2,C), (2,D), (2,E), (3,C), (3,D), (3,E)
  • 31. Grid Search Try points on a grid defined by ranges and step sizes X-axis: {A,...,F} Y-axis: 0-5, step = 1 A B C D E F 0 1 2 3 4 5
  • 32. A B C D E F 0 1 2 3 4 5 Random Search Sample from distributions over ranges X-axis: Uniform({A,...,F}) Y-axis: Uniform([0,5])
  • 33. Start with random search, then iterate: •Use the previous “generation” to inform the next generation •E.g., sample from best performers & then perturb them Population Based Algorithms A B C D E F 0 1 2 3 4 5
  • 34. Start with random search, then iterate: •Use the previous “generation” to inform the next generation •E.g., sample from best performers & then perturb them Population Based Algorithms A B C D E F 0 1 2 3 4 5
  • 35. Start with random search, then iterate: •Use the previous “generation” to inform the next generation •E.g., sample from best performers & then perturb them Population Based Algorithms A B C D E F 0 1 2 3 4 5
  • 36. Model the loss function: Hyperparameters ⇒ loss Iteratively search space, trading off between exploration and exploitation A B C D E F 0 1 2 3 4 5 Bayesian Optimization
  • 37. Get samples: Test new points in hyperparameter space Bayesian Optimization A B C D E F 0 1 2 3 4 5
  • 38. A B C D E F 0 1 2 3 4 5 Get samples: Test new points in hyperparameter space Update model of space: Hyperparameters ⇒ loss Bayesian Optimization
  • 39. Comparing tuning methods Iterative / adaptive? # evaluations for P params Model of param space Grid search No O(c^P) none Random search No O(k) none Population-based Yes O(k) implicit Bayesian Yes O(k) explicit
  • 40. Open-source tools for tuning Grid search Random search Population -based Bayesian PyPi downloads last month Github stars License scikit-learn Yes Yes --- --- BSD MLlib Yes --- --- Apache 2.0 scikit-opti mize Yes 49,189 1,278 BSD Hyperopt Yes Yes 98,282 3,286 BSD DEAP Yes 26,700 2,789 LGPL v3 TPOT Yes 9,057 5,609 LGPL v3 GPyOpt Yes 4,959 451 BSD As of mid-April 2019
  • 42. MLflow Overview 42 Tracking Record and query experiments: code, data, config, results Projects Packaging format for reproducible runs on any platform Models General model format that supports diverse deployment tools mlflow.org github.com/mlflow twitter.com/MLflowdatabricks.com/mlflow
  • 43. Organizing with Training Data Validation Data Test Data Final ML ModelML Model 1 ML Model 2 ML Model 3 Experiment Main run Child runs Tip: Tune full pipeline, not 1 model.
  • 44. Instrumenting tuning with MLflow concepts for tracking runs Params: hyperparameters Metrics: training & validation, loss & objective, multiple objectives Tags: provenance, simple metadata Artifacts: serialized model, large metadata
  • 45. Analyzing how tuning performs Questions to answer • Am I tuning the right hyperparameters? • Am I exploring the right parts of the search space? • Do I need to do another round of tuning? Examining results • Simple case: visualize param vs metric • Challenges: multiple params and metrics, iterative experimentation
  • 46. Auto-tracking MLlib with Training Data Validation Data Test Data Final ML ModelML Model 1 ML Model 2 ML Model 3 Experiment Main run Child runs In Databricks • CrossValidator & TrainValidationSplit • 1 run per setting of hyperparameters • Avg metrics for CV folds(demo)
  • 48. Hyperopt Hyperparameter tuning in Python ML workflows ● Usable with any Python ML library ● Tuning algorithms: ○ Random search ○ Bayesian (Tree of Parzen Estimators) ● Open source (3-clause BSD license) https://github.com/hyperopt/hyperopt
  • 49. Distribute tuning across Spark clusters ● Each Spark task trains & evaluates 1 model (hyperparameter setting) ○ Applicable to single-machine ML workloads ● Via new SparkTrials plugin ● Contributing to open source Hyperopt: github.com/hyperopt/hyperopt/pull/509 With automated MLflow tracking in Databricks Available now in Databricks Runtime 5.4 ML Hyperopt on Apache Spark (demo)
  • 50. Related Content Blog: • Hyperparameter Tuning with MLflow, Apache Spark MLlib and Hyperopt Webinar: • How to Automate Machine Learning and Scale Delivery Tutorials ● Hyperparameter Tuning Documentation ● MLflow integrations with H20.ai GPyOpt, HyperOpt Notebooks ● MLlib + Automated MLflow Tracking ● Distributed Hyperopt + Automated MLflow Tracking ● Basic Introduction to DataRobot via API Videos ● Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs ● Best Practices for Hyperparameter Tuning with MLflow ● Advanced Hyperparameter Optimization for Deep Learning with MLflow
  • 51. Getting started MLflow Managed MLflow Generally Available in Databricks MLlib + automated MLflow tracking Public preview in Databricks Runtime 5.4 & 5.4ML Distributed Hyperopt + automated MLflow tracking Public preview in Databricks Runtime 5.4ML https://docs.databricks.com/spark/latest/mllib/index.html#hyperparameter-tuning https://docs.azuredatabricks.net/spark/latest/mllib/index.html#hyperparameter-tuning https://mlflow.org/