Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automated Hyperparameter Tuning, Scaling and Tracking

1,294 views

Published on

Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.

For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.

Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:

Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking

Recording and notebooks will be provided after the webinar so that you can practice at your own pace.

Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.

Published in: Data & Analytics
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Automated Hyperparameter Tuning, Scaling and Tracking

  1. 1. Automated Hyperparameter Tuning June 20th, 2019
  2. 2. Logistics • We can’t hear you… • Recording will be available… • Slides will be available… • Code samples and notebooks will be available… • Queue up Questions… • Bookmark databricks.com/blog
  3. 3. About our speakers Yifan Cao, Sr. Product Manager, Machine Learning at Databricks • Product Area: ML/DL algorithms and Databricks Runtime for Machine Learning • Built and grew two ML products to multi-million dollars in annual revenue • B.S. Engineering from UC Berkeley; MBA from MIT Joseph Bradley, Software Engineer, Machine Learning at Databricks • Apache Spark PMC member • Postdoc at UC Berkeley • Ph.D. in Machine Learning from Carnegie Mellon
  4. 4. Accelerate innovation by unifying data science, engineering and business • Original creators of • 2000+ global companies use our platform across big data & machine learning lifecycle VISION WHO WE ARE Unified Analytics PlatformSOLUTION
  5. 5. DATA ENGINEERS x Data & ML Tech and People are in Silos DATA SCIENTISTS
  6. 6. Hiring Data Scientists is a Key Blocker
  7. 7. “My team needs to build 100+ models this year, but it has only got to 20%.”
  8. 8. What is Automated ML (AutoML)? ● Excel-like tool that enables anyone to do machine learning ● Productivity tools for data scientists
  9. 9. Raw Data Model Exploration Feature Engineering ETL Model Scoring Hyperparam eter Tuning Alerting & Monitoring Cross Validation Where does AutoML fit on Databricks? DATA ENGINEERS DATA SCIENTISTS AutoML
  10. 10. Great Training AutoML on Databricks (1/3) AutoML librariesUSER CONTROL
  11. 11. Watch it now > https://dbricks.co/zynga Custom Solution: Zynga Automating Predictive Modeling at Zynga with Pandas UDFs
  12. 12. Great Training AutoML on Databricks (2/3) AutoML libraries PartnershipsAUTOMATION USER CONTROL
  13. 13. Databricks ETL & ML Databricks ML Test & Model Enable data scientists and citizen data scientists to accelerate and scale the development and delivery of predictive models. Run and deploy ML models at Scale 14 Databricks and DataRobot Integration Watch it now > https://dbricks.co/datarobot
  14. 14. Great Training AutoML on Databricks (3/3) AutoML libraries Partnerships Hyperopt AUTOMATION USER CONTROL AUTOMATION + CONTROL Integrations MLlib Today's Content
  15. 15. Great Training A simple analogy Manual Transmission Semi AutonomousAUTOMATION USER CONTROL AUTOMATION + CONTROL Automatic Transmission Today's Content
  16. 16. Use Case #1: Hyperparameter Tuning Model Exploration Feature Engineering Model Scoring Hyperparam eter Tuning Alerting & Monitoring Cross Validation Scenarios: ● Automated hyperparameter search to select models after cross validation ● Automated hyperparameter search to optimize models in production Our Offerings: ● Distributed Hyperopt + Automated MLflow Tracking Raw Data ETL
  17. 17. Use Case #2: Model Search Model Exploration Feature Engineering Model Scoring Hyperparam eter Tuning Alerting & Monitoring Cross Validation Scenarios: ● Automated model search by exploring different combinations of featuresets, algos, hyperparameters ● Automated model search by extending a baseline model to 1000+ custom models Our Offerings: ● MLlib + Automated MLflow Tracking ● Distributed Hyperopt + Automated MLflow Tracking, with conditional hyperparameter tuning Raw Data ETL
  18. 18. Scenarios: ● Automated end-to-end Machine Learning model generation pipelines incorporating customer-specified logics Our Offerings: ● Leverage existing Databricks internal tools & frameworks on top of Databricks Runtime ML Use Case #3: End-to-end ML Pipeline Model Exploration Feature Engineering Model Scoring Hyperparam eter Tuning Alerting & Monitoring Cross Validation Raw Data ETL
  19. 19. Hyperparameters
  20. 20. Hyperparameters Express high-level concepts, such as statistical assumptions E.g.: regularization Are fixed before training or are hard to learn from data E.g.: neural net architecture Affect objective, test time performance, computational cost E.g.: # iterations or epochs
  21. 21. Tuning hyperparameters E.g.: Fitting a polynomial Common goals: • More flexible modeling process • Reduced generalization error • Faster training • Plug & play ML
  22. 22. Challenges in tuning Curse of dimensionality Non-convex optimization Computational cost Unintuitive hyperparameters
  23. 23. Data prep: train-validation-test splits Data
  24. 24. Data prep: train-validation-test splits Training Data Test Data ML Model
  25. 25. Data prep: train-validation-test splits Training Data Validation Data Test Data Final ML Model ML Model 1 ML Model 2 ML Model 3
  26. 26. A practical definition of tuning ML Model Featurization Model family selection Hyperparameter tuning Parameters: configs which your ML library learns from data Hyperparameters: configs which your ML library does not learn from data
  27. 27. Tuning Methods
  28. 28. Overview of tuning methods •Manual search •Grid search •Random search •Population-based algorithms •Bayesian algorithms
  29. 29. Manual search Select hyperparameter settings to try based on human intuition. 2 hyperparameters: •[0, ..., 5] •{A, B, ..., F} A B C D E F 0 1 2 3 4 5 Expert knowledge tells us to try: (2,C), (2,D), (2,E), (3,C), (3,D), (3,E)
  30. 30. Grid Search Try points on a grid defined by ranges and step sizes X-axis: {A,...,F} Y-axis: 0-5, step = 1 A B C D E F 0 1 2 3 4 5
  31. 31. A B C D E F 0 1 2 3 4 5 Random Search Sample from distributions over ranges X-axis: Uniform({A,...,F}) Y-axis: Uniform([0,5])
  32. 32. Start with random search, then iterate: •Use the previous “generation” to inform the next generation •E.g., sample from best performers & then perturb them Population Based Algorithms A B C D E F 0 1 2 3 4 5
  33. 33. Start with random search, then iterate: •Use the previous “generation” to inform the next generation •E.g., sample from best performers & then perturb them Population Based Algorithms A B C D E F 0 1 2 3 4 5
  34. 34. Start with random search, then iterate: •Use the previous “generation” to inform the next generation •E.g., sample from best performers & then perturb them Population Based Algorithms A B C D E F 0 1 2 3 4 5
  35. 35. Model the loss function: Hyperparameters ⇒ loss Iteratively search space, trading off between exploration and exploitation A B C D E F 0 1 2 3 4 5 Bayesian Optimization
  36. 36. Get samples: Test new points in hyperparameter space Bayesian Optimization A B C D E F 0 1 2 3 4 5
  37. 37. A B C D E F 0 1 2 3 4 5 Get samples: Test new points in hyperparameter space Update model of space: Hyperparameters ⇒ loss Bayesian Optimization
  38. 38. Comparing tuning methods Iterative / adaptive? # evaluations for P params Model of param space Grid search No O(c^P) none Random search No O(k) none Population-based Yes O(k) implicit Bayesian Yes O(k) explicit
  39. 39. Open-source tools for tuning Grid search Random search Population -based Bayesian PyPi downloads last month Github stars License scikit-learn Yes Yes --- --- BSD MLlib Yes --- --- Apache 2.0 scikit-opti mize Yes 49,189 1,278 BSD Hyperopt Yes Yes 98,282 3,286 BSD DEAP Yes 26,700 2,789 LGPL v3 TPOT Yes 9,057 5,609 LGPL v3 GPyOpt Yes 4,959 451 BSD As of mid-April 2019
  40. 40. Tracking Tuning Workflows
  41. 41. MLflow Overview 42 Tracking Record and query experiments: code, data, config, results Projects Packaging format for reproducible runs on any platform Models General model format that supports diverse deployment tools mlflow.org github.com/mlflow twitter.com/MLflowdatabricks.com/mlflow
  42. 42. Organizing with Training Data Validation Data Test Data Final ML ModelML Model 1 ML Model 2 ML Model 3 Experiment Main run Child runs Tip: Tune full pipeline, not 1 model.
  43. 43. Instrumenting tuning with MLflow concepts for tracking runs Params: hyperparameters Metrics: training & validation, loss & objective, multiple objectives Tags: provenance, simple metadata Artifacts: serialized model, large metadata
  44. 44. Analyzing how tuning performs Questions to answer • Am I tuning the right hyperparameters? • Am I exploring the right parts of the search space? • Do I need to do another round of tuning? Examining results • Simple case: visualize param vs metric • Challenges: multiple params and metrics, iterative experimentation
  45. 45. Auto-tracking MLlib with Training Data Validation Data Test Data Final ML ModelML Model 1 ML Model 2 ML Model 3 Experiment Main run Child runs In Databricks • CrossValidator & TrainValidationSplit • 1 run per setting of hyperparameters • Avg metrics for CV folds(demo)
  46. 46. Scaling Tuning Workflows
  47. 47. Hyperopt Hyperparameter tuning in Python ML workflows ● Usable with any Python ML library ● Tuning algorithms: ○ Random search ○ Bayesian (Tree of Parzen Estimators) ● Open source (3-clause BSD license) https://github.com/hyperopt/hyperopt
  48. 48. Distribute tuning across Spark clusters ● Each Spark task trains & evaluates 1 model (hyperparameter setting) ○ Applicable to single-machine ML workloads ● Via new SparkTrials plugin ● Contributing to open source Hyperopt: github.com/hyperopt/hyperopt/pull/509 With automated MLflow tracking in Databricks Available now in Databricks Runtime 5.4 ML Hyperopt on Apache Spark (demo)
  49. 49. Related Content Blog: • Hyperparameter Tuning with MLflow, Apache Spark MLlib and Hyperopt Webinar: • How to Automate Machine Learning and Scale Delivery Tutorials ● Hyperparameter Tuning Documentation ● MLflow integrations with H20.ai GPyOpt, HyperOpt Notebooks ● MLlib + Automated MLflow Tracking ● Distributed Hyperopt + Automated MLflow Tracking ● Basic Introduction to DataRobot via API Videos ● Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs ● Best Practices for Hyperparameter Tuning with MLflow ● Advanced Hyperparameter Optimization for Deep Learning with MLflow
  50. 50. Getting started MLflow Managed MLflow Generally Available in Databricks MLlib + automated MLflow tracking Public preview in Databricks Runtime 5.4 & 5.4ML Distributed Hyperopt + automated MLflow tracking Public preview in Databricks Runtime 5.4ML https://docs.databricks.com/spark/latest/mllib/index.html#hyperparameter-tuning https://docs.azuredatabricks.net/spark/latest/mllib/index.html#hyperparameter-tuning https://mlflow.org/
  51. 51. Thank you Q&A 52

×