Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy

305 views

Published on

Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Jim Dowling, Logical Clocks AB and KTH Moritz Meister, Logical Clocks AB Asynchronous Hyperparameter Optimization with Apache Spark #UnifiedDataAnalytics #SparkAISummit @jim_dowling @morimeister
  3. 3. The Bitter Lesson (of AI)* “Methods that scale with computation are the future of AI”** 3 ** https://www.youtube.com/watch?v=EeMCEQa85tw Rich Sutton (Father of Reinforcement Learning) * http://www.incompleteideas.net/IncIdeas/BitterLesson.html “The two (general purpose) methods that seem to scale ... are search and learning.”*
  4. 4. Spark scales with available compute => Spark is the answer! This talk is about why bulk- synchronous parallel compute (Spark) does not scale efficiently for search and how we made Spark efficient for directed search (task-based asynchronous parallel compute). 4
  5. 5. Inner and Outer Loop of Deep Learning Inner Loop Outer Loop Training Data worker1 worker2 workerN … ∆ 1 ∆ 2 ∆ N Synchronization Metric Search Method HParams http://tiny.cc/51yjdz
  6. 6. Inner and Outer Loop of Deep Learning Inner Loop Outer Loop Training Data worker1 worker2 workerN … ∆ 1 ∆ 2 ∆ N Synchronization Metric Search Method HParams http://tiny.cc/51yjdz LEARNINGSEARCH
  7. 7. 7 Hopsworks – a platform for Data-Intensive AI
  8. 8. Hopsworks Technical Milestones 8 World’s first Hadoop platform to support GPUs-as-a-Resource World’s fastest HDFS Published at USENIX FAST with Oracle and Spotify World’s First Open Source Feature Store for Machine Learning World’s First Distributed Filesystem to store small files in metadata on NVMe disks Winner of IEEE Scale Challenge 2017 with HopsFS - 1.2m ops/sec 2017 World’s most scalable POSIX-like Hierarchical Filesystem with Multi Data Center Availability with 1.6m ops/sec on GCP 2018 2019 First non-Google ML Platform with TensorFlow Extended (TFX) support through Beam/Flink World’s first Unified Hyperparam and Ablation Study Framework
  9. 9. The Complexity of Deep Learning 9 Data validation Distributed Training Model Serving A/B Testing Monitoring Pipeline Management HyperParameter Tuning Feature Engineering Data Collection Hardware Management Data Model Prediction φ(x) Hopsworks Feature Store Hopsworks REST API [Adapted from Schulley et al “Technical Debt of ML” ]
  10. 10. 10
  11. 11. 11
  12. 12. 12 Datasources Applications API Dashboards Hopsworks Apache Beam Apache Spark Pip Conda Tensorflow scikit-learn Keras Jupyter Notebooks Tensorboard Apache Beam Apache Spark Apache Flink Kubernetes Batch Distributed ML & DL Model Serving Hopsworks Feature Store Kafka + Spark Streaming Model Monitoring Orchestration in Airflow Data Preparation & Ingestion Experimentation & Model Training Deploy & Productionalize Streaming Filesystem and Metadata storage HopsFS
  13. 13. “AI is the new Electricity” – Andrew Ng 14 What engine should we use?
  14. 14. Engines Matter! 15 [Image from https://twitter.com/_youhadonejob1/status/1143968337359187968?s=20]
  15. 15. Engines Really Matter! 16 Photo by Zbynek Burival on Unsplash
  16. 16. Hopsworks Engine: ML Pipelines 17 Data Pipelines Ingest & Prep Feature Store Machine Learning Experiments Data Parallel Training Model Serving Ablation Studies Hyperparameter Optimization Bottleneck, due to • iterative nature • human interaction Horizontal Scalability at all Stages
  17. 17. Iterative Model Development • Trial and Error is slow • Iterative approach is greedy • Search spaces are usually large • Sensitivity and interaction of hyperparameters 18 Set Hyper- parameters Train Model Evaluate Performance
  18. 18. Black Box Optimization 19 Learning Black Box Metric Meta-level learning & optimization Search space
  19. 19. Parallel Black Box Optimization 20 Which algorithm to use for search? How to monitor progress? Fault Tolerance?How to aggregate results? Learning Black Box Metric Meta-level learning & optimization Parallel WorkersQueue Trial Trial Search space This should be managed with platform support!
  20. 20. Maggy A flexible framework for running different black-box optimization algorithms on Hopsworks: ASHA, Bayesian Optimization, Random Search, Grid Search and more to come… 21
  21. 21. Synchronous Search 22 Task11 Driver Task12 Task13 Task1N … HDFS Task21 Task22 Task23 Task2N … Barrier Barrier Task31 Task32 Task33 Task3N … Barrier Metrics1 Metrics2 Metrics3
  22. 22. Add Early Stopping 23 Task11 Driver Task12 Task13 Task1N … HDFS Task21 Task22 Task23 Task2N … Barrier Barrier Task31 Task32 Task33 Task3N … Barrier Metrics1 Metrics2 Metrics3 Wasted Compute Wasted ComputeWasted Compute
  23. 23. Performance Enhancement 24 Early Stopping: ● Median Stopping Rule ● Performance curve prediction Multi-fidelity Methods: ● Successive Halving Algorithm ● Hyperband Figure: Successive Halving Algorithm
  24. 24. 25 Synchronous Successive Halving Kevin G. Jamieson et al. “Non-stochastic Best Arm Identification and Hyperparameter Optimization” (2015). Animation: https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/
  25. 25. 26 Asynchronous Successive Halving Liam Li et al. “Massively Parallel Hyperparameter Tuning” (2018). Animation: https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/
  26. 26. Challenge How can we fit this into the bulk synchronous execution model of Spark? Mismatch: Spark Tasks and Stages vs. Trials 27 Databricks’ approach: Project Hydrogen (barrier execution mode) & SparkTrials in Hyperopt
  27. 27. 28 Task11 Driver Task12 Task13 Task1N … Barrier Metrics New Trial Early Stop The Solution Long running tasks:
  28. 28. Enter Maggy 30
  29. 29. User API 31
  30. 30. Developer API 32
  31. 31. Results 33 Hyperparameter Optimization Task ASHA Validation Task ASHA RS-ES RS-NS ASHA RS-ES RS-NS
  32. 32. Ablation 34 PClassname survivesex sexname survive Replacing the Maggy Optimizer with an Ablator: • Feature Ablation using the Feature Store • Leave-One-Layer-Out Ablation • Leave-One-Component-Out (LOCO)
  33. 33. Ablation API 35
  34. 34. Ablation API 36
  35. 35. Conclusion ● Avoid iterative Hyperparameter Optimization ● Black box optimization is hard ● State-of-the-art algorithms can be deployed asynchronously ● Maggy: platform support for automated hyperparameter optimization and ablation studies ● Save resources with asynchronism ● Early stopping for sensible models 37
  36. 36. What next? 38 • More algorithms • Comparability of experiments • Implicit Provenance • Support for PyTorch
  37. 37. Thank you! Box 1263, Isafjordsgatan 22 Kista, Stockholm https://www.logicalclocks.com Register for a free account at www.hops.site Twitter @logicalclocks @hopsworks GitHub https://github.com/hopshadoop/maggy https://maggy.readthedocs.io/en/latest/ https://github.com/logicalclocks/hopsworks
  38. 38. Acknowledgements and References Thanks to the entire Logical Clocks Team ☺ Contributions from colleagues: Robin Andersson @robzor92 Sina Sheikholeslami @cutlash Kim Hammar @KimHammar1 Alex Ormenisan @alex_ormenisan • Maggy https://github.com/logicalclocks/maggy or https://maggy.readthedocs.io/en/latest/ • Feature Store: the missing data layer in ML pipelines? https://www.logicalclocks.com/feature-store/ • Hopsworks white paper. https://www.logicalclocks.com/whitepapers/hopsworks • ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019
  39. 39. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×