Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Learning at Scale with Apache Spark and Determined

394 views

Published on

Despite its enormous potential to enable new applications, deep learning remains prohibitively expensive, difficult, and time-consuming for the vast majority of companies. Training DL models at scale is particularly challenging: training a single model can take days or weeks, and DL engineers are often forced to spend much of their time doing DevOps or writing boilerplate code to handle routine tasks like data loading, distributed training, or fault tolerance.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Deep Learning at Scale with Apache Spark and Determined

  1. 1. Deep Learning at Scale with Spark and Determined Spark + AI Summit June 26, 2020 Neil Conway David Hershey
  2. 2. Typical Deep Learning Workflow RAW DATA DATA LAKE MODEL TRAINING & DEVELOPMENT REAL-TIME SERVING BATCH INFERENCE
  3. 3. RAW DATA DATA LAKE MODEL TRAINING & DEVELOPMENT REAL-TIME SERVING BATCH INFERENCE RAW DATA DATA LAKE REAL-TIME SERVING BATCH INFERENCE MODEL TRAINING & DEVELOPMENT Typical Deep Learning Workflow
  4. 4. RAW DATA The Determined DL Platform TRAINING PLATFORM1 ● What does a DL Training Platform do? ● System architecture ● Key features ML ECOSYSTEM DEMO2 ● How to build an end-to-end deep learning environment with Determined and the Spark ecosystem ● Demo! + +
  5. 5. TensorFlow and PyTorch are great tools, but they are focused on solving the challenges faced by ● A single DL engineer ● Training a single model ● With a single GPU As your team size, cluster size, and data size all increase, you soon run into problems that are beyond the scope of TF and PyTorch. Isn’t Training Solved By TensorFlow and PyTorch? “How do I share a GPU cluster with my team?” “How do I do distributed training?” “How do I manage my models and store my training metadata?” “How do I do efficient hyperparameter search?”
  6. 6. WebservicesandApps (MLFlow,Seldon,SageMaker,TFServing) Model export DataStorageandETL (HDFS,S3,Airflow,Pachyderm,Spark,etc.) Data Visualization & DebuggingHyperparameter Search Distributed Training Experiment Tracking Cluster Sharing and Resource Management Batch Inference NAS DLDataCache Data Management and Preparation Model Development Model Serving Determined DL Training Platform
  7. 7. WebservicesandApps (MLFlow,Seldon,SageMaker,TFServing) Model export DataStorageandETL (HDFS,S3,Airflow,Pachyderm,Spark,etc.) Data Visualization & DebuggingHyperparameter Search Distributed Training Experiment Tracking Cluster Sharing and Resource Management Batch Inference Available Today In Development NAS DLDataCache Data Management and Preparation Model Development Model Serving
  8. 8. Hyperparameter Optimization/NAS ● 100x faster than standard methods ● 10x faster than research methods ● Tight integration w/ GPU scheduler Determined AI Horovod Distributed Training ● 44x faster than single GPU ● 2x faster than Horovod ● No DevOps pain or code changes Dataset: COCO, Target 37.8% mAP Seamlessly integrated components → Dramatically easier-to-use Determined: Key Capabilities and Benefits Dataset: CIFAR10 Automatic fault tolerance Tools for DL Teams ● Focus on models, not infrastructure ● Up to 70% savings on cloud instances ● Saves each DL engineer 1 day per week Elastic resource allocation and simultaneous progress Experiment tracking and visualization Specialized hardware management
  9. 9. Demo
  10. 10. Demonstration: End to End ML with Spark + Determined Land Data into Delta Lake with Spark Scaled Deep Learning with Determined Inference in Spark Read Data from Delta Checkpoint Export Key Points: 1.Easy data versioning with Delta + Determined 2.Scaling your experiments with Determined 3.Versioned models with Determined + using them for inference in Spark
  11. 11. Now Open Source! We’ve spent the last three years building Determined and working closely with cutting- edge DL teams in several industries. 🎉🎉🎉 We recently released the platform under the Apache 2.0 license 🎉🎉🎉 We’re very excited to share Determined with the DL community, and would love your feedback on the product! Learn More & Join the Determined Community: ● https://github.com/determined-ai/determined ● https://docs.determined.ai/ ● https://determined.ai
  12. 12. Thank you! https://github.com/determined-ai/determined

×