SlideShare a Scribd company logo
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Setting up custom machine learning
environments on AWS
Shashank Prasanna
Sr. Technical Evangelist, AI/ML
AWS
A I M 3 0 9
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Agenda
A data scientist’s journey to cloud
• Meet Jane Roe, lead data scientist at a self-driving car startup
• Challenges with machine learning (ML) infrastructure
Evaluating AWS capabilities for machine learning
• Custom ML environment #1 – research and prototyping
• Custom ML environment #2 – getting production ready
• Custom ML environment #3 – training and deployment at scale
Summary and Q&A
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A data scientist’s journey to cloud
On-premises
Meet Jane Roe, head of data science and research
Responsible for research and development of algorithms
for self-driving cars
Infrastructure
• Workstations with GPUs for experimentation
• Data center with more compute and storage
Software
• Open-source machine learning frameworks
• Custom proprietarily toolchains for self-driving use
cases
• IT provisioned schedulers, orchestrators, and job
monitoring
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Software management
• Building, testing and maintaining machine learning (ML) frameworks and in-
house customizations
• Keeping up with the latest from the open-source AI community
Performance
optimizations
• Optimizing the full stack (drivers, libraries, dependencies) for CPUs and
GPUs, and for training and inference deployments
Collaborative
development
• Sharing, collaborating and testing full stack on different environments and
getting reproducible results
Infrastructure
management
• Working with job schedulers, orchestrators and monitoring tools,
provisioned and managed by centralized IT teams
Scalability
• Scaling compute resources in bursts for specific experiments
• Planning and forecasting need for compute capacity over time
Challenges with machine learning infrastructure
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
FRAMEWORKS INTERFACES INFRASTRUCTURE
AI services
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
ML services
ML frameworks + infrastructure
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E X F O R E C A S TR E K O G N I T I O N
I M A G E
R E K O G N I T I O N
V I D E O
T E X T R A C T P E R S O N A L I Z E
Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker
F P G A SE C 2 P 3
& P 3 D N
E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C
I N F E R E N C E
Evaluating AWS capabilities for machine learning
1
2
3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Evaluating AWS capabilities for machine learning
✓ Custom environments: Customized setups with domain-specific modifications to open-source
frameworks
✓ Custom pipelines: Build integrated pipelines with in-house developed upstream and downstream
tools
✓ Flexibility: Variety of compute (CPUs, GPUs, FPGAs, inference accelerators) and storage options
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom setup 1 – research and prototyping
1. Compute (CPUs, GPUs)
2. Storage
3. Source control
4. Frameworks
CLI
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom setup 1 – research and prototyping
EC2 instance
GPUsDL AMI Amazon EBS
AWS Cloud
CLI
AWS CodeCommit Amazon S3
Datasets and checkpoints
Trained models and
metadata
Private Git repository
1. P3 instances with up to 8
NVIDIA V100 per instance
2. Amazon EBS with up to 16
TB of block storage and
Amazon S3 for models and
metadata
3. AWS CodeCommit private
Git repository
4. Amazon Deep Learning AMI
with pre-configured,
optimized frameworks
2
23
4 1
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Deep Learning Amazon Machine Image (DL AMI)
• Pre-configured with popular deep-learning
frameworks and interfaces
• Optimized for performance with latest NVIDIA
driver, CUDA libraries, and Intel libraries
• Dedicated Conda environments for each
framework
• Customizable and extensible for custom
workflows
AWS DL AMIs
OS
• Ubuntu
• Amazon Linux
• Amazon Linux 2
AMI
• Conda AMI
• Base AMI
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DEMO
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom setup 1 – benefits and gaps
Benefits
• Replicate on-premises development experience
• AWS Deep Learning AMI with optimized frameworks
• Up to 8 NVIDIA V100 GPUs per instance, 96 vCPUs and 768 GB of memory, up to 16 TiB with Amazon
EBS
Gaps
• Each user works on their own dedicated instance – leads to under utilization
• Not easy for collaboration on full development stack and guarantee reproducibility in different
environments
• Difficult to do distributed training or running large-scale parallel experiments require managing
multiple instances and configuring them manually
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom setup 2 – Getting production ready
AWS Cloud
CLI
CLI
CLI
1. Compute (CPUs, GPUs)
2. Storage
3. Source control
4. Frameworks
5. Reproducible environments
6. Shared file system
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom setup 2 – Getting production ready
AWS Cloud
CLI
CLI
CLI
AWS DL
containers
EC2 instance
GPUs
AWS DL
containers
EC2 instance
GPUs
AWS DL
containers
EC2 instance
GPUs
5. AWS Deep Learning Containers
provide lightweight,
reproducible environments,
managed with Amazon ECR
6. Amazon EFS and Amazon FSx for
Lustre offer scalable, elastic file
storage accessible
simultaneously by multiple EC2
instances
5
Amazon S3
Trained models and
metadata
Amazon ECR
Docker container
registry
AWS CodeCommit
Private Git
repository
Amazon EFS
Shared file system
5
6
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Deep Learning Containers
Containers:
Training (4)
Inference (4)
CPU, GPU
Python 2,
Python 3
Training (4)
Inference (4)
CPU, GPU
Python 2,
Python 3
Deep Learning Containers
• 16 container images configured for training, inference, CPUs,
GPUs, and multiple frameworks to meet your needs
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DEMO
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom setup 2 – benefits and gaps
Benefits
• Lightweight, scalable, and consistent development and deployment environments
• Packages code, configuration, and dependencies that can be extended by collaborators
• DL containers are fully configured and validated and include performance optimizations for CPU,
GPU, training, and inference workflows
Gaps
• Difficult to scale out training, since users need to manage multiple instances
• Need additional software and setup to run distributed training and running large-scale parallel
experiments
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom setup 3 – training and deployment at scale
AWS Cloud
CLI
CLI
CLI
1. Compute (CPUs, GPUs)
2. Storage
3. Source control
4. Frameworks
5. Reproducible environments
6. Shared filesystem
7. Cluster management, scheduling and
orchestration
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom setup 3 – training and deployment at scale
AWS Cloud
CLI
CLI
CLI
Amazon S3
Trained models and
metadata
Amazon ECR
Docker container
registry
AWS CodeCommit
Private Git
repository
Amazon EFS
Shared file system
EC2 instance
EKS cluster #1: large scale experiments
EKS cluster #2: distributed training
EKS cluster #3: testing and validation
EKSendpointEKSendpointEKSendpoint
EC2 instance
EC2 instance
7. Amazon EKS makes
it easy to deploy,
manage, and scale
containerized
applications
using Kubernetes on
AWS.
7
7
7
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS container services
Image registry
Container image repository
Amazon Elastic Container
Registry (Amazon ECR)
Management
Deployment, scheduling, scaling,
and management of containerized
applications
Amazon Elastic Kubernetes
Service (Amazon EKS)
Amazon Elastic
Container Service
(Amazon ECS)
Compute
Where the containers run
Amazon EC2 AWS Fargate
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep learning training on Amazon EKS
eksctl create cluster 
--name eks-gpu 
--version 1.12 
--region us-west-2 
--nodegroup-name gpu-nodes 
--node-type p3.8xlarge 
--nodes 8 
--timeout=40m 
--ssh-access 
--ssh-public-key=<public-key> 
--auto-kubeconfig
Create cluster Submit a training job
CLI
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DEMO
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Custom setup 3 – benefits and gaps
Benefits
• Easily scale-out training, let Amazon EKS or Amazon ECS manage scheduling and orchestrating
workloads
• Improve cluster utilization and save cost by sharing available resources
• Use cases:
o Run large-scale experiments (model architecture and hyperparameter search) to find the best
model and approach
o Run distributed training jobs on large datasets
Gaps
• Users still need to manage EC2 instances part of an Amazon EKS or Amazon ECS cluster
• Users still need to set up experiment and workflow management tools such as Kubeflow, Argo, etc.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon SageMaker: End-to-end machine learning
Choose and
optimize your
machine learning
algorithm
Set up and manage
environments for
training
Train and tune
model
(trial and error)
Scale and manage
the production
environment
Deploy model
in production
Built-in, high-
performance
algorithms
One-click training
on the highest
performance
infrastructure
Model
optimization
Fully managed
with automatic
scaling
One-click
deployment
Pre-built notebooks
for common
problems
Collect and prepare
training data
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS benefits for custom ML environments
Software management
• AWS Deep Learning AMI and AWS Deep Learning Containers include popular
deep-learning frameworks and come fully configured and validated.
Performance
optimizations
• DL AMI and DL containers include frameworks optimized by experts to
deliver the best training and inference performance on CPUs and GPUs.
Collaborative
development
• With DL containers, Amazon ECR, and AWS CodeCommit, collaborative
development across different environments is easy.
Infrastructure
management
• Simplify infrastructure management, scheduling, and orchestrating with
Amazon container services such as Amazon EKS, Amazon ECS, and Amazon
ECR.
Scalability
• Scale-out in bursts for large-scale training experiments or for long-running
distributed training jobs. Use as much or as little resources as you need.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shashank Prasanna
@shshnkp

More Related Content

What's hot

HK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-WorkshopHK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-Workshop
Amazon Web Services
 
Building AR/VR apps with AWS - SVC201 - Santa Clara AWS Summit.pdf
Building AR/VR apps with AWS - SVC201 - Santa Clara AWS Summit.pdfBuilding AR/VR apps with AWS - SVC201 - Santa Clara AWS Summit.pdf
Building AR/VR apps with AWS - SVC201 - Santa Clara AWS Summit.pdf
Amazon Web Services
 
Get hands-on with AWS DeepRacer and compete in the AWS DeepRacer League - AIM...
Get hands-on with AWS DeepRacer and compete in the AWS DeepRacer League - AIM...Get hands-on with AWS DeepRacer and compete in the AWS DeepRacer League - AIM...
Get hands-on with AWS DeepRacer and compete in the AWS DeepRacer League - AIM...
Amazon Web Services
 
Build-Train-Deploy-Machine-Learning-Models-at-Any-Scale
Build-Train-Deploy-Machine-Learning-Models-at-Any-ScaleBuild-Train-Deploy-Machine-Learning-Models-at-Any-Scale
Build-Train-Deploy-Machine-Learning-Models-at-Any-Scale
Amazon Web Services
 
Progetta, crea e gestisci Modern Application per web e mobile su AWS
Progetta, crea e gestisci Modern Application per web e mobile su AWSProgetta, crea e gestisci Modern Application per web e mobile su AWS
Progetta, crea e gestisci Modern Application per web e mobile su AWS
Amazon Web Services
 
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS SummitScalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Amazon Web Services
 
Best-Practices-for-Running-Windows-Workloads-on-AWS
Best-Practices-for-Running-Windows-Workloads-on-AWSBest-Practices-for-Running-Windows-Workloads-on-AWS
Best-Practices-for-Running-Windows-Workloads-on-AWS
Amazon Web Services
 
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS SummitAsk me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Amazon Web Services
 
[REPEAT] Optimize your workloads with Amazon EC2 & AMD EPYC - DEM01-R - Santa...
[REPEAT] Optimize your workloads with Amazon EC2 & AMD EPYC - DEM01-R - Santa...[REPEAT] Optimize your workloads with Amazon EC2 & AMD EPYC - DEM01-R - Santa...
[REPEAT] Optimize your workloads with Amazon EC2 & AMD EPYC - DEM01-R - Santa...
Amazon Web Services
 
Increase the value of video using ML and AWS media services - SVC301 - Atlant...
Increase the value of video using ML and AWS media services - SVC301 - Atlant...Increase the value of video using ML and AWS media services - SVC301 - Atlant...
Increase the value of video using ML and AWS media services - SVC301 - Atlant...
Amazon Web Services
 
Pro-Tips-for-Builders-on-AWS
Pro-Tips-for-Builders-on-AWSPro-Tips-for-Builders-on-AWS
Pro-Tips-for-Builders-on-AWS
Amazon Web Services
 
Introducing-AWS-Hong-Kong-Region
Introducing-AWS-Hong-Kong-RegionIntroducing-AWS-Hong-Kong-Region
Introducing-AWS-Hong-Kong-Region
Amazon Web Services
 
Compliance-Data-Archival
Compliance-Data-ArchivalCompliance-Data-Archival
Compliance-Data-Archival
Amazon Web Services
 
Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...
Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...
Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...
Amazon Web Services
 
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitBuilding Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Amazon Web Services
 
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
Amazon Web Services
 
Machine learning for developers & data scientists with Amazon SageMaker - AIM...
Machine learning for developers & data scientists with Amazon SageMaker - AIM...Machine learning for developers & data scientists with Amazon SageMaker - AIM...
Machine learning for developers & data scientists with Amazon SageMaker - AIM...
Amazon Web Services
 
Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...
Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...
Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...
Amazon Web Services
 
Architetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo realeArchitetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo reale
Amazon Web Services
 
Machine learning at the edge for industrial applications - SVC302 - New York ...
Machine learning at the edge for industrial applications - SVC302 - New York ...Machine learning at the edge for industrial applications - SVC302 - New York ...
Machine learning at the edge for industrial applications - SVC302 - New York ...
Amazon Web Services
 

What's hot (20)

HK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-WorkshopHK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-Workshop
 
Building AR/VR apps with AWS - SVC201 - Santa Clara AWS Summit.pdf
Building AR/VR apps with AWS - SVC201 - Santa Clara AWS Summit.pdfBuilding AR/VR apps with AWS - SVC201 - Santa Clara AWS Summit.pdf
Building AR/VR apps with AWS - SVC201 - Santa Clara AWS Summit.pdf
 
Get hands-on with AWS DeepRacer and compete in the AWS DeepRacer League - AIM...
Get hands-on with AWS DeepRacer and compete in the AWS DeepRacer League - AIM...Get hands-on with AWS DeepRacer and compete in the AWS DeepRacer League - AIM...
Get hands-on with AWS DeepRacer and compete in the AWS DeepRacer League - AIM...
 
Build-Train-Deploy-Machine-Learning-Models-at-Any-Scale
Build-Train-Deploy-Machine-Learning-Models-at-Any-ScaleBuild-Train-Deploy-Machine-Learning-Models-at-Any-Scale
Build-Train-Deploy-Machine-Learning-Models-at-Any-Scale
 
Progetta, crea e gestisci Modern Application per web e mobile su AWS
Progetta, crea e gestisci Modern Application per web e mobile su AWSProgetta, crea e gestisci Modern Application per web e mobile su AWS
Progetta, crea e gestisci Modern Application per web e mobile su AWS
 
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS SummitScalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
Scalable, secure log analytics with Amazon ES - ADB302 - Chicago AWS Summit
 
Best-Practices-for-Running-Windows-Workloads-on-AWS
Best-Practices-for-Running-Windows-Workloads-on-AWSBest-Practices-for-Running-Windows-Workloads-on-AWS
Best-Practices-for-Running-Windows-Workloads-on-AWS
 
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS SummitAsk me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
 
[REPEAT] Optimize your workloads with Amazon EC2 & AMD EPYC - DEM01-R - Santa...
[REPEAT] Optimize your workloads with Amazon EC2 & AMD EPYC - DEM01-R - Santa...[REPEAT] Optimize your workloads with Amazon EC2 & AMD EPYC - DEM01-R - Santa...
[REPEAT] Optimize your workloads with Amazon EC2 & AMD EPYC - DEM01-R - Santa...
 
Increase the value of video using ML and AWS media services - SVC301 - Atlant...
Increase the value of video using ML and AWS media services - SVC301 - Atlant...Increase the value of video using ML and AWS media services - SVC301 - Atlant...
Increase the value of video using ML and AWS media services - SVC301 - Atlant...
 
Pro-Tips-for-Builders-on-AWS
Pro-Tips-for-Builders-on-AWSPro-Tips-for-Builders-on-AWS
Pro-Tips-for-Builders-on-AWS
 
Introducing-AWS-Hong-Kong-Region
Introducing-AWS-Hong-Kong-RegionIntroducing-AWS-Hong-Kong-Region
Introducing-AWS-Hong-Kong-Region
 
Compliance-Data-Archival
Compliance-Data-ArchivalCompliance-Data-Archival
Compliance-Data-Archival
 
Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...
Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...
Introduction to EC2 A1 instances, powered by the AWS Graviton processor - CMP...
 
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS SummitBuilding Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
Building Data Lakes for Analytics on AWS - ADB201 - Anaheim AWS Summit
 
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
Tech deep dive: Cloud data management with Veeam and AWS - SVC216-S - New Yor...
 
Machine learning for developers & data scientists with Amazon SageMaker - AIM...
Machine learning for developers & data scientists with Amazon SageMaker - AIM...Machine learning for developers & data scientists with Amazon SageMaker - AIM...
Machine learning for developers & data scientists with Amazon SageMaker - AIM...
 
Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...
Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...
Migliora la disponibilità e le prestazioni delle tue applicazioni con Amazon ...
 
Architetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo realeArchitetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo reale
 
Machine learning at the edge for industrial applications - SVC302 - New York ...
Machine learning at the edge for industrial applications - SVC302 - New York ...Machine learning at the edge for industrial applications - SVC302 - New York ...
Machine learning at the edge for industrial applications - SVC302 - New York ...
 

Similar to Setting up custom machine learning environments on AWS - AIM309 - New York AWS Summit

Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Amazon Web Services
 
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS SummitDeploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
Amazon Web Services
 
Deploying cost-effective machine learning models - AIM202 - Atlanta AWS Summit
Deploying cost-effective machine learning models - AIM202 - Atlanta AWS SummitDeploying cost-effective machine learning models - AIM202 - Atlanta AWS Summit
Deploying cost-effective machine learning models - AIM202 - Atlanta AWS Summit
Amazon Web Services
 
[AWS Tech Talk] Using containers for deep learning workflows
[AWS Tech Talk] Using containers for deep learning workflows[AWS Tech Talk] Using containers for deep learning workflows
[AWS Tech Talk] Using containers for deep learning workflows
shashank4
 
Machine Learning using Kubernetes - AI Conclave 2019
Machine Learning using Kubernetes - AI Conclave 2019Machine Learning using Kubernetes - AI Conclave 2019
Machine Learning using Kubernetes - AI Conclave 2019
Arun Gupta
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
Julien SIMON
 
High-Performance-Computing-on-AWS-and-Industry-Simulation
High-Performance-Computing-on-AWS-and-Industry-SimulationHigh-Performance-Computing-on-AWS-and-Industry-Simulation
High-Performance-Computing-on-AWS-and-Industry-Simulation
Amazon Web Services
 
MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNet
Apache MXNet
 
利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統
Amazon Web Services
 
Building well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS SummitBuilding well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS Summit
Amazon Web Services
 
Well Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdfWell Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdf
ConradoDeBiasi
 
[NEW LAUNCH] Introducing AWS Deep Learning Containers
[NEW LAUNCH] Introducing AWS Deep Learning Containers[NEW LAUNCH] Introducing AWS Deep Learning Containers
[NEW LAUNCH] Introducing AWS Deep Learning Containers
Amazon Web Services
 
Technical Essentials Training: AWS Innovate Ottawa
Technical Essentials Training: AWS Innovate OttawaTechnical Essentials Training: AWS Innovate Ottawa
Technical Essentials Training: AWS Innovate Ottawa
Amazon Web Services
 
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
Amazon Web Services
 
Modern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECSModern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECS
Amazon Web Services
 
What Can HPC on AWS Do?
What Can HPC on AWS Do?What Can HPC on AWS Do?
What Can HPC on AWS Do?
inside-BigData.com
 
High Performance Computing on AWS
High Performance Computing on AWSHigh Performance Computing on AWS
High Performance Computing on AWS
Amazon Web Services
 
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Amazon Web Services
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
AWS Summits
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Amazon Web Services
 

Similar to Setting up custom machine learning environments on AWS - AIM309 - New York AWS Summit (20)

Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
Setting up custom machine learning environments on AWS - AIM204 - Chicago AWS...
 
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS SummitDeploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
Deploying Cost-Effective Machine Learning Models - AIM204 - Anaheim AWS Summit
 
Deploying cost-effective machine learning models - AIM202 - Atlanta AWS Summit
Deploying cost-effective machine learning models - AIM202 - Atlanta AWS SummitDeploying cost-effective machine learning models - AIM202 - Atlanta AWS Summit
Deploying cost-effective machine learning models - AIM202 - Atlanta AWS Summit
 
[AWS Tech Talk] Using containers for deep learning workflows
[AWS Tech Talk] Using containers for deep learning workflows[AWS Tech Talk] Using containers for deep learning workflows
[AWS Tech Talk] Using containers for deep learning workflows
 
Machine Learning using Kubernetes - AI Conclave 2019
Machine Learning using Kubernetes - AI Conclave 2019Machine Learning using Kubernetes - AI Conclave 2019
Machine Learning using Kubernetes - AI Conclave 2019
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
 
High-Performance-Computing-on-AWS-and-Industry-Simulation
High-Performance-Computing-on-AWS-and-Industry-SimulationHigh-Performance-Computing-on-AWS-and-Industry-Simulation
High-Performance-Computing-on-AWS-and-Industry-Simulation
 
MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNet
 
利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統利用 Fargate - 無伺服器的容器環境建置高可用的系統
利用 Fargate - 無伺服器的容器環境建置高可用的系統
 
Building well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS SummitBuilding well architected .NET applications - SVC209 - Atlanta AWS Summit
Building well architected .NET applications - SVC209 - Atlanta AWS Summit
 
Well Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdfWell Archictecture Framework dotNET.pdf
Well Archictecture Framework dotNET.pdf
 
[NEW LAUNCH] Introducing AWS Deep Learning Containers
[NEW LAUNCH] Introducing AWS Deep Learning Containers[NEW LAUNCH] Introducing AWS Deep Learning Containers
[NEW LAUNCH] Introducing AWS Deep Learning Containers
 
Technical Essentials Training: AWS Innovate Ottawa
Technical Essentials Training: AWS Innovate OttawaTechnical Essentials Training: AWS Innovate Ottawa
Technical Essentials Training: AWS Innovate Ottawa
 
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
 
Modern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECSModern-Application-Design-with-Amazon-ECS
Modern-Application-Design-with-Amazon-ECS
 
What Can HPC on AWS Do?
What Can HPC on AWS Do?What Can HPC on AWS Do?
What Can HPC on AWS Do?
 
High Performance Computing on AWS
High Performance Computing on AWSHigh Performance Computing on AWS
High Performance Computing on AWS
 
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
Running Amazon Elastic Compute Cloud (Amazon EC2) workloads at scale - CMP202...
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
 
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019Optimize your Machine Learning workloads  | AWS Summit Tel Aviv 2019
Optimize your Machine Learning workloads | AWS Summit Tel Aviv 2019
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Setting up custom machine learning environments on AWS - AIM309 - New York AWS Summit

  • 1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Setting up custom machine learning environments on AWS Shashank Prasanna Sr. Technical Evangelist, AI/ML AWS A I M 3 0 9
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Agenda A data scientist’s journey to cloud • Meet Jane Roe, lead data scientist at a self-driving car startup • Challenges with machine learning (ML) infrastructure Evaluating AWS capabilities for machine learning • Custom ML environment #1 – research and prototyping • Custom ML environment #2 – getting production ready • Custom ML environment #3 – training and deployment at scale Summary and Q&A
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T A data scientist’s journey to cloud On-premises Meet Jane Roe, head of data science and research Responsible for research and development of algorithms for self-driving cars Infrastructure • Workstations with GPUs for experimentation • Data center with more compute and storage Software • Open-source machine learning frameworks • Custom proprietarily toolchains for self-driving use cases • IT provisioned schedulers, orchestrators, and job monitoring
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Software management • Building, testing and maintaining machine learning (ML) frameworks and in- house customizations • Keeping up with the latest from the open-source AI community Performance optimizations • Optimizing the full stack (drivers, libraries, dependencies) for CPUs and GPUs, and for training and inference deployments Collaborative development • Sharing, collaborating and testing full stack on different environments and getting reproducible results Infrastructure management • Working with job schedulers, orchestrators and monitoring tools, provisioned and managed by centralized IT teams Scalability • Scaling compute resources in bursts for specific experiments • Planning and forecasting need for compute capacity over time Challenges with machine learning infrastructure
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T FRAMEWORKS INTERFACES INFRASTRUCTURE AI services VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML services ML frameworks + infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E X F O R E C A S TR E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker F P G A SE C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C I N F E R E N C E Evaluating AWS capabilities for machine learning 1 2 3
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Evaluating AWS capabilities for machine learning ✓ Custom environments: Customized setups with domain-specific modifications to open-source frameworks ✓ Custom pipelines: Build integrated pipelines with in-house developed upstream and downstream tools ✓ Flexibility: Variety of compute (CPUs, GPUs, FPGAs, inference accelerators) and storage options
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom setup 1 – research and prototyping 1. Compute (CPUs, GPUs) 2. Storage 3. Source control 4. Frameworks CLI
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom setup 1 – research and prototyping EC2 instance GPUsDL AMI Amazon EBS AWS Cloud CLI AWS CodeCommit Amazon S3 Datasets and checkpoints Trained models and metadata Private Git repository 1. P3 instances with up to 8 NVIDIA V100 per instance 2. Amazon EBS with up to 16 TB of block storage and Amazon S3 for models and metadata 3. AWS CodeCommit private Git repository 4. Amazon Deep Learning AMI with pre-configured, optimized frameworks 2 23 4 1
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Deep Learning Amazon Machine Image (DL AMI) • Pre-configured with popular deep-learning frameworks and interfaces • Optimized for performance with latest NVIDIA driver, CUDA libraries, and Intel libraries • Dedicated Conda environments for each framework • Customizable and extensible for custom workflows AWS DL AMIs OS • Ubuntu • Amazon Linux • Amazon Linux 2 AMI • Conda AMI • Base AMI
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T DEMO
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom setup 1 – benefits and gaps Benefits • Replicate on-premises development experience • AWS Deep Learning AMI with optimized frameworks • Up to 8 NVIDIA V100 GPUs per instance, 96 vCPUs and 768 GB of memory, up to 16 TiB with Amazon EBS Gaps • Each user works on their own dedicated instance – leads to under utilization • Not easy for collaboration on full development stack and guarantee reproducibility in different environments • Difficult to do distributed training or running large-scale parallel experiments require managing multiple instances and configuring them manually
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom setup 2 – Getting production ready AWS Cloud CLI CLI CLI 1. Compute (CPUs, GPUs) 2. Storage 3. Source control 4. Frameworks 5. Reproducible environments 6. Shared file system
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom setup 2 – Getting production ready AWS Cloud CLI CLI CLI AWS DL containers EC2 instance GPUs AWS DL containers EC2 instance GPUs AWS DL containers EC2 instance GPUs 5. AWS Deep Learning Containers provide lightweight, reproducible environments, managed with Amazon ECR 6. Amazon EFS and Amazon FSx for Lustre offer scalable, elastic file storage accessible simultaneously by multiple EC2 instances 5 Amazon S3 Trained models and metadata Amazon ECR Docker container registry AWS CodeCommit Private Git repository Amazon EFS Shared file system 5 6
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Deep Learning Containers Containers: Training (4) Inference (4) CPU, GPU Python 2, Python 3 Training (4) Inference (4) CPU, GPU Python 2, Python 3 Deep Learning Containers • 16 container images configured for training, inference, CPUs, GPUs, and multiple frameworks to meet your needs
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T DEMO
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom setup 2 – benefits and gaps Benefits • Lightweight, scalable, and consistent development and deployment environments • Packages code, configuration, and dependencies that can be extended by collaborators • DL containers are fully configured and validated and include performance optimizations for CPU, GPU, training, and inference workflows Gaps • Difficult to scale out training, since users need to manage multiple instances • Need additional software and setup to run distributed training and running large-scale parallel experiments
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom setup 3 – training and deployment at scale AWS Cloud CLI CLI CLI 1. Compute (CPUs, GPUs) 2. Storage 3. Source control 4. Frameworks 5. Reproducible environments 6. Shared filesystem 7. Cluster management, scheduling and orchestration
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom setup 3 – training and deployment at scale AWS Cloud CLI CLI CLI Amazon S3 Trained models and metadata Amazon ECR Docker container registry AWS CodeCommit Private Git repository Amazon EFS Shared file system EC2 instance EKS cluster #1: large scale experiments EKS cluster #2: distributed training EKS cluster #3: testing and validation EKSendpointEKSendpointEKSendpoint EC2 instance EC2 instance 7. Amazon EKS makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS. 7 7 7
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS container services Image registry Container image repository Amazon Elastic Container Registry (Amazon ECR) Management Deployment, scheduling, scaling, and management of containerized applications Amazon Elastic Kubernetes Service (Amazon EKS) Amazon Elastic Container Service (Amazon ECS) Compute Where the containers run Amazon EC2 AWS Fargate
  • 20. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep learning training on Amazon EKS eksctl create cluster --name eks-gpu --version 1.12 --region us-west-2 --nodegroup-name gpu-nodes --node-type p3.8xlarge --nodes 8 --timeout=40m --ssh-access --ssh-public-key=<public-key> --auto-kubeconfig Create cluster Submit a training job CLI
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T DEMO
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Custom setup 3 – benefits and gaps Benefits • Easily scale-out training, let Amazon EKS or Amazon ECS manage scheduling and orchestrating workloads • Improve cluster utilization and save cost by sharing available resources • Use cases: o Run large-scale experiments (model architecture and hyperparameter search) to find the best model and approach o Run distributed training jobs on large datasets Gaps • Users still need to manage EC2 instances part of an Amazon EKS or Amazon ECS cluster • Users still need to set up experiment and workflow management tools such as Kubeflow, Argo, etc.
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon SageMaker: End-to-end machine learning Choose and optimize your machine learning algorithm Set up and manage environments for training Train and tune model (trial and error) Scale and manage the production environment Deploy model in production Built-in, high- performance algorithms One-click training on the highest performance infrastructure Model optimization Fully managed with automatic scaling One-click deployment Pre-built notebooks for common problems Collect and prepare training data
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS benefits for custom ML environments Software management • AWS Deep Learning AMI and AWS Deep Learning Containers include popular deep-learning frameworks and come fully configured and validated. Performance optimizations • DL AMI and DL containers include frameworks optimized by experts to deliver the best training and inference performance on CPUs and GPUs. Collaborative development • With DL containers, Amazon ECR, and AWS CodeCommit, collaborative development across different environments is easy. Infrastructure management • Simplify infrastructure management, scheduling, and orchestrating with Amazon container services such as Amazon EKS, Amazon ECS, and Amazon ECR. Scalability • Scale-out in bursts for large-scale training experiments or for long-running distributed training jobs. Use as much or as little resources as you need.
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shashank Prasanna @shshnkp