Setting up custom machine learning environments on AWS - AIM309 - New York AWS Summit

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Setting up custom machine learning
environments on AWS
Shashank Prasanna
Sr. Technical Evangelist, AI/ML
AWS
A I M 3 0 9

Agenda
A data scientist’s journey to cloud
• Meet Jane Roe, lead data scientist at a self-driving car startup
• Challenges with machine learning (ML) infrastructure
Evaluating AWS capabilities for machine learning
• Custom ML environment #1 – research and prototyping
• Custom ML environment #2 – getting production ready
• Custom ML environment #3 – training and deployment at scale
Summary and Q&A

A data scientist’s journey to cloud
On-premises
Meet Jane Roe, head of data science and research
Responsible for research and development of algorithms
for self-driving cars
Infrastructure
• Workstations with GPUs for experimentation
• Data center with more compute and storage
Software
• Open-source machine learning frameworks
• Custom proprietarily toolchains for self-driving use
cases
• IT provisioned schedulers, orchestrators, and job
monitoring

Software management
• Building, testing and maintaining machine learning (ML) frameworks and in-
house customizations
• Keeping up with the latest from the open-source AI community
Performance
optimizations
• Optimizing the full stack (drivers, libraries, dependencies) for CPUs and
GPUs, and for training and inference deployments
Collaborative
development
• Sharing, collaborating and testing full stack on different environments and
getting reproducible results
Infrastructure
management
• Working with job schedulers, orchestrators and monitoring tools,
provisioned and managed by centralized IT teams
Scalability
• Scaling compute resources in bursts for specific experiments
• Planning and forecasting need for compute capacity over time
Challenges with machine learning infrastructure

FRAMEWORKS INTERFACES INFRASTRUCTURE
AI services
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
ML services
ML frameworks + infrastructure
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E X F O R E C A S TR E K O G N I T I O N
I M A G E
R E K O G N I T I O N
V I D E O
T E X T R A C T P E R S O N A L I Z E
Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker
F P G A SE C 2 P 3
& P 3 D N
E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C
I N F E R E N C E
1
2
3

✓ Custom environments: Customized setups with domain-specific modifications to open-source
frameworks
✓ Custom pipelines: Build integrated pipelines with in-house developed upstream and downstream
tools
✓ Flexibility: Variety of compute (CPUs, GPUs, FPGAs, inference accelerators) and storage options

Custom setup 1 – research and prototyping
1. Compute (CPUs, GPUs)
2. Storage
3. Source control
4. Frameworks
CLI

Custom setup 1 – research and prototyping
EC2 instance
GPUsDL AMI Amazon EBS
AWS Cloud
CLI
AWS CodeCommit Amazon S3
Datasets and checkpoints
Trained models and
metadata
Private Git repository
1. P3 instances with up to 8
NVIDIA V100 per instance
2. Amazon EBS with up to 16
TB of block storage and
Amazon S3 for models and
metadata
3. AWS CodeCommit private
Git repository
4. Amazon Deep Learning AMI
with pre-configured,
optimized frameworks
2
23
4 1

AWS Deep Learning Amazon Machine Image (DL AMI)
• Pre-configured with popular deep-learning
frameworks and interfaces
• Optimized for performance with latest NVIDIA
driver, CUDA libraries, and Intel libraries
• Dedicated Conda environments for each
framework
• Customizable and extensible for custom
workflows
AWS DL AMIs
OS
• Ubuntu
• Amazon Linux
• Amazon Linux 2
AMI
• Conda AMI
• Base AMI

DEMO

Custom setup 1 – benefits and gaps
Benefits
• Replicate on-premises development experience
• AWS Deep Learning AMI with optimized frameworks
• Up to 8 NVIDIA V100 GPUs per instance, 96 vCPUs and 768 GB of memory, up to 16 TiB with Amazon
EBS
Gaps
• Each user works on their own dedicated instance – leads to under utilization
• Not easy for collaboration on full development stack and guarantee reproducibility in different
environments
• Difficult to do distributed training or running large-scale parallel experiments require managing
multiple instances and configuring them manually

Custom setup 2 – Getting production ready
AWS Cloud
CLI
CLI
CLI
2. Storage
3. Source control
4. Frameworks
5. Reproducible environments
6. Shared file system

Custom setup 2 – Getting production ready
AWS Cloud
CLI
CLI
CLI
AWS DL
containers
EC2 instance
GPUs
AWS DL
containers
EC2 instance
GPUs
AWS DL
containers
EC2 instance
GPUs
5. AWS Deep Learning Containers
provide lightweight,
reproducible environments,
managed with Amazon ECR
6. Amazon EFS and Amazon FSx for
Lustre offer scalable, elastic file
storage accessible
simultaneously by multiple EC2
instances
5
Amazon S3
Trained models and
metadata
Amazon ECR
Docker container
registry
AWS CodeCommit
Private Git
repository
Amazon EFS
Shared file system
5
6

AWS Deep Learning Containers
Containers:
Training (4)
Inference (4)
CPU, GPU
Python 2,
Python 3
Training (4)
Inference (4)
CPU, GPU
Python 2,
Python 3
Deep Learning Containers
• 16 container images configured for training, inference, CPUs,
GPUs, and multiple frameworks to meet your needs

Benefits
• Lightweight, scalable, and consistent development and deployment environments
• Packages code, configuration, and dependencies that can be extended by collaborators
• DL containers are fully configured and validated and include performance optimizations for CPU,
GPU, training, and inference workflows
Gaps
• Difficult to scale out training, since users need to manage multiple instances
• Need additional software and setup to run distributed training and running large-scale parallel
experiments

Custom setup 3 – training and deployment at scale
AWS Cloud
CLI
CLI
CLI
2. Storage
3. Source control
4. Frameworks
5. Reproducible environments
6. Shared filesystem
7. Cluster management, scheduling and
orchestration

Custom setup 3 – training and deployment at scale
AWS Cloud
CLI
CLI
CLI
Amazon S3
Trained models and
metadata
Amazon ECR
Docker container
registry
AWS CodeCommit
Private Git
repository
Amazon EFS
Shared file system
EC2 instance
EKS cluster #1: large scale experiments
EKS cluster #2: distributed training
EKS cluster #3: testing and validation
EKSendpointEKSendpointEKSendpoint
EC2 instance
EC2 instance
7. Amazon EKS makes
it easy to deploy,
manage, and scale
containerized
applications
using Kubernetes on
AWS.
7
7
7

AWS container services
Image registry
Container image repository
Amazon Elastic Container
Registry (Amazon ECR)
Management
Deployment, scheduling, scaling,
and management of containerized
applications
Amazon Elastic Kubernetes
Service (Amazon EKS)
Amazon Elastic
Container Service
(Amazon ECS)
Compute
Where the containers run
Amazon EC2 AWS Fargate

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep learning training on Amazon EKS
eksctl create cluster
--name eks-gpu
--version 1.12
--region us-west-2
--nodegroup-name gpu-nodes
--node-type p3.8xlarge
--nodes 8
--timeout=40m
--ssh-access
--ssh-public-key=<public-key>
--auto-kubeconfig
Create cluster Submit a training job
CLI

Benefits
• Easily scale-out training, let Amazon EKS or Amazon ECS manage scheduling and orchestrating
workloads
• Improve cluster utilization and save cost by sharing available resources
• Use cases:
o Run large-scale experiments (model architecture and hyperparameter search) to find the best
model and approach
o Run distributed training jobs on large datasets
Gaps
• Users still need to manage EC2 instances part of an Amazon EKS or Amazon ECS cluster
• Users still need to set up experiment and workflow management tools such as Kubeflow, Argo, etc.

Amazon SageMaker: End-to-end machine learning
Choose and
optimize your
machine learning
algorithm
Set up and manage
environments for
training
Train and tune
model
(trial and error)
Scale and manage
the production
environment
Deploy model
in production
Built-in, high-
performance
algorithms
One-click training
on the highest
performance
infrastructure
Model
optimization
Fully managed
with automatic
scaling
One-click
deployment
Pre-built notebooks
for common
problems
Collect and prepare
training data

AWS benefits for custom ML environments
Software management
• AWS Deep Learning AMI and AWS Deep Learning Containers include popular
deep-learning frameworks and come fully configured and validated.
Performance
optimizations
• DL AMI and DL containers include frameworks optimized by experts to
deliver the best training and inference performance on CPUs and GPUs.
Collaborative
development
• With DL containers, Amazon ECR, and AWS CodeCommit, collaborative
development across different environments is easy.
Infrastructure
management
• Simplify infrastructure management, scheduling, and orchestrating with
Amazon container services such as Amazon EKS, Amazon ECS, and Amazon
ECR.
Scalability
• Scale-out in bursts for large-scale training experiments or for long-running
distributed training jobs. Use as much or as little resources as you need.

Setting up custom machine learning environments on AWS - AIM309 - New York AWS Summit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Setting up custom machine learning environments on AWS - AIM309 - New York AWS Summit

Similar to Setting up custom machine learning environments on AWS - AIM309 - New York AWS Summit (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Setting up custom machine learning environments on AWS - AIM309 - New York AWS Summit