This document discusses tools for AI development including Visual Studio Code, the Data Science Virtual Machine (DSVM), and Azure Batch AI. It provides overviews and links to resources for setting up environments for AI/ML development and training models at scale using these tools on Azure. Key points covered include Visual Studio Code extensions for AI, the DSVM for local development, Azure Batch AI for distributed training at scale, and tools like aztk and Spark on DSVM for end-to-end model development.
9. Visual Studio [Code] Tools for AI
VS & VS Code extensions to
streamline computations in
servers, Azure ML, Batch AI, …
End to end development
environment, from new project
through training
Support for remote training & job
management
On top of all of the goodness of
VS (Python, Jupyter, Git, etc)
THR3129 Getting Started with Visual Studio Tools for AI, Chris Lauren
14. • Local tools
• Local Debug
• Faster
experimentation
Single VM
Development
• Larger VMs
• GPU
Scale Up
• Multi Node
• Remote Spark
• Batch Nodes
• VM Scale Sets
Scale Out
15. Series RAM vCPU GPU Approx Cost
Standard_B1s 1 Gb 1 None Free [*]
DS3_v2 14Gb 4 None $0.23 / hr
DS4_v2 28Gb 8 None $0.46 / hr
A8v2 16Gb 8 None $0.82 / hr
Standard_NC6 56 Gb 6 0.5 NV Tesla K80 $0.93 / hr
Standard_ND6s 112 Gb 6 1x Tesla P40 $2.14 / hr
[*] Not recommended: Standard_B1s (free, but too small to be useful)
20. Azure Batch Batch pools
Configure and
create VMs to cater
for any scale: tens
to thousands.
Automatically scale
the number of
VMs to maximize
utilization.
Choose the VM
size most suited
to your
application.
Batch jobs and tasks
Task is a unit of execution;
task = command line application
Jobs created and tasks submitted
to a pool; tasks are queued, then
assigned to VMs.
Any application, any
execution time; run
applications unchanged.
Automatic detection and
retry of frozen or failing
tasks.
21. Cost savings
Scale clusters
size up and
down as
needed
Reserved
Instances for
persistent
infrastructure
Per-second
billing for
VMs
Flexible
consumption
and savings
with low-
priority VMs
22. Scaling AI with DSVM and Batch AI
DSVM
(Dev/Test Workstation)
Azure File
Store
Azure Batch AI
Cluster
Batch AI Run Script
Store Py Scripts in File Store
Create Py Scripts
Trained AI
Model
23.
24.
25.
26.
27.
28.
29. • Traditionally, static-sized clusters were the standard, so
compute and storage had to be collocated
• A single cluster with all necessary applications would be
installed onto the cluster (typically managed by YARN, or
something similar)
• The cluster was either over-utilized (jobs had to be
queued due to lack of capacity) OR was under-utilized
(there were idle cores that burned costs)
• Teams of data-scientists would have to submit jobs agaisnt
a single cluster - this meant that the cluster had to be
generic, preventing users from truly customizing their
clusters specifically for their jobs
Traditional / On-Premise Paradigm
DataStore
30. • With cloud computing, customers are no longer limited to
static size clusters
• Each job, or set of jobs, can have its own cluster so that a
customer is only charged for the minutes that the job runs
for
• Each user can have their own cluster, so that they don’t
have to compute for resources
• Each user can have their own custom cluster that is
created specifically for their experience and their
workload. Each user can install exactly the software they
need without polluting other user’s experiences
• IT admins don’t need t worry about running out of
capacity or burning dollars on idle cores
Modern / Cloud Paradigm
DataStore