Leonid Kuligin "Training ML models with Cloud"

Training your machine learning model
with Cloud

Training ML model with Cloud
● Rent a single VM
○ Which options do you have?
○ How to make it cost-effective?
● Distributed training
○ Horizontal vs. vertical scaling
○ Tensorflow

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
The ML surprise
Effort Allocation
Defining KPI’s
Collecting data
Building infrastructure
Optimizing ML algorithm
Integration
Expectation
*Informally based on my many conversations with new ML practitioners

The ML surprise
Effort Allocation
Expectation
Reality
0.25 0.5 0.75
Defining KPI’s
Collecting data
Building infrastructure
Optimizing ML algorithm
.1
Integration

This is ...
● an intro to Cloud with regard to training ML models
● an intro to distributed training
This is not
● an intro to Tensorflow or ML
● a talk about TPU
● we won’t focus on inference at all

Why?
● Scale
● Flexibility
● Additional tools as fully managed services
● Cost-effective
● Shorter time-to-market

About me
● Leonid Kuligin
○ Moscow Institute of Physics & Technique
○ 2017 - … Google Cloud, PSO, ML engineer - Munich, Germany
○ 2016-2017 Scout GmbH, Senior Software Engineer - Munich, Germany
○ 2015-2016 HH.RU, Senior Product Manager (search engine) - Moscow, Russia
○ 2013-2015 Yandex, Team Lead (data production team, Local search) - Moscow, Russia
https://www.linkedin.com/in/leonid-kuligin-53569544/

Prepare & organize your data
● BLOB storage = Google Cloud Storage / AWS S3
○ Think about format (json, parquet, avro, TFRecord, ...) & proper partitioning
○ Blob storage is usually HDFS-compatible
● Other storage options
○ Big Query
○ Cloud SQL (managed MySQL or PostgreSQL)
○ Key-value storage (e.g. Big Table) as managed service

Rent a single VM
● Create a VM via GUI or script
● Specify basics parameters
○ Machine type
○ Persistent disk
○ Image
○ Attach accelerator
○ Preemptible
○ Security, networking, ...
○ ...

Deploy Wait & dumpData
Shutdown
VM
Move your data to the
Cloud and organize it
Reproduce your environment at
the VM rented in the Cloud
Local
trainer
Prepare
VM
Write a local trainer and test it
on a small sample of your data
Package your trainer and
deploy it to the VM
1
3
4
5
2
Wait until the training is
completed and dump the
binaries to the BLOB storage

Machine types and cost
● Predefined machine types
○ Virtual CPUs + memory (by some providers, network throughput might depend on
the machine type)
○ Price depends on the datacenter location (i.e., if you need to train a model only, you
might choose the cheapest location)
● Billing: per seconds but minimum 1 minute per VM
● Discounts
○ Sustained use discounts
■ Up to 30% net discount if you use an instance for more than 25% a month
■ Different instance types are automatically combined
○ Committed use discounts
■ for 1-3 years

Images
● You can start with a “clean” OS with a certain version or you can use an OS with pre-
installed/pre-configured software
● Free tier vs premium tier
○ free tier: Debian, Ubuntu, CentOS, coreOS
○ beta: deep learning images (with jupyter server running)
○ premium team: Anaconda Enterprise, Caffe Python 3.6 CPU, … - you might pay
some additional licensing fee and machines are not customizable
○ user-provided custom image
● You can create your own images / make snapshots of your machines
● Official images have Compute Engine tools installed on them

Startup scripts
● Install/update software, perform warm-up, …
● GCE copies your script, sets permission to make it executable and executes it
● How to provide script to GCE:
○ from your local file --metadata-from-file=startup-script=<PATH_TO_SCRIPT>
○ directly --metadata=startup-script=’’#! /bin/bash; sudo su -; apt-get update”
○ from GCS (take care about ACLs!) --metadata=startup-script-url=”gs://bucket/startup.sh”
● You can provide custom keys to the startup scripts when creating an instance
gcloud compute instances create example-instance --metadata foo=bar
--metadata=startup-script=’’#! /bin/bash;
FOO=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo -H ‘Metadata-
Flavor: Google’); echo $FOO”

Persistent disk
● HDD vs SDD
○ you decide on the size on the disk
○ disk can be resized after the VM has been created
● additional pricing until the persistent disk exists
○ so you can stop the VM and restart it later (paying only for the disk storage)
○ or make your VM fully stateless and use other storages
● you can add disks / resize disks / reattach disks
○ create snapshots for backups

GPUs
● Additional pricing and subject to project quotas
○ Nvidia Tesla P100 (16 GB HDM2)
○ Nvidia Tesla K80 (12 GB GDDR5)
○ Nvidia Tesla V100 (16 GB HDM2) = beta
● GPUs are not available in all zones
● GPU instances can be terminated for host maintenance
● NVidia Tesla V100 are offered with NVLink connections between GPUs

TPUs
● Google custom-developed ASICs to accelerate ML workloads (using Tensorflow)
○ accelerate the performance of linear algebra computation
● Currently in beta, available only in a few zones
● You need to get a quota for your project
● All the data preprocessing / checkpoints / etc. is executed on the VM
○ The dense part of the Tensorflow graph, loss and gradients subgraphs are
compiled with XLA (Accelerated Linear Algebra) compiler and this part of
your code is executed on TPUs
○ Only a limited list of Tensorflow OPs is available on TPU
○ Benchmarks: http://dawn.cs.stanford.edu/benchmark/

CPUs
● Quick prototyping
● Simple models
● Small models with small batch
sizes
● A lot of custom ops written in
C++
● Models limited by I/O or
network bandwidth available
GPUs
● Models that are not written in
Tensorflow
● A lot of custom ops written in
C++ but optimized for GPUs
● Medium-to-large models with
large batch size
● Using Tensorflow Ops on
available on TPUs
TPUs
● A lot of matrix computations
● Only supported Tensorflow Ops
● Large models with large batch
size
● Long training times

Train the model
● Create a command the instance
gcloud compute instances create <YOUR INSTANCE> … --metadata
“KEEP_ALIVE=true”
● Build the startup script
○ Install all dependencies (or use you custom image!)
○ Copy your data locally gcloud cp gs://bucket/raw-data/2018-04/*
○ Copy your code locally git clone ...
○ Execute your training job
○ If needed, take care about ssh tunnelling to access tensorboard / etc.
● Include self-destroy command into the training job as a last step
if ! [ $KEEP_ALIVE = "true" ] ; then
sudo shutdown -h now
Fi
● You can provide your hyperparameters as metadata key-value pairs
○ how effective is your gridsearch now!

Preemptible VMs
● Much lower price than normal instances
● Can be terminated at any time
○ Sends a soft OFF signal, you have 30s to cleanup with a shutdown script
○ Will be always terminated after 24h
○ You can simulate maintenance event
gcloud beta compute instances simulate-maintenance-event
<YOUR_INSTANCE_NAME> --zone <YOUR_ZONE>
● Might be not available and are not under SLA
● GCE generally preempts the instances launched most recently if needed
○ average preemption rate varies from 5% to 15%, but there is no SLA here!
● VMs, GPUs, local SSDs can all be preemptible

Open source solution
● Created by Google Brain
● Most popular ML project on Github
○ Over 480 contributors
○ 10000 commits in 12 months
● Fast C++ engine and support for distributed training
● Extensive coverage of latest deep learning algorithms
● Both high-level and low-level APIs
● Multiple deployment options
○ Mobile, Desktop, Server, Cloud
○ CPU, GPU, TPU

TF.ESTIMATOR
● tf.estimator - a high-level Tensorflow API that makes your life easier
○ You can always use lower-level APIs if needed
○ it takes care about a lot of things during the training
■ error handling
■ building graphs, initializing variables, starting queues, …
■ preparing summaries for Tensorboard
■ creating checkpoints and recovering from failures
● you can run the same model in a different environments (local, multi-server, …)
● you provide a model as a set of functions (train_op, eval_op, …)
○ estimators has apis like fit(), evaluate(), predict(), …

Preemptible VMs with tf
● Read processed / save checkpoints data from GCS bucket directly
● If the preemptible instance will be killed, the training will continue from the last checkpoint
after you’ll restart it
● To restart automatically, you need to use instance template and managed instance group
○ this is a mechanism used for autostart / autohealing / autoscaling
gcloud beta compute instance-templates create my-training-job
… --preemptible
gcloud compute instance-groups managed create my-training-
job-managed --base-instance-name my-training-job
--size 1 --template my-training-job --zone <YOUR_ZONE>
● You can launch tensorboard locally:
○ tensorboard --logdir=$YOUR_OUTPUT_DIR

OOM / too long training time?
● Scale vertically
○ use more powerful VMs
○ parallelize hyperparameters tuning (run multiple training jobs in parallel)
● Scale horizontally
○ rent a cluster and use data parallelization: execute batches in parallel
○ rent a cluster and use model parallelization: execute your model in parallel

It’s important to remember, that in most
cases switching to distributed training
requires you to adjust your code.

Need for compute power
● Amount of ops required to train state of the art models is increasing exponentially
● 300000X increase since 2012
○ Neural Machine Translation: 80 PFs-days
■ Google TPU offers up to 180teraflops
Source: https://blog.openai.com/ai-and-compute/

There are much more tricks and approaches
available (than just train in distributed mode)

Parallelization in practice
● Choose your favorite framework and explore options for distributed training
○ Use your favorite framework
■ Rent a few VMs, take care about cluster’s setup, incl. the networking
■ Rent a hadoop cluster as a managed service (Dataproc), use SparkML
■ ...
○ Use Tensorflow
■ Rent a few VMs, take care about cluster’s setup, incl. the network
■ Use ML Engine (a managed service)
■ Use kubeflow (with Google Kubernetes Engine)
■ ...

Data parallelism
● Standard SGD
● Synchronous parallel SGD
○ Shuffle data uniformly across workers
○ For every step:
■ Deliver current parameters to each worker and run a SGD step
■ [Synchronize] Aggregate the results and update parameters on the PS

1 2 j
Uniformly shuffled data
Sampled minibatch
SGD step
Aggregating gradients
& updating parameters

Data parallelism
● Parallel SGD should be evaluated in terms of computation time, communication cost and
communication time
● You are able to reduce some syncing bottlenecks with Asynchronous Stochastic Gradient
Descent
○ You have various adjustments to compensate for delayed gradients

You have 4 workers. What would you expect?

Data processing with Tensorflow
● tf.data.Dataset API provides a high-level API to implement data input pipelines
○ There are available implementations of FixedLengthRecordDataset,
TextLineDataset, TFRecordDataset
● You can read CSV with tf.decode_csv
● You can apply various transformation to your dataset (cache, map, flat_map and decode)
● When you are done, you end up with initializing an Iterator with
.make_one_shot_iterator() or .make_initializable_iterator()

Pipelining
dataset = dataset.batch(batch_size=FLAGS.batch_size)
dataset = dataset.batch(batch_size=FLAGS.batch_size)
dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)

Parallelize Data Transformation
dataset = dataset.map(map_func=parse_fn)
dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_calls)

Parallelize Data Extraction
dataset = files.interleave(tf.data.TFRecordDataset)
dataset = files.apply(tf.contrib.data.parallel_interleave(
tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_readers, sloppy=True))

Randomizing data
● shard according to amount of workers
● interleave on every worker
● shuffle each portion of data

Model parallelism
● Separate subgraphs can be placed on different devices
○ Workers need to communicate with each other
○ You can combine CPUs/GPUs/TPUs together
● The optimization should be done on your done explicitly
with tf.device("/gpu:0"):
…
with tf.device("/gpu:1"):
…
with tf.device("/cpu:0"):
…

Data Parallelism & Model Parallelism
Data parallelism to process multiple mini-batches in
different workers simultaneously
Model parallelism to run different model operations on
multiple devices simultaneously
Shared
Data
Chief
Queue
Parameter
Server
Worker Node
Gradients
Loss
Model
Layers
Mini-batch
1
Mean
Update
Parameters
Asynchronous
Stochastic Gradient
Descent (SGD)
Good for partially connected models
(wide and deep models + embeddings)
and Recurrent Models
Input_fn
Worker Node
Gradients
Loss
Model
Layers
Input_fn
Worker Node
Gradients
Loss
Model
Layers
Input_fn
Mini-batch
2
Mini-batch
3

Three ways for AI on Google Cloud
your data + our models
AutoML
Dialogflow
Ease of Use
Train our state-of-the-art models Build your own modelsCall our perception APIs
our data + our models
Cloud
Vision API
Cloud
Translation API
Cloud Natural
Language API
Cloud
Speech API
Cloud Video
Intelligence API
Cloud Speech
Synthesis API
Data Loss
Prevention API
Cloud ML Engine
Cloud Dataproc
Compute Engine
Kubernetes Engine
Cloud TPUs
your data + your model
Customisation

Outcomes
● Training models in the cloud is easy and you can scale very easy, and you can automate a
lot of things
○ If your framework support fault-tolerance, don’t forget about preemptible instances
● Distributed training is relatively easy to implement with modern frameworks, but you still
need to tune your code
○ You need to start thinking about it from the step when you start organizing your
data
○ Data parallelization versus model parallelization
○ Or scale vertically with TPUs
● Try out Tensorflow!

Leonid Kuligin "Training ML models with Cloud"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Leonid Kuligin "Training ML models with Cloud"

Similar to Leonid Kuligin "Training ML models with Cloud" (20)

More from Lviv Startup Club

More from Lviv Startup Club (20)

Recently uploaded

Recently uploaded (20)

Leonid Kuligin "Training ML models with Cloud"

Editor's Notes