2. Training ML model with Cloud
● Rent a single VM
○ Which options do you have?
○ How to make it cost-effective?
● Distributed training
○ Horizontal vs. vertical scaling
○ Tensorflow
5. This is ...
● an intro to Cloud with regard to training ML models
● an intro to distributed training
This is not
● an intro to Tensorflow or ML
● a talk about TPU
● we won’t focus on inference at all
7. About me
● Leonid Kuligin
○ Moscow Institute of Physics & Technique
○ 2017 - … Google Cloud, PSO, ML engineer - Munich, Germany
○ 2016-2017 Scout GmbH, Senior Software Engineer - Munich, Germany
○ 2015-2016 HH.RU, Senior Product Manager (search engine) - Moscow, Russia
○ 2013-2015 Yandex, Team Lead (data production team, Local search) - Moscow, Russia
https://www.linkedin.com/in/leonid-kuligin-53569544/
9. Prepare & organize your data
● BLOB storage = Google Cloud Storage / AWS S3
○ Think about format (json, parquet, avro, TFRecord, ...) & proper partitioning
○ Blob storage is usually HDFS-compatible
● Other storage options
○ Big Query
○ Cloud SQL (managed MySQL or PostgreSQL)
○ Key-value storage (e.g. Big Table) as managed service
10. Rent a single VM
● Create a VM via GUI or script
● Specify basics parameters
○ Machine type
○ Persistent disk
○ Image
○ Attach accelerator
○ Preemptible
○ Security, networking, ...
○ ...
11. Deploy Wait & dumpData
Shutdown
VM
Move your data to the
Cloud and organize it
Reproduce your environment at
the VM rented in the Cloud
Local
trainer
Prepare
VM
Write a local trainer and test it
on a small sample of your data
Package your trainer and
deploy it to the VM
1
3
4
5
2
Wait until the training is
completed and dump the
binaries to the BLOB storage
12. Machine types and cost
● Predefined machine types
○ Virtual CPUs + memory (by some providers, network throughput might depend on
the machine type)
○ Price depends on the datacenter location (i.e., if you need to train a model only, you
might choose the cheapest location)
● Billing: per seconds but minimum 1 minute per VM
● Discounts
○ Sustained use discounts
■ Up to 30% net discount if you use an instance for more than 25% a month
■ Different instance types are automatically combined
○ Committed use discounts
■ for 1-3 years
13. Images
● You can start with a “clean” OS with a certain version or you can use an OS with pre-
installed/pre-configured software
● Free tier vs premium tier
○ free tier: Debian, Ubuntu, CentOS, coreOS
○ beta: deep learning images (with jupyter server running)
○ premium team: Anaconda Enterprise, Caffe Python 3.6 CPU, … - you might pay
some additional licensing fee and machines are not customizable
○ user-provided custom image
● You can create your own images / make snapshots of your machines
● Official images have Compute Engine tools installed on them
14. Startup scripts
● Install/update software, perform warm-up, …
● GCE copies your script, sets permission to make it executable and executes it
● How to provide script to GCE:
○ from your local file --metadata-from-file=startup-script=<PATH_TO_SCRIPT>
○ directly --metadata=startup-script=’’#! /bin/bash; sudo su -; apt-get update”
○ from GCS (take care about ACLs!) --metadata=startup-script-url=”gs://bucket/startup.sh”
● You can provide custom keys to the startup scripts when creating an instance
gcloud compute instances create example-instance --metadata foo=bar
--metadata=startup-script=’’#! /bin/bash;
FOO=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo -H ‘Metadata-
Flavor: Google’); echo $FOO”
15. Persistent disk
● HDD vs SDD
○ you decide on the size on the disk
○ disk can be resized after the VM has been created
● additional pricing until the persistent disk exists
○ so you can stop the VM and restart it later (paying only for the disk storage)
○ or make your VM fully stateless and use other storages
● you can add disks / resize disks / reattach disks
○ create snapshots for backups
16. GPUs
● Additional pricing and subject to project quotas
○ Nvidia Tesla P100 (16 GB HDM2)
○ Nvidia Tesla K80 (12 GB GDDR5)
○ Nvidia Tesla V100 (16 GB HDM2) = beta
● GPUs are not available in all zones
● GPU instances can be terminated for host maintenance
● NVidia Tesla V100 are offered with NVLink connections between GPUs
17. TPUs
● Google custom-developed ASICs to accelerate ML workloads (using Tensorflow)
○ accelerate the performance of linear algebra computation
● Currently in beta, available only in a few zones
● You need to get a quota for your project
● All the data preprocessing / checkpoints / etc. is executed on the VM
○ The dense part of the Tensorflow graph, loss and gradients subgraphs are
compiled with XLA (Accelerated Linear Algebra) compiler and this part of
your code is executed on TPUs
○ Only a limited list of Tensorflow OPs is available on TPU
○ Benchmarks: http://dawn.cs.stanford.edu/benchmark/
18. CPUs
● Quick prototyping
● Simple models
● Small models with small batch
sizes
● A lot of custom ops written in
C++
● Models limited by I/O or
network bandwidth available
GPUs
● Models that are not written in
Tensorflow
● A lot of custom ops written in
C++ but optimized for GPUs
● Medium-to-large models with
large batch size
● Using Tensorflow Ops on
available on TPUs
TPUs
● A lot of matrix computations
● Only supported Tensorflow Ops
● Large models with large batch
size
● Long training times
19. Train the model
● Create a command the instance
gcloud compute instances create <YOUR INSTANCE> … --metadata
“KEEP_ALIVE=true”
● Build the startup script
○ Install all dependencies (or use you custom image!)
○ Copy your data locally gcloud cp gs://bucket/raw-data/2018-04/*
○ Copy your code locally git clone ...
○ Execute your training job
○ If needed, take care about ssh tunnelling to access tensorboard / etc.
● Include self-destroy command into the training job as a last step
if ! [ $KEEP_ALIVE = "true" ] ; then
sudo shutdown -h now
Fi
● You can provide your hyperparameters as metadata key-value pairs
○ how effective is your gridsearch now!
20. Preemptible VMs
● Much lower price than normal instances
● Can be terminated at any time
○ Sends a soft OFF signal, you have 30s to cleanup with a shutdown script
○ Will be always terminated after 24h
○ You can simulate maintenance event
gcloud beta compute instances simulate-maintenance-event
<YOUR_INSTANCE_NAME> --zone <YOUR_ZONE>
● Might be not available and are not under SLA
● GCE generally preempts the instances launched most recently if needed
○ average preemption rate varies from 5% to 15%, but there is no SLA here!
● VMs, GPUs, local SSDs can all be preemptible
21. Open source solution
● Created by Google Brain
● Most popular ML project on Github
○ Over 480 contributors
○ 10000 commits in 12 months
● Fast C++ engine and support for distributed training
● Extensive coverage of latest deep learning algorithms
● Both high-level and low-level APIs
● Multiple deployment options
○ Mobile, Desktop, Server, Cloud
○ CPU, GPU, TPU
22. TF.ESTIMATOR
● tf.estimator - a high-level Tensorflow API that makes your life easier
○ You can always use lower-level APIs if needed
○ it takes care about a lot of things during the training
■ error handling
■ building graphs, initializing variables, starting queues, …
■ preparing summaries for Tensorboard
■ creating checkpoints and recovering from failures
● you can run the same model in a different environments (local, multi-server, …)
● you provide a model as a set of functions (train_op, eval_op, …)
○ estimators has apis like fit(), evaluate(), predict(), …
23. Preemptible VMs with tf
● Read processed / save checkpoints data from GCS bucket directly
● If the preemptible instance will be killed, the training will continue from the last checkpoint
after you’ll restart it
● To restart automatically, you need to use instance template and managed instance group
○ this is a mechanism used for autostart / autohealing / autoscaling
gcloud beta compute instance-templates create my-training-job
… --preemptible
gcloud compute instance-groups managed create my-training-
job-managed --base-instance-name my-training-job
--size 1 --template my-training-job --zone <YOUR_ZONE>
● You can launch tensorboard locally:
○ tensorboard --logdir=$YOUR_OUTPUT_DIR
25. OOM / too long training time?
● Scale vertically
○ use more powerful VMs
○ parallelize hyperparameters tuning (run multiple training jobs in parallel)
● Scale horizontally
○ rent a cluster and use data parallelization: execute batches in parallel
○ rent a cluster and use model parallelization: execute your model in parallel
26. It’s important to remember, that in most
cases switching to distributed training
requires you to adjust your code.
27. Need for compute power
● Amount of ops required to train state of the art models is increasing exponentially
● 300000X increase since 2012
○ Neural Machine Translation: 80 PFs-days
■ Google TPU offers up to 180teraflops
Source: https://blog.openai.com/ai-and-compute/
28. There are much more tricks and approaches
available (than just train in distributed mode)
29. Parallelization in practice
● Choose your favorite framework and explore options for distributed training
○ Use your favorite framework
■ Rent a few VMs, take care about cluster’s setup, incl. the networking
■ Rent a hadoop cluster as a managed service (Dataproc), use SparkML
■ ...
○ Use Tensorflow
■ Rent a few VMs, take care about cluster’s setup, incl. the network
■ Use ML Engine (a managed service)
■ Use kubeflow (with Google Kubernetes Engine)
■ ...
30. Data parallelism
● Standard SGD
● Synchronous parallel SGD
○ Shuffle data uniformly across workers
○ For every step:
■ Deliver current parameters to each worker and run a SGD step
■ [Synchronize] Aggregate the results and update parameters on the PS
32. Data parallelism
● Parallel SGD should be evaluated in terms of computation time, communication cost and
communication time
● You are able to reduce some syncing bottlenecks with Asynchronous Stochastic Gradient
Descent
○ You have various adjustments to compensate for delayed gradients
34. Data processing with Tensorflow
● tf.data.Dataset API provides a high-level API to implement data input pipelines
○ There are available implementations of FixedLengthRecordDataset,
TextLineDataset, TFRecordDataset
● You can read CSV with tf.decode_csv
● You can apply various transformation to your dataset (cache, map, flat_map and decode)
● When you are done, you end up with initializing an Iterator with
.make_one_shot_iterator() or .make_initializable_iterator()
38. Randomizing data
● shard according to amount of workers
● interleave on every worker
● shuffle each portion of data
39. Model parallelism
● Separate subgraphs can be placed on different devices
○ Workers need to communicate with each other
○ You can combine CPUs/GPUs/TPUs together
● The optimization should be done on your done explicitly
with tf.device("/gpu:0"):
…
with tf.device("/gpu:1"):
…
with tf.device("/cpu:0"):
…
42. Outcomes
● Training models in the cloud is easy and you can scale very easy, and you can automate a
lot of things
○ If your framework support fault-tolerance, don’t forget about preemptible instances
● Distributed training is relatively easy to implement with modern frameworks, but you still
need to tune your code
○ You need to start thinking about it from the step when you start organizing your
data
○ Data parallelization versus model parallelization
○ Or scale vertically with TPUs
● Try out Tensorflow!
I’m compiled this very informal chart, after many conversations with new ML practitioners. Overwhelming folks tend to focus in on how they will choose and optimize the core ML algorithm itself. They worry about which papers to read, or how to chose select hyperparameters, and so on; often to the exclusion of other parts of the system
In reality though, successfully deployed ML systems have a very different balance.
Storage per se is relatively cheap, but networking costs might arise - so that’s all you need to take care about
We are not going to talk about networking here. Just to mention, ingress traffic is always for free, egress is for free within the same datacenter (via internal IPs) and costs quite tiny money for traffic within the region.
We have general | highCpu | high memory | and some other specific use cases instances, as well as you can configure a VM yourself.
Your functions return the ops required for a given input
You might have too many data / too big embeddings that do not fit into single machine’s memory
Guys from OpenAi have made a very nice research. They’ve analyzed amount of operations needed for one forward-backward pass for a few recently published models (based on the information provided by the paper’s authors).PFLOPS-day = if we have a petaflop/sec compute performance, how many days we need to get thisFor comparison = one Google TPU gives you 180teraflops, so you nedd >400 TPUs to achieve this kind of compute performanceOps/model are increasing exp, with doubling period 3.6months (Moore’s law has doubling period equals to 18m period)E.g., Alexnet (2012, CNN net to compete at Imagenet Visual Recognition Challenge) has 62M parameters
E.g., use extremely large minibatch SGD https://arxiv.org/abs/1711.04325
E.g., use extremely large minibatch SGD https://arxiv.org/abs/1711.04325
A very native example can be LSTM network, when each layer can be placed on a separate device (so after processing the first input and passing the results, the device can start to process the second one until waiting to receive the backpropagation back)