Training your machine learning model
with Cloud
Training ML model with Cloud
● Rent a single VM
○ Which options do you have?
○ How to make it cost-effective?
● Distributed training
○ Horizontal vs. vertical scaling
○ Tensorflow
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
The ML surprise
Effort Allocation
Defining KPI’s
Collecting data
Building infrastructure
Optimizing ML algorithm
Integration
Expectation
*Informally based on my many conversations with new ML practitioners
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
The ML surprise
Effort Allocation
Expectation
Reality
0.25 0.5 0.75
Defining KPI’s
Collecting data
Building infrastructure
Optimizing ML algorithm
.1
Integration
This is ...
● an intro to Cloud with regard to training ML models
● an intro to distributed training
This is not
● an intro to Tensorflow or ML
● a talk about TPU
● we won’t focus on inference at all
Why?
● Scale
● Flexibility
● Additional tools as fully managed services
● Cost-effective
● Shorter time-to-market
About me
● Leonid Kuligin
○ Moscow Institute of Physics & Technique
○ 2017 - … Google Cloud, PSO, ML engineer - Munich, Germany
○ 2016-2017 Scout GmbH, Senior Software Engineer - Munich, Germany
○ 2015-2016 HH.RU, Senior Product Manager (search engine) - Moscow, Russia
○ 2013-2015 Yandex, Team Lead (data production team, Local search) - Moscow, Russia
https://www.linkedin.com/in/leonid-kuligin-53569544/
Rent a VM
Prepare & organize your data
● BLOB storage = Google Cloud Storage / AWS S3
○ Think about format (json, parquet, avro, TFRecord, ...) & proper partitioning
○ Blob storage is usually HDFS-compatible
● Other storage options
○ Big Query
○ Cloud SQL (managed MySQL or PostgreSQL)
○ Key-value storage (e.g. Big Table) as managed service
Rent a single VM
● Create a VM via GUI or script
● Specify basics parameters
○ Machine type
○ Persistent disk
○ Image
○ Attach accelerator
○ Preemptible
○ Security, networking, ...
○ ...
Deploy Wait & dumpData
Shutdown
VM
Move your data to the
Cloud and organize it
Reproduce your environment at
the VM rented in the Cloud
Local
trainer
Prepare
VM
Write a local trainer and test it
on a small sample of your data
Package your trainer and
deploy it to the VM
1
3
4
5
2
Wait until the training is
completed and dump the
binaries to the BLOB storage
Machine types and cost
● Predefined machine types
○ Virtual CPUs + memory (by some providers, network throughput might depend on
the machine type)
○ Price depends on the datacenter location (i.e., if you need to train a model only, you
might choose the cheapest location)
● Billing: per seconds but minimum 1 minute per VM
● Discounts
○ Sustained use discounts
■ Up to 30% net discount if you use an instance for more than 25% a month
■ Different instance types are automatically combined
○ Committed use discounts
■ for 1-3 years
Images
● You can start with a “clean” OS with a certain version or you can use an OS with pre-
installed/pre-configured software
● Free tier vs premium tier
○ free tier: Debian, Ubuntu, CentOS, coreOS
○ beta: deep learning images (with jupyter server running)
○ premium team: Anaconda Enterprise, Caffe Python 3.6 CPU, … - you might pay
some additional licensing fee and machines are not customizable
○ user-provided custom image
● You can create your own images / make snapshots of your machines
● Official images have Compute Engine tools installed on them
Startup scripts
● Install/update software, perform warm-up, …
● GCE copies your script, sets permission to make it executable and executes it
● How to provide script to GCE:
○ from your local file --metadata-from-file=startup-script=<PATH_TO_SCRIPT>
○ directly --metadata=startup-script=’’#! /bin/bash; sudo su -; apt-get update”
○ from GCS (take care about ACLs!) --metadata=startup-script-url=”gs://bucket/startup.sh”
● You can provide custom keys to the startup scripts when creating an instance
gcloud compute instances create example-instance --metadata foo=bar
--metadata=startup-script=’’#! /bin/bash;
FOO=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo -H ‘Metadata-
Flavor: Google’); echo $FOO”
Persistent disk
● HDD vs SDD
○ you decide on the size on the disk
○ disk can be resized after the VM has been created
● additional pricing until the persistent disk exists
○ so you can stop the VM and restart it later (paying only for the disk storage)
○ or make your VM fully stateless and use other storages
● you can add disks / resize disks / reattach disks
○ create snapshots for backups
GPUs
● Additional pricing and subject to project quotas
○ Nvidia Tesla P100 (16 GB HDM2)
○ Nvidia Tesla K80 (12 GB GDDR5)
○ Nvidia Tesla V100 (16 GB HDM2) = beta
● GPUs are not available in all zones
● GPU instances can be terminated for host maintenance
● NVidia Tesla V100 are offered with NVLink connections between GPUs
TPUs
● Google custom-developed ASICs to accelerate ML workloads (using Tensorflow)
○ accelerate the performance of linear algebra computation
● Currently in beta, available only in a few zones
● You need to get a quota for your project
● All the data preprocessing / checkpoints / etc. is executed on the VM
○ The dense part of the Tensorflow graph, loss and gradients subgraphs are
compiled with XLA (Accelerated Linear Algebra) compiler and this part of
your code is executed on TPUs
○ Only a limited list of Tensorflow OPs is available on TPU
○ Benchmarks: http://dawn.cs.stanford.edu/benchmark/
CPUs
● Quick prototyping
● Simple models
● Small models with small batch
sizes
● A lot of custom ops written in
C++
● Models limited by I/O or
network bandwidth available
GPUs
● Models that are not written in
Tensorflow
● A lot of custom ops written in
C++ but optimized for GPUs
● Medium-to-large models with
large batch size
● Using Tensorflow Ops on
available on TPUs
TPUs
● A lot of matrix computations
● Only supported Tensorflow Ops
● Large models with large batch
size
● Long training times
Train the model
● Create a command the instance
gcloud compute instances create <YOUR INSTANCE> … --metadata
“KEEP_ALIVE=true”
● Build the startup script
○ Install all dependencies (or use you custom image!)
○ Copy your data locally gcloud cp gs://bucket/raw-data/2018-04/*
○ Copy your code locally git clone ...
○ Execute your training job
○ If needed, take care about ssh tunnelling to access tensorboard / etc.
● Include self-destroy command into the training job as a last step
if ! [ $KEEP_ALIVE = "true" ] ; then
sudo shutdown -h now
Fi
● You can provide your hyperparameters as metadata key-value pairs
○ how effective is your gridsearch now!
Preemptible VMs
● Much lower price than normal instances
● Can be terminated at any time
○ Sends a soft OFF signal, you have 30s to cleanup with a shutdown script
○ Will be always terminated after 24h
○ You can simulate maintenance event
gcloud beta compute instances simulate-maintenance-event
<YOUR_INSTANCE_NAME> --zone <YOUR_ZONE>
● Might be not available and are not under SLA
● GCE generally preempts the instances launched most recently if needed
○ average preemption rate varies from 5% to 15%, but there is no SLA here!
● VMs, GPUs, local SSDs can all be preemptible
Open source solution
● Created by Google Brain
● Most popular ML project on Github
○ Over 480 contributors
○ 10000 commits in 12 months
● Fast C++ engine and support for distributed training
● Extensive coverage of latest deep learning algorithms
● Both high-level and low-level APIs
● Multiple deployment options
○ Mobile, Desktop, Server, Cloud
○ CPU, GPU, TPU
TF.ESTIMATOR
● tf.estimator - a high-level Tensorflow API that makes your life easier
○ You can always use lower-level APIs if needed
○ it takes care about a lot of things during the training
■ error handling
■ building graphs, initializing variables, starting queues, …
■ preparing summaries for Tensorboard
■ creating checkpoints and recovering from failures
● you can run the same model in a different environments (local, multi-server, …)
● you provide a model as a set of functions (train_op, eval_op, …)
○ estimators has apis like fit(), evaluate(), predict(), …
Preemptible VMs with tf
● Read processed / save checkpoints data from GCS bucket directly
● If the preemptible instance will be killed, the training will continue from the last checkpoint
after you’ll restart it
● To restart automatically, you need to use instance template and managed instance group
○ this is a mechanism used for autostart / autohealing / autoscaling
gcloud beta compute instance-templates create my-training-job
… --preemptible
gcloud compute instance-groups managed create my-training-
job-managed --base-instance-name my-training-job
--size 1 --template my-training-job --zone <YOUR_ZONE>
● You can launch tensorboard locally:
○ tensorboard --logdir=$YOUR_OUTPUT_DIR
Distributed training
OOM / too long training time?
● Scale vertically
○ use more powerful VMs
○ parallelize hyperparameters tuning (run multiple training jobs in parallel)
● Scale horizontally
○ rent a cluster and use data parallelization: execute batches in parallel
○ rent a cluster and use model parallelization: execute your model in parallel
It’s important to remember, that in most
cases switching to distributed training
requires you to adjust your code.
Need for compute power
● Amount of ops required to train state of the art models is increasing exponentially
● 300000X increase since 2012
○ Neural Machine Translation: 80 PFs-days
■ Google TPU offers up to 180teraflops
Source: https://blog.openai.com/ai-and-compute/
There are much more tricks and approaches
available (than just train in distributed mode)
Parallelization in practice
● Choose your favorite framework and explore options for distributed training
○ Use your favorite framework
■ Rent a few VMs, take care about cluster’s setup, incl. the networking
■ Rent a hadoop cluster as a managed service (Dataproc), use SparkML
■ ...
○ Use Tensorflow
■ Rent a few VMs, take care about cluster’s setup, incl. the network
■ Use ML Engine (a managed service)
■ Use kubeflow (with Google Kubernetes Engine)
■ ...
Data parallelism
● Standard SGD
● Synchronous parallel SGD
○ Shuffle data uniformly across workers
○ For every step:
■ Deliver current parameters to each worker and run a SGD step
■ [Synchronize] Aggregate the results and update parameters on the PS
1 2 j
Uniformly shuffled data
Sampled minibatch
SGD step
Aggregating gradients
& updating parameters
Data parallelism
● Parallel SGD should be evaluated in terms of computation time, communication cost and
communication time
● You are able to reduce some syncing bottlenecks with Asynchronous Stochastic Gradient
Descent
○ You have various adjustments to compensate for delayed gradients
You have 4 workers. What would you expect?
Data processing with Tensorflow
● tf.data.Dataset API provides a high-level API to implement data input pipelines
○ There are available implementations of FixedLengthRecordDataset,
TextLineDataset, TFRecordDataset
● You can read CSV with tf.decode_csv
● You can apply various transformation to your dataset (cache, map, flat_map and decode)
● When you are done, you end up with initializing an Iterator with
.make_one_shot_iterator() or .make_initializable_iterator()
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Pipelining
dataset = dataset.batch(batch_size=FLAGS.batch_size)
dataset = dataset.batch(batch_size=FLAGS.batch_size)
dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Parallelize Data Transformation
dataset = dataset.map(map_func=parse_fn)
dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_calls)
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Parallelize Data Extraction
dataset = files.interleave(tf.data.TFRecordDataset)
dataset = files.apply(tf.contrib.data.parallel_interleave(
tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_readers, sloppy=True))
Randomizing data
● shard according to amount of workers
● interleave on every worker
● shuffle each portion of data
Model parallelism
● Separate subgraphs can be placed on different devices
○ Workers need to communicate with each other
○ You can combine CPUs/GPUs/TPUs together
● The optimization should be done on your done explicitly
with tf.device("/gpu:0"):
…
with tf.device("/gpu:1"):
…
with tf.device("/cpu:0"):
…
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Data Parallelism & Model Parallelism
Data parallelism to process multiple mini-batches in
different workers simultaneously
Model parallelism to run different model operations on
multiple devices simultaneously
Shared
Data
Chief
Queue
Parameter
Server
Worker Node
Gradients
Loss
Model
Layers
Mini-batch
1
Mean
Update
Parameters
Asynchronous
Stochastic Gradient
Descent (SGD)
Good for partially connected models
(wide and deep models + embeddings)
and Recurrent Models
Input_fn
Worker Node
Gradients
Loss
Model
Layers
Input_fn
Worker Node
Gradients
Loss
Model
Layers
Input_fn
Mini-batch
2
Mini-batch
3
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Three ways for AI on Google Cloud
your data + our models
AutoML
Dialogflow
Ease of Use
Train our state-of-the-art models Build your own modelsCall our perception APIs
our data + our models
Cloud
Vision API
Cloud
Translation API
Cloud Natural
Language API
Cloud
Speech API
Cloud Video
Intelligence API
Cloud Speech
Synthesis API
Data Loss
Prevention API
Cloud ML Engine
Cloud Dataproc
Compute Engine
Kubernetes Engine
Cloud TPUs
your data + your model
Customisation
Outcomes
● Training models in the cloud is easy and you can scale very easy, and you can automate a
lot of things
○ If your framework support fault-tolerance, don’t forget about preemptible instances
● Distributed training is relatively easy to implement with modern frameworks, but you still
need to tune your code
○ You need to start thinking about it from the step when you start organizing your
data
○ Data parallelization versus model parallelization
○ Or scale vertically with TPUs
● Try out Tensorflow!
Thank you!
Questions?

Leonid Kuligin "Training ML models with Cloud"

  • 1.
    Training your machinelearning model with Cloud
  • 2.
    Training ML modelwith Cloud ● Rent a single VM ○ Which options do you have? ○ How to make it cost-effective? ● Distributed training ○ Horizontal vs. vertical scaling ○ Tensorflow
  • 3.
    © 2017 GoogleInc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. The ML surprise Effort Allocation Defining KPI’s Collecting data Building infrastructure Optimizing ML algorithm Integration Expectation *Informally based on my many conversations with new ML practitioners
  • 4.
    © 2017 GoogleInc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. The ML surprise Effort Allocation Expectation Reality 0.25 0.5 0.75 Defining KPI’s Collecting data Building infrastructure Optimizing ML algorithm .1 Integration
  • 5.
    This is ... ●an intro to Cloud with regard to training ML models ● an intro to distributed training This is not ● an intro to Tensorflow or ML ● a talk about TPU ● we won’t focus on inference at all
  • 6.
    Why? ● Scale ● Flexibility ●Additional tools as fully managed services ● Cost-effective ● Shorter time-to-market
  • 7.
    About me ● LeonidKuligin ○ Moscow Institute of Physics & Technique ○ 2017 - … Google Cloud, PSO, ML engineer - Munich, Germany ○ 2016-2017 Scout GmbH, Senior Software Engineer - Munich, Germany ○ 2015-2016 HH.RU, Senior Product Manager (search engine) - Moscow, Russia ○ 2013-2015 Yandex, Team Lead (data production team, Local search) - Moscow, Russia https://www.linkedin.com/in/leonid-kuligin-53569544/
  • 8.
  • 9.
    Prepare & organizeyour data ● BLOB storage = Google Cloud Storage / AWS S3 ○ Think about format (json, parquet, avro, TFRecord, ...) & proper partitioning ○ Blob storage is usually HDFS-compatible ● Other storage options ○ Big Query ○ Cloud SQL (managed MySQL or PostgreSQL) ○ Key-value storage (e.g. Big Table) as managed service
  • 10.
    Rent a singleVM ● Create a VM via GUI or script ● Specify basics parameters ○ Machine type ○ Persistent disk ○ Image ○ Attach accelerator ○ Preemptible ○ Security, networking, ... ○ ...
  • 11.
    Deploy Wait &dumpData Shutdown VM Move your data to the Cloud and organize it Reproduce your environment at the VM rented in the Cloud Local trainer Prepare VM Write a local trainer and test it on a small sample of your data Package your trainer and deploy it to the VM 1 3 4 5 2 Wait until the training is completed and dump the binaries to the BLOB storage
  • 12.
    Machine types andcost ● Predefined machine types ○ Virtual CPUs + memory (by some providers, network throughput might depend on the machine type) ○ Price depends on the datacenter location (i.e., if you need to train a model only, you might choose the cheapest location) ● Billing: per seconds but minimum 1 minute per VM ● Discounts ○ Sustained use discounts ■ Up to 30% net discount if you use an instance for more than 25% a month ■ Different instance types are automatically combined ○ Committed use discounts ■ for 1-3 years
  • 13.
    Images ● You canstart with a “clean” OS with a certain version or you can use an OS with pre- installed/pre-configured software ● Free tier vs premium tier ○ free tier: Debian, Ubuntu, CentOS, coreOS ○ beta: deep learning images (with jupyter server running) ○ premium team: Anaconda Enterprise, Caffe Python 3.6 CPU, … - you might pay some additional licensing fee and machines are not customizable ○ user-provided custom image ● You can create your own images / make snapshots of your machines ● Official images have Compute Engine tools installed on them
  • 14.
    Startup scripts ● Install/updatesoftware, perform warm-up, … ● GCE copies your script, sets permission to make it executable and executes it ● How to provide script to GCE: ○ from your local file --metadata-from-file=startup-script=<PATH_TO_SCRIPT> ○ directly --metadata=startup-script=’’#! /bin/bash; sudo su -; apt-get update” ○ from GCS (take care about ACLs!) --metadata=startup-script-url=”gs://bucket/startup.sh” ● You can provide custom keys to the startup scripts when creating an instance gcloud compute instances create example-instance --metadata foo=bar --metadata=startup-script=’’#! /bin/bash; FOO=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo -H ‘Metadata- Flavor: Google’); echo $FOO”
  • 15.
    Persistent disk ● HDDvs SDD ○ you decide on the size on the disk ○ disk can be resized after the VM has been created ● additional pricing until the persistent disk exists ○ so you can stop the VM and restart it later (paying only for the disk storage) ○ or make your VM fully stateless and use other storages ● you can add disks / resize disks / reattach disks ○ create snapshots for backups
  • 16.
    GPUs ● Additional pricingand subject to project quotas ○ Nvidia Tesla P100 (16 GB HDM2) ○ Nvidia Tesla K80 (12 GB GDDR5) ○ Nvidia Tesla V100 (16 GB HDM2) = beta ● GPUs are not available in all zones ● GPU instances can be terminated for host maintenance ● NVidia Tesla V100 are offered with NVLink connections between GPUs
  • 17.
    TPUs ● Google custom-developedASICs to accelerate ML workloads (using Tensorflow) ○ accelerate the performance of linear algebra computation ● Currently in beta, available only in a few zones ● You need to get a quota for your project ● All the data preprocessing / checkpoints / etc. is executed on the VM ○ The dense part of the Tensorflow graph, loss and gradients subgraphs are compiled with XLA (Accelerated Linear Algebra) compiler and this part of your code is executed on TPUs ○ Only a limited list of Tensorflow OPs is available on TPU ○ Benchmarks: http://dawn.cs.stanford.edu/benchmark/
  • 18.
    CPUs ● Quick prototyping ●Simple models ● Small models with small batch sizes ● A lot of custom ops written in C++ ● Models limited by I/O or network bandwidth available GPUs ● Models that are not written in Tensorflow ● A lot of custom ops written in C++ but optimized for GPUs ● Medium-to-large models with large batch size ● Using Tensorflow Ops on available on TPUs TPUs ● A lot of matrix computations ● Only supported Tensorflow Ops ● Large models with large batch size ● Long training times
  • 19.
    Train the model ●Create a command the instance gcloud compute instances create <YOUR INSTANCE> … --metadata “KEEP_ALIVE=true” ● Build the startup script ○ Install all dependencies (or use you custom image!) ○ Copy your data locally gcloud cp gs://bucket/raw-data/2018-04/* ○ Copy your code locally git clone ... ○ Execute your training job ○ If needed, take care about ssh tunnelling to access tensorboard / etc. ● Include self-destroy command into the training job as a last step if ! [ $KEEP_ALIVE = "true" ] ; then sudo shutdown -h now Fi ● You can provide your hyperparameters as metadata key-value pairs ○ how effective is your gridsearch now!
  • 20.
    Preemptible VMs ● Muchlower price than normal instances ● Can be terminated at any time ○ Sends a soft OFF signal, you have 30s to cleanup with a shutdown script ○ Will be always terminated after 24h ○ You can simulate maintenance event gcloud beta compute instances simulate-maintenance-event <YOUR_INSTANCE_NAME> --zone <YOUR_ZONE> ● Might be not available and are not under SLA ● GCE generally preempts the instances launched most recently if needed ○ average preemption rate varies from 5% to 15%, but there is no SLA here! ● VMs, GPUs, local SSDs can all be preemptible
  • 21.
    Open source solution ●Created by Google Brain ● Most popular ML project on Github ○ Over 480 contributors ○ 10000 commits in 12 months ● Fast C++ engine and support for distributed training ● Extensive coverage of latest deep learning algorithms ● Both high-level and low-level APIs ● Multiple deployment options ○ Mobile, Desktop, Server, Cloud ○ CPU, GPU, TPU
  • 22.
    TF.ESTIMATOR ● tf.estimator -a high-level Tensorflow API that makes your life easier ○ You can always use lower-level APIs if needed ○ it takes care about a lot of things during the training ■ error handling ■ building graphs, initializing variables, starting queues, … ■ preparing summaries for Tensorboard ■ creating checkpoints and recovering from failures ● you can run the same model in a different environments (local, multi-server, …) ● you provide a model as a set of functions (train_op, eval_op, …) ○ estimators has apis like fit(), evaluate(), predict(), …
  • 23.
    Preemptible VMs withtf ● Read processed / save checkpoints data from GCS bucket directly ● If the preemptible instance will be killed, the training will continue from the last checkpoint after you’ll restart it ● To restart automatically, you need to use instance template and managed instance group ○ this is a mechanism used for autostart / autohealing / autoscaling gcloud beta compute instance-templates create my-training-job … --preemptible gcloud compute instance-groups managed create my-training- job-managed --base-instance-name my-training-job --size 1 --template my-training-job --zone <YOUR_ZONE> ● You can launch tensorboard locally: ○ tensorboard --logdir=$YOUR_OUTPUT_DIR
  • 24.
  • 25.
    OOM / toolong training time? ● Scale vertically ○ use more powerful VMs ○ parallelize hyperparameters tuning (run multiple training jobs in parallel) ● Scale horizontally ○ rent a cluster and use data parallelization: execute batches in parallel ○ rent a cluster and use model parallelization: execute your model in parallel
  • 26.
    It’s important toremember, that in most cases switching to distributed training requires you to adjust your code.
  • 27.
    Need for computepower ● Amount of ops required to train state of the art models is increasing exponentially ● 300000X increase since 2012 ○ Neural Machine Translation: 80 PFs-days ■ Google TPU offers up to 180teraflops Source: https://blog.openai.com/ai-and-compute/
  • 28.
    There are muchmore tricks and approaches available (than just train in distributed mode)
  • 29.
    Parallelization in practice ●Choose your favorite framework and explore options for distributed training ○ Use your favorite framework ■ Rent a few VMs, take care about cluster’s setup, incl. the networking ■ Rent a hadoop cluster as a managed service (Dataproc), use SparkML ■ ... ○ Use Tensorflow ■ Rent a few VMs, take care about cluster’s setup, incl. the network ■ Use ML Engine (a managed service) ■ Use kubeflow (with Google Kubernetes Engine) ■ ...
  • 30.
    Data parallelism ● StandardSGD ● Synchronous parallel SGD ○ Shuffle data uniformly across workers ○ For every step: ■ Deliver current parameters to each worker and run a SGD step ■ [Synchronize] Aggregate the results and update parameters on the PS
  • 31.
    1 2 j Uniformlyshuffled data Sampled minibatch SGD step Aggregating gradients & updating parameters
  • 32.
    Data parallelism ● ParallelSGD should be evaluated in terms of computation time, communication cost and communication time ● You are able to reduce some syncing bottlenecks with Asynchronous Stochastic Gradient Descent ○ You have various adjustments to compensate for delayed gradients
  • 33.
    You have 4workers. What would you expect?
  • 34.
    Data processing withTensorflow ● tf.data.Dataset API provides a high-level API to implement data input pipelines ○ There are available implementations of FixedLengthRecordDataset, TextLineDataset, TFRecordDataset ● You can read CSV with tf.decode_csv ● You can apply various transformation to your dataset (cache, map, flat_map and decode) ● When you are done, you end up with initializing an Iterator with .make_one_shot_iterator() or .make_initializable_iterator()
  • 35.
    © 2017 GoogleInc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Pipelining dataset = dataset.batch(batch_size=FLAGS.batch_size) dataset = dataset.batch(batch_size=FLAGS.batch_size) dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
  • 36.
    © 2017 GoogleInc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Parallelize Data Transformation dataset = dataset.map(map_func=parse_fn) dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_calls)
  • 37.
    © 2017 GoogleInc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Parallelize Data Extraction dataset = files.interleave(tf.data.TFRecordDataset) dataset = files.apply(tf.contrib.data.parallel_interleave( tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_readers, sloppy=True))
  • 38.
    Randomizing data ● shardaccording to amount of workers ● interleave on every worker ● shuffle each portion of data
  • 39.
    Model parallelism ● Separatesubgraphs can be placed on different devices ○ Workers need to communicate with each other ○ You can combine CPUs/GPUs/TPUs together ● The optimization should be done on your done explicitly with tf.device("/gpu:0"): … with tf.device("/gpu:1"): … with tf.device("/cpu:0"): …
  • 40.
    © 2017 GoogleInc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Data Parallelism & Model Parallelism Data parallelism to process multiple mini-batches in different workers simultaneously Model parallelism to run different model operations on multiple devices simultaneously Shared Data Chief Queue Parameter Server Worker Node Gradients Loss Model Layers Mini-batch 1 Mean Update Parameters Asynchronous Stochastic Gradient Descent (SGD) Good for partially connected models (wide and deep models + embeddings) and Recurrent Models Input_fn Worker Node Gradients Loss Model Layers Input_fn Worker Node Gradients Loss Model Layers Input_fn Mini-batch 2 Mini-batch 3
  • 41.
    © 2017 GoogleInc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Three ways for AI on Google Cloud your data + our models AutoML Dialogflow Ease of Use Train our state-of-the-art models Build your own modelsCall our perception APIs our data + our models Cloud Vision API Cloud Translation API Cloud Natural Language API Cloud Speech API Cloud Video Intelligence API Cloud Speech Synthesis API Data Loss Prevention API Cloud ML Engine Cloud Dataproc Compute Engine Kubernetes Engine Cloud TPUs your data + your model Customisation
  • 42.
    Outcomes ● Training modelsin the cloud is easy and you can scale very easy, and you can automate a lot of things ○ If your framework support fault-tolerance, don’t forget about preemptible instances ● Distributed training is relatively easy to implement with modern frameworks, but you still need to tune your code ○ You need to start thinking about it from the step when you start organizing your data ○ Data parallelization versus model parallelization ○ Or scale vertically with TPUs ● Try out Tensorflow!
  • 43.

Editor's Notes

  • #4  I’m compiled this very informal chart, after many conversations with new ML practitioners. Overwhelming folks tend to focus in on how they will choose and optimize the core ML algorithm itself. They worry about which papers to read, or how to chose select hyperparameters, and so on; often to the exclusion of other parts of the system In reality though, successfully deployed ML systems have a very different balance.
  • #10 Storage per se is relatively cheap, but networking costs might arise - so that’s all you need to take care about
  • #11 We are not going to talk about networking here. Just to mention, ingress traffic is always for free, egress is for free within the same datacenter (via internal IPs) and costs quite tiny money for traffic within the region.
  • #13 We have general | highCpu | high memory | and some other specific use cases instances, as well as you can configure a VM yourself.
  • #14 https://cloud.google.com/compute/docs/images
  • #17 https://cloud.google.com/compute/docs/gpus/ https://devblogs.nvidia.com/how-nvlink-will-enable-faster-easier-multi-gpu-computing/
  • #18 application-specific integrated circuits (ASICs) https://cloud.google.com/tpu/docs/tpus
  • #19 application-specific integrated circuits (ASICs)
  • #23 Your functions return the ops required for a given input
  • #26 You might have too many data / too big embeddings that do not fit into single machine’s memory
  • #28 Guys from OpenAi have made a very nice research. They’ve analyzed amount of operations needed for one forward-backward pass for a few recently published models (based on the information provided by the paper’s authors). PFLOPS-day = if we have a petaflop/sec compute performance, how many days we need to get this For comparison = one Google TPU gives you 180teraflops, so you nedd >400 TPUs to achieve this kind of compute performance Ops/model are increasing exp, with doubling period 3.6months (Moore’s law has doubling period equals to 18m period) E.g., Alexnet (2012, CNN net to compete at Imagenet Visual Recognition Challenge) has 62M parameters
  • #29 E.g., use extremely large minibatch SGD https://arxiv.org/abs/1711.04325
  • #31 \omega_{j+1}=\omega_j-\alpha*\nabla_{\omega}L_j,\ \ L_j=\frac{1}{m}\sum_{i=0}^{M}L_i
  • #34 E.g., use extremely large minibatch SGD https://arxiv.org/abs/1711.04325
  • #40 A very native example can be LSTM network, when each layer can be placed on a separate device (so after processing the first input and passing the results, the device can start to process the second one until waiting to receive the backpropagation back)
  • #42 Demo APIs and AutoML