Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Leonid Kuligin "Training ML models with Cloud"

12 views

Published on

Data Science Practice

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Leonid Kuligin "Training ML models with Cloud"

  1. 1. Training your machine learning model with Cloud
  2. 2. Training ML model with Cloud ● Rent a single VM ○ Which options do you have? ○ How to make it cost-effective? ● Distributed training ○ Horizontal vs. vertical scaling ○ Tensorflow
  3. 3. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. The ML surprise Effort Allocation Defining KPI’s Collecting data Building infrastructure Optimizing ML algorithm Integration Expectation *Informally based on my many conversations with new ML practitioners
  4. 4. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. The ML surprise Effort Allocation Expectation Reality 0.25 0.5 0.75 Defining KPI’s Collecting data Building infrastructure Optimizing ML algorithm .1 Integration
  5. 5. This is ... ● an intro to Cloud with regard to training ML models ● an intro to distributed training This is not ● an intro to Tensorflow or ML ● a talk about TPU ● we won’t focus on inference at all
  6. 6. Why? ● Scale ● Flexibility ● Additional tools as fully managed services ● Cost-effective ● Shorter time-to-market
  7. 7. About me ● Leonid Kuligin ○ Moscow Institute of Physics & Technique ○ 2017 - … Google Cloud, PSO, ML engineer - Munich, Germany ○ 2016-2017 Scout GmbH, Senior Software Engineer - Munich, Germany ○ 2015-2016 HH.RU, Senior Product Manager (search engine) - Moscow, Russia ○ 2013-2015 Yandex, Team Lead (data production team, Local search) - Moscow, Russia https://www.linkedin.com/in/leonid-kuligin-53569544/
  8. 8. Rent a VM
  9. 9. Prepare & organize your data ● BLOB storage = Google Cloud Storage / AWS S3 ○ Think about format (json, parquet, avro, TFRecord, ...) & proper partitioning ○ Blob storage is usually HDFS-compatible ● Other storage options ○ Big Query ○ Cloud SQL (managed MySQL or PostgreSQL) ○ Key-value storage (e.g. Big Table) as managed service
  10. 10. Rent a single VM ● Create a VM via GUI or script ● Specify basics parameters ○ Machine type ○ Persistent disk ○ Image ○ Attach accelerator ○ Preemptible ○ Security, networking, ... ○ ...
  11. 11. Deploy Wait & dumpData Shutdown VM Move your data to the Cloud and organize it Reproduce your environment at the VM rented in the Cloud Local trainer Prepare VM Write a local trainer and test it on a small sample of your data Package your trainer and deploy it to the VM 1 3 4 5 2 Wait until the training is completed and dump the binaries to the BLOB storage
  12. 12. Machine types and cost ● Predefined machine types ○ Virtual CPUs + memory (by some providers, network throughput might depend on the machine type) ○ Price depends on the datacenter location (i.e., if you need to train a model only, you might choose the cheapest location) ● Billing: per seconds but minimum 1 minute per VM ● Discounts ○ Sustained use discounts ■ Up to 30% net discount if you use an instance for more than 25% a month ■ Different instance types are automatically combined ○ Committed use discounts ■ for 1-3 years
  13. 13. Images ● You can start with a “clean” OS with a certain version or you can use an OS with pre- installed/pre-configured software ● Free tier vs premium tier ○ free tier: Debian, Ubuntu, CentOS, coreOS ○ beta: deep learning images (with jupyter server running) ○ premium team: Anaconda Enterprise, Caffe Python 3.6 CPU, … - you might pay some additional licensing fee and machines are not customizable ○ user-provided custom image ● You can create your own images / make snapshots of your machines ● Official images have Compute Engine tools installed on them
  14. 14. Startup scripts ● Install/update software, perform warm-up, … ● GCE copies your script, sets permission to make it executable and executes it ● How to provide script to GCE: ○ from your local file --metadata-from-file=startup-script=<PATH_TO_SCRIPT> ○ directly --metadata=startup-script=’’#! /bin/bash; sudo su -; apt-get update” ○ from GCS (take care about ACLs!) --metadata=startup-script-url=”gs://bucket/startup.sh” ● You can provide custom keys to the startup scripts when creating an instance gcloud compute instances create example-instance --metadata foo=bar --metadata=startup-script=’’#! /bin/bash; FOO=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo -H ‘Metadata- Flavor: Google’); echo $FOO”
  15. 15. Persistent disk ● HDD vs SDD ○ you decide on the size on the disk ○ disk can be resized after the VM has been created ● additional pricing until the persistent disk exists ○ so you can stop the VM and restart it later (paying only for the disk storage) ○ or make your VM fully stateless and use other storages ● you can add disks / resize disks / reattach disks ○ create snapshots for backups
  16. 16. GPUs ● Additional pricing and subject to project quotas ○ Nvidia Tesla P100 (16 GB HDM2) ○ Nvidia Tesla K80 (12 GB GDDR5) ○ Nvidia Tesla V100 (16 GB HDM2) = beta ● GPUs are not available in all zones ● GPU instances can be terminated for host maintenance ● NVidia Tesla V100 are offered with NVLink connections between GPUs
  17. 17. TPUs ● Google custom-developed ASICs to accelerate ML workloads (using Tensorflow) ○ accelerate the performance of linear algebra computation ● Currently in beta, available only in a few zones ● You need to get a quota for your project ● All the data preprocessing / checkpoints / etc. is executed on the VM ○ The dense part of the Tensorflow graph, loss and gradients subgraphs are compiled with XLA (Accelerated Linear Algebra) compiler and this part of your code is executed on TPUs ○ Only a limited list of Tensorflow OPs is available on TPU ○ Benchmarks: http://dawn.cs.stanford.edu/benchmark/
  18. 18. CPUs ● Quick prototyping ● Simple models ● Small models with small batch sizes ● A lot of custom ops written in C++ ● Models limited by I/O or network bandwidth available GPUs ● Models that are not written in Tensorflow ● A lot of custom ops written in C++ but optimized for GPUs ● Medium-to-large models with large batch size ● Using Tensorflow Ops on available on TPUs TPUs ● A lot of matrix computations ● Only supported Tensorflow Ops ● Large models with large batch size ● Long training times
  19. 19. Train the model ● Create a command the instance gcloud compute instances create <YOUR INSTANCE> … --metadata “KEEP_ALIVE=true” ● Build the startup script ○ Install all dependencies (or use you custom image!) ○ Copy your data locally gcloud cp gs://bucket/raw-data/2018-04/* ○ Copy your code locally git clone ... ○ Execute your training job ○ If needed, take care about ssh tunnelling to access tensorboard / etc. ● Include self-destroy command into the training job as a last step if ! [ $KEEP_ALIVE = "true" ] ; then sudo shutdown -h now Fi ● You can provide your hyperparameters as metadata key-value pairs ○ how effective is your gridsearch now!
  20. 20. Preemptible VMs ● Much lower price than normal instances ● Can be terminated at any time ○ Sends a soft OFF signal, you have 30s to cleanup with a shutdown script ○ Will be always terminated after 24h ○ You can simulate maintenance event gcloud beta compute instances simulate-maintenance-event <YOUR_INSTANCE_NAME> --zone <YOUR_ZONE> ● Might be not available and are not under SLA ● GCE generally preempts the instances launched most recently if needed ○ average preemption rate varies from 5% to 15%, but there is no SLA here! ● VMs, GPUs, local SSDs can all be preemptible
  21. 21. Open source solution ● Created by Google Brain ● Most popular ML project on Github ○ Over 480 contributors ○ 10000 commits in 12 months ● Fast C++ engine and support for distributed training ● Extensive coverage of latest deep learning algorithms ● Both high-level and low-level APIs ● Multiple deployment options ○ Mobile, Desktop, Server, Cloud ○ CPU, GPU, TPU
  22. 22. TF.ESTIMATOR ● tf.estimator - a high-level Tensorflow API that makes your life easier ○ You can always use lower-level APIs if needed ○ it takes care about a lot of things during the training ■ error handling ■ building graphs, initializing variables, starting queues, … ■ preparing summaries for Tensorboard ■ creating checkpoints and recovering from failures ● you can run the same model in a different environments (local, multi-server, …) ● you provide a model as a set of functions (train_op, eval_op, …) ○ estimators has apis like fit(), evaluate(), predict(), …
  23. 23. Preemptible VMs with tf ● Read processed / save checkpoints data from GCS bucket directly ● If the preemptible instance will be killed, the training will continue from the last checkpoint after you’ll restart it ● To restart automatically, you need to use instance template and managed instance group ○ this is a mechanism used for autostart / autohealing / autoscaling gcloud beta compute instance-templates create my-training-job … --preemptible gcloud compute instance-groups managed create my-training- job-managed --base-instance-name my-training-job --size 1 --template my-training-job --zone <YOUR_ZONE> ● You can launch tensorboard locally: ○ tensorboard --logdir=$YOUR_OUTPUT_DIR
  24. 24. Distributed training
  25. 25. OOM / too long training time? ● Scale vertically ○ use more powerful VMs ○ parallelize hyperparameters tuning (run multiple training jobs in parallel) ● Scale horizontally ○ rent a cluster and use data parallelization: execute batches in parallel ○ rent a cluster and use model parallelization: execute your model in parallel
  26. 26. It’s important to remember, that in most cases switching to distributed training requires you to adjust your code.
  27. 27. Need for compute power ● Amount of ops required to train state of the art models is increasing exponentially ● 300000X increase since 2012 ○ Neural Machine Translation: 80 PFs-days ■ Google TPU offers up to 180teraflops Source: https://blog.openai.com/ai-and-compute/
  28. 28. There are much more tricks and approaches available (than just train in distributed mode)
  29. 29. Parallelization in practice ● Choose your favorite framework and explore options for distributed training ○ Use your favorite framework ■ Rent a few VMs, take care about cluster’s setup, incl. the networking ■ Rent a hadoop cluster as a managed service (Dataproc), use SparkML ■ ... ○ Use Tensorflow ■ Rent a few VMs, take care about cluster’s setup, incl. the network ■ Use ML Engine (a managed service) ■ Use kubeflow (with Google Kubernetes Engine) ■ ...
  30. 30. Data parallelism ● Standard SGD ● Synchronous parallel SGD ○ Shuffle data uniformly across workers ○ For every step: ■ Deliver current parameters to each worker and run a SGD step ■ [Synchronize] Aggregate the results and update parameters on the PS
  31. 31. 1 2 j Uniformly shuffled data Sampled minibatch SGD step Aggregating gradients & updating parameters
  32. 32. Data parallelism ● Parallel SGD should be evaluated in terms of computation time, communication cost and communication time ● You are able to reduce some syncing bottlenecks with Asynchronous Stochastic Gradient Descent ○ You have various adjustments to compensate for delayed gradients
  33. 33. You have 4 workers. What would you expect?
  34. 34. Data processing with Tensorflow ● tf.data.Dataset API provides a high-level API to implement data input pipelines ○ There are available implementations of FixedLengthRecordDataset, TextLineDataset, TFRecordDataset ● You can read CSV with tf.decode_csv ● You can apply various transformation to your dataset (cache, map, flat_map and decode) ● When you are done, you end up with initializing an Iterator with .make_one_shot_iterator() or .make_initializable_iterator()
  35. 35. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Pipelining dataset = dataset.batch(batch_size=FLAGS.batch_size) dataset = dataset.batch(batch_size=FLAGS.batch_size) dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
  36. 36. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Parallelize Data Transformation dataset = dataset.map(map_func=parse_fn) dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_calls)
  37. 37. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Parallelize Data Extraction dataset = files.interleave(tf.data.TFRecordDataset) dataset = files.apply(tf.contrib.data.parallel_interleave( tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_readers, sloppy=True))
  38. 38. Randomizing data ● shard according to amount of workers ● interleave on every worker ● shuffle each portion of data
  39. 39. Model parallelism ● Separate subgraphs can be placed on different devices ○ Workers need to communicate with each other ○ You can combine CPUs/GPUs/TPUs together ● The optimization should be done on your done explicitly with tf.device("/gpu:0"): … with tf.device("/gpu:1"): … with tf.device("/cpu:0"): …
  40. 40. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Data Parallelism & Model Parallelism Data parallelism to process multiple mini-batches in different workers simultaneously Model parallelism to run different model operations on multiple devices simultaneously Shared Data Chief Queue Parameter Server Worker Node Gradients Loss Model Layers Mini-batch 1 Mean Update Parameters Asynchronous Stochastic Gradient Descent (SGD) Good for partially connected models (wide and deep models + embeddings) and Recurrent Models Input_fn Worker Node Gradients Loss Model Layers Input_fn Worker Node Gradients Loss Model Layers Input_fn Mini-batch 2 Mini-batch 3
  41. 41. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Three ways for AI on Google Cloud your data + our models AutoML Dialogflow Ease of Use Train our state-of-the-art models Build your own modelsCall our perception APIs our data + our models Cloud Vision API Cloud Translation API Cloud Natural Language API Cloud Speech API Cloud Video Intelligence API Cloud Speech Synthesis API Data Loss Prevention API Cloud ML Engine Cloud Dataproc Compute Engine Kubernetes Engine Cloud TPUs your data + your model Customisation
  42. 42. Outcomes ● Training models in the cloud is easy and you can scale very easy, and you can automate a lot of things ○ If your framework support fault-tolerance, don’t forget about preemptible instances ● Distributed training is relatively easy to implement with modern frameworks, but you still need to tune your code ○ You need to start thinking about it from the step when you start organizing your data ○ Data parallelization versus model parallelization ○ Or scale vertically with TPUs ● Try out Tensorflow!
  43. 43. Thank you! Questions?

×