SlideShare a Scribd company logo
1 of 43
Training your machine learning model
with Cloud
Training ML model with Cloud
● Rent a single VM
○ Which options do you have?
○ How to make it cost-effective?
● Distributed training
○ Horizontal vs. vertical scaling
○ Tensorflow
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
The ML surprise
Effort Allocation
Defining KPI’s
Collecting data
Building infrastructure
Optimizing ML algorithm
Integration
Expectation
*Informally based on my many conversations with new ML practitioners
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
The ML surprise
Effort Allocation
Expectation
Reality
0.25 0.5 0.75
Defining KPI’s
Collecting data
Building infrastructure
Optimizing ML algorithm
.1
Integration
This is ...
● an intro to Cloud with regard to training ML models
● an intro to distributed training
This is not
● an intro to Tensorflow or ML
● a talk about TPU
● we won’t focus on inference at all
Why?
● Scale
● Flexibility
● Additional tools as fully managed services
● Cost-effective
● Shorter time-to-market
About me
● Leonid Kuligin
○ Moscow Institute of Physics & Technique
○ 2017 - … Google Cloud, PSO, ML engineer - Munich, Germany
○ 2016-2017 Scout GmbH, Senior Software Engineer - Munich, Germany
○ 2015-2016 HH.RU, Senior Product Manager (search engine) - Moscow, Russia
○ 2013-2015 Yandex, Team Lead (data production team, Local search) - Moscow, Russia
https://www.linkedin.com/in/leonid-kuligin-53569544/
Rent a VM
Prepare & organize your data
● BLOB storage = Google Cloud Storage / AWS S3
○ Think about format (json, parquet, avro, TFRecord, ...) & proper partitioning
○ Blob storage is usually HDFS-compatible
● Other storage options
○ Big Query
○ Cloud SQL (managed MySQL or PostgreSQL)
○ Key-value storage (e.g. Big Table) as managed service
Rent a single VM
● Create a VM via GUI or script
● Specify basics parameters
○ Machine type
○ Persistent disk
○ Image
○ Attach accelerator
○ Preemptible
○ Security, networking, ...
○ ...
Deploy Wait & dumpData
Shutdown
VM
Move your data to the
Cloud and organize it
Reproduce your environment at
the VM rented in the Cloud
Local
trainer
Prepare
VM
Write a local trainer and test it
on a small sample of your data
Package your trainer and
deploy it to the VM
1
3
4
5
2
Wait until the training is
completed and dump the
binaries to the BLOB storage
Machine types and cost
● Predefined machine types
○ Virtual CPUs + memory (by some providers, network throughput might depend on
the machine type)
○ Price depends on the datacenter location (i.e., if you need to train a model only, you
might choose the cheapest location)
● Billing: per seconds but minimum 1 minute per VM
● Discounts
○ Sustained use discounts
■ Up to 30% net discount if you use an instance for more than 25% a month
■ Different instance types are automatically combined
○ Committed use discounts
■ for 1-3 years
Images
● You can start with a “clean” OS with a certain version or you can use an OS with pre-
installed/pre-configured software
● Free tier vs premium tier
○ free tier: Debian, Ubuntu, CentOS, coreOS
○ beta: deep learning images (with jupyter server running)
○ premium team: Anaconda Enterprise, Caffe Python 3.6 CPU, … - you might pay
some additional licensing fee and machines are not customizable
○ user-provided custom image
● You can create your own images / make snapshots of your machines
● Official images have Compute Engine tools installed on them
Startup scripts
● Install/update software, perform warm-up, …
● GCE copies your script, sets permission to make it executable and executes it
● How to provide script to GCE:
○ from your local file --metadata-from-file=startup-script=<PATH_TO_SCRIPT>
○ directly --metadata=startup-script=’’#! /bin/bash; sudo su -; apt-get update”
○ from GCS (take care about ACLs!) --metadata=startup-script-url=”gs://bucket/startup.sh”
● You can provide custom keys to the startup scripts when creating an instance
gcloud compute instances create example-instance --metadata foo=bar
--metadata=startup-script=’’#! /bin/bash;
FOO=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo -H ‘Metadata-
Flavor: Google’); echo $FOO”
Persistent disk
● HDD vs SDD
○ you decide on the size on the disk
○ disk can be resized after the VM has been created
● additional pricing until the persistent disk exists
○ so you can stop the VM and restart it later (paying only for the disk storage)
○ or make your VM fully stateless and use other storages
● you can add disks / resize disks / reattach disks
○ create snapshots for backups
GPUs
● Additional pricing and subject to project quotas
○ Nvidia Tesla P100 (16 GB HDM2)
○ Nvidia Tesla K80 (12 GB GDDR5)
○ Nvidia Tesla V100 (16 GB HDM2) = beta
● GPUs are not available in all zones
● GPU instances can be terminated for host maintenance
● NVidia Tesla V100 are offered with NVLink connections between GPUs
TPUs
● Google custom-developed ASICs to accelerate ML workloads (using Tensorflow)
○ accelerate the performance of linear algebra computation
● Currently in beta, available only in a few zones
● You need to get a quota for your project
● All the data preprocessing / checkpoints / etc. is executed on the VM
○ The dense part of the Tensorflow graph, loss and gradients subgraphs are
compiled with XLA (Accelerated Linear Algebra) compiler and this part of
your code is executed on TPUs
○ Only a limited list of Tensorflow OPs is available on TPU
○ Benchmarks: http://dawn.cs.stanford.edu/benchmark/
CPUs
● Quick prototyping
● Simple models
● Small models with small batch
sizes
● A lot of custom ops written in
C++
● Models limited by I/O or
network bandwidth available
GPUs
● Models that are not written in
Tensorflow
● A lot of custom ops written in
C++ but optimized for GPUs
● Medium-to-large models with
large batch size
● Using Tensorflow Ops on
available on TPUs
TPUs
● A lot of matrix computations
● Only supported Tensorflow Ops
● Large models with large batch
size
● Long training times
Train the model
● Create a command the instance
gcloud compute instances create <YOUR INSTANCE> … --metadata
“KEEP_ALIVE=true”
● Build the startup script
○ Install all dependencies (or use you custom image!)
○ Copy your data locally gcloud cp gs://bucket/raw-data/2018-04/*
○ Copy your code locally git clone ...
○ Execute your training job
○ If needed, take care about ssh tunnelling to access tensorboard / etc.
● Include self-destroy command into the training job as a last step
if ! [ $KEEP_ALIVE = "true" ] ; then
sudo shutdown -h now
Fi
● You can provide your hyperparameters as metadata key-value pairs
○ how effective is your gridsearch now!
Preemptible VMs
● Much lower price than normal instances
● Can be terminated at any time
○ Sends a soft OFF signal, you have 30s to cleanup with a shutdown script
○ Will be always terminated after 24h
○ You can simulate maintenance event
gcloud beta compute instances simulate-maintenance-event
<YOUR_INSTANCE_NAME> --zone <YOUR_ZONE>
● Might be not available and are not under SLA
● GCE generally preempts the instances launched most recently if needed
○ average preemption rate varies from 5% to 15%, but there is no SLA here!
● VMs, GPUs, local SSDs can all be preemptible
Open source solution
● Created by Google Brain
● Most popular ML project on Github
○ Over 480 contributors
○ 10000 commits in 12 months
● Fast C++ engine and support for distributed training
● Extensive coverage of latest deep learning algorithms
● Both high-level and low-level APIs
● Multiple deployment options
○ Mobile, Desktop, Server, Cloud
○ CPU, GPU, TPU
TF.ESTIMATOR
● tf.estimator - a high-level Tensorflow API that makes your life easier
○ You can always use lower-level APIs if needed
○ it takes care about a lot of things during the training
■ error handling
■ building graphs, initializing variables, starting queues, …
■ preparing summaries for Tensorboard
■ creating checkpoints and recovering from failures
● you can run the same model in a different environments (local, multi-server, …)
● you provide a model as a set of functions (train_op, eval_op, …)
○ estimators has apis like fit(), evaluate(), predict(), …
Preemptible VMs with tf
● Read processed / save checkpoints data from GCS bucket directly
● If the preemptible instance will be killed, the training will continue from the last checkpoint
after you’ll restart it
● To restart automatically, you need to use instance template and managed instance group
○ this is a mechanism used for autostart / autohealing / autoscaling
gcloud beta compute instance-templates create my-training-job
… --preemptible
gcloud compute instance-groups managed create my-training-
job-managed --base-instance-name my-training-job
--size 1 --template my-training-job --zone <YOUR_ZONE>
● You can launch tensorboard locally:
○ tensorboard --logdir=$YOUR_OUTPUT_DIR
Distributed training
OOM / too long training time?
● Scale vertically
○ use more powerful VMs
○ parallelize hyperparameters tuning (run multiple training jobs in parallel)
● Scale horizontally
○ rent a cluster and use data parallelization: execute batches in parallel
○ rent a cluster and use model parallelization: execute your model in parallel
It’s important to remember, that in most
cases switching to distributed training
requires you to adjust your code.
Need for compute power
● Amount of ops required to train state of the art models is increasing exponentially
● 300000X increase since 2012
○ Neural Machine Translation: 80 PFs-days
■ Google TPU offers up to 180teraflops
Source: https://blog.openai.com/ai-and-compute/
There are much more tricks and approaches
available (than just train in distributed mode)
Parallelization in practice
● Choose your favorite framework and explore options for distributed training
○ Use your favorite framework
■ Rent a few VMs, take care about cluster’s setup, incl. the networking
■ Rent a hadoop cluster as a managed service (Dataproc), use SparkML
■ ...
○ Use Tensorflow
■ Rent a few VMs, take care about cluster’s setup, incl. the network
■ Use ML Engine (a managed service)
■ Use kubeflow (with Google Kubernetes Engine)
■ ...
Data parallelism
● Standard SGD
● Synchronous parallel SGD
○ Shuffle data uniformly across workers
○ For every step:
■ Deliver current parameters to each worker and run a SGD step
■ [Synchronize] Aggregate the results and update parameters on the PS
1 2 j
Uniformly shuffled data
Sampled minibatch
SGD step
Aggregating gradients
& updating parameters
Data parallelism
● Parallel SGD should be evaluated in terms of computation time, communication cost and
communication time
● You are able to reduce some syncing bottlenecks with Asynchronous Stochastic Gradient
Descent
○ You have various adjustments to compensate for delayed gradients
You have 4 workers. What would you expect?
Data processing with Tensorflow
● tf.data.Dataset API provides a high-level API to implement data input pipelines
○ There are available implementations of FixedLengthRecordDataset,
TextLineDataset, TFRecordDataset
● You can read CSV with tf.decode_csv
● You can apply various transformation to your dataset (cache, map, flat_map and decode)
● When you are done, you end up with initializing an Iterator with
.make_one_shot_iterator() or .make_initializable_iterator()
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Pipelining
dataset = dataset.batch(batch_size=FLAGS.batch_size)
dataset = dataset.batch(batch_size=FLAGS.batch_size)
dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Parallelize Data Transformation
dataset = dataset.map(map_func=parse_fn)
dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_calls)
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Parallelize Data Extraction
dataset = files.interleave(tf.data.TFRecordDataset)
dataset = files.apply(tf.contrib.data.parallel_interleave(
tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_readers, sloppy=True))
Randomizing data
● shard according to amount of workers
● interleave on every worker
● shuffle each portion of data
Model parallelism
● Separate subgraphs can be placed on different devices
○ Workers need to communicate with each other
○ You can combine CPUs/GPUs/TPUs together
● The optimization should be done on your done explicitly
with tf.device("/gpu:0"):
…
with tf.device("/gpu:1"):
…
with tf.device("/cpu:0"):
…
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Data Parallelism & Model Parallelism
Data parallelism to process multiple mini-batches in
different workers simultaneously
Model parallelism to run different model operations on
multiple devices simultaneously
Shared
Data
Chief
Queue
Parameter
Server
Worker Node
Gradients
Loss
Model
Layers
Mini-batch
1
Mean
Update
Parameters
Asynchronous
Stochastic Gradient
Descent (SGD)
Good for partially connected models
(wide and deep models + embeddings)
and Recurrent Models
Input_fn
Worker Node
Gradients
Loss
Model
Layers
Input_fn
Worker Node
Gradients
Loss
Model
Layers
Input_fn
Mini-batch
2
Mini-batch
3
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Three ways for AI on Google Cloud
your data + our models
AutoML
Dialogflow
Ease of Use
Train our state-of-the-art models Build your own modelsCall our perception APIs
our data + our models
Cloud
Vision API
Cloud
Translation API
Cloud Natural
Language API
Cloud
Speech API
Cloud Video
Intelligence API
Cloud Speech
Synthesis API
Data Loss
Prevention API
Cloud ML Engine
Cloud Dataproc
Compute Engine
Kubernetes Engine
Cloud TPUs
your data + your model
Customisation
Outcomes
● Training models in the cloud is easy and you can scale very easy, and you can automate a
lot of things
○ If your framework support fault-tolerance, don’t forget about preemptible instances
● Distributed training is relatively easy to implement with modern frameworks, but you still
need to tune your code
○ You need to start thinking about it from the step when you start organizing your
data
○ Data parallelization versus model parallelization
○ Or scale vertically with TPUs
● Try out Tensorflow!
Thank you!
Questions?

More Related Content

What's hot

Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportPreferred Networks
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentationVishal Singh
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1Kenta Oono
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsTristan Lorach
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Chris Fregly
 
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PR-129: Horovod: fast and easy distributed deep learning in TensorFlowPR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PR-129: Horovod: fast and easy distributed deep learning in TensorFlowSeoul National University
 

What's hot (20)

GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
 
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PR-129: Horovod: fast and easy distributed deep learning in TensorFlowPR-129: Horovod: fast and easy distributed deep learning in TensorFlow
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
 

Similar to Leonid Kuligin "Training ML models with Cloud"

Kubernetes for machine learning
Kubernetes for machine learningKubernetes for machine learning
Kubernetes for machine learningAkash Agrawal
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiSearce Inc
 
Uber's Journey in Distributed Deep Learning
Uber's Journey in Distributed Deep LearningUber's Journey in Distributed Deep Learning
Uber's Journey in Distributed Deep Learninginside-BigData.com
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化NVIDIA Taiwan
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from UberBill Liu
 
Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)AWS User Group Pune
 
Amazon WorkSpaces-Virtual Desktops in Cloud
Amazon WorkSpaces-Virtual Desktops in CloudAmazon WorkSpaces-Virtual Desktops in Cloud
Amazon WorkSpaces-Virtual Desktops in Cloudamodkadam
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsAltoros
 
AWS DevOps - Terraform, Docker, HashiCorp Vault
AWS DevOps - Terraform, Docker, HashiCorp VaultAWS DevOps - Terraform, Docker, HashiCorp Vault
AWS DevOps - Terraform, Docker, HashiCorp VaultGrzegorz Adamowicz
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsdata://disrupted®
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
Second Skin: Real-Time Retheming a Legacy Web Application with Diazo in the C...
Second Skin: Real-Time Retheming a Legacy Web Application with Diazo in the C...Second Skin: Real-Time Retheming a Legacy Web Application with Diazo in the C...
Second Skin: Real-Time Retheming a Legacy Web Application with Diazo in the C...Chris Shenton
 
Demystifying Machine Learning with AWS (ACD Mumbai)
Demystifying Machine Learning with AWS (ACD Mumbai)Demystifying Machine Learning with AWS (ACD Mumbai)
Demystifying Machine Learning with AWS (ACD Mumbai)AWS User Group Pune
 
Building a continuous delivery platform for the biggest spike in e-commerce -...
Building a continuous delivery platform for the biggest spike in e-commerce -...Building a continuous delivery platform for the biggest spike in e-commerce -...
Building a continuous delivery platform for the biggest spike in e-commerce -...Puppet
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019VMware Tanzu
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestgeetachauhan
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning serviceRuth Yakubu
 

Similar to Leonid Kuligin "Training ML models with Cloud" (20)

Kubernetes for machine learning
Kubernetes for machine learningKubernetes for machine learning
Kubernetes for machine learning
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
 
Uber's Journey in Distributed Deep Learning
Uber's Journey in Distributed Deep LearningUber's Journey in Distributed Deep Learning
Uber's Journey in Distributed Deep Learning
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
 
Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)Demystifying Amazon Sagemaker (ACD Kochi)
Demystifying Amazon Sagemaker (ACD Kochi)
 
Amazon WorkSpaces-Virtual Desktops in Cloud
Amazon WorkSpaces-Virtual Desktops in CloudAmazon WorkSpaces-Virtual Desktops in Cloud
Amazon WorkSpaces-Virtual Desktops in Cloud
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
AWS DevOps - Terraform, Docker, HashiCorp Vault
AWS DevOps - Terraform, Docker, HashiCorp VaultAWS DevOps - Terraform, Docker, HashiCorp Vault
AWS DevOps - Terraform, Docker, HashiCorp Vault
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public clouds
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Second Skin: Real-Time Retheming a Legacy Web Application with Diazo in the C...
Second Skin: Real-Time Retheming a Legacy Web Application with Diazo in the C...Second Skin: Real-Time Retheming a Legacy Web Application with Diazo in the C...
Second Skin: Real-Time Retheming a Legacy Web Application with Diazo in the C...
 
Demystifying Machine Learning with AWS (ACD Mumbai)
Demystifying Machine Learning with AWS (ACD Mumbai)Demystifying Machine Learning with AWS (ACD Mumbai)
Demystifying Machine Learning with AWS (ACD Mumbai)
 
Building a continuous delivery platform for the biggest spike in e-commerce -...
Building a continuous delivery platform for the biggest spike in e-commerce -...Building a continuous delivery platform for the biggest spike in e-commerce -...
Building a continuous delivery platform for the biggest spike in e-commerce -...
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBest
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning service
 
C3 w3
C3 w3C3 w3
C3 w3
 

More from Lviv Startup Club

Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...Lviv Startup Club
 
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...Lviv Startup Club
 
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...Lviv Startup Club
 
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...Lviv Startup Club
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Lviv Startup Club
 
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)Lviv Startup Club
 
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Lviv Startup Club
 
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Lviv Startup Club
 
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Lviv Startup Club
 
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Lviv Startup Club
 
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Lviv Startup Club
 
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Lviv Startup Club
 
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Lviv Startup Club
 
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Lviv Startup Club
 
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Lviv Startup Club
 
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Lviv Startup Club
 
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Lviv Startup Club
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Lviv Startup Club
 
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Lviv Startup Club
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Lviv Startup Club
 

More from Lviv Startup Club (20)

Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
 
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
 
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
 
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
 
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
 
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
 
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
 
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
 
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
 
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
 
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
 
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
 
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
 
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
 
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
 
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
 
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
 

Recently uploaded

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 

Recently uploaded (20)

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 

Leonid Kuligin "Training ML models with Cloud"

  • 1. Training your machine learning model with Cloud
  • 2. Training ML model with Cloud ● Rent a single VM ○ Which options do you have? ○ How to make it cost-effective? ● Distributed training ○ Horizontal vs. vertical scaling ○ Tensorflow
  • 3. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. The ML surprise Effort Allocation Defining KPI’s Collecting data Building infrastructure Optimizing ML algorithm Integration Expectation *Informally based on my many conversations with new ML practitioners
  • 4. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. The ML surprise Effort Allocation Expectation Reality 0.25 0.5 0.75 Defining KPI’s Collecting data Building infrastructure Optimizing ML algorithm .1 Integration
  • 5. This is ... ● an intro to Cloud with regard to training ML models ● an intro to distributed training This is not ● an intro to Tensorflow or ML ● a talk about TPU ● we won’t focus on inference at all
  • 6. Why? ● Scale ● Flexibility ● Additional tools as fully managed services ● Cost-effective ● Shorter time-to-market
  • 7. About me ● Leonid Kuligin ○ Moscow Institute of Physics & Technique ○ 2017 - … Google Cloud, PSO, ML engineer - Munich, Germany ○ 2016-2017 Scout GmbH, Senior Software Engineer - Munich, Germany ○ 2015-2016 HH.RU, Senior Product Manager (search engine) - Moscow, Russia ○ 2013-2015 Yandex, Team Lead (data production team, Local search) - Moscow, Russia https://www.linkedin.com/in/leonid-kuligin-53569544/
  • 9. Prepare & organize your data ● BLOB storage = Google Cloud Storage / AWS S3 ○ Think about format (json, parquet, avro, TFRecord, ...) & proper partitioning ○ Blob storage is usually HDFS-compatible ● Other storage options ○ Big Query ○ Cloud SQL (managed MySQL or PostgreSQL) ○ Key-value storage (e.g. Big Table) as managed service
  • 10. Rent a single VM ● Create a VM via GUI or script ● Specify basics parameters ○ Machine type ○ Persistent disk ○ Image ○ Attach accelerator ○ Preemptible ○ Security, networking, ... ○ ...
  • 11. Deploy Wait & dumpData Shutdown VM Move your data to the Cloud and organize it Reproduce your environment at the VM rented in the Cloud Local trainer Prepare VM Write a local trainer and test it on a small sample of your data Package your trainer and deploy it to the VM 1 3 4 5 2 Wait until the training is completed and dump the binaries to the BLOB storage
  • 12. Machine types and cost ● Predefined machine types ○ Virtual CPUs + memory (by some providers, network throughput might depend on the machine type) ○ Price depends on the datacenter location (i.e., if you need to train a model only, you might choose the cheapest location) ● Billing: per seconds but minimum 1 minute per VM ● Discounts ○ Sustained use discounts ■ Up to 30% net discount if you use an instance for more than 25% a month ■ Different instance types are automatically combined ○ Committed use discounts ■ for 1-3 years
  • 13. Images ● You can start with a “clean” OS with a certain version or you can use an OS with pre- installed/pre-configured software ● Free tier vs premium tier ○ free tier: Debian, Ubuntu, CentOS, coreOS ○ beta: deep learning images (with jupyter server running) ○ premium team: Anaconda Enterprise, Caffe Python 3.6 CPU, … - you might pay some additional licensing fee and machines are not customizable ○ user-provided custom image ● You can create your own images / make snapshots of your machines ● Official images have Compute Engine tools installed on them
  • 14. Startup scripts ● Install/update software, perform warm-up, … ● GCE copies your script, sets permission to make it executable and executes it ● How to provide script to GCE: ○ from your local file --metadata-from-file=startup-script=<PATH_TO_SCRIPT> ○ directly --metadata=startup-script=’’#! /bin/bash; sudo su -; apt-get update” ○ from GCS (take care about ACLs!) --metadata=startup-script-url=”gs://bucket/startup.sh” ● You can provide custom keys to the startup scripts when creating an instance gcloud compute instances create example-instance --metadata foo=bar --metadata=startup-script=’’#! /bin/bash; FOO=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo -H ‘Metadata- Flavor: Google’); echo $FOO”
  • 15. Persistent disk ● HDD vs SDD ○ you decide on the size on the disk ○ disk can be resized after the VM has been created ● additional pricing until the persistent disk exists ○ so you can stop the VM and restart it later (paying only for the disk storage) ○ or make your VM fully stateless and use other storages ● you can add disks / resize disks / reattach disks ○ create snapshots for backups
  • 16. GPUs ● Additional pricing and subject to project quotas ○ Nvidia Tesla P100 (16 GB HDM2) ○ Nvidia Tesla K80 (12 GB GDDR5) ○ Nvidia Tesla V100 (16 GB HDM2) = beta ● GPUs are not available in all zones ● GPU instances can be terminated for host maintenance ● NVidia Tesla V100 are offered with NVLink connections between GPUs
  • 17. TPUs ● Google custom-developed ASICs to accelerate ML workloads (using Tensorflow) ○ accelerate the performance of linear algebra computation ● Currently in beta, available only in a few zones ● You need to get a quota for your project ● All the data preprocessing / checkpoints / etc. is executed on the VM ○ The dense part of the Tensorflow graph, loss and gradients subgraphs are compiled with XLA (Accelerated Linear Algebra) compiler and this part of your code is executed on TPUs ○ Only a limited list of Tensorflow OPs is available on TPU ○ Benchmarks: http://dawn.cs.stanford.edu/benchmark/
  • 18. CPUs ● Quick prototyping ● Simple models ● Small models with small batch sizes ● A lot of custom ops written in C++ ● Models limited by I/O or network bandwidth available GPUs ● Models that are not written in Tensorflow ● A lot of custom ops written in C++ but optimized for GPUs ● Medium-to-large models with large batch size ● Using Tensorflow Ops on available on TPUs TPUs ● A lot of matrix computations ● Only supported Tensorflow Ops ● Large models with large batch size ● Long training times
  • 19. Train the model ● Create a command the instance gcloud compute instances create <YOUR INSTANCE> … --metadata “KEEP_ALIVE=true” ● Build the startup script ○ Install all dependencies (or use you custom image!) ○ Copy your data locally gcloud cp gs://bucket/raw-data/2018-04/* ○ Copy your code locally git clone ... ○ Execute your training job ○ If needed, take care about ssh tunnelling to access tensorboard / etc. ● Include self-destroy command into the training job as a last step if ! [ $KEEP_ALIVE = "true" ] ; then sudo shutdown -h now Fi ● You can provide your hyperparameters as metadata key-value pairs ○ how effective is your gridsearch now!
  • 20. Preemptible VMs ● Much lower price than normal instances ● Can be terminated at any time ○ Sends a soft OFF signal, you have 30s to cleanup with a shutdown script ○ Will be always terminated after 24h ○ You can simulate maintenance event gcloud beta compute instances simulate-maintenance-event <YOUR_INSTANCE_NAME> --zone <YOUR_ZONE> ● Might be not available and are not under SLA ● GCE generally preempts the instances launched most recently if needed ○ average preemption rate varies from 5% to 15%, but there is no SLA here! ● VMs, GPUs, local SSDs can all be preemptible
  • 21. Open source solution ● Created by Google Brain ● Most popular ML project on Github ○ Over 480 contributors ○ 10000 commits in 12 months ● Fast C++ engine and support for distributed training ● Extensive coverage of latest deep learning algorithms ● Both high-level and low-level APIs ● Multiple deployment options ○ Mobile, Desktop, Server, Cloud ○ CPU, GPU, TPU
  • 22. TF.ESTIMATOR ● tf.estimator - a high-level Tensorflow API that makes your life easier ○ You can always use lower-level APIs if needed ○ it takes care about a lot of things during the training ■ error handling ■ building graphs, initializing variables, starting queues, … ■ preparing summaries for Tensorboard ■ creating checkpoints and recovering from failures ● you can run the same model in a different environments (local, multi-server, …) ● you provide a model as a set of functions (train_op, eval_op, …) ○ estimators has apis like fit(), evaluate(), predict(), …
  • 23. Preemptible VMs with tf ● Read processed / save checkpoints data from GCS bucket directly ● If the preemptible instance will be killed, the training will continue from the last checkpoint after you’ll restart it ● To restart automatically, you need to use instance template and managed instance group ○ this is a mechanism used for autostart / autohealing / autoscaling gcloud beta compute instance-templates create my-training-job … --preemptible gcloud compute instance-groups managed create my-training- job-managed --base-instance-name my-training-job --size 1 --template my-training-job --zone <YOUR_ZONE> ● You can launch tensorboard locally: ○ tensorboard --logdir=$YOUR_OUTPUT_DIR
  • 25. OOM / too long training time? ● Scale vertically ○ use more powerful VMs ○ parallelize hyperparameters tuning (run multiple training jobs in parallel) ● Scale horizontally ○ rent a cluster and use data parallelization: execute batches in parallel ○ rent a cluster and use model parallelization: execute your model in parallel
  • 26. It’s important to remember, that in most cases switching to distributed training requires you to adjust your code.
  • 27. Need for compute power ● Amount of ops required to train state of the art models is increasing exponentially ● 300000X increase since 2012 ○ Neural Machine Translation: 80 PFs-days ■ Google TPU offers up to 180teraflops Source: https://blog.openai.com/ai-and-compute/
  • 28. There are much more tricks and approaches available (than just train in distributed mode)
  • 29. Parallelization in practice ● Choose your favorite framework and explore options for distributed training ○ Use your favorite framework ■ Rent a few VMs, take care about cluster’s setup, incl. the networking ■ Rent a hadoop cluster as a managed service (Dataproc), use SparkML ■ ... ○ Use Tensorflow ■ Rent a few VMs, take care about cluster’s setup, incl. the network ■ Use ML Engine (a managed service) ■ Use kubeflow (with Google Kubernetes Engine) ■ ...
  • 30. Data parallelism ● Standard SGD ● Synchronous parallel SGD ○ Shuffle data uniformly across workers ○ For every step: ■ Deliver current parameters to each worker and run a SGD step ■ [Synchronize] Aggregate the results and update parameters on the PS
  • 31. 1 2 j Uniformly shuffled data Sampled minibatch SGD step Aggregating gradients & updating parameters
  • 32. Data parallelism ● Parallel SGD should be evaluated in terms of computation time, communication cost and communication time ● You are able to reduce some syncing bottlenecks with Asynchronous Stochastic Gradient Descent ○ You have various adjustments to compensate for delayed gradients
  • 33. You have 4 workers. What would you expect?
  • 34. Data processing with Tensorflow ● tf.data.Dataset API provides a high-level API to implement data input pipelines ○ There are available implementations of FixedLengthRecordDataset, TextLineDataset, TFRecordDataset ● You can read CSV with tf.decode_csv ● You can apply various transformation to your dataset (cache, map, flat_map and decode) ● When you are done, you end up with initializing an Iterator with .make_one_shot_iterator() or .make_initializable_iterator()
  • 35. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Pipelining dataset = dataset.batch(batch_size=FLAGS.batch_size) dataset = dataset.batch(batch_size=FLAGS.batch_size) dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
  • 36. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Parallelize Data Transformation dataset = dataset.map(map_func=parse_fn) dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_calls)
  • 37. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Parallelize Data Extraction dataset = files.interleave(tf.data.TFRecordDataset) dataset = files.apply(tf.contrib.data.parallel_interleave( tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_readers, sloppy=True))
  • 38. Randomizing data ● shard according to amount of workers ● interleave on every worker ● shuffle each portion of data
  • 39. Model parallelism ● Separate subgraphs can be placed on different devices ○ Workers need to communicate with each other ○ You can combine CPUs/GPUs/TPUs together ● The optimization should be done on your done explicitly with tf.device("/gpu:0"): … with tf.device("/gpu:1"): … with tf.device("/cpu:0"): …
  • 40. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Data Parallelism & Model Parallelism Data parallelism to process multiple mini-batches in different workers simultaneously Model parallelism to run different model operations on multiple devices simultaneously Shared Data Chief Queue Parameter Server Worker Node Gradients Loss Model Layers Mini-batch 1 Mean Update Parameters Asynchronous Stochastic Gradient Descent (SGD) Good for partially connected models (wide and deep models + embeddings) and Recurrent Models Input_fn Worker Node Gradients Loss Model Layers Input_fn Worker Node Gradients Loss Model Layers Input_fn Mini-batch 2 Mini-batch 3
  • 41. © 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other company and product names may be trademarks of the respective companies with which they are associated. Three ways for AI on Google Cloud your data + our models AutoML Dialogflow Ease of Use Train our state-of-the-art models Build your own modelsCall our perception APIs our data + our models Cloud Vision API Cloud Translation API Cloud Natural Language API Cloud Speech API Cloud Video Intelligence API Cloud Speech Synthesis API Data Loss Prevention API Cloud ML Engine Cloud Dataproc Compute Engine Kubernetes Engine Cloud TPUs your data + your model Customisation
  • 42. Outcomes ● Training models in the cloud is easy and you can scale very easy, and you can automate a lot of things ○ If your framework support fault-tolerance, don’t forget about preemptible instances ● Distributed training is relatively easy to implement with modern frameworks, but you still need to tune your code ○ You need to start thinking about it from the step when you start organizing your data ○ Data parallelization versus model parallelization ○ Or scale vertically with TPUs ● Try out Tensorflow!

Editor's Notes

  1. I’m compiled this very informal chart, after many conversations with new ML practitioners. Overwhelming folks tend to focus in on how they will choose and optimize the core ML algorithm itself. They worry about which papers to read, or how to chose select hyperparameters, and so on; often to the exclusion of other parts of the system In reality though, successfully deployed ML systems have a very different balance.
  2. Storage per se is relatively cheap, but networking costs might arise - so that’s all you need to take care about
  3. We are not going to talk about networking here. Just to mention, ingress traffic is always for free, egress is for free within the same datacenter (via internal IPs) and costs quite tiny money for traffic within the region.
  4. We have general | highCpu | high memory | and some other specific use cases instances, as well as you can configure a VM yourself.
  5. https://cloud.google.com/compute/docs/images
  6. https://cloud.google.com/compute/docs/gpus/ https://devblogs.nvidia.com/how-nvlink-will-enable-faster-easier-multi-gpu-computing/
  7. application-specific integrated circuits (ASICs) https://cloud.google.com/tpu/docs/tpus
  8. application-specific integrated circuits (ASICs)
  9. Your functions return the ops required for a given input
  10. You might have too many data / too big embeddings that do not fit into single machine’s memory
  11. Guys from OpenAi have made a very nice research. They’ve analyzed amount of operations needed for one forward-backward pass for a few recently published models (based on the information provided by the paper’s authors). PFLOPS-day = if we have a petaflop/sec compute performance, how many days we need to get this For comparison = one Google TPU gives you 180teraflops, so you nedd >400 TPUs to achieve this kind of compute performance Ops/model are increasing exp, with doubling period 3.6months (Moore’s law has doubling period equals to 18m period) E.g., Alexnet (2012, CNN net to compete at Imagenet Visual Recognition Challenge) has 62M parameters
  12. E.g., use extremely large minibatch SGD https://arxiv.org/abs/1711.04325
  13. \omega_{j+1}=\omega_j-\alpha*\nabla_{\omega}L_j,\ \ L_j=\frac{1}{m}\sum_{i=0}^{M}L_i
  14. E.g., use extremely large minibatch SGD https://arxiv.org/abs/1711.04325
  15. A very native example can be LSTM network, when each layer can be placed on a separate device (so after processing the first input and passing the results, the device can start to process the second one until waiting to receive the backpropagation back)
  16. Demo APIs and AutoML