Accelerate Machine Learning on Google Cloud

•Download as PPTX, PDF•

2 likes•279 views

As a Machine Learning engineer, one of the most important tasks is to select the right hardware to run your ML workflow on. Is it an Intel CPU or an NVIDIA GPU? Expect the unexpected from the results of the benchmark performed on the Recommender System created for one of Datatonic's customers, a top-5 UK retailer, on Google Cloud.

Technology

Datatonic
Samantha Guerriero
Machine Learning Engineer
Head of R&D on Intel Skylake
Oliver Gindele
Head of Machine Learning
Will Fletcher
Machine Learning Researcher
Jessie Tan
Data Science Intern

Accelerate Machine Learning
with Datatonic & Intel
Innovation Cloud Readiness Compute

Understand which Machine Learning workflows work best on CPU and GPU and how to best optimise hardware and
software for Intel Skylake®
Purpose
Optimise the build of TensorFlow Pip , Bazel , Conda & MKL library
Optimise the model
Optimise the parameters
Feature Engineering, Model selection and Hyperparameter Fine-tuning
KMP_BLOCKTIME , intra_op_parallelism_threads , inter_op_parallelism_threads
How

INCEPTIONV3
BS=1 BS=32 BS=64 BS=96 BS=128
Broadwell 8.39 10.69 10.86 11.02 11.09
Skylake 9.29 14.74 14.97 15.08 15.1
% gain 10.7% 37.9% 37.8% 36.8% 36.2%
Replicate results on
tf_cnn_benchmarks

25-40%
Up to
cheaper & faster vs. Broadwell*
Replicate results on
tf_cnn_benchmarks

Accelerate Machine Learning
with Datatonic & Intel
Millions of Customers
Many Different Questions
Thousands of Products
Billions of Answers

Optimise Training for
Personalisation Models
Large Scale
Many (tens to hundreds of) features
Two networks trained jointly:
wide for memorization, deep for
generalization.
CPU Intensive Process

Intra_op_parallelism threadsInter_op_parallelism threadsKMP BLOCKTIME
How many parallel threads to
run for operations that can be
parallelized internally.
How many parallel threads to
run for operations that can be
run independently. .
How much time each thread
should wait after completing
the execution of a parallel
region, in milliseconds.
CPU optimisationModel optimisation
Optimise Training for
Personalisation Models

*Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32];
KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request
Optimise Training for
Personalisation Models

Level 39
One Canada Square
Canary Wharf
E14 5AB London
uk@datatonic.com
+44 (0)20 3856 3287
www.datatonic.com
Thank You

What's hot

FPGAs in the cloud? (October 2017)Julien SIMON

Distributed deep learning optimizations - AI WithTheBestgeetachauhan

Scaling Deep LearningIntel® Software

Deep Learning with Apache MXNet (September 2017)Julien SIMON

How to use Apache TVM to optimize your ML modelsDatabricks

Which Is Deeper - Comparison Of Deep Learning Frameworks On SparkSpark Summit

Intel colfax optimizing-machine-learning-workloadsTracy Johnson

(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014Amazon Web Services

Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Joseph Luchette

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks

STAR CCM GLOBAL CONFERENCE UBERCLOUDThomas Francis

Threading Successes 01 Introguest40fc7cd

running Tensorflow in ProductionMatthias Feys

High Performance Computing (HPC) in cloudAccubits Technologies

Cheap HPCAlex Moore

HybridAzureCloudChris Condo

FPGA on the Cloud jtsagata

Microsoft Azure in HPC scenariosmictc

ML studio overview v1.1Paulo R. Batalhão

GTC Taiwan 2017 企業端深度學習與人工智慧應用NVIDIA Taiwan

What's hot (20)

FPGAs in the cloud? (October 2017)

Distributed deep learning optimizations - AI WithTheBest

Scaling Deep Learning

Deep Learning with Apache MXNet (September 2017)

How to use Apache TVM to optimize your ML models

Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark

Intel colfax optimizing-machine-learning-workloads

(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014

Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

STAR CCM GLOBAL CONFERENCE UBERCLOUD

Threading Successes 01 Intro

running Tensorflow in Production

High Performance Computing (HPC) in cloud

Cheap HPC

HybridAzureCloud

FPGA on the Cloud

Microsoft Azure in HPC scenarios

ML studio overview v1.1

GTC Taiwan 2017 企業端深度學習與人工智慧應用

Similar to Accelerate Machine Learning on Google Cloud

Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software

AI On the Edge: Model CompressionApache MXNet

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software

Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com

Supermicro X12 Performance UpdateRebekah Rodriguez

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja

Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services

Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Amazon Web Services

Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.

Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services

“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...Edge AI and Vision Alliance

Deep learning for FinTechgeetachauhan

Build, Train & Deploy Machine Learning Models at ScaleAmazon Web Services

Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN

Entenda de onde vem toda a potência do Intel® Xeon Phi™ Intel Software Brasil

ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...Bharath Sudharsan

Open power ddl and lmsGanesan Narayanasamy

Deep Dive on Amazon EC2 instancesAmazon Web Services

Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano

Trends in Systems and How to Get Efficient Performanceinside-BigData.com

Similar to Accelerate Machine Learning on Google Cloud (20)

Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...

AI On the Edge: Model Compression

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...

Large-Scale Optimization Strategies for Typical HPC Workloads

Supermicro X12 Performance Update

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...

Deep Dive on Delivering Amazon EC2 Instance Performance

Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...

Backend.AI Technical Introduction (19.09 / 2019 Autumn)

Deep Dive on Delivering Amazon EC2 Instance Performance

“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...

Deep learning for FinTech

Build, Train & Deploy Machine Learning Models at Scale

Intel Distribution for Python - Scaling for HPC and Big Data

Entenda de onde vem toda a potência do Intel® Xeon Phi™

ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...

Open power ddl and lms

Deep Dive on Amazon EC2 instances

Architectural Optimizations for High Performance and Energy Efficient Smith-W...

Trends in Systems and How to Get Efficient Performance

Recently uploaded

"ML in Production",Oleksandr BaganFwdays

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

AI as an Interface for Commercial BuildingsMemoori

Install Stable Diffusion in windows machinePadma Pradeep

Gen AI in Business - Global Trends Report 2024.pdfAddepto

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Training state-of-the-art general text embeddingZilliz

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

CloudStudio User manual (basic edition):comworks

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Recently uploaded (20)

"ML in Production",Oleksandr Bagan

The Future of Software Development - Devin AI Innovative Approach.pdf

AI as an Interface for Commercial Buildings

Install Stable Diffusion in windows machine

Gen AI in Business - Global Trends Report 2024.pdf

My Hashitalk Indonesia April 2024 Presentation

Unraveling Multimodality with Large Language Models.pdf

Training state-of-the-art general text embedding

Connect Wave/ connectwave Pitch Deck Presentation

DevEX - reference for building teams, processes, and platforms

Designing IA for AI - Information Architecture Conference 2024

Commit 2024 - Secret Management made easy

Advanced Test Driven-Development @ php[tek] 2024

Ensuring Technical Readiness For Copilot in Microsoft 365

Streamlining Python Development: A Guide to a Modern Project Setup

CloudStudio User manual (basic edition):

"Debugging python applications inside k8s environment", Andrii Soldatenko

DMCC Future of Trade Web3 - Special Edition

Nell’iperspazio con Rocket: il Framework Web di Rust!

Vertex AI Gemini Prompt Engineering Tips

Accelerate Machine Learning on Google Cloud

1. Datatonic Samantha Guerriero Machine Learning Engineer Head of R&D on Intel Skylake Oliver Gindele Head of Machine Learning Will Fletcher Machine Learning Researcher Jessie Tan Data Science Intern

2. Accelerate Machine Learning with Datatonic & Intel Innovation Cloud Readiness Compute

3. Understand which Machine Learning workflows work best on CPU and GPU and how to best optimise hardware and software for Intel Skylake® Purpose Optimise the build of TensorFlow Pip , Bazel , Conda & MKL library Optimise the model Optimise the parameters Feature Engineering, Model selection and Hyperparameter Fine-tuning KMP_BLOCKTIME , intra_op_parallelism_threads , inter_op_parallelism_threads How

4. INCEPTIONV3 BS=1 BS=32 BS=64 BS=96 BS=128 Broadwell 8.39 10.69 10.86 11.02 11.09 Skylake 9.29 14.74 14.97 15.08 15.1 % gain 10.7% 37.9% 37.8% 36.8% 36.2% Replicate results on tf_cnn_benchmarks

5. 25-40% Up to cheaper & faster vs. Broadwell* Replicate results on tf_cnn_benchmarks

6. Personalisation

7. Accelerate Machine Learning with Datatonic & Intel Millions of Customers Many Different Questions Thousands of Products Billions of Answers

8. Optimise Training for Personalisation Models Large Scale Many (tens to hundreds of) features Two networks trained jointly: wide for memorization, deep for generalization. CPU Intensive Process

9. Intra_op_parallelism threadsInter_op_parallelism threadsKMP BLOCKTIME How many parallel threads to run for operations that can be parallelized internally. How many parallel threads to run for operations that can be run independently. . How much time each thread should wait after completing the execution of a parallel region, in milliseconds. CPU optimisationModel optimisation Optimise Training for Personalisation Models

10. *Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32]; KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request Optimise Training for Personalisation Models

11. *Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32]; KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request Optimise Training for Personalisation Models Skylake CPU vs GPU Nvidia K80: 77.66% cheaper & 51.5% faster. Skylake CPU vs GPU Nvidia V100: 90% cheaper & 31.6% faster.

12. Level 39 One Canada Square Canary Wharf E14 5AB London uk@datatonic.com +44 (0)20 3856 3287 www.datatonic.com Thank You

Editor's Notes

This presentation is to showcase the results of a collaboration we have been running with Intel to understand how, together, we can accelerate Machine Learning on Google Cloud Platform. Google Cloud platform provides different hardware architectures: GPUs, TPUs and Intel CPUs, amongst which the latest generation architecture is Intel Skylake. Intel believes that Intel Skylake holds the power to run ML flows fast (and cheap - considering that the least expensive GPU is three times as expensive as a CPU). And we have come to believe the same, at least for the workflows we have tested so far.
First thing, was to replicate the results on the tf_cnn_bechmarks so to make sure that the way we build the CPU for TensorFlow is optimal. There are three different ways of doing so: Pip installation Conda installation These two are more out of the box. Bazel installation: for those who have never used it before, bazel is a tool to automatically build and test software. You’d use to to run compilers and linkers to produce executable programs and libraries, and assembling packages. It is quite cool because it is: multi-language, high-level, reproducible, scalable. It comes from what Google uses to build its server software internally. The difference about these build is that: Bazel is quite slower (an hour) and it is the best, because it is compiled directly against the target hardware.
So, overall, the results from the standard benchmark show that Intel Skylake is between 25-40% faster than Broadwell, which makes sense being Skylake the latest architecture :) Moving on, we have decided to test the Intel architectures on a different workflow, one closer to what we typically do at Datatonic, with two aims: See if Skylake is again better than Broadwell Check what the performance of the CPUs vs GPUs is
Our typical ML workflow revolves around the world: Personalization. In Retail, Finance, Media,... we create Recommender Systems, Propensity Models, Bundle Recommendations Engine,....Typically for big-companies with millions of customers and thousand of products. The amount of data that can be gathered by such organizations is impressive, it is big data, which requires smart engineering to handle and smart ML model creation with the aim, of course, to uncover a tailored customer journey for all customers visiting the company which underlies this data.
Going quickly through the main characteristics of this model: It is composed of two networks, trained jointly: a wide network which is a one-layer with the capability of memorization, and a deep network with as many layers as required, which has the capability of generalisation. The Deep network is a standard MLP, but the addition of the wide part allows to smooth the possibility of the deep network to predict too many unconventional outliers. This model is a standard TF model that can be implemented with the Estimator API, and it can be quite a complex model to run given the huge amount of feature columns and, of course, big data.
If you notice, the title from the previous and current and also next slide contains the world optimised. To obtain the best performance for a model, a little work is necessary not only to finetune the model on the software level, but also to finetune the parameters of the CPU on a hardware level. The CPU parameters to finetune are: KMP_blocktime, intra_threads and inter_threads operations. KMP BLOCKTIME: how much time each thread should wait after completing the execution of a parallel region, in milliseconds. For some reason, it default value is 200 which is quite high: most models will require a small value, in our case 0. Inter-op and intra-op: parallelism within one layer as well as across layers. If you have many operations that are independent in your TensorFlow graph—because there is no directed path between them in the dataflow graph—TensorFlow will attempt to run them concurrently, (inter) If you have an operation that can be parallelized internally, such as matrix multiplication (tf.matmul()) or a reduction (e.g. tf.reduce_sum()), TensorFlow will execute it by scheduling tasks in a thread pool with intra_op_parallelism_threads threads. You’d typically set this to the number of physical cores. Typically, you’d set inter-op to 2 and intra-op to the number of logical cores. IN our case, they were respectively 4 and 8, as we have been running on a 8-core architecture.
Is this optimisation step that important? Our findings show that optimised vs unoptimised is 27% avg faster and cheaper, for Skylake which is our main focus and 33% for Broadwell. So: optimisation is fundamental. Just a note: This results, as well as all the results in our benchmark, are based on an average over 3 runs with the model architecture that provides the best model performance (over auc).
Next, as I have anticipated previously, our goal was not only to compare the Intel architectures, but to compare with GPUs as well which are the go-to for ML workloads. While the previous results can somehow be expected, this result can come as a surprise to many: Intel Skylake is much faster, for our model, than GPUs. First thing to notice is that GPUs are I/O bound, and increasing the batch size drastically changes these percentages. Why do we report these, though? Because we have finetuned our model and discovered that the best batch size is 1024. You may increase the batch size to 4000-8000 and run faster (not cheaper) with GPU, but you’d be degrading your model performance. Another thing to note is that this percentage compares Skylake vs Nvidia k80: if you were to use a more powerful GPU, then again this percentage would reduce sensibly. Like everything, you need to compromise though: is it worth the cost, that is 4-6x higher than Skylake? Depending on the company and/or the task, it may be. It’s a matter of choices. It is worth to note that, for our model, also Nvidia V100 (which is an extremely powerful GPU) was only as fast (not faster) than Skylake, which makes the decision on which architecture to pick much easier :)
Now, of course there has been some R&D involved to get to this results, as we have seen together: with finetuning the parameters of the model and the architecture, as well as many other difficulties: Provide consistent and reproducible results, over multiple runs and over the different architectures (CPUs and GPUs with same packages, versions, build,..) Replicate the results on new models More technically, build the CPU to its best possible optimisation (which is using the bazel build as it will always use the most recent version of TF and MKL-DNN. And because it is compiled directly against the target hardware. To tackle these difficulties, we have created a Benchmark Suite (in Python and for TensorFlow only at the moment), which automatically performs all the analysis we have shared with you as long as you direct the benchmark to run your code. It takes four steps: It spins up the VMs with the specific architectures: CPU, GPU,... It builds the architecture with the desired installation type (bazel, pip, conda) and, of course, for the GPU installs Cuda and everything. Runs your model Save the results of the runs on the different hardware on BigQuery, reporting running time, cost, model performance,... Personally, I think this is great! If you think the same, don’t hesitate to contact us and learn how we can accelerate your Machine Learning models together on GCP!

Accelerate Machine Learning on Google Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Accelerate Machine Learning on Google Cloud

Similar to Accelerate Machine Learning on Google Cloud (20)

Recently uploaded

Recently uploaded (20)

Accelerate Machine Learning on Google Cloud

Editor's Notes