As a Machine Learning engineer, one of the most important tasks is to select the right hardware to run your ML workflow on. Is it an Intel CPU or an NVIDIA GPU?
Expect the unexpected from the results of the benchmark performed on the Recommender System created for one of Datatonic's customers, a top-5 UK retailer, on Google Cloud.
1. Datatonic
Samantha Guerriero
Machine Learning Engineer
Head of R&D on Intel Skylake
Oliver Gindele
Head of Machine Learning
Will Fletcher
Machine Learning Researcher
Jessie Tan
Data Science Intern
3. Understand which Machine Learning workflows work best on CPU and GPU and how to best optimise hardware and
software for Intel Skylake®
Purpose
Optimise the build of TensorFlow Pip , Bazel , Conda & MKL library
Optimise the model
Optimise the parameters
Feature Engineering, Model selection and Hyperparameter Fine-tuning
KMP_BLOCKTIME , intra_op_parallelism_threads , inter_op_parallelism_threads
How
7. Accelerate Machine Learning
with Datatonic & Intel
Millions of Customers
Many Different Questions
Thousands of Products
Billions of Answers
8. Optimise Training for
Personalisation Models
Large Scale
Many (tens to hundreds of) features
Two networks trained jointly:
wide for memorization, deep for
generalization.
CPU Intensive Process
9. Intra_op_parallelism threadsInter_op_parallelism threadsKMP BLOCKTIME
How many parallel threads to
run for operations that can be
parallelized internally.
How many parallel threads to
run for operations that can be
run independently. .
How much time each thread
should wait after completing
the execution of a parallel
region, in milliseconds.
CPU optimisationModel optimisation
Optimise Training for
Personalisation Models
10. *Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32];
KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request
Optimise Training for
Personalisation Models
11. *Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32];
KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request
Optimise Training for
Personalisation Models
Skylake CPU vs GPU Nvidia K80: 77.66% cheaper & 51.5% faster.
Skylake CPU vs GPU Nvidia V100: 90% cheaper & 31.6% faster.
12. Level 39
One Canada Square
Canary Wharf
E14 5AB London
uk@datatonic.com
+44 (0)20 3856 3287
www.datatonic.com
Thank You
Editor's Notes
This presentation is to showcase the results of a collaboration we have been running with Intel to understand how, together, we can accelerate Machine Learning on Google Cloud Platform. Google Cloud platform provides different hardware architectures: GPUs, TPUs and Intel CPUs, amongst which the latest generation architecture is Intel Skylake. Intel believes that Intel Skylake holds the power to run ML flows fast (and cheap - considering that the least expensive GPU is three times as expensive as a CPU). And we have come to believe the same, at least for the workflows we have tested so far.
First thing, was to replicate the results on the tf_cnn_bechmarks so to make sure that the way we build the CPU for TensorFlow is optimal.
There are three different ways of doing so:
Pip installation
Conda installation
These two are more out of the box.
Bazel installation: for those who have never used it before, bazel is a tool to automatically build and test software. You’d use to to run compilers and linkers to produce executable programs and libraries, and assembling packages. It is quite cool because it is: multi-language, high-level, reproducible, scalable. It comes from what Google uses to build its server software internally.
The difference about these build is that: Bazel is quite slower (an hour) and it is the best, because it is compiled directly against the target hardware.
So, overall, the results from the standard benchmark show that Intel Skylake is between 25-40% faster than Broadwell, which makes sense being Skylake the latest architecture :)
Moving on, we have decided to test the Intel architectures on a different workflow, one closer to what we typically do at Datatonic, with two aims:
See if Skylake is again better than Broadwell
Check what the performance of the CPUs vs GPUs is
Our typical ML workflow revolves around the world: Personalization. In Retail, Finance, Media,... we create Recommender Systems, Propensity Models, Bundle Recommendations Engine,....Typically for big-companies with millions of customers and thousand of products. The amount of data that can be gathered by such organizations is impressive, it is big data, which requires smart engineering to handle and smart ML model creation with the aim, of course, to uncover a tailored customer journey for all customers visiting the company which underlies this data.
Going quickly through the main characteristics of this model:
It is composed of two networks, trained jointly: a wide network which is a one-layer with the capability of memorization, and a deep network with as many layers as required, which has the capability of generalisation. The Deep network is a standard MLP, but the addition of the wide part allows to smooth the possibility of the deep network to predict too many unconventional outliers.
This model is a standard TF model that can be implemented with the Estimator API, and it can be quite a complex model to run given the huge amount of feature columns and, of course, big data.
If you notice, the title from the previous and current and also next slide contains the world optimised. To obtain the best performance for a model, a little work is necessary not only to finetune the model on the software level, but also to finetune the parameters of the CPU on a hardware level. The CPU parameters to finetune are: KMP_blocktime, intra_threads and inter_threads operations.
KMP BLOCKTIME: how much time each thread should wait after completing the execution of a parallel region, in milliseconds. For some reason, it default value is 200 which is quite high: most models will require a small value, in our case 0.
Inter-op and intra-op: parallelism within one layer as well as across layers.
If you have many operations that are independent in your TensorFlow graph—because there is no directed path between them in the dataflow graph—TensorFlow will attempt to run them concurrently, (inter)
If you have an operation that can be parallelized internally, such as matrix multiplication (tf.matmul()) or a reduction (e.g. tf.reduce_sum()), TensorFlow will execute it by scheduling tasks in a thread pool with intra_op_parallelism_threads threads. You’d typically set this to the number of physical cores.
Typically, you’d set inter-op to 2 and intra-op to the number of logical cores. IN our case, they were respectively 4 and 8, as we have been running on a 8-core architecture.
Is this optimisation step that important?
Our findings show that optimised vs unoptimised is 27% avg faster and cheaper, for Skylake which is our main focus and 33% for Broadwell.
So: optimisation is fundamental.
Just a note: This results, as well as all the results in our benchmark, are based on an average over 3 runs with the model architecture that provides the best model performance (over auc).
Next, as I have anticipated previously, our goal was not only to compare the Intel architectures, but to compare with GPUs as well which are the go-to for ML workloads.
While the previous results can somehow be expected, this result can come as a surprise to many: Intel Skylake is much faster, for our model, than GPUs.
First thing to notice is that GPUs are I/O bound, and increasing the batch size drastically changes these percentages. Why do we report these, though? Because we have finetuned our model and discovered that the best batch size is 1024. You may increase the batch size to 4000-8000 and run faster (not cheaper) with GPU, but you’d be degrading your model performance.
Another thing to note is that this percentage compares Skylake vs Nvidia k80: if you were to use a more powerful GPU, then again this percentage would reduce sensibly. Like everything, you need to compromise though: is it worth the cost, that is 4-6x higher than Skylake? Depending on the company and/or the task, it may be. It’s a matter of choices. It is worth to note that, for our model, also Nvidia V100 (which is an extremely powerful GPU) was only as fast (not faster) than Skylake, which makes the decision on which architecture to pick much easier :)
Now, of course there has been some R&D involved to get to this results, as we have seen together: with finetuning the parameters of the model and the architecture, as well as many other difficulties:
Provide consistent and reproducible results, over multiple runs and over the different architectures (CPUs and GPUs with same packages, versions, build,..)
Replicate the results on new models
More technically, build the CPU to its best possible optimisation (which is using the bazel build as it will always use the most recent version of TF and MKL-DNN. And because it is compiled directly against the target hardware.
To tackle these difficulties, we have created a Benchmark Suite (in Python and for TensorFlow only at the moment), which automatically performs all the analysis we have shared with you as long as you direct the benchmark to run your code. It takes four steps:
It spins up the VMs with the specific architectures: CPU, GPU,...
It builds the architecture with the desired installation type (bazel, pip, conda) and, of course, for the GPU installs Cuda and everything.
Runs your model
Save the results of the runs on the different hardware on BigQuery, reporting running time, cost, model performance,...
Personally, I think this is great! If you think the same, don’t hesitate to contact us and learn how we can accelerate your Machine Learning models together on GCP!