Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerate Machine Learning on Google Cloud

70 views

Published on

As a Machine Learning engineer, one of the most important tasks is to select the right hardware to run your ML workflow on. Is it an Intel CPU or an NVIDIA GPU?
Expect the unexpected from the results of the benchmark performed on the Recommender System created for one of Datatonic's customers, a top-5 UK retailer, on Google Cloud.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Accelerate Machine Learning on Google Cloud

  1. 1. Datatonic Samantha Guerriero Machine Learning Engineer Head of R&D on Intel Skylake Oliver Gindele Head of Machine Learning Will Fletcher Machine Learning Researcher Jessie Tan Data Science Intern
  2. 2. Accelerate Machine Learning with Datatonic & Intel Innovation Cloud Readiness Compute
  3. 3. Understand which Machine Learning workflows work best on CPU and GPU and how to best optimise hardware and software for Intel Skylake� Purpose Optimise the build of TensorFlow Pip , Bazel , Conda & MKL library Optimise the model Optimise the parameters Feature Engineering, Model selection and Hyperparameter Fine-tuning KMP_BLOCKTIME , intra_op_parallelism_threads , inter_op_parallelism_threads How
  4. 4. INCEPTIONV3 BS=1 BS=32 BS=64 BS=96 BS=128 Broadwell 8.39 10.69 10.86 11.02 11.09 Skylake 9.29 14.74 14.97 15.08 15.1 % gain 10.7% 37.9% 37.8% 36.8% 36.2% Replicate results on tf_cnn_benchmarks
  5. 5. 25-40% Up to cheaper & faster vs. Broadwell* Replicate results on tf_cnn_benchmarks
  6. 6. Personalisation
  7. 7. Accelerate Machine Learning with Datatonic & Intel Millions of Customers Many Different Questions Thousands of Products Billions of Answers
  8. 8. Optimise Training for Personalisation Models Large Scale Many (tens to hundreds of) features Two networks trained jointly: wide for memorization, deep for generalization. CPU Intensive Process
  9. 9. Intra_op_parallelism threadsInter_op_parallelism threadsKMP BLOCKTIME How many parallel threads to run for operations that can be parallelized internally. How many parallel threads to run for operations that can be run independently. . How much time each thread should wait after completing the execution of a parallel region, in milliseconds. CPU optimisationModel optimisation Optimise Training for Personalisation Models
  10. 10. *Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32]; KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request Optimise Training for Personalisation Models
  11. 11. *Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32]; KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request Optimise Training for Personalisation Models Skylake CPU vs GPU Nvidia K80: 77.66% cheaper & 51.5% faster. Skylake CPU vs GPU Nvidia V100: 90% cheaper & 31.6% faster.
  12. 12. Level 39 One Canada Square Canary Wharf E14 5AB London uk@datatonic.com +44 (0)20 3856 3287 www.datatonic.com Thank You

×