SlideShare a Scribd company logo
1 of 12
Datatonic
Samantha Guerriero
Machine Learning Engineer
Head of R&D on Intel Skylake
Oliver Gindele
Head of Machine Learning
Will Fletcher
Machine Learning Researcher
Jessie Tan
Data Science Intern
Accelerate Machine Learning
with Datatonic & Intel
Innovation Cloud Readiness Compute
Understand which Machine Learning workflows work best on CPU and GPU and how to best optimise hardware and
software for Intel Skylake®
Purpose
Optimise the build of TensorFlow Pip , Bazel , Conda & MKL library
Optimise the model
Optimise the parameters
Feature Engineering, Model selection and Hyperparameter Fine-tuning
KMP_BLOCKTIME , intra_op_parallelism_threads , inter_op_parallelism_threads
How
INCEPTIONV3
BS=1 BS=32 BS=64 BS=96 BS=128
Broadwell 8.39 10.69 10.86 11.02 11.09
Skylake 9.29 14.74 14.97 15.08 15.1
% gain 10.7% 37.9% 37.8% 36.8% 36.2%
Replicate results on
tf_cnn_benchmarks
25-40%
Up to
cheaper & faster vs. Broadwell*
Replicate results on
tf_cnn_benchmarks
Personalisation
Accelerate Machine Learning
with Datatonic & Intel
Millions of Customers
Many Different Questions
Thousands of Products
Billions of Answers
Optimise Training for
Personalisation Models
Large Scale
Many (tens to hundreds of) features
Two networks trained jointly:
wide for memorization, deep for
generalization.
CPU Intensive Process
Intra_op_parallelism threadsInter_op_parallelism threadsKMP BLOCKTIME
How many parallel threads to
run for operations that can be
parallelized internally.
How many parallel threads to
run for operations that can be
run independently. .
How much time each thread
should wait after completing
the execution of a parallel
region, in milliseconds.
CPU optimisationModel optimisation
Optimise Training for
Personalisation Models
*Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32];
KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request
Optimise Training for
Personalisation Models
*Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32];
KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request
Optimise Training for
Personalisation Models
Skylake CPU vs GPU Nvidia K80: 77.66% cheaper & 51.5% faster.
Skylake CPU vs GPU Nvidia V100: 90% cheaper & 31.6% faster.
Level 39
One Canada Square
Canary Wharf
E14 5AB London
uk@datatonic.com
+44 (0)20 3856 3287
www.datatonic.com
Thank You

More Related Content

What's hot

FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)Julien SIMON
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestgeetachauhan
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Julien SIMON
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsDatabricks
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On SparkSpark Summit
 
Intel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloadsIntel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloadsTracy Johnson
 
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014Amazon Web Services
 
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Joseph Luchette
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
STAR CCM GLOBAL CONFERENCE UBERCLOUD
STAR CCM GLOBAL CONFERENCE UBERCLOUDSTAR CCM GLOBAL CONFERENCE UBERCLOUD
STAR CCM GLOBAL CONFERENCE UBERCLOUDThomas Francis
 
Threading Successes 01 Intro
Threading Successes 01   IntroThreading Successes 01   Intro
Threading Successes 01 Introguest40fc7cd
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in ProductionMatthias Feys
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudAccubits Technologies
 
HybridAzureCloud
HybridAzureCloudHybridAzureCloud
HybridAzureCloudChris Condo
 
FPGA on the Cloud
FPGA on the Cloud FPGA on the Cloud
FPGA on the Cloud jtsagata
 
Microsoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosMicrosoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosmictc
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用NVIDIA Taiwan
 

What's hot (20)

FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBest
 
Scaling Deep Learning
Scaling Deep LearningScaling Deep Learning
Scaling Deep Learning
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 
Intel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloadsIntel colfax optimizing-machine-learning-workloads
Intel colfax optimizing-machine-learning-workloads
 
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
(BDT202) HPC Now Means 'High Personal Computing' | AWS re:Invent 2014
 
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
STAR CCM GLOBAL CONFERENCE UBERCLOUD
STAR CCM GLOBAL CONFERENCE UBERCLOUDSTAR CCM GLOBAL CONFERENCE UBERCLOUD
STAR CCM GLOBAL CONFERENCE UBERCLOUD
 
Threading Successes 01 Intro
Threading Successes 01   IntroThreading Successes 01   Intro
Threading Successes 01 Intro
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloud
 
Cheap HPC
Cheap HPCCheap HPC
Cheap HPC
 
HybridAzureCloud
HybridAzureCloudHybridAzureCloud
HybridAzureCloud
 
FPGA on the Cloud
FPGA on the Cloud FPGA on the Cloud
FPGA on the Cloud
 
Microsoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosMicrosoft Azure in HPC scenarios
Microsoft Azure in HPC scenarios
 
ML studio overview v1.1
ML studio overview v1.1ML studio overview v1.1
ML studio overview v1.1
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用
 

Similar to Accelerate Machine Learning on Google Cloud

Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model CompressionApache MXNet
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Supermicro X12 Performance Update
Supermicro X12 Performance UpdateSupermicro X12 Performance Update
Supermicro X12 Performance UpdateRebekah Rodriguez
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Amazon Web Services
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...Edge AI and Vision Alliance
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTechgeetachauhan
 
Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleAmazon Web Services
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Intel Software Brasil
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...Bharath Sudharsan
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesAmazon Web Services
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
 

Similar to Accelerate Machine Learning on Google Cloud (20)

Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Supermicro X12 Performance Update
Supermicro X12 Performance UpdateSupermicro X12 Performance Update
Supermicro X12 Performance Update
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at Scale
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Accelerate Machine Learning on Google Cloud

  • 1. Datatonic Samantha Guerriero Machine Learning Engineer Head of R&D on Intel Skylake Oliver Gindele Head of Machine Learning Will Fletcher Machine Learning Researcher Jessie Tan Data Science Intern
  • 2. Accelerate Machine Learning with Datatonic & Intel Innovation Cloud Readiness Compute
  • 3. Understand which Machine Learning workflows work best on CPU and GPU and how to best optimise hardware and software for Intel Skylake® Purpose Optimise the build of TensorFlow Pip , Bazel , Conda & MKL library Optimise the model Optimise the parameters Feature Engineering, Model selection and Hyperparameter Fine-tuning KMP_BLOCKTIME , intra_op_parallelism_threads , inter_op_parallelism_threads How
  • 4. INCEPTIONV3 BS=1 BS=32 BS=64 BS=96 BS=128 Broadwell 8.39 10.69 10.86 11.02 11.09 Skylake 9.29 14.74 14.97 15.08 15.1 % gain 10.7% 37.9% 37.8% 36.8% 36.2% Replicate results on tf_cnn_benchmarks
  • 5. 25-40% Up to cheaper & faster vs. Broadwell* Replicate results on tf_cnn_benchmarks
  • 7. Accelerate Machine Learning with Datatonic & Intel Millions of Customers Many Different Questions Thousands of Products Billions of Answers
  • 8. Optimise Training for Personalisation Models Large Scale Many (tens to hundreds of) features Two networks trained jointly: wide for memorization, deep for generalization. CPU Intensive Process
  • 9. Intra_op_parallelism threadsInter_op_parallelism threadsKMP BLOCKTIME How many parallel threads to run for operations that can be parallelized internally. How many parallel threads to run for operations that can be run independently. . How much time each thread should wait after completing the execution of a parallel region, in milliseconds. CPU optimisationModel optimisation Optimise Training for Personalisation Models
  • 10. *Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32]; KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request Optimise Training for Personalisation Models
  • 11. *Calculations based on 100 runs. Benchmark details: running time averaged over 3 runs; Batch size: 1024, Hidden Units: [256, 128, 64, 32]; KMP parameters set: inter operations = 4, intra operations = 8 blocktime = 0. More details available upon request Optimise Training for Personalisation Models Skylake CPU vs GPU Nvidia K80: 77.66% cheaper & 51.5% faster. Skylake CPU vs GPU Nvidia V100: 90% cheaper & 31.6% faster.
  • 12. Level 39 One Canada Square Canary Wharf E14 5AB London uk@datatonic.com +44 (0)20 3856 3287 www.datatonic.com Thank You

Editor's Notes

  1. This presentation is to showcase the results of a collaboration we have been running with Intel to understand how, together, we can accelerate Machine Learning on Google Cloud Platform. Google Cloud platform provides different hardware architectures: GPUs, TPUs and Intel CPUs, amongst which the latest generation architecture is Intel Skylake. Intel believes that Intel Skylake holds the power to run ML flows fast (and cheap - considering that the least expensive GPU is three times as expensive as a CPU). And we have come to believe the same, at least for the workflows we have tested so far.
  2. First thing, was to replicate the results on the tf_cnn_bechmarks so to make sure that the way we build the CPU for TensorFlow is optimal. There are three different ways of doing so: Pip installation Conda installation These two are more out of the box. Bazel installation: for those who have never used it before, bazel is a tool to automatically build and test software. You’d use to to run compilers and linkers to produce executable programs and libraries, and assembling packages. It is quite cool because it is: multi-language, high-level, reproducible, scalable. It comes from what Google uses to build its server software internally. The difference about these build is that: Bazel is quite slower (an hour) and it is the best, because it is compiled directly against the target hardware.
  3. So, overall, the results from the standard benchmark show that Intel Skylake is between 25-40% faster than Broadwell, which makes sense being Skylake the latest architecture :) Moving on, we have decided to test the Intel architectures on a different workflow, one closer to what we typically do at Datatonic, with two aims: See if Skylake is again better than Broadwell Check what the performance of the CPUs vs GPUs is
  4. Our typical ML workflow revolves around the world: Personalization. In Retail, Finance, Media,... we create Recommender Systems, Propensity Models, Bundle Recommendations Engine,....Typically for big-companies with millions of customers and thousand of products. The amount of data that can be gathered by such organizations is impressive, it is big data, which requires smart engineering to handle and smart ML model creation with the aim, of course, to uncover a tailored customer journey for all customers visiting the company which underlies this data.
  5. Going quickly through the main characteristics of this model: It is composed of two networks, trained jointly: a wide network which is a one-layer with the capability of memorization, and a deep network with as many layers as required, which has the capability of generalisation. The Deep network is a standard MLP, but the addition of the wide part allows to smooth the possibility of the deep network to predict too many unconventional outliers. This model is a standard TF model that can be implemented with the Estimator API, and it can be quite a complex model to run given the huge amount of feature columns and, of course, big data.
  6. If you notice, the title from the previous and current and also next slide contains the world optimised. To obtain the best performance for a model, a little work is necessary not only to finetune the model on the software level, but also to finetune the parameters of the CPU on a hardware level. The CPU parameters to finetune are: KMP_blocktime, intra_threads and inter_threads operations. KMP BLOCKTIME: how much time each thread should wait after completing the execution of a parallel region, in milliseconds. For some reason, it default value is 200 which is quite high: most models will require a small value, in our case 0. Inter-op and intra-op: parallelism within one layer as well as across layers. If you have many operations that are independent in your TensorFlow graph—because there is no directed path between them in the dataflow graph—TensorFlow will attempt to run them concurrently, (inter) If you have an operation that can be parallelized internally, such as matrix multiplication (tf.matmul()) or a reduction (e.g. tf.reduce_sum()), TensorFlow will execute it by scheduling tasks in a thread pool with intra_op_parallelism_threads threads. You’d typically set this to the number of physical cores. Typically, you’d set inter-op to 2 and intra-op to the number of logical cores. IN our case, they were respectively 4 and 8, as we have been running on a 8-core architecture.
  7. Is this optimisation step that important? Our findings show that optimised vs unoptimised is 27% avg faster and cheaper, for Skylake which is our main focus and 33% for Broadwell. So: optimisation is fundamental. Just a note: This results, as well as all the results in our benchmark, are based on an average over 3 runs with the model architecture that provides the best model performance (over auc).
  8. Next, as I have anticipated previously, our goal was not only to compare the Intel architectures, but to compare with GPUs as well which are the go-to for ML workloads. While the previous results can somehow be expected, this result can come as a surprise to many: Intel Skylake is much faster, for our model, than GPUs. First thing to notice is that GPUs are I/O bound, and increasing the batch size drastically changes these percentages. Why do we report these, though? Because we have finetuned our model and discovered that the best batch size is 1024. You may increase the batch size to 4000-8000 and run faster (not cheaper) with GPU, but you’d be degrading your model performance. Another thing to note is that this percentage compares Skylake vs Nvidia k80: if you were to use a more powerful GPU, then again this percentage would reduce sensibly. Like everything, you need to compromise though: is it worth the cost, that is 4-6x higher than Skylake? Depending on the company and/or the task, it may be. It’s a matter of choices. It is worth to note that, for our model, also Nvidia V100 (which is an extremely powerful GPU) was only as fast (not faster) than Skylake, which makes the decision on which architecture to pick much easier :)
  9. Now, of course there has been some R&D involved to get to this results, as we have seen together: with finetuning the parameters of the model and the architecture, as well as many other difficulties: Provide consistent and reproducible results, over multiple runs and over the different architectures (CPUs and GPUs with same packages, versions, build,..) Replicate the results on new models More technically, build the CPU to its best possible optimisation (which is using the bazel build as it will always use the most recent version of TF and MKL-DNN. And because it is compiled directly against the target hardware. To tackle these difficulties, we have created a Benchmark Suite (in Python and for TensorFlow only at the moment), which automatically performs all the analysis we have shared with you as long as you direct the benchmark to run your code. It takes four steps: It spins up the VMs with the specific architectures: CPU, GPU,... It builds the architecture with the desired installation type (bazel, pip, conda) and, of course, for the GPU installs Cuda and everything. Runs your model Save the results of the runs on the different hardware on BigQuery, reporting running time, cost, model performance,... Personally, I think this is great! If you think the same, don’t hesitate to contact us and learn how we can accelerate your Machine Learning models together on GCP!