SlideShare a Scribd company logo
1 of 29
Download to read offline
Introduction to Snap Machine Learning
Andreea Anghel
Post-doctoral Researcher
IBM Research – Zurich
IBM Research - Zurich / Introduction to Snap Machine Learning / June 2018 / © 2018 IBM Corporation
Contents
2
Code Examples
snap-ml-local
snap-ml-mpi
snap-ml-spark
Experimental Results
Application and datasets
Single-node experiments
Out-of-core performance (PCIe)
Out-of-core performance (NVLINK)
Dealing with slow networks
Terabyte-scale benchmark
Conclusions
Introduction
What are GLMs?
Why are GLMs useful?
What is Snap Machine Learning?
Snap ML Architecture
Multi-level parallelism
Data-parallel framework
Inter-node communication
Intra-node communication
Implementation
GPU Acceleration
Out-of-core learning (DuHL)
Streaming pipeline
Software architecture
Introduction
3
What are GLMs?
Why are GLMs useful?
What is Snap Machine Learning?
What are GLMs?
4
Ridge
Regression
Lasso
Regression
Support
Vector
Machines
Logistic
Regression
Generalized
Linear Models
Classification
Regression
Fast Training
Can scale to datasets
with billions of
examples and/or
features.
Why are GLMs useful? Less tuning
State-of-the-art
algorithms for training
linear models do not
involve a step-size
parameter.
Widely used in industry
The Kaggle ”State of Data Science” survey asked
16,000 data scientists and ML practitioners what
tools and algorithms they use on a daily basis.
37.6% of respondents use Neural Networks
63.5% of respondents use Logistic Regression
Interpretability
New data protection regulations in Europe (GDPR)
give E.U. citizens the right to “obtain an explanation
of a decision reached” by an algorithm.
5
What is Snap Machine Learning?
6
Framework Models
GPU
Acceleration
Distributed
Training
Sparse Data
Support
scikit-learn ML/{DL} No No Yes
Apache
Spark* MLlib
ML/{DL} No Yes Yes
TensorFlow** ML Yes Yes Limited
Snap ML GLMs Yes Yes Yes
Machine Learning
(ML)
Deep
Learning
(DL)
GLMs
Snap ML: A new framework for fast training of GLMs
* The Apache Software Foundation (ASF) owns all Apache-related trademarks, service marks, and graphic logos
on behalf of our Apache project communities, and the names of all Apache projects are trademarks of the ASF.
**TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
Snap ML Architecture
7
Multi-level parallelism
Data-parallel framework
Inter-node communication
Intra-node communication
Level 1
Parallelism across
nodes connected via a
network interface.
Level 2
Parallelism across
GPUs within the same
node connected via an
interconnect (e.g.
NVLINK).
Level 3
Parallelism across the streaming multiprocessors
of the GPU hardware.
Multi-level Parallelism
8
Node 0
Data-Parallel Framework
9
Large Dataset
Partition 0
GPU 0
GPU 1
GPU 2
GPU 3
Partition (0,0)
Partition (0,1)
Partition (0,2)
Partition (0,3)
Node 1
Partition 1
GPU 0
GPU 1
GPU 2
GPU 3
Partition (1,0)
Partition (1,1)
Partition (1,2)
Partition (1,3)
Inter-node communication
10
Node 0
10Gbit/s
Node 1
Node 2
Node 3
Intra-node communication
POWER is a registered trademark of International Business Machines Corporation in the United States, other countries, or both.
11
Snap ML Implementation
12
GPU Acceleration
Out-of-code learning (DuHL)
Streaming pipeline
Software architecture
Each local sub-task
can be solved
effectively using
stochastic coordinate
descent (SCD).
[Shalev-Schwartz
2013]
Recently,
asynchronous variants
of SCD have been
proposed to run on
multi-core CPUs.
[Liu 2013]
[Tran 2015]
[Hsiesh 2015]
Twice parallel asynchronous SCD (TPA-SCD) is a
another recent variant designed to run on GPUs.
It assigns each coordinate update to a different
block of threads that executed asynchronously.
Within each thread block, the coordinate update is
computed using many tightly coupled threads.
GPU Acceleration
13
Local Model
T. Parnell, C. Dünner, K. Atasu, M. Sifalakis and H. Pozidis, "Large-
scale stochastic learning using GPUs”, ParLearning 2017
Out-of-Core Learning (DuHL)
14
Main Memory CPU
Randomly sample
points and compute
the duality gap
Periodically identify
set of points with
largest gap and
ensure these points
are in GPU memory
Local
Data
GPU
Update the model using
the subset of the data that
is currently in memory
Model
Subset of data
Periodically copy model
back to CPU
C. Dünner, T. Parnell and M. Jaggi, “Efficient Use of Limited-Memory Resources to Accelerate Linear Learning”, NIPS 2017
While DuHL can
minimize the amount
of data transfer
between CPU and
GPU, it can still
become a bottleneck
particularly in the
beginning of training.
Using CUDA streams,
we have implemented
a training pipeline
whereby the next set
of data is copied onto
the GPU while the
current set is being
trained.
TPA-SCD also requires making a random
permutation of the coordinates in the GPU memory
at any given time.
To achieve this we introduce a 3rd pipeline stage
whereby the CPU generates a set of numbers for
the set of points that are currently being copied.
In the next stage, these random numbers are
copied onto the GPU and sorted to produce the
desired permutation.
Streaming Pipeline
15
Copy data (i+1)RNG (i+1)
Sort RN (i)
Train (i)
Copy RN (i+1)
Copy data (i+2)RNG (i+2)
Sort RN (i+1)
Train (i+1)
Copy RN (i+2)
Copy data (i+3)RNG (i+3)
Sort RN (i+2)
Train (i+2)
Copy RN (i+3)
CPU Interconnect GPU
libglm
Underlying C++/CUDA
template library.
snap-ml-local
Small to medium-
scale data.
Single node
deployment.
Multi-GPU support.
scikit-learn compatible
Python API
snap-ml-mpi
Large-scale data.
Multi-node
deployment in HPC
environments.
Many-GPU support.
Python API.
snap-ml-spark
Large-scale data
Multi-node
deployment in Apache
Spark environments.
Many GPU support
Python/Scala /Java*
API.
*Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
16
Code Examples
17
snap-ml-local
snap-ml-mpi
snap-ml-spark
Example: snap-ml-local
Acceleration existing scikit-learn applications
by changing only 2 lines of code.
Example: snap-ml-mpi
Describe application using high-level Python code.
Launch application on 4 nodes using mpirun (4 GPUs per node):
Example: snap-ml-spark
Describe application using high-level Python code:
Launch application on Spark cluster (1 GPU per Spark executor):
Experimental Results
21
Application and datasets
Single-node experiments
Out-of-core performance (PCIe)
Out-of-core performance (NVLINK)
Dealing with slow networks
Terabyte-scale benchmark
Can use train ML
models to predict
whether or not a user
will click on an advert?
Click-through Rate (CTR)
Prediction
Core business of many
internet giants.
Labelled training
examples are being
generated in real-time.
Dataset: criteo-tb
Number of features: 1 million
Number of examples: 4.2 billion
Size: 3 Terabytes (SVM Light)
Dataset: criteo-kaggle
Number of features: 1 million
Number of examples: 45 million
Size: 40 Gigabytes (SVM Light)
22
Single-node Performance
23
Dataset Examples Nodes Total GPUs Network CPU-GPU interface
criteo-kaggle 45 million 1x Power AC922 server 1x V100 n/a NVLINK 2.0
Out-of-core performance (PCIe Gen3)
24
Train chunk
(i) on GPU
Copy chunk
(i+1) onto GPU
90ms
318ms
S1 Init
12ms
330ms
S2
Train chunk
(i+1) on GPU
Copy chunk
(i+2) onto GPU
90ms
318ms
Init
12ms
330ms
Dataset Examples Nodes Total GPUs Network CPU-GPU interface
criteo-tb 200 million 1x Intel Xeon* Gold 6150 1x V100 n/a PCI Gen3
*Intel Xeon is a trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Out-of-core performance (NVLINK 2.0)
25
90ms
55ms
S1
3ms
93ms
S2
Train chunk
(i+1) on GPU
Copy chunk
(i+2) onto GPU
90ms
55ms
Init
3ms
93ms
Train chunk
(i+2) on GPU
Copy chunk
(i+3) onto GPU
90ms
55ms
Init
3ms
93ms
Train chunk
(i+3) on GPU
Copy chunk
(i+4) onto GPU
90ms
55ms
Init
3ms
93ms
Train chunk
(i+4) on GPU
Copy chunk
(i+5) onto GPU
90ms
55ms
Init
3ms
93ms
Dataset Examples Nodes Total GPUs Network CPU-GPU interface
criteo-tb 200 million 1x Power AC922 server 1x V100 n/a NVLINK 2.0
Train chunk
(i) on GPU
Copy chunk
(i+1) onto GPU
Init
Dealing with slow networks
26
Dataset Examples Nodes Total GPUs Network CPU-GPU interface
criteo-tb 1 billion 4x Power AC922 server 16x V100 1Gbit Ethernet NVLINK 2.0
criteo-tb 1 billion 4x Power AC922 server 16x V100 InfiniBand NVLINK 2.0
Terabyte-scale Benchmark
27
Dataset Examples Nodes Total GPUs Network CPU-GPU interface
criteo-tb 4.2 billion 4x Power AC922 server 16x V100 InfiniBand NVLINK 2.0
Conclusions Snap ML is a new framework for fast training of
GLMs.
Snap ML benefits from GPU acceleration.
It can be deployed in single-node and multi-node
environments.
The hierarchical structure of the framework makes
it suitable for cloud-based deployments.
It can leverage fast interconnects like NVLINK to
achieve streaming, out-of-core performance.
Snap ML significantly outperforms other software
frameworks for training GLMS in both single-node
and multi-node benchmarks.
Snap ML can train a logistic regression classifier on
the Criteo Terabyte Click Logs data in 1.5 minutes.
Snap ML will be available to try in June as part of
IBM PowerAI (Tech Preview)
Celestine Dünner Dimitrios Sarigiannis Andreea Anghel
Nikolas Ioannou Haris Pozidis Thomas Parnell
The Snap ML team:
28
https://www.zurich.ibm.com/snapml/
29

More Related Content

What's hot

OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research Ganesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsGanesan Narayanasamy
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM Ganesan Narayanasamy
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AIJMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AILablup Inc.
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learningGanesan Narayanasamy
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysTaylor Riggan
 
Transparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep LearningTransparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep LearningIndrajit Poddar
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...Edge AI and Vision Alliance
 

What's hot (20)

OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
PowerAI Deep dive
PowerAI Deep divePowerAI Deep dive
PowerAI Deep dive
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
 
WML OpenPOWER presentation
WML OpenPOWER presentationWML OpenPOWER presentation
WML OpenPOWER presentation
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AIJMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
 
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learning
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
 
Transparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep LearningTransparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep Learning
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
 

Similar to SNAP MACHINE LEARNING

GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_ProcessingKohei KaiGai
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...Equnix Business Solutions
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable CloudInside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloudinside-BigData.com
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5Steen Larsen
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 

Similar to SNAP MACHINE LEARNING (20)

No[1][1]
No[1][1]No[1][1]
No[1][1]
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable CloudInside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
 
A2O Core implementation on FPGA
A2O Core implementation on FPGAA2O Core implementation on FPGA
A2O Core implementation on FPGA
 
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

SNAP MACHINE LEARNING

  • 1. Introduction to Snap Machine Learning Andreea Anghel Post-doctoral Researcher IBM Research – Zurich IBM Research - Zurich / Introduction to Snap Machine Learning / June 2018 / © 2018 IBM Corporation
  • 2. Contents 2 Code Examples snap-ml-local snap-ml-mpi snap-ml-spark Experimental Results Application and datasets Single-node experiments Out-of-core performance (PCIe) Out-of-core performance (NVLINK) Dealing with slow networks Terabyte-scale benchmark Conclusions Introduction What are GLMs? Why are GLMs useful? What is Snap Machine Learning? Snap ML Architecture Multi-level parallelism Data-parallel framework Inter-node communication Intra-node communication Implementation GPU Acceleration Out-of-core learning (DuHL) Streaming pipeline Software architecture
  • 3. Introduction 3 What are GLMs? Why are GLMs useful? What is Snap Machine Learning?
  • 5. Fast Training Can scale to datasets with billions of examples and/or features. Why are GLMs useful? Less tuning State-of-the-art algorithms for training linear models do not involve a step-size parameter. Widely used in industry The Kaggle ”State of Data Science” survey asked 16,000 data scientists and ML practitioners what tools and algorithms they use on a daily basis. 37.6% of respondents use Neural Networks 63.5% of respondents use Logistic Regression Interpretability New data protection regulations in Europe (GDPR) give E.U. citizens the right to “obtain an explanation of a decision reached” by an algorithm. 5
  • 6. What is Snap Machine Learning? 6 Framework Models GPU Acceleration Distributed Training Sparse Data Support scikit-learn ML/{DL} No No Yes Apache Spark* MLlib ML/{DL} No Yes Yes TensorFlow** ML Yes Yes Limited Snap ML GLMs Yes Yes Yes Machine Learning (ML) Deep Learning (DL) GLMs Snap ML: A new framework for fast training of GLMs * The Apache Software Foundation (ASF) owns all Apache-related trademarks, service marks, and graphic logos on behalf of our Apache project communities, and the names of all Apache projects are trademarks of the ASF. **TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
  • 7. Snap ML Architecture 7 Multi-level parallelism Data-parallel framework Inter-node communication Intra-node communication
  • 8. Level 1 Parallelism across nodes connected via a network interface. Level 2 Parallelism across GPUs within the same node connected via an interconnect (e.g. NVLINK). Level 3 Parallelism across the streaming multiprocessors of the GPU hardware. Multi-level Parallelism 8
  • 9. Node 0 Data-Parallel Framework 9 Large Dataset Partition 0 GPU 0 GPU 1 GPU 2 GPU 3 Partition (0,0) Partition (0,1) Partition (0,2) Partition (0,3) Node 1 Partition 1 GPU 0 GPU 1 GPU 2 GPU 3 Partition (1,0) Partition (1,1) Partition (1,2) Partition (1,3)
  • 11. Intra-node communication POWER is a registered trademark of International Business Machines Corporation in the United States, other countries, or both. 11
  • 12. Snap ML Implementation 12 GPU Acceleration Out-of-code learning (DuHL) Streaming pipeline Software architecture
  • 13. Each local sub-task can be solved effectively using stochastic coordinate descent (SCD). [Shalev-Schwartz 2013] Recently, asynchronous variants of SCD have been proposed to run on multi-core CPUs. [Liu 2013] [Tran 2015] [Hsiesh 2015] Twice parallel asynchronous SCD (TPA-SCD) is a another recent variant designed to run on GPUs. It assigns each coordinate update to a different block of threads that executed asynchronously. Within each thread block, the coordinate update is computed using many tightly coupled threads. GPU Acceleration 13 Local Model T. Parnell, C. Dünner, K. Atasu, M. Sifalakis and H. Pozidis, "Large- scale stochastic learning using GPUs”, ParLearning 2017
  • 14. Out-of-Core Learning (DuHL) 14 Main Memory CPU Randomly sample points and compute the duality gap Periodically identify set of points with largest gap and ensure these points are in GPU memory Local Data GPU Update the model using the subset of the data that is currently in memory Model Subset of data Periodically copy model back to CPU C. Dünner, T. Parnell and M. Jaggi, “Efficient Use of Limited-Memory Resources to Accelerate Linear Learning”, NIPS 2017
  • 15. While DuHL can minimize the amount of data transfer between CPU and GPU, it can still become a bottleneck particularly in the beginning of training. Using CUDA streams, we have implemented a training pipeline whereby the next set of data is copied onto the GPU while the current set is being trained. TPA-SCD also requires making a random permutation of the coordinates in the GPU memory at any given time. To achieve this we introduce a 3rd pipeline stage whereby the CPU generates a set of numbers for the set of points that are currently being copied. In the next stage, these random numbers are copied onto the GPU and sorted to produce the desired permutation. Streaming Pipeline 15 Copy data (i+1)RNG (i+1) Sort RN (i) Train (i) Copy RN (i+1) Copy data (i+2)RNG (i+2) Sort RN (i+1) Train (i+1) Copy RN (i+2) Copy data (i+3)RNG (i+3) Sort RN (i+2) Train (i+2) Copy RN (i+3) CPU Interconnect GPU
  • 16. libglm Underlying C++/CUDA template library. snap-ml-local Small to medium- scale data. Single node deployment. Multi-GPU support. scikit-learn compatible Python API snap-ml-mpi Large-scale data. Multi-node deployment in HPC environments. Many-GPU support. Python API. snap-ml-spark Large-scale data Multi-node deployment in Apache Spark environments. Many GPU support Python/Scala /Java* API. *Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. 16
  • 18. Example: snap-ml-local Acceleration existing scikit-learn applications by changing only 2 lines of code.
  • 19. Example: snap-ml-mpi Describe application using high-level Python code. Launch application on 4 nodes using mpirun (4 GPUs per node):
  • 20. Example: snap-ml-spark Describe application using high-level Python code: Launch application on Spark cluster (1 GPU per Spark executor):
  • 21. Experimental Results 21 Application and datasets Single-node experiments Out-of-core performance (PCIe) Out-of-core performance (NVLINK) Dealing with slow networks Terabyte-scale benchmark
  • 22. Can use train ML models to predict whether or not a user will click on an advert? Click-through Rate (CTR) Prediction Core business of many internet giants. Labelled training examples are being generated in real-time. Dataset: criteo-tb Number of features: 1 million Number of examples: 4.2 billion Size: 3 Terabytes (SVM Light) Dataset: criteo-kaggle Number of features: 1 million Number of examples: 45 million Size: 40 Gigabytes (SVM Light) 22
  • 23. Single-node Performance 23 Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-kaggle 45 million 1x Power AC922 server 1x V100 n/a NVLINK 2.0
  • 24. Out-of-core performance (PCIe Gen3) 24 Train chunk (i) on GPU Copy chunk (i+1) onto GPU 90ms 318ms S1 Init 12ms 330ms S2 Train chunk (i+1) on GPU Copy chunk (i+2) onto GPU 90ms 318ms Init 12ms 330ms Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 200 million 1x Intel Xeon* Gold 6150 1x V100 n/a PCI Gen3 *Intel Xeon is a trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
  • 25. Out-of-core performance (NVLINK 2.0) 25 90ms 55ms S1 3ms 93ms S2 Train chunk (i+1) on GPU Copy chunk (i+2) onto GPU 90ms 55ms Init 3ms 93ms Train chunk (i+2) on GPU Copy chunk (i+3) onto GPU 90ms 55ms Init 3ms 93ms Train chunk (i+3) on GPU Copy chunk (i+4) onto GPU 90ms 55ms Init 3ms 93ms Train chunk (i+4) on GPU Copy chunk (i+5) onto GPU 90ms 55ms Init 3ms 93ms Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 200 million 1x Power AC922 server 1x V100 n/a NVLINK 2.0 Train chunk (i) on GPU Copy chunk (i+1) onto GPU Init
  • 26. Dealing with slow networks 26 Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 1 billion 4x Power AC922 server 16x V100 1Gbit Ethernet NVLINK 2.0 criteo-tb 1 billion 4x Power AC922 server 16x V100 InfiniBand NVLINK 2.0
  • 27. Terabyte-scale Benchmark 27 Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 4.2 billion 4x Power AC922 server 16x V100 InfiniBand NVLINK 2.0
  • 28. Conclusions Snap ML is a new framework for fast training of GLMs. Snap ML benefits from GPU acceleration. It can be deployed in single-node and multi-node environments. The hierarchical structure of the framework makes it suitable for cloud-based deployments. It can leverage fast interconnects like NVLINK to achieve streaming, out-of-core performance. Snap ML significantly outperforms other software frameworks for training GLMS in both single-node and multi-node benchmarks. Snap ML can train a logistic regression classifier on the Criteo Terabyte Click Logs data in 1.5 minutes. Snap ML will be available to try in June as part of IBM PowerAI (Tech Preview) Celestine Dünner Dimitrios Sarigiannis Andreea Anghel Nikolas Ioannou Haris Pozidis Thomas Parnell The Snap ML team: 28 https://www.zurich.ibm.com/snapml/
  • 29. 29