GPU enablement for data science on OpenShift | DevNation Tech Talk

bit.ly/kubemaster1
1
GPU Enablement for Data
Science on OpenShift
Pete MacKinnon
Red Hat AI Center of Excellence

@pdmackinnon
● pmackinn@redhat.com
● Principal Engineer in the Red Hat AI Center of Excellence
● Kubeflow committer since project formation
● Open Data Hub and NVIDIA GPU Operator contributor
● KubeCon, TensorFlow World, GTC, ODSC, OpenShift
Commons, and SCaLE 17x presenter
● Technical Editor for upcoming Kubeflow publication
● Co-author of “Linux Unleashed”
● Thirty years of distributed computing consulting and
engineering experience

• Data science: data and models
• AI/ML lifecycle: training to inference
• Scalars, vectors, and tensors
• CPU and GPU
• Notebooks and frameworks
• The OpenShift GPU operator “family”
• The components of GPU enablement
• Installation and demo
Agenda

The AI/ML lifecycle
Inference/Serving
Training
Data collection
Feature
extraction
Labeling
Monitoring
Logging
Analysis
Transformation
Validation
Splitting
Model validation
Hyperparameter tuning
Algorithm selection or
development
Model Data and Model
in Production
Data

Scalars, vectors, and tensors
Scalar - a real number having magnitude that measures
something: volume, density, speed, energy, mass, time, etc.
Vector - a one-dimensional array of scalars: force, velocity,
momentum, etc.
Tensor - a higher-order algebraic object that could be a scalar, a
vector, a multidimensional array, a multilinear map, etc.
Modern CPU have advanced instruction sets for vector algebra
but modern GPU are built specifically to perform complex
tensor operations with a high degree of parallelism

Scalars, vectors, and tensors
How many matrix multiplications can be done in one clock cycle?
Image: https://iq.opengenus.org/
10¹ 10⁴ 10⁵

So, in one clock cycle...
CPU (scalar)
CPU/GPU
(vector)
GPU (tensor)

Or, DL with real world data...
Object
(scalar)
Movement
(vector)
Classification, velocity,
bearing, and much more
(tensor)

CPU and GPU
NVIDIA Ampere A100
• 6912 FP32 CUDA Cores
• 432 Gen3 Tensor Cores
but
• FP32 -> 19.5 TFLOPS
AMD EPYC 7702 (Rome)
• 64 CPU Cores
• 128 Threads
• 2.0GHz Base Clock
• FP32 -> 1-2 TFLOPS

Profit
380x speedup over CPU in basic CNN smoke test
(Intel Xeon E5-2686 vs. NVIDIA V100-SXM2-16Gi)

Special Resource Operator
(SRO)
● Community operator
● Reference
implementation for other
specialized hardware
○ NIC, FPGA
● Provided the code basis
for the NVIDIA GPU
Operator
● Deployed from
OperatorHub
GPU operators
NVIDIA GPU Operator
● Certified and supported on
OpenShift by NVIDIA and Red Hat
● Can be deployed from embedded
OperatorHub or with Helm
Both operators require node feature
discovery (NFD)
NVIDIA also provides the GPU feature
operator for enhanced labeling

Operator components
• Container-runtime-toolkit: The NVIDIA GPU Operator
supports docker and cri-o container runtimes. This daemonset
ensures the correct runtime setup for the GPU hook.
• Driver: A container deployed as a daemonset that holds all
userspace and kernelspace software to make the GPU device
work.
• Device plugin: A daemonset that monitors the health and
availability of the GPU on the node. Vital for pod scheduling.
• DCGM: Data Center GPU Monitoring - a node exporter that
captures GPU metrics for use by Prometheus.
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"

GPU enablement for data science on OpenShift | DevNation Tech Talk

More Related Content

What's hot

Similar to GPU enablement for data science on OpenShift | DevNation Tech Talk

More from Red Hat Developers

Recently uploaded

GPU enablement for data science on OpenShift | DevNation Tech Talk