Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to OpenStack Environment

Copyright © NTT Communications Corporation.
Transform your business, transcend expectations with our technologically advanced solutions.
Can we boost more HPC performance?
Integrate IBM POWER servers with GPUs to
OpenStack Environment
Ankit Purohit, Takeaki Matsumoto

1
Self-Introduction
Takeaki Matsumoto
takeaki.matsumoto@ntt.com
NTT Communications
Technology Development
R&D for OpenStack
Ops for Private Cloud
Ankit Purohit
a.purohit@ntt.com
NTT Communications
Technology Development
High Performance Computing
GPU

● March 19, 2018 at Las Vegas
● OpenPOWER Summit Website: https://openpowerfoundation.org/summit-2018-03-us/
● Co-speaker : Yutaka Kawai, IBM Japan
● Our Talk’s Video: https://www.youtube.com/watch?v=L4g6SmTGcOU&feature=youtu.be
2
Previous talk at OpenPOWER Summit 2018

3
Agenda
● Background
○ Our OpenStack GPU cloud
○ Motivation for using POWER server
● Goal
○ Can we boost more performance with POWER?
● Approach
○ Unleash POWER’s full performance as Baremetal server
○ Integrate POWER server into OpenStack Cloud
● Conclusion
● Another choice: Kubernetes

4
Agenda
● Background
● Goal
● Approach
● Conclusion

5
Background
● NTT Communications
○ The largest Telecommunications company in Japan
○ Subsidiaries and offices in over 110 cities worldwide
○ Part of a Fortune Global 100 company
● Our team provide GPU cloud using OpenStack,
for in-house users’ experimental usage.
○ AI communication engine COTOHA
http://www.ntt.com/en/services/application/cotoha.html
○ Deep Learning training on customer data
(time-series)
○ etc.

6
Our OpenStack Environment
nVIDIA
K10 GPU
x86 servers (as compute nodes)
nVIDIA
M60 GPU
nVIDIA
P100 GPU
Image source: https://www.openstack.org/software/

7
Motivation to try IBM POWER system
➢ Intel based system : DGX-1
- CPU and GPU are connected via PCle (32 GB/s)
- Bandwidth between CPU sockets is 64 GB/s
- Bandwidth between CPU and memory is 76.8 GB/s
➢ IBM POWER8 system : Minsky
- CPU and GPU are connected via NVLink (80 GB/s)
- Bandwidth between CPU sockets is 76.8 GB/s
- Bandwidth between CPU and memory is 115 GB/s
32 GB/s
64 GB/s
76.8 GB/s76.8 GB/s
● Even with same GPU card...
different server architecture brings us better performance?
76.8
GB/s

8
Goal
How can we boost more
performance with POWER?

9
Agenda
● Background
● Goal
● Approach
● Conclusion

- nbody is kind of cuda sample program.
- This program can calculate single precision and double precision by
using GPU and the results are displayed in GFLOPS.
- It can be also calculated by CPU only.
10
Benchmark program: nbody
$ ./nbody -benchmark -numbodies=2048000 -numdevices=1
-benchmark : (run benchmark to measure performance)
-numbodies : (number of bodies (>= 1) to run in simulation)
(for GPU benchmark：2048000, for CPU benchmark：20480)
-numdevice : (where i=(number of CUDA devices > 0) to use for simulation)
-cpu : (run n-body simulation on the CPU)]
-fp64 : (use double precision floating point values for simulation)

11
Benchmark program: nbody
Zero-copy
CPU GPU1GPU0
Main
Memory
GPU
Memory
GPU
Memory
NVLink(or PCle)
...
● We use nbody to emulate memory intensive workflow
● In nbody, GPU directly access data from
host memory (Main memory) many times
Bottleneck?
nbody data flow

12
Benchmark Result: POWER8 baremetal (1/2)
With default server configuration
Workload: numbodies=2048000, FP32 on Minsky w/ RHEL7.3
When using 2 GPUs, specifying different GPUs
causes different performance.
T. Kamenoue, M. Mitsugi, and Y. Kawai, "The optimization of nbody simulation on Multi-GPU environment” in Proc. the 80th National Convention of Information Processing Society of Japan (IPSJ), Tokyo, Japan, Mar. 2018, pp. 1-25,26.
When using 4 GPUs, there is low
performance than 2 GPUs because
it is not scaled
Why?!
1GPU 2GPU 2GPU 4GPU

13
A Solution : Memory Interleave
What memory Interleave actually does??
- It enables equally use of memories of all the node (CPU sockets) in round robin way.
- I/O access can be balanced
- it works well for the case of nbody benchmark (FP32)
- How to execute ?
numactl -interleave=all ./nbody … numactl -i all ./nbody ...OR
Interleave disabled(default) Interleave enabled

14
What happens if Interleave is disabled?
workload : FP32, numbodies=2048000, 4GPU, Interleave disabled
➔ GPU0 and GPU1 always reads from CLOSE Memory
➔ GPU2 and GPU3 always reads from FAR Memory
➔ Elapsed Time Per 1 Iteration
- GPU 0 : 4.3 - 4.4 Second
- GPU 1 : 4.3 - 4.4 Second
- GPU 2 : 9.2 - 9.10 Second
- GPU 3 : 9.2 - 9.10 Second
➔ Benchmark Result : 8673 GFLOP/s
1 Iteration

15
What happens if Interleave is enabled?
workload : FP32, numbodies=2048000, 4GPU, Interleave enabled
➔ GPU0 and GPU1 always reads
1/2 data from CLOSE Memory
1/2 data from FAR Memory
➔ All GPUs read same as above
➔ Elapsed Time Per 1 Iteration
- GPU 0 : 5.2 - 5.3 Second
- GPU 1 : 5.2 - 5.3 Second
- GPU 2 : 5.2 - 5.3 Second
- GPU 3 : 5.2 - 5.3 Second
➔ Benchmark Result : 15969 GFLOP/s
1 Iteration

16
Benchmark Result: POWER8 baremetal (2/2)
Now it is scaled. 4 GPU case
has becomes faster than 2
GPU.
With memory interleave enabled
Workload: numbodies=2048000, FP32 on Minsky w/ RHEL7.3
1GPU 2GPU 2GPU 4GPU

17
Benchmark Result: POWER8 vs DGX-1 baremetal
- Current Intel Architecture
machine can not take
benefit from Memory
Interleave because of its
narrow I/O bandwidth.
GFLOP/s
POWER8
DGX-1
nbody result when increasing GPU number
Workload: numbodies=2048000, FP32
1GPU 2GPU 4GPU

18
Agenda
● Background
● Goal
● Approach
● Conclusion

19
How to integrate POWER8 to OpenStack
Controller (x86)
nova-api
nova-scheduler
nova-conductor
Compute (x86)
nova-compute
Compute (x86)
nova-compute
Compute (ppc64le)
nova-compute

20
● Linux can run on POWER8
● KVM can run on POWER8
● OpenStack can run on POWER8
○ Cloud Archive repository available
Basically, same procedure can be used as x86

21
● For GPU, we need KVM PCI-Passthrough
○ KVM support
■ qemu (1:2.6.1+dfsg-0ubuntu2) xenial; urgency=medium
● Enable GPU Passthru for ppc64le
https://launchpad.net/bugs/1541902
○ IOMMU (like Intel VT-d)
■ In POWER servers, IBM Translation Control Entry is available

22
● Environment
○ OpenPOWER IBM S822LC for HPC "Minsky"
■ CPU: 20 cores (logical: 160 cores)
■ MEM: 1TB
■ GPU: NVIDIA P100 * 4 (with NVLink)
○ OS
■ Ubuntu 16.04.4 (kernel: 4.15.0-13-generic)
○ Software
■ KVM 2.11
■ Nova 17.0.1 (Queens)

23
● Configuration
○ Kernel parameters
■ vfio-pci.disable_idle_d3=1
○ Disable SMT
■ $ ppc64_cpu --smt=off
○ Disable nouveau driver
■ $ cat /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
■ $ sudo update-initramfs -u
■ $ reboot
■ $ lsmod | grep nouveau

24
● Nova Configuration
○ Compute node
■ Ensure PCI device id
● $ lspci -nn | grep -i nvidia
0002:01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f9] (rev a1)
■ nova.conf
● [default]
pci_passthrough_whitelist={"vendor_id":"10de","product_id":"15f9"}
○ Controller node
■ nova.conf
● [default]
pci_alias= {"vendor_id":"10de", "product_id":"15f9", "name": "P100"}
● [filter_scheduler]
enabled_filters = …,PciPassthroughFilter

25
Our OpenStack Environment: After Integration
nVIDIA
K10 GPU
x86 servers POWER8 servers
nVIDIA
M60 GPU
nVIDIA
P100 GPU
Image source: https://www.openstack.org/software/
nVIDIA
P100 GPU

26
Benchmark of OpenStack-integrated VM
● Instance flavor
○ vCPU: 16
○ Mem: 120GB
○ Disk: 160GB
○ Metadata:
■ pci_passthrough:alias=P100:4
■ hw:mem_page_size=16384
■ hw:numa_nodes=2
● GPU environment
○ NVIDIA Driver: 390.12
○ CUDA: 9.1

27
● nbody benchmark results
○ $ numactl -i all ./nbody -benchmark -numbodies=2048000
1GPU 2GPU 4GPU

28
● CPU-GPU Memory bandwidth benchmark results
○ $ ./bandwidthTest

29
○ $ ./bandwidthTest
Why?

Linux recognizePhysical
30
● NVLink implementation
CPU
GPU
NVLink
(2.5x PCIe）
CPU
GPU
NVLink
Device
NVLink
Device
PCI

Copyright © NTT Communications Corporation. 31
● OpenStack attached only GPU
VM
GPU
NVLink
Device
NVLink
Device
PCI-Passthrough
PCIe x8

Copyright © NTT Communications Corporation. 32
● Passthrough 3 devices solve this issue?
VM
GPU
NVLink
Device
NVLink
Device
PCI-Passthrough

33
● GPU loc-code
$ lspci -d 10de:15f9
0002:01:00.0 3D controller: NVIDIA Corporation Device 15f9 (rev a1)
0003:01:00.0 3D controller: NVIDIA Corporation Device 15f9 (rev a1)
000a:01:00.0 3D controller: NVIDIA Corporation Device 15f9 (rev a1)
000b:01:00.0 3D controller: NVIDIA Corporation Device 15f9 (rev a1)
$ cat /sys/bus/pci/devices/0002:01:00.0/of_node/ibm,loc-code
GPU1
GPU2
$ cat /sys/bus/pci/devices/000a:01:00.0/of_node/ibm,loc-code
GPU3
$ cat /sys/bus/pci/devices/000b:01:00.0/of_node/ibm,loc-code
GPU4

34
● NVLink devices and its connection
$ lspci -d 1014:04ea
0004:00:00.0 Bridge: IBM Device 04ea
GPU2
GPU2
GPU1
GPU1
GPU4
GPU4
GPU3
GPU3

35
● Add NVLink devices (by hand)
~~~
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0002' bus='0x01' slot='0x00' function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x8' function='0x0'/>
</hostdev>
<source>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x9' function='0x0' multifunction='on'/>
</hostdev>
<source>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x9' function='0x1'/>
</hostdev>
~~~
instance-000000xx.xml

36
with NVLink device added

37
● nbody benchmark results with NVLink device
with NVLink device added
1GPU 2GPU 4GPU

1014:04ea pool 10de:15f9 pool
38
How can we manage NVLink devices?
● OpenStack doesn't care about device connection
NVLink
Device
GPU1
NVLink
Device
GPU1
NVLink
Device
GPU3
NVLink
Device
GPU3
NVLink
Device
GPU2
NVLink
Device
GPU2
NVLink
Device
GPU4
NVLink
Device
GPU4
GPU1
GPU3
GPU2
GPU4
Request P100:1,NVLink:2

device_set_p100 pool
39
● In ideal
NVLink
Device
GPU1
NVLink
Device
GPU1
GPU1
Request device_set_p100:1
NVLink
Device
GPU3
NVLink
Device
GPU3
GPU3
NVLink
Device
GPU2
NVLink
Device
GPU2
GPU2
NVLink
Device
GPU4
NVLink
Device
GPU4
GPU4

40
● Our solution
○ Add simple script between libvirt and qemu
■ Rename qemu-system-ppc64 to qemu-system-ppc64.orig
■ Add the script as qemu-system-ppc64
Nova libvirt qemuscript
Add NVLink devices
parameters
Request P100
Launch VM with
P100 and NVLink devices
qemu-system-ppc64 ... -device vfio-pci,host=0003:01:00.0,id=hostdev0,bus=pci.1.0,addr=0x1
qemu-system-ppc64.orig ... -device vfio-pci,host=0003:01:00.0,id=hostdev0,bus=pci.1.0,addr=0x1
-device vfio-pci,host=0004:00:00.0,bus=pci.1.0,addr=0x2,multifunction=on -device vfio-pci,host=0004:00:00.1,bus=pci.1.0,addr=0x2.0x1

41
Agenda
● Background
● Goal
● Approach
● Conclusion

● How can we boost more performance with POWER?
○ Memory interleave may be required to get max performance
○ Add POWER as compute node into OpenStack
○ Specify GPU and its NVLink devices to passthrough to VM
● Power8 results better performance than x86 in some cases
○ It has powerful NVLink CPU-GPU connection
● With OpenStack, some limitations exists
○ SMT is no available
○ NVLink requires extra device allocation OpenStack doesn't support now
42
Conclusion

43
Agenda
● Background
● Goal
● Approach
● Conclusion

44
Another option
How is the container?

45
Another option
● How to manage containers and GPUs

46
Another option
● Kubernetes
○ schedules containers
○ can integrate with OpenStack
○ supports GPU scheduler
■ requirements
● NVIDIA drivers ~= 361.93
● Device Plugin feature
● NVIDIA device plugin for Kubernetes
● nvidia-docker

47
Another option
Device plugin feature
NVIDIA device plugin for Kubernetes
nvidia-docker
NVIDIA Driver NVIDIA GPU

48
Another option
● Device Plugin feature
○ Add kubelet exec parameter <= K8s version 1.9
"-feature-gates=DevicePlugins=true"
■ Example: deployed by kubeadm
$ cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf | grep
KUBELET_EXTRA_ARGS=
Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true"
○ Device Plugins feature is Beta >= K8s version 1.10
■ Enabled by default
Note:
If you deploy k8s using kubeadm and the controller is x86, you have to do like
$ docker tag gcr.io/google_containers/kube-proxy-ppc64le:v1.9.2 gcr.io/google_containers/kube-proxy:v1.9.2

49
Another option
● NVIDIA device plugin for Kubernetes
○ https://github.com/NVIDIA/k8s-device-plugin
■ Build image for ppc64le
$ docker build . -t nvidia/k8s-device-plugin:1.9

50
Another option
● nvidia-docker (2.0)
○ supports NVLink devices
○ ppc64le packages are not available yet
○ nvidia-docker depends on following packages
■ libnvidia-container
https://github.com/NVIDIA/libnvidia-container
■ nvidia-container-runtime
https://github.com/NVIDIA/nvidia-container-runtime
○ can be installed using nvidia official repository now
https://nvidia.github.io/nvidia-docker/

51
Another option
● Change the default runtime
○ $ cat /etc/docker/daemon.json
$ sudo systemctl daemon-reload
$ sudo systemctl restart kubelet
● Enable NVIDIA device plugin
○ $ kubectl create -f
https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml

52
Another option
● Ensure GPU resource is available
○ $ kubectl describe node

53
Another option
● Ensure GPU resource is available
bandwidth-test.yml
$ kubectl apply -f bandwidth-test.yml $ kubectl logs bwt-pod

54
Another option

55
Thank you!

56
References
● OpenStack Docs: Attaching physical PCI devices to guests
○ https://docs.openstack.org/nova/pike/admin/pci-passthrough.html
● Device Plugins - Kubernetes
○ https://kubernetes.io/docs/concepts/cluster-administration/device-plugins/
● Feature Gates | Kubernetes
○ https://kubernetes.io/docs/reference/feature-gates/
● GitHub - NVIDIA/k8s-device-plugin
○ https://github.com/NVIDIA/k8s-device-plugin
● GitHub - NVIDIA/nvidia-docker
○ https://github.com/NVIDIA/nvidia-docker

Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to OpenStack Environment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to OpenStack Environment

Similar to Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to OpenStack Environment (20)

More from NTT Communications Technology Development

More from NTT Communications Technology Development (20)

Recently uploaded

Recently uploaded (20)

Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to OpenStack Environment