In this video from Arm HPC Asia 2019, Fu Li from Quantum Cloud presents: Scale out AI Training on Massive Core System from HPC to Fabric based SOC.
"The purpose of these workshops has been to bring together the leading Arm vendors, end users and open source development community to discuss the latest products, developments and open source software support in HPC on Arm."
Learn more: https://www.linaro.org/events/workshop/arm-hpc-asia-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Scale-out AI Training on Massive Core System from HPC to Fabric-based SOC
1. Scale-out Computing Model on Massive Core
System: From HPC to Fabric-Based SoC
Dr. Fu Li
li@qcftech.com
Quantum Cloud Future (Beijing) Technologies Co., Ltd.
2. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Cook Book
1. What is Massive Core System (MCS)?
1.1. HPC system
1.2. GPU system
1.3. MicroSlides: Fabric-based SoC
2. Why scale-out computing is important in MCS?
3. How to make MCS faster?
3.1. MPI and openMP in HPC
3.2. Memory coalescing and cudaDMA in GPU computing
4. QCF’s scale-out computing model for Microslides
4.1. the hardware (Socionext)
4.2. the architecture
4.3. the result (arm vs x86 vs GPU)
new
3. Quantum Cloud Future (Beijing) Technology Co. Ltd.
QuantumTheory and Spectroscopy
Molecular Dynamics Fast Fourier Transform
HPC
Content-Centric Networking
Cloud Storage
Doppler ASIC Boba FPGA
MPI, OpenMPCUDAStatistic Mechanics
GPU switch
PacketShader
Introduction to Quantum Cloud
With background from Quantum calculation,
1) we perform large-scale molecular dynamics simulation on HPC cluster using
Amber and Gromacs,
2) we optimize Fourier transform and matrix operation on multicore system.
4. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Introduction to Quantum Cloud
Then we found GPU is a great tool for both molecular dynamics and matrix
operation.
5. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Introduction to Quantum Cloud
Later we found similar systems with massive CPU cores.
6. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Introduction to Quantum Cloud
Today we will show some practical example about our scale-out algorithm on
these systems
7. Quantum Cloud Future (Beijing) Technology Co. Ltd.
NumberofCores
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC
Server
Blade
Server
Super
Computer
General-purpose
8. Quantum Cloud Future (Beijing) Technology Co. Ltd.
NumberofCores
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC
Server
Blade
Server
Super
Computer
GPU
GPU Cluster
General-purpose
Special-purpose
9. Quantum Cloud Future (Beijing) Technology Co. Ltd.
NumberofCores
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC
Server
Blade
Server
Super
Computer
GPU
GPU Cluster
General-purpose
Special-purpose
Traditional
ARM
Server
ARM
SoC
10. Quantum Cloud Future (Beijing) Technology Co. Ltd.
NumberofCores
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC
Server
Blade
Server
Super
Computer
GPU
GPU Cluster
Microslides
Special-purpose
General-purpose
General-purpose
Microslides
of ARM CPU
Microslides
of ARM SoC
Traditional
ARM
Server
ARM
SoC
11. Quantum Cloud Future (Beijing) Technology Co. Ltd.
NumberofCores
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)
10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC
Server
Blade
Server
Super
Computer
GPU
GPU Cluster
Microslides
Microslides
of ARM CPU
Microslides
of ARM SoC
2006 20182012
intra CPU connection
inter CPU connection
cluster connection
Special-purpose
General-purpose
General-purpose
Traditional
ARM
Server
ARM
SoC
12. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
Cache/Storage
I/O
Hierarchical structure is critical for Von Neumann architecture
13. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
Cache/Storage
I/O
14. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1instruction-level
parallelism
OS-level
parallelism
algorithm-level
parallelism
Cache/Storage
I/O
15. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1instruction-level
parallelism
OS-level
parallelism
algorithm-level
parallelism
batch, share-nothing
stateless computing
big RAM
avoid context switching
TLB, cache-conscious
big.LITTLE
GPU, FPGA
Fast cache, cache prefetch
Vector processing, SIMD/AVX
Cache/Storage
I/O
16. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1instruction-level
parallelism
OS-level
parallelism
algorithm-level
parallelism
batch, share-nothing
stateless computing
big RAM
avoid context switching
TLB, cache-conscious
big.LITTLE
GPU, FPGA
Fast cache, cache prefetch
Vector processing, SIMD/AVX
Cache/Storage
I/O
Consolidation will be the next-wave innovation for Chip design and system optimization
• IO consolidation: networking, bus, fabric
• storage consolidation: memory, cache, networking buffer
20. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Share-Nothing + Message Queue Architecture
Stateless
计算架构
host
core core
IO
core
use an “individual” core to do IO for the host to
increase the throughput
23. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: Rendering on Arm
0
7.5
15
22.5
30
performace scaled 1 scaled 2
Intel arm SoC Intel arm SoC Intel arm SoC
scaled 1: scaled performance with frequency and core number
scaled 2: scaled performance with frequency and core number and watts
24. Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: AI on Arm
Caffe@Container ARM vs Intel vs GPU (scaled)
0
0.4
0.8
1.2
1.6
CIFAR 10 - 1 CIFAR 10 -2 CIFAR 10 - 3
Intel ARM GPU 1070