IAC 2024 - IA Fast Track to Search Focused AI Solutions
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems
1. TAIPEI | SEP. 21-22, 2016
Tony W. H. Sheu, Neo Shih-Chao Kao,
Maxim Solovchuk, Cheng-Tao Wu,
Yu-Wei Chang
National Taiwan University
RECENT PROGRESS IN SCCS ON GPU SIMULATION
OF BIOMEDICAL AND HYDRODYNAMIC PROBLEMS
Acknowledgement : SCCS (Scientific Computing and Cardiovascular
Simulation) team working on GPU simulation(輝達)
(Aug. 8, 2016)
2. 2
OBJECTIVE
Migration of in-house developed CPU codes* to Nvidia Cuda codes to experience the
power of GPU acceleration on simulating large-sized problems
* 1. 3D finite element code to simulate incompressible Navier-Stokes
equations
2. 3D finite difference code to simulate incompressible Navier-Stokes
equations
3. 3D finite difference code to simulate Maxwell’s equations
4. 3D finite difference code to simulate Westervelt equation for
ultrasound wave propagation
9/26/16
3. 3
CONTENT OF THE PRESENTATION
9/26/16
Cheng-Tao Wu (吳政道), CUDA programming on
Frontal matrix solver for accelerating finite element
calculation of incompressible Navier-Stokes
solutions
Yu-Wei Chang (張育維), GPU acceleration
of patient-specific airway image segmentation
Undergraduate students
4. 4
CONTENT OF THE PRESENTATION
Research scientists
9/26/16
Neo Shih-Chao Kao (高仕超), OpenAcc
acceleration of the three-dimensional
incompressible Navier-Stokes equations
Maxim Solovchuk, Acceleration of HIFU
(High Intensity Focused Ultrasound)
ablation of liver tumor on K80(*4) GPUs
6. GTC-Taipei ; Sep. 21, 2016
國立臺灣大學
工程科學及海洋工程學系
吳政道
CUDA PROGRAMMING ON FRONTAL MATRIX
SOLVER FOR ACCELERATING FINITE ELEMENT
CALCULATION OF INCOMPRESSIBLE NAVIER-
STOKES EQUATIONS
8. 8
MOTIVATION AND OBJECTIVE
Finite Element Method
Finite Element Method(FEM) is a global integration method, rendering minimum
energy in entire physical space. Large-sized matrix equation accounting for the total
number of unknowns shall be dealt with.
GPU is an excellent choice of accomplishing computationally intensive tasks in FEM
calculation of solutions.
Finite element matrix equation, shared the same weak formulation, results from
assemblage of all local element matrix equations derived from the same integral
equations.
1. GPU is an excellent choice of making good parallelization within the framework
containing many core processors.
2. GPU is an excellent choice of storing tremendous individual element matrix
equations in blocks of shared memory.
9. 9
MOTIVATION AND OBJECTIVE
Finite Element Method
Data structure is a key to success of
parallelization
1. Element numbering
2. Global nodal numbering
3. Local nodal numbering
10. 10
MOTIVATION AND OBJECTIVE
Finite Element Method
In one element of current incompressible
Navier-Stokes finite element
formulation, it contains 22 unknowns.
• 9 u, v velocity components
• 4 p pressure components
Each element involve a 22x22 matrix
equation.
Two elements involves a 37x37 matrix.
22*2(elements) – 3*2(u, v velocity) –
2(pressure) = 37 unknowns
11. 11
MOTIVATION AND OBJECTIVE
Finite Element Method
Elements 1 100 400
Matrix size 22x22 1003x1003 3803x3803
Elements 900 1600 2500
Matrix size 8403x8403 14803x14803 23003x23003
12. 12
MOTIVATION AND OBJECTIVE
Solution method
There are two kinds of matrix solvers.
Iterative solver:
Pro: memory and computing are less intensive
Con: no theory is available to guarantee convergent solution can be computed.
13. 13
MOTIVATION AND OBJECTIVE
Solution Solver
Direct solver:
Underlying Gaussian elimination method
Pro: solution can be computed for any non-ill-
conditioned matrix equation
Con: memory and computing are very intensive
For the parallelization sake, element by
element Frontal solver is chosen
15. 15
MOTIVATION AND OBJECTIVE
Evolution of computer chips
5/2016 GTX 1080
9 TFlop/s (SP)
$699
180W
11/2001 #1
7.2 TFlop/s (DP)
$110 million
3MW
Temporal conclusion – to perform HPC tasks, cost-effective GPU turns out to be a smart choice
16. 16
MOTIVATION AND OBJECTIVE
Evolution of computer chips
June 2015 June 2016
Nvidia GPU Accelerator
Systems Share 54%
Nvidia GPU Accelerator
Systems Share 67%
17. 17
THEREFORE, MIGRATION OF THE ORIGINAL
CPU CODE TO NVIDIA CUDA CODE CAN
EXPERIENCE A TREMENDOUS BENEFIT.
21. 21
COMPUTING SYSTEM
Computing Aspect
has private Memory
data can be synchronised by
setting a barrier and then
share the memory
memory will only be updated
after finishing the execution or
encountering data conflict
~
~~~~
~~~~
~~~~
~~~~
~~~~
22. 22
COMPUTING SYSTEM
Communication concern
We can assume CPU as a manager, and
GPU as his/her employees.
To fully utilize GPU, one should reduce
the amount of communications
between CPU and GPU.
26. 26
CUDA STREAM
CUDA stream is a working queue of GPU. Operations in different streams may be
overlapped.
GPU scheduler can delete automatically managing kernels, programmers need not to
specify it when executing the stream.
After CPU placing a request in a stream, it can keep operating until CUDA streams
need to be synchronized.
27. 27
CUDA STREAM
Kernel 1
Kernel 2
Kernel 2
Kernel 3
Kernel 4
Kernel 2
Kernel 3 Kernel 4
Kernel 1 Kernel 2
Without CUDA
Stream
With CUDA Stream
Time
28. 28
CUDA STREAM
cudaStream_t stream[4];
#pragma omp parallel for
for(i = 0;i<4;i++){
cudaStreamCreate(&stream[i]);
cu_Func<<<blocks, threads, 0, stream[i]>>>();
// CPU task
cudaStreamSynchronize(stream[i]);
cudaStreamDestroy(stream[i]);
}
CPU 1
Stream 1
CPU 2
CPU 3
CPU 4
Stream 2
Stream 3
Stream 4
30. 30
COMPUTING RESULTS
Lid-driven cavity flow problem
[*] High Re solutions for incompressible flow using the Navier-Stokes equations and a multigrid method.
U. Ghia, K. N. Ghia, and C.T. Shin
35. 35
FUTURE WORK
In the future, multi-frontal direct solver will be integrated into the finite element
flow instead of frontal solver, providing a better parallelized algorithm and reduce
the computing time.
Our aim in the near future is point in NTU campus to solve the incompressible
Navier-Stokes equations in a domain containing mesh size 2560*2560*2560 nodal
points.
38. GTC-Taipei ; Sep. 21, 2016
Neo Shih-Chao Kao (高仕超)
OPENACC ACCELERATION OF THE CALCULATION
OF THREE-DIMENSONAL INCOMPRESSIBLE NAVIER-
STOKES EQUATIONS
Acknowledgement :
Department of Engineering Science and Ocean Engineering,
National Taiwan University
Scientific Computing and Cardiovascular Simulation laboratory (SCCS),
National Taiwan University
(輝達)
39. 39
AGENDA
1. Why GPU is needed ?
2. How GPU is used ?
3. What GPU helps me ?
4. Concluding remarks
40. 40
WHY GPU IS NEEDED ?
Computational Fluid Dynamics (CFD)
(Incompressible flow equation)
High performance computing
Objective
To obtain convergent solution
FASTER (3D problem)
Discretization scheme
Objective
(Two major tasks)
http://homepage.ntu.edu.tw/~twhsheu/index.htm
To derive a finite difference model
rendering minimized phase error in
convection terms
High performance computing
< 8 hours !
41. 41
n The non-dimensional three-dimensional incompressible Navier-Stokes equations
where u={u,v,w} denotes the velocity vector , p the pressure field, Re the
Reynolds number and f is the force term.
n Finite difference method (FDM)
n Features of CPU code :
n Compiler : PGI workstation v13.10
n Column-major ordering (Fortran)
21
+
t Re
u
u u p u f
¶
+ ×Ñ = - Ñ Ñ +
¶
0uÑ× =
*J. Kim, P. Moin, Application of a Fractional-Step method to incompressible Navier-Stokes equations, Journal of
Computational Physics, Vol. 59, pp. 308-323, 1985.
n The fractional-step algorithm of Kim* is adopted
42. 42
WHY GPU IS NEEDED ?
Schematic of problem
Ø Uniform mesh sizes
ü h = 1/96,1/128,1/150
Ø Reynolds numbers : Re = 400,1000
n Computational setting
n 3D benchmark flow problem (空穴流)
n Solution resolution requirement
Ø Fine grid distribution (h << 1)
43. 43
INEFFECTIVE COMPUTING (CPU+OPENMP)
2016/9/26
Mesh length Re = 400 Re = 1000
1/96 15250.4 (s) 24007.7 (s)
1/128 37689.7 (s) 116439.0 (s)
1/150 196114.2 (s) 400228.2(s)
n OpenMP (8-threads)
n Time-consuming tasks
Comparison of velocity profiles
u(x,0.5,z) and w(0.5,y,z)
n The applicability of the proposed
CPU code to predict high Re
incompressible flow is confirmed
Streamlines
at Re = 1000
H. Ding et al., Comput. Methods. Appl.
Mech. Engrg., Vol. 195, pp. 516-533, 2006.
(Intel i7-4820K)
4.6 days
46. 46
WHY GPU IS NEEDED ?
n GPU programming :
n Before 2007 : OpenGL
n 2007 : CUDA
n 2011 : OpenAcc
CPU architecture
l Multi-core structure
l Sophisticated control
logic unit
l Large cache to reduce
access latencies
GPU architecture
l Many-core structure
l Minimized control logical unit
l Large number of threads
l High peak performance
/memory bandwidth
Acknowledgement : CUDA programming guide
CPU GPUALU : Arithmetic logical unit
48. 48
OPENACC
n It was developed by Nvidia, PGI, Cray and CAPS
n Similar to OpenMP programming model
n Directive is added to serial source code
ü Manage loop parallelization
ü Manage data copy between CPU and GPU
n The existing original source code (C/C++/Fortran) is reused
n Ideally, no modification of the original code is necessary
OpenAcc API
49. 49
EXAMPLE
C A B= +
Problem code_GPU_Acc
…
Data copy CPU --> GPU
…
!$acc parallel
do i = 1 , N
C(i) = A(i) + B(i)
end do
!$acc end parallel
…
Data copy GPU --> CPU
…
end program
OpenAcc
Problem code_CPU
…
do i = 1 , N
C(i) = A(i) + B(i)
end do
…
end program
CPU
Module cuda_lib
use Cudafor
Contains
Attributes(global) subroutine add(C,A,B,N)
integer :: i
integer , value :: N
real(kind=8) :: A(N), B(N), C(N)
i = (blockid%x-1)*blockdim%x+threadidx%x
if ( i < N ) then
C(i) = A(i) + B(i)
end if
call syncthreads()
end subroutine
end module
Problem code_CUDA_Fortran
use module cuda_lib
…
Call Add_kernel<<<NB,NT>>>(C,A,B,N)
…
end program CUDA Fortran
50. 50
HOW GPU IS USED ?
CUDA model OpenAcc model
Grid
ThreadThread
ThreadThread
warp
ThreadThread
ThreadThread
warp
ThreadThread
ThreadThread
warp
ThreadThread
ThreadThread
warp
Block Block
VectorVector
VectorVector
worker
VectorVector
VectorVector
worker
VectorVector
VectorVector
worker
VectorVector
VectorVector
worker
Gang Gang
Parallel region
51. 512016/9/26
Non-continuous access
n Four degrees of freedom (u,v,w,p) for each node
U
Node 1
V
Node 1
W
Node 1
P
Node 1
U
Node 2
V
Node 2
W
Node 2
P
Node 2
U
Node N
V
Node N
W
Node N
P
Node N
……
GPU memory (global)
Array Of Struct (AOS)
n N nodes
n The performance becomes deteriorated owing to an ineffective access
HOW GPU IS USED ?
n AOS
52. 522016/9/26
n SOA data format is effective for SIMD hardware (GPU)
Continuous accessContinuous access Continuous access Continuous access
GPU memory (global)
Structs Of Array (SOA)
U
Node 1
U
Node 2
U
Node N
V
Node 1
V
Node 2
V
Node N
P
Node 1
P
Node 2
P
Node N
W
Node 1
W
Node 2
W
Node N
…… … …
n The data must be reordered following the SOA format
given below
HOW GPU IS USED ?
57. 57
CONCLUDING REMARKS
1. We have successfully ported, tested and benchmarked a complete
3D finite difference code using OpenAcc.
2. Code is portable across different GPU architectures.
3. Using OpenAcc, the original source code can be almost unchanged.
4. A large amount of computing time was reduced when executing a
computational task on GPU architecture.
Acknowledgement :
Computer Center in National Taiwan University
63. 63
INTRODUCTION
Desirable goals to achieve for practical application to patient
Applying machine learning to radiotherapyplanning for head and neck cancer
The length procedure may be reduced to 1/4
Source: https://deepmind.com/health
30th August 2016
65. 65
INTRODUCTION
With Intel® Xeon® Processor E5-2620, segmentation block takes 85% time
145.568
1424.284
106.904
0
200
400
600
800
1000
1200
1400
1600
Preprocessing Segmentation Represnetation
time(s)
step
time (s)
time (s)
66. 66
MOTIVATION AND OBJECTIVE
Amdahl’s law implies that the percentage of the code that benefits from
parallelization is important
𝑆 =
1
1 − 𝑝 +
𝑝
𝑖
=
1
1 − 0.85 +
0.85
𝑖
≤ 6.6
S speedup
p the percentage of the execution time that benefits from parallelization
i is the speedup in latency of p
p Maximum speedup
0.85 (current) 6.6
0.95 20
0.99 100
67. 67
MOTIVATION AND OBJECTIVE
In a cloud computing environment, time is money
Type CPU GPU
Card Intel Xeon NVIDIA Kepler
/core
/hour
0.03 0.4
Cost for 1000 patients
(USD)
14 To be announced
Time for 1000 patient
(hour)
465 To be announced
68. 68
HARDWARE ENVIRONMENT
GPU has a 30 times better FLOPs performance
Card Nvidia Tesla K40c* Nvidia Tesla K20c* Intel Xeon E5-2630**
Cores 2880 2496 6
Peak single precision
floating point
performance
4.29Tflops 3.52 Tflops 0.134 Tflops
**http://ark.intel.com/products/64593/Intel-Xeon-Processor-E5-2630-15M-Cache-2_30-GHz-
7_20-GTs-Intel-QPI
*http://www.nvidia.com/object/tesla-workstations.html
69. 69
HARDWARE ENVIRONMENT
GPU does a great job using only global memory
Processor Parallelization
language
Time usage (sec) Speedup gain Note
Intel Xeon NA 158.08 Sequential code
Intel Xeon OpenMP 14.335 1 Double checked
locking
Nvidia Tesla K20c CUDA 3 4.8 Global memory
only
71. 71
HARDWARE ENVIRONMENT
Block and grid structure would affect the usage of share memory
Grid
size
block
size
32
32
128
128
256
256
512
512
Tesla
K40c
3714.5 335.5 388.1 474.9
Tasks
per
thread
2/
20
21
< 21
Tune the block and thread number to
optimize the performance.
Let each thread do less job.
72. 72
HARDWARE ENVIRONMENT
The performance would benefit from the usage of shared and texture memory
Memory usage
on Tesla K20c
Time usage
(sec)
Speedup gain
Global
memory only
3 1
Shared and
global memory
0.471 6.36
Texture and
global memory
0.321 9.34
Despite of faster performance, texture
memory renders a lower accuracy.
While in computational science, accuracy is of
great importance,
so shared memory is more preferable.
5120MB
77. 77
CONCLUDING REMARKS
Memory, thread and block setting is important
Platform Time usage (sec) Speedup gain
CPU 14.335 1
CPU + 1 GPU 0.335 43
CPU + 2 GPU 0.232 61.8
Tune the block and thread number to optimize the performance.
Let each thread do less job.
Despite of faster performance, texture memory renders a lower accuracy.
While in computational science, accuracy is of great importance,
so shared memory is more preferable.
78. 78
CONCLUDING REMARKS
Amdahl’s law implies that the percentage of the code that benefits from
parallelization is important
𝑆 =
1
1 − 𝑝 +
𝑝
𝑖
=
1
1 − 0.85 +
0.85
61.8
= 𝟔. 𝟏
p Maximum speedup
0.85 (current) 6.6
0.95 20
0.99 100
79. 79
CONCLUDING REMARKS
In a cloud computing environment, time is money
Type CPU 2 GPU + CPU
Card Intel Xeon NVIDIA Kepler +
Intel Xeon
/core
/hour
0.03 0.4 / 0.03
Cost for 1000
patients (USD)
14 7
Time for 1000
patients (hour)
465 77
80. 80
FUTURE WORK
Increase the resolution of CT from 64-slice to 256-slice
Use deep learning to classify tumor type from CT images and accelerate the whole process
81. 81
REFERENCES
[1] Babin, D., et al. (2010). Segmentation of airways in lungs using projections in 3-D CT
angiography images. 2010 Annual International Conference of the IEEE Engineering in Medicine
and Biology.
[2] Arti T, Priya R, Amit Ujjlayan R. (2015). A performance study of image segmentation
techniques. Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions),
2015 4th International Conference on
Acknowledgement :
許文翰老師
張恆華老師
高仕超學長 提供胸腔CT
(輝達)
國立臺灣大學計算機中心