SlideShare a Scribd company logo
1 of 81
Download to read offline
TAIPEI | SEP. 21-22, 2016
Tony W. H. Sheu, Neo Shih-Chao Kao,
Maxim Solovchuk, Cheng-Tao Wu,
Yu-Wei Chang
National Taiwan University
RECENT PROGRESS IN SCCS ON GPU SIMULATION
OF BIOMEDICAL AND HYDRODYNAMIC PROBLEMS
Acknowledgement : SCCS (Scientific Computing and Cardiovascular
Simulation) team working on GPU simulation(輝達)
(Aug. 8, 2016)
2
OBJECTIVE
Migration of in-house developed CPU codes* to Nvidia Cuda codes to experience the
power of GPU acceleration on simulating large-sized problems
* 1. 3D finite element code to simulate incompressible Navier-Stokes
equations
2. 3D finite difference code to simulate incompressible Navier-Stokes
equations
3. 3D finite difference code to simulate Maxwell’s equations
4. 3D finite difference code to simulate Westervelt equation for
ultrasound wave propagation
9/26/16
3
CONTENT OF THE PRESENTATION
9/26/16
Cheng-Tao Wu (吳政道), CUDA programming on
Frontal matrix solver for accelerating finite element
calculation of incompressible Navier-Stokes
solutions
Yu-Wei Chang (張育維), GPU acceleration
of patient-specific airway image segmentation
Undergraduate students
4
CONTENT OF THE PRESENTATION
Research scientists
9/26/16
Neo Shih-Chao Kao (高仕超), OpenAcc
acceleration of the three-dimensional
incompressible Navier-Stokes equations
Maxim Solovchuk, Acceleration of HIFU
(High Intensity Focused Ultrasound)
ablation of liver tumor on K80(*4) GPUs
5
GTC-Taipei ; Sep. 21, 2016
國立臺灣大學
工程科學及海洋工程學系
吳政道
CUDA PROGRAMMING ON FRONTAL MATRIX
SOLVER FOR ACCELERATING FINITE ELEMENT
CALCULATION OF INCOMPRESSIBLE NAVIER-
STOKES EQUATIONS
7
AGENDA
Motivation and Objective
CPU-GPU computing environment
One important CUDA API feature - CUDA stream
Computational results
Future work
8
MOTIVATION AND OBJECTIVE
Finite Element Method
Finite Element Method(FEM) is a global integration method, rendering minimum
energy in entire physical space. Large-sized matrix equation accounting for the total
number of unknowns shall be dealt with.
GPU is an excellent choice of accomplishing computationally intensive tasks in FEM
calculation of solutions.
Finite element matrix equation, shared the same weak formulation, results from
assemblage of all local element matrix equations derived from the same integral
equations.
1. GPU is an excellent choice of making good parallelization within the framework
containing many core processors.
2. GPU is an excellent choice of storing tremendous individual element matrix
equations in blocks of shared memory.
9
MOTIVATION AND OBJECTIVE
Finite Element Method
Data structure is a key to success of
parallelization
1. Element numbering
2. Global nodal numbering
3. Local nodal numbering
10
MOTIVATION AND OBJECTIVE
Finite Element Method
In one element of current incompressible
Navier-Stokes finite element
formulation, it contains 22 unknowns.
• 9 u, v velocity components
• 4 p pressure components
Each element involve a 22x22 matrix
equation.
Two elements involves a 37x37 matrix.
22*2(elements) – 3*2(u, v velocity) –
2(pressure) = 37 unknowns
11
MOTIVATION AND OBJECTIVE
Finite Element Method
Elements 1 100 400
Matrix size 22x22 1003x1003 3803x3803
Elements 900 1600 2500
Matrix size 8403x8403 14803x14803 23003x23003
12
MOTIVATION AND OBJECTIVE
Solution method
There are two kinds of matrix solvers.
Iterative solver:
Pro: memory and computing are less intensive
Con: no theory is available to guarantee convergent solution can be computed.
13
MOTIVATION AND OBJECTIVE
Solution Solver
Direct solver:
Underlying Gaussian elimination method
Pro: solution can be computed for any non-ill-
conditioned matrix equation
Con: memory and computing are very intensive
For the parallelization sake, element by
element Frontal solver is chosen
14
MOTIVATION AND OBJECTIVE
Frontal Solver
Temporal conclusion- An efficient matrix solver is essential in finite element flow calculation
15
MOTIVATION AND OBJECTIVE
Evolution of computer chips
5/2016 GTX 1080
9 TFlop/s (SP)
$699
180W
11/2001 #1
7.2 TFlop/s (DP)
$110 million
3MW
Temporal conclusion – to perform HPC tasks, cost-effective GPU turns out to be a smart choice
16
MOTIVATION AND OBJECTIVE
Evolution of computer chips
June 2015 June 2016
Nvidia GPU Accelerator
Systems Share 54%
Nvidia GPU Accelerator
Systems Share 67%
17
THEREFORE, MIGRATION OF THE ORIGINAL
CPU CODE TO NVIDIA CUDA CODE CAN
EXPERIENCE A TREMENDOUS BENEFIT.
18
CPU-GPU
COMPUTATIONAL SYSTEM
19
COMPUTING SYSTEM
CPU GPU
Name Intel Core i7 930 Nvidia K20c
Architecture Bloomfield Kepler
Number of Cores 4 cores
2496 CUDA cores,
13 SMs
Memory Bandwidth 25.6GB/s 208 GB/sec
DP Flops/s ~100GFlops 1170GFlops
20
COMPUTING SYSTEM
Computing Aspect
One Thread
One Block
One Grid
~
~~~~
~~~~
~~~~
~~~~
~~~~
21
COMPUTING SYSTEM
Computing Aspect
has private Memory
data can be synchronised by
setting a barrier and then
share the memory
memory will only be updated
after finishing the execution or
encountering data conflict
~
~~~~
~~~~
~~~~
~~~~
~~~~
22
COMPUTING SYSTEM
Communication concern
We can assume CPU as a manager, and
GPU as his/her employees.
To fully utilize GPU, one should reduce
the amount of communications
between CPU and GPU.
23
COMPUTING SYSTEM
Communication concern
24
COMPUTING SYSTEM
Nvidia Kepler Architecture
In K20, it has 13 Streaming
Multiprocessors (SMXs) and a
aa scheduler
GigaThread
SM
25
ONE IMPORTANT CUDA API FEATURE
- CUDA STREAM
26
CUDA STREAM
CUDA stream is a working queue of GPU. Operations in different streams may be
overlapped.
GPU scheduler can delete automatically managing kernels, programmers need not to
specify it when executing the stream.
After CPU placing a request in a stream, it can keep operating until CUDA streams
need to be synchronized.
27
CUDA STREAM
Kernel 1
Kernel 2
Kernel 2
Kernel 3
Kernel 4
Kernel 2
Kernel 3 Kernel 4
Kernel 1 Kernel 2
Without CUDA
Stream
With CUDA Stream
Time
28
CUDA STREAM
cudaStream_t stream[4];
#pragma omp parallel for
for(i = 0;i<4;i++){
cudaStreamCreate(&stream[i]);
cu_Func<<<blocks, threads, 0, stream[i]>>>();
// CPU task
cudaStreamSynchronize(stream[i]);
cudaStreamDestroy(stream[i]);
}
CPU 1
Stream 1
CPU 2
CPU 3
CPU 4
Stream 2
Stream 3
Stream 4
29
COMPUTING RESULTS
30
COMPUTING RESULTS
Lid-driven cavity flow problem
[*] High Re solutions for incompressible flow using the Navier-Stokes equations and a multigrid method.
U. Ghia, K. N. Ghia, and C.T. Shin
31
COMPUTING RESULTS
Lid-driven cavity flow problem
32
COMPUTING RESULTS
Lid-driven cavity flow problem
33
COMPUTING RESULTS
Improvement
9/26/16
0
50
100
150
200
250
300
100 400 900
Execution	time
No.	Elements
C
CUDA	
CUDA	with	
Stream
3.6x 3.9x
34
COMPUTING RESULTS
Improvement
9/26/16
0
10
20
30
40
50
60
70
80
90
100
Prefrontal	 Assembly	 Forward	
Elimination
Backward	
Substitution
CPU%
C CUDA
35
FUTURE WORK
In the future, multi-frontal direct solver will be integrated into the finite element
flow instead of frontal solver, providing a better parallelized algorithm and reduce
the computing time.
Our aim in the near future is point in NTU campus to solve the incompressible
Navier-Stokes equations in a domain containing mesh size 2560*2560*2560 nodal
points.
April 4-7, 2016 | Silicon Valley
THANK YOU
37
GTC-Taipei ; Sep. 21, 2016
Neo Shih-Chao Kao (高仕超)
OPENACC ACCELERATION OF THE CALCULATION
OF THREE-DIMENSONAL INCOMPRESSIBLE NAVIER-
STOKES EQUATIONS
Acknowledgement :
Department of Engineering Science and Ocean Engineering,
National Taiwan University
Scientific Computing and Cardiovascular Simulation laboratory (SCCS),
National Taiwan University
(輝達)
39
AGENDA
1. Why GPU is needed ?
2. How GPU is used ?
3. What GPU helps me ?
4. Concluding remarks
40
WHY GPU IS NEEDED ?
Computational Fluid Dynamics (CFD)
(Incompressible flow equation)
High performance computing
Objective
To obtain convergent solution
FASTER (3D problem)
Discretization scheme
Objective
(Two major tasks)
http://homepage.ntu.edu.tw/~twhsheu/index.htm
To derive a finite difference model
rendering minimized phase error in
convection terms
High performance computing
< 8 hours !
41
n The non-dimensional three-dimensional incompressible Navier-Stokes equations
where u={u,v,w} denotes the velocity vector , p the pressure field, Re the
Reynolds number and f is the force term.
n Finite difference method (FDM)
n Features of CPU code :
n Compiler : PGI workstation v13.10
n Column-major ordering (Fortran)
21
+
t Re
u
u u p u f
¶
+ ×Ñ = - Ñ Ñ +
¶
0uÑ× =
*J. Kim, P. Moin, Application of a Fractional-Step method to incompressible Navier-Stokes equations, Journal of
Computational Physics, Vol. 59, pp. 308-323, 1985.
n The fractional-step algorithm of Kim* is adopted
42
WHY GPU IS NEEDED ?
Schematic of problem
Ø Uniform mesh sizes
ü h = 1/96,1/128,1/150
Ø Reynolds numbers : Re = 400,1000
n Computational setting
n 3D benchmark flow problem (空穴流)
n Solution resolution requirement
Ø Fine grid distribution (h << 1)
43
INEFFECTIVE COMPUTING (CPU+OPENMP)
2016/9/26
Mesh length Re = 400 Re = 1000
1/96 15250.4 (s) 24007.7 (s)
1/128 37689.7 (s) 116439.0 (s)
1/150 196114.2 (s) 400228.2(s)
n OpenMP (8-threads)
n Time-consuming tasks
Comparison of velocity profiles
u(x,0.5,z) and w(0.5,y,z)
n The applicability of the proposed
CPU code to predict high Re
incompressible flow is confirmed
Streamlines
at Re = 1000
H. Ding et al., Comput. Methods. Appl.
Mech. Engrg., Vol. 195, pp. 516-533, 2006.
(Intel i7-4820K)
4.6 days
44
THIS IS WHY GPU IS NEEDED!!
2016/9/26
45
GPU (GRAPHIC PROCESSING UNIT)
2016/9/26
Deadpool (Quadro M6000)
GTA5 (Geforce GTX)
http://www.geforce.com.tw/whats-new/articles
/grand-theft-auto-v-nvidia-gameworks-and-technology
https://blogs.nvidia.com.tw/2016/02/
deadpool-movie/
PC-Game
Movie
46
WHY GPU IS NEEDED ?
n GPU programming :
n Before 2007 : OpenGL
n 2007 : CUDA
n 2011 : OpenAcc
CPU architecture
l Multi-core structure
l Sophisticated control
logic unit
l Large cache to reduce
access latencies
GPU architecture
l Many-core structure
l Minimized control logical unit
l Large number of threads
l High peak performance
/memory bandwidth
Acknowledgement : CUDA programming guide
CPU GPUALU : Arithmetic logical unit
47
heterogeneous CPU/GPU computing platform
Tasks
WHY GPU IS NEEDED ?
Programmingrunning
CPUGPU
Intel i7-4820K
Computing-
intensive tasks
CPU code
GPU code
Task 1
Task 2
……
Non-computing
-intensive tasks
Computing-
intensive tasks
Non-computing
-intensive tasks
48
OPENACC
n It was developed by Nvidia, PGI, Cray and CAPS
n Similar to OpenMP programming model
n Directive is added to serial source code
ü Manage loop parallelization
ü Manage data copy between CPU and GPU
n The existing original source code (C/C++/Fortran) is reused
n Ideally, no modification of the original code is necessary
OpenAcc API
49
EXAMPLE
C A B= +
Problem code_GPU_Acc
…
Data copy CPU --> GPU
…
!$acc parallel
do i = 1 , N
C(i) = A(i) + B(i)
end do
!$acc end parallel
…
Data copy GPU --> CPU
…
end program
OpenAcc
Problem code_CPU
…
do i = 1 , N
C(i) = A(i) + B(i)
end do
…
end program
CPU
Module cuda_lib
use Cudafor
Contains
Attributes(global) subroutine add(C,A,B,N)
integer :: i
integer , value :: N
real(kind=8) :: A(N), B(N), C(N)
i = (blockid%x-1)*blockdim%x+threadidx%x
if ( i < N ) then
C(i) = A(i) + B(i)
end if
call syncthreads()
end subroutine
end module
Problem code_CUDA_Fortran
use module cuda_lib
…
Call Add_kernel<<<NB,NT>>>(C,A,B,N)
…
end program CUDA Fortran
50
HOW GPU IS USED ?
CUDA model OpenAcc model
Grid
ThreadThread
ThreadThread
warp
ThreadThread
ThreadThread
warp
ThreadThread
ThreadThread
warp
ThreadThread
ThreadThread
warp
Block Block
VectorVector
VectorVector
worker
VectorVector
VectorVector
worker
VectorVector
VectorVector
worker
VectorVector
VectorVector
worker
Gang Gang
Parallel region
512016/9/26
Non-continuous access
n Four degrees of freedom (u,v,w,p) for each node
U
Node 1
V
Node 1
W
Node 1
P
Node 1
U
Node 2
V
Node 2
W
Node 2
P
Node 2
U
Node N
V
Node N
W
Node N
P
Node N
……
GPU memory (global)
Array Of Struct (AOS)
n N nodes
n The performance becomes deteriorated owing to an ineffective access
HOW GPU IS USED ?
n AOS
522016/9/26
n SOA data format is effective for SIMD hardware (GPU)
Continuous accessContinuous access Continuous access Continuous access
GPU memory (global)
Structs Of Array (SOA)
U
Node 1
U
Node 2
U
Node N
V
Node 1
V
Node 2
V
Node N
P
Node 1
P
Node 2
P
Node N
W
Node 1
W
Node 2
W
Node N
…… … …
n The data must be reordered following the SOA format
given below
HOW GPU IS USED ?
53
CPU GPU 1 GPU 2 GPU3
Architecture Intel i7 4820k Nvidia K20 Nvidia K40 Nvidia K80
Cores 8
2496 (SP)
832 (DP)
2880 (SP)
960 (DP)
4992 (SP)
1664 (DP)
Memory 32GB 5GB 12GB 24GB
Memory bandwidth 59.7 GB/S 208 GB/S 288 GB/S 480 GB/S
Peak performance 59.2 GFlops/s (DP) 1.17 TFlops/s (DP) 1.43 TFlops/s (DP) 1.87 TFlops/s (DP)
IEEE754 SP/DP YES YES YES YES
SP/DP : single/double precision
http://www.nvidia.com.tw/object/tesla_product_literature_tw.html
HARDWARE ARCHITECTURE
K20 K40 K80
Portability
54
NUMERICAL RESULTS (GPU)
H. Ding et al., Comput. Methods. Appl. Mech. Engrg., Vol. 195, pp. 516-533, 2006.
u(x,0.5,z)
w(0.5,y,z)
55
WHAT GPU HELPS ME ?
Mesh length OpenMP OpenAcc(K20) OpenAcc(K40) OpenAcc(K80)
1/96 15250.4 2693.4 2252.1 1846.5
1/128 37689.7 9660.7 7937.3 6711.4
1/150 196114.5 29838.2 23713.8 21456.7
Re = 400
56
WHAT GPU HELPS ME ?
Mesh length OpenMP OpenAcc(K20) OpenAcc(K40) OpenAcc(K80)
1/96 24007.2 4280.1 3626.6 2974.5
1/128 116439.0 15145.0 12334.4 10426.5
1/150 400228.2 44865.0 33732.7 30505.2
Re = 1000
57
CONCLUDING REMARKS
1. We have successfully ported, tested and benchmarked a complete
3D finite difference code using OpenAcc.
2. Code is portable across different GPU architectures.
3. Using OpenAcc, the original source code can be almost unchanged.
4. A large amount of computing time was reduced when executing a
computational task on GPU architecture.
Acknowledgement :
Computer Center in National Taiwan University
April 4-7, 2016 | Silicon Valley
THANK YOU
59
GTC-Taipei	;	Sep.	21,	2016
Yu-Wei Chang
Engineering science and ocean engineering
National Taiwan University
GPU ACCELERATION OF PATIENT-SPECIFIC
AIRWAY IMAGE SEGMENTATION
61
OUTLINE
Introduction
Motivation and objective
Hardware environment
Application example
Concluding remarks
62
INTRODUCTION
Importance of image segmentation
Source: http://www.vision.ee.ethz.ch/~rhayko/
63
INTRODUCTION
Desirable goals to achieve for practical application to patient
Applying machine learning to radiotherapyplanning for head and neck cancer
The length procedure may be reduced to 1/4
Source: https://deepmind.com/health
30th August 2016
64
INTRODUCTION
Building blocks of the 3D airway reconstruction
Acknowledgement: 高仕超學長 所提供的個人胸腔CT
65
INTRODUCTION
With Intel® Xeon® Processor E5-2620, segmentation block takes 85% time
145.568
1424.284
106.904
0
200
400
600
800
1000
1200
1400
1600
Preprocessing Segmentation Represnetation
time(s)
step
time (s)
time (s)
66
MOTIVATION AND OBJECTIVE
Amdahl’s law implies that the percentage of the code that benefits from
parallelization is important
𝑆 =
1
1 − 𝑝 +
𝑝
𝑖
=
1
1 − 0.85 +
0.85
𝑖
≤ 6.6
S speedup
p the percentage of the execution time that benefits from parallelization
i is the speedup in latency of p
p Maximum speedup
0.85 (current) 6.6
0.95 20
0.99 100
67
MOTIVATION AND OBJECTIVE
In a cloud computing environment, time is money
Type CPU GPU
Card Intel Xeon NVIDIA Kepler
/core
/hour
0.03 0.4
Cost for 1000 patients
(USD)
14 To be announced
Time for 1000 patient
(hour)
465 To be announced
68
HARDWARE ENVIRONMENT
GPU has a 30 times better FLOPs performance
Card Nvidia Tesla K40c* Nvidia Tesla K20c* Intel Xeon E5-2630**
Cores 2880 2496 6
Peak single precision
floating point
performance
4.29Tflops 3.52 Tflops 0.134 Tflops
**http://ark.intel.com/products/64593/Intel-Xeon-Processor-E5-2630-15M-Cache-2_30-GHz-
7_20-GTs-Intel-QPI
*http://www.nvidia.com/object/tesla-workstations.html
69
HARDWARE ENVIRONMENT
GPU does a great job using only global memory
Processor Parallelization
language
Time usage (sec) Speedup gain Note
Intel Xeon NA 158.08 Sequential code
Intel Xeon OpenMP 14.335 1 Double checked
locking
Nvidia Tesla K20c CUDA 3 4.8 Global memory
only
70
YOU CANNOT MAKE BRICKS WITHOUT STRAW
工欲善其事必先利其器
71
HARDWARE ENVIRONMENT
Block and grid structure would affect the usage of share memory
Grid
size
block
size
32
32
128
128
256
256
512
512
Tesla
K40c
3714.5 335.5 388.1 474.9
Tasks
per
thread
2/
20
21
< 21
Tune the block and thread number to
optimize the performance.
Let each thread do less job.
72
HARDWARE ENVIRONMENT
The performance would benefit from the usage of shared and texture memory
Memory usage
on Tesla K20c
Time usage
(sec)
Speedup gain
Global
memory only
3 1
Shared and
global memory
0.471 6.36
Texture and
global memory
0.321 9.34
Despite of faster performance, texture
memory renders a lower accuracy.
While in computational science, accuracy is of
great importance,
so shared memory is more preferable.
5120MB
73
APPLICATION EXAMPLE
74
APPLICATION EXAMPLE
Acquisition and pre-processing
75
APPLICATION EXAMPLE
Mathematical morphology, lung filter, and segmentation
Source: https://en.wikipedia.org/wiki/Mathematical_morphology
Opening operation Lung mask Segmentation
76
APPLICATION EXAMPLE
Full view of the lung from the top
Acknowledgement:高仕超學長 所提供的個人胸腔CT
Left airway Right airway
77
CONCLUDING REMARKS
Memory, thread and block setting is important
Platform Time usage (sec) Speedup gain
CPU 14.335 1
CPU + 1 GPU 0.335 43
CPU + 2 GPU 0.232 61.8
Tune the block and thread number to optimize the performance.
Let each thread do less job.
Despite of faster performance, texture memory renders a lower accuracy.
While in computational science, accuracy is of great importance,
so shared memory is more preferable.
78
CONCLUDING REMARKS
Amdahl’s law implies that the percentage of the code that benefits from
parallelization is important
𝑆 =
1
1 − 𝑝 +
𝑝
𝑖
=
1
1 − 0.85 +
0.85
61.8
= 𝟔. 𝟏
p Maximum speedup
0.85 (current) 6.6
0.95 20
0.99 100
79
CONCLUDING REMARKS
In a cloud computing environment, time is money
Type CPU 2 GPU + CPU
Card Intel Xeon NVIDIA Kepler +
Intel Xeon
/core
/hour
0.03 0.4 / 0.03
Cost for 1000
patients (USD)
14 7
Time for 1000
patients (hour)
465 77
80
FUTURE WORK
Increase the resolution of CT from 64-slice to 256-slice
Use deep learning to classify tumor type from CT images and accelerate the whole process
81
REFERENCES
[1] Babin, D., et al. (2010). Segmentation of airways in lungs using projections in 3-D CT
angiography images. 2010 Annual International Conference of the IEEE Engineering in Medicine
and Biology.
[2] Arti T, Priya R, Amit Ujjlayan R. (2015). A performance study of image segmentation
techniques. Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions),
2015 4th International Conference on
Acknowledgement	:	
許文翰老師
張恆華老師
高仕超學長 提供胸腔CT
(輝達)
國立臺灣大學計算機中心

More Related Content

What's hot

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用NVIDIA Taiwan
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detectionNVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detectionNVIDIA Taiwan
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataIntel Nervana
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORNVIDIA Japan
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 

What's hot (20)

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detectionNVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio data
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 

Viewers also liked

Aeroprobing A.I. Drone with TX1
Aeroprobing A.I. Drone with TX1Aeroprobing A.I. Drone with TX1
Aeroprobing A.I. Drone with TX1NVIDIA Taiwan
 
Embedded and Reliable Computer Vision
Embedded and Reliable Computer VisionEmbedded and Reliable Computer Vision
Embedded and Reliable Computer VisionNVIDIA Taiwan
 
Medical Image Processing on NVIDIA TK1/TX1
Medical Image Processing on NVIDIA TK1/TX1Medical Image Processing on NVIDIA TK1/TX1
Medical Image Processing on NVIDIA TK1/TX1NVIDIA Taiwan
 
高效益、設計專利保護 如何達成雙贏?
高效益、設計專利保護 如何達成雙贏?高效益、設計專利保護 如何達成雙贏?
高效益、設計專利保護 如何達成雙贏?NVIDIA Taiwan
 
全面保護企業的關鍵智慧資產
全面保護企業的關鍵智慧資產全面保護企業的關鍵智慧資產
全面保護企業的關鍵智慧資產NVIDIA Taiwan
 
麗明營造 NVIDIA 使用成效分享
麗明營造 NVIDIA 使用成效分享麗明營造 NVIDIA 使用成效分享
麗明營造 NVIDIA 使用成效分享NVIDIA Taiwan
 
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習NVIDIA Taiwan
 
圖形處理器於腦部核磁共振影像處理應用
圖形處理器於腦部核磁共振影像處理應用圖形處理器於腦部核磁共振影像處理應用
圖形處理器於腦部核磁共振影像處理應用NVIDIA Taiwan
 
OpenPOWER Foundation Overview
OpenPOWER Foundation OverviewOpenPOWER Foundation Overview
OpenPOWER Foundation OverviewNVIDIA Taiwan
 
Affordable AI Connects To A Better Life
Affordable AI Connects To A Better LifeAffordable AI Connects To A Better Life
Affordable AI Connects To A Better LifeNVIDIA Taiwan
 
“樓下的房客”以數位特效技術 打造寫實近代台灣風格街景
“樓下的房客”以數位特效技術 打造寫實近代台灣風格街景“樓下的房客”以數位特效技術 打造寫實近代台灣風格街景
“樓下的房客”以數位特效技術 打造寫實近代台灣風格街景NVIDIA Taiwan
 
How to Choose Mobile Workstation? VR Ready
How to Choose Mobile Workstation? VR ReadyHow to Choose Mobile Workstation? VR Ready
How to Choose Mobile Workstation? VR ReadyNVIDIA Taiwan
 
The Birth of Doraemon
The Birth of DoraemonThe Birth of Doraemon
The Birth of DoraemonNVIDIA Taiwan
 
東海大學使用 NVIDIA Quadro & GRID 技術在教育雲端創新服務的經驗分享
 東海大學使用 NVIDIA Quadro & GRID 技術在教育雲端創新服務的經驗分享 東海大學使用 NVIDIA Quadro & GRID 技術在教育雲端創新服務的經驗分享
東海大學使用 NVIDIA Quadro & GRID 技術在教育雲端創新服務的經驗分享NVIDIA Taiwan
 
Lid driven cavity flow simulation using CFD & MATLAB
Lid driven cavity flow simulation using CFD & MATLABLid driven cavity flow simulation using CFD & MATLAB
Lid driven cavity flow simulation using CFD & MATLABIJSRD
 
Future of Making Things in Media & Entertainment FOMT - Design Visualisation ...
Future of Making Things in Media & Entertainment FOMT - Design Visualisation ...Future of Making Things in Media & Entertainment FOMT - Design Visualisation ...
Future of Making Things in Media & Entertainment FOMT - Design Visualisation ...NVIDIA Taiwan
 
Cfd analysis report of bike model
Cfd analysis report of  bike modelCfd analysis report of  bike model
Cfd analysis report of bike modelSoumya Dash
 
Towards 3D Object Capture for Interactive CFD with Automotive Applications - ...
Towards 3D Object Capture for Interactive CFD with Automotive Applications - ...Towards 3D Object Capture for Interactive CFD with Automotive Applications - ...
Towards 3D Object Capture for Interactive CFD with Automotive Applications - ...Malcolm Dias
 
Yechun portfolio
Yechun portfolioYechun portfolio
Yechun portfolioYechun Fu
 

Viewers also liked (20)

Aeroprobing A.I. Drone with TX1
Aeroprobing A.I. Drone with TX1Aeroprobing A.I. Drone with TX1
Aeroprobing A.I. Drone with TX1
 
Embedded and Reliable Computer Vision
Embedded and Reliable Computer VisionEmbedded and Reliable Computer Vision
Embedded and Reliable Computer Vision
 
Medical Image Processing on NVIDIA TK1/TX1
Medical Image Processing on NVIDIA TK1/TX1Medical Image Processing on NVIDIA TK1/TX1
Medical Image Processing on NVIDIA TK1/TX1
 
高效益、設計專利保護 如何達成雙贏?
高效益、設計專利保護 如何達成雙贏?高效益、設計專利保護 如何達成雙贏?
高效益、設計專利保護 如何達成雙贏?
 
全面保護企業的關鍵智慧資產
全面保護企業的關鍵智慧資產全面保護企業的關鍵智慧資產
全面保護企業的關鍵智慧資產
 
麗明營造 NVIDIA 使用成效分享
麗明營造 NVIDIA 使用成效分享麗明營造 NVIDIA 使用成效分享
麗明營造 NVIDIA 使用成效分享
 
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
 
圖形處理器於腦部核磁共振影像處理應用
圖形處理器於腦部核磁共振影像處理應用圖形處理器於腦部核磁共振影像處理應用
圖形處理器於腦部核磁共振影像處理應用
 
OpenPOWER Foundation Overview
OpenPOWER Foundation OverviewOpenPOWER Foundation Overview
OpenPOWER Foundation Overview
 
Affordable AI Connects To A Better Life
Affordable AI Connects To A Better LifeAffordable AI Connects To A Better Life
Affordable AI Connects To A Better Life
 
“樓下的房客”以數位特效技術 打造寫實近代台灣風格街景
“樓下的房客”以數位特效技術 打造寫實近代台灣風格街景“樓下的房客”以數位特效技術 打造寫實近代台灣風格街景
“樓下的房客”以數位特效技術 打造寫實近代台灣風格街景
 
How to Choose Mobile Workstation? VR Ready
How to Choose Mobile Workstation? VR ReadyHow to Choose Mobile Workstation? VR Ready
How to Choose Mobile Workstation? VR Ready
 
The Birth of Doraemon
The Birth of DoraemonThe Birth of Doraemon
The Birth of Doraemon
 
東海大學使用 NVIDIA Quadro & GRID 技術在教育雲端創新服務的經驗分享
 東海大學使用 NVIDIA Quadro & GRID 技術在教育雲端創新服務的經驗分享 東海大學使用 NVIDIA Quadro & GRID 技術在教育雲端創新服務的經驗分享
東海大學使用 NVIDIA Quadro & GRID 技術在教育雲端創新服務的經驗分享
 
Lid driven cavity flow simulation using CFD & MATLAB
Lid driven cavity flow simulation using CFD & MATLABLid driven cavity flow simulation using CFD & MATLAB
Lid driven cavity flow simulation using CFD & MATLAB
 
Future of Making Things in Media & Entertainment FOMT - Design Visualisation ...
Future of Making Things in Media & Entertainment FOMT - Design Visualisation ...Future of Making Things in Media & Entertainment FOMT - Design Visualisation ...
Future of Making Things in Media & Entertainment FOMT - Design Visualisation ...
 
Cfd analysis report of bike model
Cfd analysis report of  bike modelCfd analysis report of  bike model
Cfd analysis report of bike model
 
Towards 3D Object Capture for Interactive CFD with Automotive Applications - ...
Towards 3D Object Capture for Interactive CFD with Automotive Applications - ...Towards 3D Object Capture for Interactive CFD with Automotive Applications - ...
Towards 3D Object Capture for Interactive CFD with Automotive Applications - ...
 
As per Industry Requirements Automotive, Aerospace & CFD Certified Training ...
As per Industry Requirements Automotive, Aerospace & CFD  Certified Training ...As per Industry Requirements Automotive, Aerospace & CFD  Certified Training ...
As per Industry Requirements Automotive, Aerospace & CFD Certified Training ...
 
Yechun portfolio
Yechun portfolioYechun portfolio
Yechun portfolio
 

Similar to Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems

Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...RISC-V International
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Kassem2009
Kassem2009Kassem2009
Kassem2009lazchi
 
2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptx2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptxJAEMINJEONG5
 
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIRJET Journal
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptgrssieee
 
Cache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsCache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsVajira Thambawita
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Jet Energy Corrections with Deep Neural Network Regression
Jet Energy Corrections with Deep Neural Network RegressionJet Energy Corrections with Deep Neural Network Regression
Jet Energy Corrections with Deep Neural Network RegressionDaniel Holmberg
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 

Similar to Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems (20)

Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
An35225228
An35225228An35225228
An35225228
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Nvidia GTC 2014 Talk
Nvidia GTC 2014 TalkNvidia GTC 2014 Talk
Nvidia GTC 2014 Talk
 
Kassem2009
Kassem2009Kassem2009
Kassem2009
 
2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptx2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptx
 
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGA
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
 
Cache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing UnitsCache Optimization Techniques for General Purpose Graphic Processing Units
Cache Optimization Techniques for General Purpose Graphic Processing Units
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Jet Energy Corrections with Deep Neural Network Regression
Jet Energy Corrections with Deep Neural Network RegressionJet Energy Corrections with Deep Neural Network Regression
Jet Energy Corrections with Deep Neural Network Regression
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 

More from NVIDIA Taiwan

GTC Taiwan 2017 基於 CNN 對易混淆中藥的手機辨識系統
GTC Taiwan 2017 基於 CNN 對易混淆中藥的手機辨識系統GTC Taiwan 2017 基於 CNN 對易混淆中藥的手機辨識系統
GTC Taiwan 2017 基於 CNN 對易混淆中藥的手機辨識系統NVIDIA Taiwan
 
GTC Taiwan 2017 CUDA 加速先進影像分析技術與深度學習於臨床電腦斷層掃瞄肝細胞腫瘤輔助診斷
GTC Taiwan 2017 CUDA 加速先進影像分析技術與深度學習於臨床電腦斷層掃瞄肝細胞腫瘤輔助診斷GTC Taiwan 2017 CUDA 加速先進影像分析技術與深度學習於臨床電腦斷層掃瞄肝細胞腫瘤輔助診斷
GTC Taiwan 2017 CUDA 加速先進影像分析技術與深度學習於臨床電腦斷層掃瞄肝細胞腫瘤輔助診斷NVIDIA Taiwan
 
GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發
GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發 GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發
GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發 NVIDIA Taiwan
 
GTC Taiwan 2017 人工智慧:保險科技的未來
GTC Taiwan 2017 人工智慧:保險科技的未來GTC Taiwan 2017 人工智慧:保險科技的未來
GTC Taiwan 2017 人工智慧:保險科技的未來NVIDIA Taiwan
 
GTC Taiwan 2017 從雲端到終端的瓶頸及解決之道
GTC Taiwan 2017 從雲端到終端的瓶頸及解決之道GTC Taiwan 2017 從雲端到終端的瓶頸及解決之道
GTC Taiwan 2017 從雲端到終端的瓶頸及解決之道NVIDIA Taiwan
 
GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心
GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心
GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心NVIDIA Taiwan
 
GTC Taiwan 2017 用計算來凝視複雜的世界
GTC Taiwan 2017 用計算來凝視複雜的世界 GTC Taiwan 2017 用計算來凝視複雜的世界
GTC Taiwan 2017 用計算來凝視複雜的世界 NVIDIA Taiwan
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化NVIDIA Taiwan
 
GTC Taiwan 2017 NVIDIA VRWorks SDK 加速性能與提升 VR 使用經驗
GTC Taiwan 2017 NVIDIA VRWorks SDK 加速性能與提升 VR 使用經驗GTC Taiwan 2017 NVIDIA VRWorks SDK 加速性能與提升 VR 使用經驗
GTC Taiwan 2017 NVIDIA VRWorks SDK 加速性能與提升 VR 使用經驗NVIDIA Taiwan
 
GTC Taiwan 2017 NVIDIA Holodeck 與 Isaac VR 技術分享
GTC Taiwan 2017 NVIDIA Holodeck 與 Isaac VR 技術分享GTC Taiwan 2017 NVIDIA Holodeck 與 Isaac VR 技術分享
GTC Taiwan 2017 NVIDIA Holodeck 與 Isaac VR 技術分享NVIDIA Taiwan
 
GTC Taiwan 2017 深度學習於表面瑕疵檢測之應用
GTC Taiwan 2017 深度學習於表面瑕疵檢測之應用GTC Taiwan 2017 深度學習於表面瑕疵檢測之應用
GTC Taiwan 2017 深度學習於表面瑕疵檢測之應用NVIDIA Taiwan
 
GTC Taiwan 2017 結合智能視覺系統之機械手臂
GTC Taiwan 2017 結合智能視覺系統之機械手臂GTC Taiwan 2017 結合智能視覺系統之機械手臂
GTC Taiwan 2017 結合智能視覺系統之機械手臂NVIDIA Taiwan
 
GTC Taiwan 2017 以雲端 GPU 將傳統硬體人工智慧化
GTC Taiwan 2017 以雲端 GPU 將傳統硬體人工智慧化GTC Taiwan 2017 以雲端 GPU 將傳統硬體人工智慧化
GTC Taiwan 2017 以雲端 GPU 將傳統硬體人工智慧化NVIDIA Taiwan
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用NVIDIA Taiwan
 
GTC Taiwan 2017 深度學習與該技術於視訊監控產業上之應用
GTC Taiwan 2017 深度學習與該技術於視訊監控產業上之應用GTC Taiwan 2017 深度學習與該技術於視訊監控產業上之應用
GTC Taiwan 2017 深度學習與該技術於視訊監控產業上之應用NVIDIA Taiwan
 
GTC Taiwan 2017 應用智慧科技於傳染病防治
GTC Taiwan 2017 應用智慧科技於傳染病防治GTC Taiwan 2017 應用智慧科技於傳染病防治
GTC Taiwan 2017 應用智慧科技於傳染病防治NVIDIA Taiwan
 
NVIDIA深度學習教育機構 (DLI): Deep Learning Institute
NVIDIA深度學習教育機構 (DLI): Deep Learning InstituteNVIDIA深度學習教育機構 (DLI): Deep Learning Institute
NVIDIA深度學習教育機構 (DLI): Deep Learning InstituteNVIDIA Taiwan
 
NVIDIA 深度學習教育機構 (DLI): Neural network deployment
NVIDIA 深度學習教育機構 (DLI): Neural network deploymentNVIDIA 深度學習教育機構 (DLI): Neural network deployment
NVIDIA 深度學習教育機構 (DLI): Neural network deploymentNVIDIA Taiwan
 
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digitsNVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digitsNVIDIA Taiwan
 

More from NVIDIA Taiwan (19)

GTC Taiwan 2017 基於 CNN 對易混淆中藥的手機辨識系統
GTC Taiwan 2017 基於 CNN 對易混淆中藥的手機辨識系統GTC Taiwan 2017 基於 CNN 對易混淆中藥的手機辨識系統
GTC Taiwan 2017 基於 CNN 對易混淆中藥的手機辨識系統
 
GTC Taiwan 2017 CUDA 加速先進影像分析技術與深度學習於臨床電腦斷層掃瞄肝細胞腫瘤輔助診斷
GTC Taiwan 2017 CUDA 加速先進影像分析技術與深度學習於臨床電腦斷層掃瞄肝細胞腫瘤輔助診斷GTC Taiwan 2017 CUDA 加速先進影像分析技術與深度學習於臨床電腦斷層掃瞄肝細胞腫瘤輔助診斷
GTC Taiwan 2017 CUDA 加速先進影像分析技術與深度學習於臨床電腦斷層掃瞄肝細胞腫瘤輔助診斷
 
GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發
GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發 GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發
GTC Taiwan 2017 自主駕駛車輛發展平台與技術研發
 
GTC Taiwan 2017 人工智慧:保險科技的未來
GTC Taiwan 2017 人工智慧:保險科技的未來GTC Taiwan 2017 人工智慧:保險科技的未來
GTC Taiwan 2017 人工智慧:保險科技的未來
 
GTC Taiwan 2017 從雲端到終端的瓶頸及解決之道
GTC Taiwan 2017 從雲端到終端的瓶頸及解決之道GTC Taiwan 2017 從雲端到終端的瓶頸及解決之道
GTC Taiwan 2017 從雲端到終端的瓶頸及解決之道
 
GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心
GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心
GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心
 
GTC Taiwan 2017 用計算來凝視複雜的世界
GTC Taiwan 2017 用計算來凝視複雜的世界 GTC Taiwan 2017 用計算來凝視複雜的世界
GTC Taiwan 2017 用計算來凝視複雜的世界
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
 
GTC Taiwan 2017 NVIDIA VRWorks SDK 加速性能與提升 VR 使用經驗
GTC Taiwan 2017 NVIDIA VRWorks SDK 加速性能與提升 VR 使用經驗GTC Taiwan 2017 NVIDIA VRWorks SDK 加速性能與提升 VR 使用經驗
GTC Taiwan 2017 NVIDIA VRWorks SDK 加速性能與提升 VR 使用經驗
 
GTC Taiwan 2017 NVIDIA Holodeck 與 Isaac VR 技術分享
GTC Taiwan 2017 NVIDIA Holodeck 與 Isaac VR 技術分享GTC Taiwan 2017 NVIDIA Holodeck 與 Isaac VR 技術分享
GTC Taiwan 2017 NVIDIA Holodeck 與 Isaac VR 技術分享
 
GTC Taiwan 2017 深度學習於表面瑕疵檢測之應用
GTC Taiwan 2017 深度學習於表面瑕疵檢測之應用GTC Taiwan 2017 深度學習於表面瑕疵檢測之應用
GTC Taiwan 2017 深度學習於表面瑕疵檢測之應用
 
GTC Taiwan 2017 結合智能視覺系統之機械手臂
GTC Taiwan 2017 結合智能視覺系統之機械手臂GTC Taiwan 2017 結合智能視覺系統之機械手臂
GTC Taiwan 2017 結合智能視覺系統之機械手臂
 
GTC Taiwan 2017 以雲端 GPU 將傳統硬體人工智慧化
GTC Taiwan 2017 以雲端 GPU 將傳統硬體人工智慧化GTC Taiwan 2017 以雲端 GPU 將傳統硬體人工智慧化
GTC Taiwan 2017 以雲端 GPU 將傳統硬體人工智慧化
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
 
GTC Taiwan 2017 深度學習與該技術於視訊監控產業上之應用
GTC Taiwan 2017 深度學習與該技術於視訊監控產業上之應用GTC Taiwan 2017 深度學習與該技術於視訊監控產業上之應用
GTC Taiwan 2017 深度學習與該技術於視訊監控產業上之應用
 
GTC Taiwan 2017 應用智慧科技於傳染病防治
GTC Taiwan 2017 應用智慧科技於傳染病防治GTC Taiwan 2017 應用智慧科技於傳染病防治
GTC Taiwan 2017 應用智慧科技於傳染病防治
 
NVIDIA深度學習教育機構 (DLI): Deep Learning Institute
NVIDIA深度學習教育機構 (DLI): Deep Learning InstituteNVIDIA深度學習教育機構 (DLI): Deep Learning Institute
NVIDIA深度學習教育機構 (DLI): Deep Learning Institute
 
NVIDIA 深度學習教育機構 (DLI): Neural network deployment
NVIDIA 深度學習教育機構 (DLI): Neural network deploymentNVIDIA 深度學習教育機構 (DLI): Neural network deployment
NVIDIA 深度學習教育機構 (DLI): Neural network deployment
 
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digitsNVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Problems

  • 1. TAIPEI | SEP. 21-22, 2016 Tony W. H. Sheu, Neo Shih-Chao Kao, Maxim Solovchuk, Cheng-Tao Wu, Yu-Wei Chang National Taiwan University RECENT PROGRESS IN SCCS ON GPU SIMULATION OF BIOMEDICAL AND HYDRODYNAMIC PROBLEMS Acknowledgement : SCCS (Scientific Computing and Cardiovascular Simulation) team working on GPU simulation(輝達) (Aug. 8, 2016)
  • 2. 2 OBJECTIVE Migration of in-house developed CPU codes* to Nvidia Cuda codes to experience the power of GPU acceleration on simulating large-sized problems * 1. 3D finite element code to simulate incompressible Navier-Stokes equations 2. 3D finite difference code to simulate incompressible Navier-Stokes equations 3. 3D finite difference code to simulate Maxwell’s equations 4. 3D finite difference code to simulate Westervelt equation for ultrasound wave propagation 9/26/16
  • 3. 3 CONTENT OF THE PRESENTATION 9/26/16 Cheng-Tao Wu (吳政道), CUDA programming on Frontal matrix solver for accelerating finite element calculation of incompressible Navier-Stokes solutions Yu-Wei Chang (張育維), GPU acceleration of patient-specific airway image segmentation Undergraduate students
  • 4. 4 CONTENT OF THE PRESENTATION Research scientists 9/26/16 Neo Shih-Chao Kao (高仕超), OpenAcc acceleration of the three-dimensional incompressible Navier-Stokes equations Maxim Solovchuk, Acceleration of HIFU (High Intensity Focused Ultrasound) ablation of liver tumor on K80(*4) GPUs
  • 5. 5
  • 6. GTC-Taipei ; Sep. 21, 2016 國立臺灣大學 工程科學及海洋工程學系 吳政道 CUDA PROGRAMMING ON FRONTAL MATRIX SOLVER FOR ACCELERATING FINITE ELEMENT CALCULATION OF INCOMPRESSIBLE NAVIER- STOKES EQUATIONS
  • 7. 7 AGENDA Motivation and Objective CPU-GPU computing environment One important CUDA API feature - CUDA stream Computational results Future work
  • 8. 8 MOTIVATION AND OBJECTIVE Finite Element Method Finite Element Method(FEM) is a global integration method, rendering minimum energy in entire physical space. Large-sized matrix equation accounting for the total number of unknowns shall be dealt with. GPU is an excellent choice of accomplishing computationally intensive tasks in FEM calculation of solutions. Finite element matrix equation, shared the same weak formulation, results from assemblage of all local element matrix equations derived from the same integral equations. 1. GPU is an excellent choice of making good parallelization within the framework containing many core processors. 2. GPU is an excellent choice of storing tremendous individual element matrix equations in blocks of shared memory.
  • 9. 9 MOTIVATION AND OBJECTIVE Finite Element Method Data structure is a key to success of parallelization 1. Element numbering 2. Global nodal numbering 3. Local nodal numbering
  • 10. 10 MOTIVATION AND OBJECTIVE Finite Element Method In one element of current incompressible Navier-Stokes finite element formulation, it contains 22 unknowns. • 9 u, v velocity components • 4 p pressure components Each element involve a 22x22 matrix equation. Two elements involves a 37x37 matrix. 22*2(elements) – 3*2(u, v velocity) – 2(pressure) = 37 unknowns
  • 11. 11 MOTIVATION AND OBJECTIVE Finite Element Method Elements 1 100 400 Matrix size 22x22 1003x1003 3803x3803 Elements 900 1600 2500 Matrix size 8403x8403 14803x14803 23003x23003
  • 12. 12 MOTIVATION AND OBJECTIVE Solution method There are two kinds of matrix solvers. Iterative solver: Pro: memory and computing are less intensive Con: no theory is available to guarantee convergent solution can be computed.
  • 13. 13 MOTIVATION AND OBJECTIVE Solution Solver Direct solver: Underlying Gaussian elimination method Pro: solution can be computed for any non-ill- conditioned matrix equation Con: memory and computing are very intensive For the parallelization sake, element by element Frontal solver is chosen
  • 14. 14 MOTIVATION AND OBJECTIVE Frontal Solver Temporal conclusion- An efficient matrix solver is essential in finite element flow calculation
  • 15. 15 MOTIVATION AND OBJECTIVE Evolution of computer chips 5/2016 GTX 1080 9 TFlop/s (SP) $699 180W 11/2001 #1 7.2 TFlop/s (DP) $110 million 3MW Temporal conclusion – to perform HPC tasks, cost-effective GPU turns out to be a smart choice
  • 16. 16 MOTIVATION AND OBJECTIVE Evolution of computer chips June 2015 June 2016 Nvidia GPU Accelerator Systems Share 54% Nvidia GPU Accelerator Systems Share 67%
  • 17. 17 THEREFORE, MIGRATION OF THE ORIGINAL CPU CODE TO NVIDIA CUDA CODE CAN EXPERIENCE A TREMENDOUS BENEFIT.
  • 19. 19 COMPUTING SYSTEM CPU GPU Name Intel Core i7 930 Nvidia K20c Architecture Bloomfield Kepler Number of Cores 4 cores 2496 CUDA cores, 13 SMs Memory Bandwidth 25.6GB/s 208 GB/sec DP Flops/s ~100GFlops 1170GFlops
  • 20. 20 COMPUTING SYSTEM Computing Aspect One Thread One Block One Grid ~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~
  • 21. 21 COMPUTING SYSTEM Computing Aspect has private Memory data can be synchronised by setting a barrier and then share the memory memory will only be updated after finishing the execution or encountering data conflict ~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~
  • 22. 22 COMPUTING SYSTEM Communication concern We can assume CPU as a manager, and GPU as his/her employees. To fully utilize GPU, one should reduce the amount of communications between CPU and GPU.
  • 24. 24 COMPUTING SYSTEM Nvidia Kepler Architecture In K20, it has 13 Streaming Multiprocessors (SMXs) and a aa scheduler GigaThread SM
  • 25. 25 ONE IMPORTANT CUDA API FEATURE - CUDA STREAM
  • 26. 26 CUDA STREAM CUDA stream is a working queue of GPU. Operations in different streams may be overlapped. GPU scheduler can delete automatically managing kernels, programmers need not to specify it when executing the stream. After CPU placing a request in a stream, it can keep operating until CUDA streams need to be synchronized.
  • 27. 27 CUDA STREAM Kernel 1 Kernel 2 Kernel 2 Kernel 3 Kernel 4 Kernel 2 Kernel 3 Kernel 4 Kernel 1 Kernel 2 Without CUDA Stream With CUDA Stream Time
  • 28. 28 CUDA STREAM cudaStream_t stream[4]; #pragma omp parallel for for(i = 0;i<4;i++){ cudaStreamCreate(&stream[i]); cu_Func<<<blocks, threads, 0, stream[i]>>>(); // CPU task cudaStreamSynchronize(stream[i]); cudaStreamDestroy(stream[i]); } CPU 1 Stream 1 CPU 2 CPU 3 CPU 4 Stream 2 Stream 3 Stream 4
  • 30. 30 COMPUTING RESULTS Lid-driven cavity flow problem [*] High Re solutions for incompressible flow using the Navier-Stokes equations and a multigrid method. U. Ghia, K. N. Ghia, and C.T. Shin
  • 33. 33 COMPUTING RESULTS Improvement 9/26/16 0 50 100 150 200 250 300 100 400 900 Execution time No. Elements C CUDA CUDA with Stream 3.6x 3.9x
  • 35. 35 FUTURE WORK In the future, multi-frontal direct solver will be integrated into the finite element flow instead of frontal solver, providing a better parallelized algorithm and reduce the computing time. Our aim in the near future is point in NTU campus to solve the incompressible Navier-Stokes equations in a domain containing mesh size 2560*2560*2560 nodal points.
  • 36. April 4-7, 2016 | Silicon Valley THANK YOU
  • 37. 37
  • 38. GTC-Taipei ; Sep. 21, 2016 Neo Shih-Chao Kao (高仕超) OPENACC ACCELERATION OF THE CALCULATION OF THREE-DIMENSONAL INCOMPRESSIBLE NAVIER- STOKES EQUATIONS Acknowledgement : Department of Engineering Science and Ocean Engineering, National Taiwan University Scientific Computing and Cardiovascular Simulation laboratory (SCCS), National Taiwan University (輝達)
  • 39. 39 AGENDA 1. Why GPU is needed ? 2. How GPU is used ? 3. What GPU helps me ? 4. Concluding remarks
  • 40. 40 WHY GPU IS NEEDED ? Computational Fluid Dynamics (CFD) (Incompressible flow equation) High performance computing Objective To obtain convergent solution FASTER (3D problem) Discretization scheme Objective (Two major tasks) http://homepage.ntu.edu.tw/~twhsheu/index.htm To derive a finite difference model rendering minimized phase error in convection terms High performance computing < 8 hours !
  • 41. 41 n The non-dimensional three-dimensional incompressible Navier-Stokes equations where u={u,v,w} denotes the velocity vector , p the pressure field, Re the Reynolds number and f is the force term. n Finite difference method (FDM) n Features of CPU code : n Compiler : PGI workstation v13.10 n Column-major ordering (Fortran) 21 + t Re u u u p u f ¶ + ×Ñ = - Ñ Ñ + ¶ 0uÑ× = *J. Kim, P. Moin, Application of a Fractional-Step method to incompressible Navier-Stokes equations, Journal of Computational Physics, Vol. 59, pp. 308-323, 1985. n The fractional-step algorithm of Kim* is adopted
  • 42. 42 WHY GPU IS NEEDED ? Schematic of problem Ø Uniform mesh sizes ü h = 1/96,1/128,1/150 Ø Reynolds numbers : Re = 400,1000 n Computational setting n 3D benchmark flow problem (空穴流) n Solution resolution requirement Ø Fine grid distribution (h << 1)
  • 43. 43 INEFFECTIVE COMPUTING (CPU+OPENMP) 2016/9/26 Mesh length Re = 400 Re = 1000 1/96 15250.4 (s) 24007.7 (s) 1/128 37689.7 (s) 116439.0 (s) 1/150 196114.2 (s) 400228.2(s) n OpenMP (8-threads) n Time-consuming tasks Comparison of velocity profiles u(x,0.5,z) and w(0.5,y,z) n The applicability of the proposed CPU code to predict high Re incompressible flow is confirmed Streamlines at Re = 1000 H. Ding et al., Comput. Methods. Appl. Mech. Engrg., Vol. 195, pp. 516-533, 2006. (Intel i7-4820K) 4.6 days
  • 44. 44 THIS IS WHY GPU IS NEEDED!! 2016/9/26
  • 45. 45 GPU (GRAPHIC PROCESSING UNIT) 2016/9/26 Deadpool (Quadro M6000) GTA5 (Geforce GTX) http://www.geforce.com.tw/whats-new/articles /grand-theft-auto-v-nvidia-gameworks-and-technology https://blogs.nvidia.com.tw/2016/02/ deadpool-movie/ PC-Game Movie
  • 46. 46 WHY GPU IS NEEDED ? n GPU programming : n Before 2007 : OpenGL n 2007 : CUDA n 2011 : OpenAcc CPU architecture l Multi-core structure l Sophisticated control logic unit l Large cache to reduce access latencies GPU architecture l Many-core structure l Minimized control logical unit l Large number of threads l High peak performance /memory bandwidth Acknowledgement : CUDA programming guide CPU GPUALU : Arithmetic logical unit
  • 47. 47 heterogeneous CPU/GPU computing platform Tasks WHY GPU IS NEEDED ? Programmingrunning CPUGPU Intel i7-4820K Computing- intensive tasks CPU code GPU code Task 1 Task 2 …… Non-computing -intensive tasks Computing- intensive tasks Non-computing -intensive tasks
  • 48. 48 OPENACC n It was developed by Nvidia, PGI, Cray and CAPS n Similar to OpenMP programming model n Directive is added to serial source code ü Manage loop parallelization ü Manage data copy between CPU and GPU n The existing original source code (C/C++/Fortran) is reused n Ideally, no modification of the original code is necessary OpenAcc API
  • 49. 49 EXAMPLE C A B= + Problem code_GPU_Acc … Data copy CPU --> GPU … !$acc parallel do i = 1 , N C(i) = A(i) + B(i) end do !$acc end parallel … Data copy GPU --> CPU … end program OpenAcc Problem code_CPU … do i = 1 , N C(i) = A(i) + B(i) end do … end program CPU Module cuda_lib use Cudafor Contains Attributes(global) subroutine add(C,A,B,N) integer :: i integer , value :: N real(kind=8) :: A(N), B(N), C(N) i = (blockid%x-1)*blockdim%x+threadidx%x if ( i < N ) then C(i) = A(i) + B(i) end if call syncthreads() end subroutine end module Problem code_CUDA_Fortran use module cuda_lib … Call Add_kernel<<<NB,NT>>>(C,A,B,N) … end program CUDA Fortran
  • 50. 50 HOW GPU IS USED ? CUDA model OpenAcc model Grid ThreadThread ThreadThread warp ThreadThread ThreadThread warp ThreadThread ThreadThread warp ThreadThread ThreadThread warp Block Block VectorVector VectorVector worker VectorVector VectorVector worker VectorVector VectorVector worker VectorVector VectorVector worker Gang Gang Parallel region
  • 51. 512016/9/26 Non-continuous access n Four degrees of freedom (u,v,w,p) for each node U Node 1 V Node 1 W Node 1 P Node 1 U Node 2 V Node 2 W Node 2 P Node 2 U Node N V Node N W Node N P Node N …… GPU memory (global) Array Of Struct (AOS) n N nodes n The performance becomes deteriorated owing to an ineffective access HOW GPU IS USED ? n AOS
  • 52. 522016/9/26 n SOA data format is effective for SIMD hardware (GPU) Continuous accessContinuous access Continuous access Continuous access GPU memory (global) Structs Of Array (SOA) U Node 1 U Node 2 U Node N V Node 1 V Node 2 V Node N P Node 1 P Node 2 P Node N W Node 1 W Node 2 W Node N …… … … n The data must be reordered following the SOA format given below HOW GPU IS USED ?
  • 53. 53 CPU GPU 1 GPU 2 GPU3 Architecture Intel i7 4820k Nvidia K20 Nvidia K40 Nvidia K80 Cores 8 2496 (SP) 832 (DP) 2880 (SP) 960 (DP) 4992 (SP) 1664 (DP) Memory 32GB 5GB 12GB 24GB Memory bandwidth 59.7 GB/S 208 GB/S 288 GB/S 480 GB/S Peak performance 59.2 GFlops/s (DP) 1.17 TFlops/s (DP) 1.43 TFlops/s (DP) 1.87 TFlops/s (DP) IEEE754 SP/DP YES YES YES YES SP/DP : single/double precision http://www.nvidia.com.tw/object/tesla_product_literature_tw.html HARDWARE ARCHITECTURE K20 K40 K80 Portability
  • 54. 54 NUMERICAL RESULTS (GPU) H. Ding et al., Comput. Methods. Appl. Mech. Engrg., Vol. 195, pp. 516-533, 2006. u(x,0.5,z) w(0.5,y,z)
  • 55. 55 WHAT GPU HELPS ME ? Mesh length OpenMP OpenAcc(K20) OpenAcc(K40) OpenAcc(K80) 1/96 15250.4 2693.4 2252.1 1846.5 1/128 37689.7 9660.7 7937.3 6711.4 1/150 196114.5 29838.2 23713.8 21456.7 Re = 400
  • 56. 56 WHAT GPU HELPS ME ? Mesh length OpenMP OpenAcc(K20) OpenAcc(K40) OpenAcc(K80) 1/96 24007.2 4280.1 3626.6 2974.5 1/128 116439.0 15145.0 12334.4 10426.5 1/150 400228.2 44865.0 33732.7 30505.2 Re = 1000
  • 57. 57 CONCLUDING REMARKS 1. We have successfully ported, tested and benchmarked a complete 3D finite difference code using OpenAcc. 2. Code is portable across different GPU architectures. 3. Using OpenAcc, the original source code can be almost unchanged. 4. A large amount of computing time was reduced when executing a computational task on GPU architecture. Acknowledgement : Computer Center in National Taiwan University
  • 58. April 4-7, 2016 | Silicon Valley THANK YOU
  • 59. 59
  • 60. GTC-Taipei ; Sep. 21, 2016 Yu-Wei Chang Engineering science and ocean engineering National Taiwan University GPU ACCELERATION OF PATIENT-SPECIFIC AIRWAY IMAGE SEGMENTATION
  • 61. 61 OUTLINE Introduction Motivation and objective Hardware environment Application example Concluding remarks
  • 62. 62 INTRODUCTION Importance of image segmentation Source: http://www.vision.ee.ethz.ch/~rhayko/
  • 63. 63 INTRODUCTION Desirable goals to achieve for practical application to patient Applying machine learning to radiotherapyplanning for head and neck cancer The length procedure may be reduced to 1/4 Source: https://deepmind.com/health 30th August 2016
  • 64. 64 INTRODUCTION Building blocks of the 3D airway reconstruction Acknowledgement: 高仕超學長 所提供的個人胸腔CT
  • 65. 65 INTRODUCTION With Intel® Xeon® Processor E5-2620, segmentation block takes 85% time 145.568 1424.284 106.904 0 200 400 600 800 1000 1200 1400 1600 Preprocessing Segmentation Represnetation time(s) step time (s) time (s)
  • 66. 66 MOTIVATION AND OBJECTIVE Amdahl’s law implies that the percentage of the code that benefits from parallelization is important 𝑆 = 1 1 − 𝑝 + 𝑝 𝑖 = 1 1 − 0.85 + 0.85 𝑖 ≤ 6.6 S speedup p the percentage of the execution time that benefits from parallelization i is the speedup in latency of p p Maximum speedup 0.85 (current) 6.6 0.95 20 0.99 100
  • 67. 67 MOTIVATION AND OBJECTIVE In a cloud computing environment, time is money Type CPU GPU Card Intel Xeon NVIDIA Kepler /core /hour 0.03 0.4 Cost for 1000 patients (USD) 14 To be announced Time for 1000 patient (hour) 465 To be announced
  • 68. 68 HARDWARE ENVIRONMENT GPU has a 30 times better FLOPs performance Card Nvidia Tesla K40c* Nvidia Tesla K20c* Intel Xeon E5-2630** Cores 2880 2496 6 Peak single precision floating point performance 4.29Tflops 3.52 Tflops 0.134 Tflops **http://ark.intel.com/products/64593/Intel-Xeon-Processor-E5-2630-15M-Cache-2_30-GHz- 7_20-GTs-Intel-QPI *http://www.nvidia.com/object/tesla-workstations.html
  • 69. 69 HARDWARE ENVIRONMENT GPU does a great job using only global memory Processor Parallelization language Time usage (sec) Speedup gain Note Intel Xeon NA 158.08 Sequential code Intel Xeon OpenMP 14.335 1 Double checked locking Nvidia Tesla K20c CUDA 3 4.8 Global memory only
  • 70. 70 YOU CANNOT MAKE BRICKS WITHOUT STRAW 工欲善其事必先利其器
  • 71. 71 HARDWARE ENVIRONMENT Block and grid structure would affect the usage of share memory Grid size block size 32 32 128 128 256 256 512 512 Tesla K40c 3714.5 335.5 388.1 474.9 Tasks per thread 2/ 20 21 < 21 Tune the block and thread number to optimize the performance. Let each thread do less job.
  • 72. 72 HARDWARE ENVIRONMENT The performance would benefit from the usage of shared and texture memory Memory usage on Tesla K20c Time usage (sec) Speedup gain Global memory only 3 1 Shared and global memory 0.471 6.36 Texture and global memory 0.321 9.34 Despite of faster performance, texture memory renders a lower accuracy. While in computational science, accuracy is of great importance, so shared memory is more preferable. 5120MB
  • 75. 75 APPLICATION EXAMPLE Mathematical morphology, lung filter, and segmentation Source: https://en.wikipedia.org/wiki/Mathematical_morphology Opening operation Lung mask Segmentation
  • 76. 76 APPLICATION EXAMPLE Full view of the lung from the top Acknowledgement:高仕超學長 所提供的個人胸腔CT Left airway Right airway
  • 77. 77 CONCLUDING REMARKS Memory, thread and block setting is important Platform Time usage (sec) Speedup gain CPU 14.335 1 CPU + 1 GPU 0.335 43 CPU + 2 GPU 0.232 61.8 Tune the block and thread number to optimize the performance. Let each thread do less job. Despite of faster performance, texture memory renders a lower accuracy. While in computational science, accuracy is of great importance, so shared memory is more preferable.
  • 78. 78 CONCLUDING REMARKS Amdahl’s law implies that the percentage of the code that benefits from parallelization is important 𝑆 = 1 1 − 𝑝 + 𝑝 𝑖 = 1 1 − 0.85 + 0.85 61.8 = 𝟔. 𝟏 p Maximum speedup 0.85 (current) 6.6 0.95 20 0.99 100
  • 79. 79 CONCLUDING REMARKS In a cloud computing environment, time is money Type CPU 2 GPU + CPU Card Intel Xeon NVIDIA Kepler + Intel Xeon /core /hour 0.03 0.4 / 0.03 Cost for 1000 patients (USD) 14 7 Time for 1000 patients (hour) 465 77
  • 80. 80 FUTURE WORK Increase the resolution of CT from 64-slice to 256-slice Use deep learning to classify tumor type from CT images and accelerate the whole process
  • 81. 81 REFERENCES [1] Babin, D., et al. (2010). Segmentation of airways in lungs using projections in 3-D CT angiography images. 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology. [2] Arti T, Priya R, Amit Ujjlayan R. (2015). A performance study of image segmentation techniques. Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), 2015 4th International Conference on Acknowledgement : 許文翰老師 張恆華老師 高仕超學長 提供胸腔CT (輝達) 國立臺灣大學計算機中心