GPU compute has leveraged discrete GPUs for a fairly limited set of academic and supercomputing system workloads until recently. With the increase in performance of integrated GPU inside an Accelerated Processing Unit (APU), introduction of Heterogeneous System Architecture (HSA) devices, and proliferation of programming tools, we are seeing GPU compute make its way into mainstream applications. In this presentation we cover GPU compute and HSA, focusing on the application of GPU compute in the Medical and Print Imaging segments. Examples of performance data are reviewed and the case is made for how GPU compute can deliver tangible benefits.
The Zero-ETL Approach: Enhancing Data Agility and Insight
GPU Compute in Medical and Print Imaging
1. GPU Compute in Medical
and Print Imaging
Amey Deosthali
Director, Embedded Imaging
2. Medical Imaging Trends
SYSTEM
OPTIMIZATION AND
MINIATURIZATION
Advances in visualization
and increased use of
3D/4D imaging for
improved diagnosis
High-end systems of
yesterday becoming
portables of today
INCREASED USE
OF 3D/4D
IMAGING
INTEGRATION OF
MODALITIES &
ADVANCED FEATURES
Endoscopic ultrasound,
Augmented reality,
Robotic endoscopy
INCREASED
SYSTEM COST
PRESSURES
Expanding emerging
markets, regulatory
pressures, increased
competition
3. Print Imaging Trends
Traditional Multi-Function
Printer Architecture
GPU Compute based Multi-
Function Printer Architecture
SoC with GPU
SCALABLE SOFTWARE SCALABLE ARCHITECTURE SYSTEM COST SAVINGS
4. GPU Compute and AMD APU
GPU Compute in Imaging
Medical and Print Imaging workloads are well suited for GPU compute
HSA architecture can deliver significant benefits in the field of Imaging
AMD APUs integrate GPU with support for Heterogeneous System Architecture (HSA)
7. FASTER
SCANS
Evolution in algorithm
complexity with GPU
Reconstruct whole
image plane
IMPROVED IMAGE
QUALITY
ACCESS TO RAW
DATA
Fast data transfer and efficient
use of system memory
SIMPLIFIED
ARCHITECTURE
Scalable SW defined
architecture
GPU Compute for SW Beamforming
Bridge
Convert JESD-204b
to PCIe
JESD-204b
64-256 I/O Channels
Image Formation
Plane Wave Imaging
• FK Stolts with optimized
FFT/iFFT
• IQ Demodulation and
Log Compression
Image Post Processing
Separable Filters
• Sobel and Box filters
Non-separable Filter
• Laplacian of Gaussian
De-speckle Filter
• Median filter
Frequency Domain Filter
• Gaussian blur and Edge
Enhancement filters
Gen 3 PCIe® x16
dGMA support for 10+ GBps
GPU
coherent compounding
GPU + CPU
post processing
8. SW Beamforming on AMD APU
Transpose
1D
FFT
Z Shift &
Transpose
1D
IFFT
FK interpolation
1D
IFFT
Acquisition
Device
iGPU or dGPU
Software Beamformer
Direct
GMA
(> 10
GB/s)RF Data
1D
FFT
X Shift &
Transpose
Transpose
OpenCL™ implementation of FK Stolts algorithm
SW Beamformer
Performance1
APU dGPU
256 Channel, 2048
Samples
1.95 ms 0.47 ms
128 Channel, 2048
Samples
1.15ms 0.29 ms
Processed Output
5x5
Median
Filter
9. Speckle Noise Reduction
Down
Sample by 2
Subtract
Multiply
With
Coefficients
Up-sample
by 2
Gama
Correction
Down-
Sample by 2
Up-Sample
by 2
Sub
Gama
Correction
Down
Sample by 2
Sobel Diffusion
Gama
Correction
Pixel
Correction
IQ Demodulation
Output
Speckle Reduction
Output
10. Speckle Noise Reduction Optimization
• Combine multiple functions
into single kernel
• Get more compute per byte of global
memory access
• Reduce kernel launch delay overheads
• Reduce use of temporary
buffers and buffer copies
• Reduce CPU bottlenecks that
require blocking calls by
moving operations to GPU
• Optimize pipeline with “in
order” enqueue of OpenCL
commands
Block
A
Block
B
Block
C
Block
E
Block
D
Block A & B
(Multiple
OpenCL
kernels)
Block C & D
(Multiple
OpenCL
kernels)
Block E
(Multiple
OpenCL
kernels)
CPU Path
(4.10 ms)
GPU Path2
(1.01 ms)
Downsample
+ memcpy
Downsample
+ Optimized
memcpy
Color conversion, edge detection, diffusion,
normalization, gamma correction, image enhancement
11. Code Migration and Optimization Process
1. Profile
Identify target
workloads to convert
2. Convert
Target workloads from
CPU to GPU
3. Block
Optimization
Combine multiple CPU
calls to a single OpenCL
kernel
4. Buffer
Optimization
Reduce use of
temporary buffers and
buffer copies
5. Pipeline
Optimization
Move low workload CPU
operations to GPU to
reduce blocking calls
6. Reduce kernel
launch delay
“in order” enqueue of
OpenCL commands
12. Sobel Filter Optimization
8-bit Grayscale
Image
(1920x1080)
Median
Filter IPP
8 to 32-bit
Float
Sobel &
Sobel
Magnitude
Max & Min
6.51ms
19.47ms
Migrate Sobel filter to GPU
with OpenCL
A:
B:
8-bit Grayscale
Image
(1920x1080)
Median
Filter IPP
8 to 32-bit
Float
Sobel &
Sobel
Magnitude
Max & Min
CPU
Optimized
Modules
GPU
Optimized
Modules
OpenCL Optimized
2X faster computation time with
migration of single module to GPU3
15. Accelerated RIP Pipeline
Open source Ghostscript postscript
renderer accelerated using GPU4
AMD G-Series Reference Board
Ubuntu 14.04 Linux OS
KMD GFX Driver
OCL CodeGLSL Libraries
C
Libraries
OCL 2.0
Runtime
OGL 4.3
Runtime
Software Stack
PDF Files
on Disk
Bitmap
File
on
RAMdisk
PDL Interpreter
Element
Decompose
Generate Glyph
Bitmaps
Bitmap
Ghostscript App
Planarize
GPU
Raster
GPU
Color Conversion
GPU
DMA
DMA
OpenCL
GL Shader
Language
(GLSL)
CPU Operating in Host Memory GPU Operating in Device Memory
16. GPU compute can deliver large increase
in PPM performance4
RIP Pipeline acceleration: PPM performance
101.8
164
244.3
370
0
50
100
150
200
250
300
350
400
GX-412 GX-424
PPM
PPM - Test case 2 @600 dpi
Legacy code (no GPU accl)
GPU accelerated code
27.6
44
76.6
111
0
20
40
60
80
100
120
GX-412 GX-424
PPM
PPM - Test case 2 @1200 dpi
Legacy code (no GPU accl)
GPU accelerated code
2.4x
2.3x
2.8x
2.5x
PPM: Pages per Minute performance of
Ghostscript RIP pipeline
17. GPU compute can free up CPU
for other value added tasks4
CPU Load: Average load across all 4 CPU
cores of G-series devices under test
RIP Pipeline acceleration: CPU Load Reduction
0
10
20
30
40
50
60
30 40 50 60 70 75 80 90 100 125 150
%CPULoad(Avg)
PPM
Average CPU Load - Test case 2 @ 600 DPI*
Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412
GPU accelerated code: GX-424 GPU accelerated code: GX-412
0
10
20
30
40
50
60
70
80
5 10 15 20 25 30 35 40%CPULoad(Avg)
PPM
Average CPU Load - Test case 2 @ 1200 DPI*
Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412
GPU accelerated code: GX-424 GPU accelerated code: GX-412
18. Optical Character Recognition: Tesseract Project
Accelerated using GPU
Tesseract Flow Optical Character Recognition (OCR) Project
Tesseract : Open source Optical Character
Recognition(OCR) Engine
GPU Compute for OCR
Most of the image preprocessing and character
recognition is GPU friendly
The data structures in word recognition phase are
not very GPU friendly
Expected Future Improvements
Deep Neural Network (DNN) for character
recognition
19. Optical Character Recognition: Demo Performance
Processing time measured for above modules with CPU
processing and GPU accelerated processing5
AMD APU 95W
(Time in seconds)
AMD APU 35W
(Time in seconds)
Non OpenCL
(CPU only)
23.65 46.2
OpenCL
(GPU Compute)
16.79 36.3
Gain 41% 27%
20. Core Scan Processing Algorithms
• AMD worked with customer to accelerate partial scan pipeline using OpenCL on AMD APU
and GPU
• Scan pipeline includes several image processing algorithms such as grayscale conversion,
edge detection, rotation, color conversion etc.
• GPU compute can deliver significant improvement in processing time compared to CPU based
processing6
– Translates to faster scan time and higher scan ppm
Iterative algorithm
optimization on AMD APU
CPU Optimized
(Execution Time)*
OpenCL Optimized
(Execution Time)
OpenCL Optimized Fused Code
(Execution Time)
Grayscale 13.5 ms 4.6 ms (2.9x)
Median 25.6 ms 3.1 ms (8.3x)
Grayscale + Median 39.1 ms 7.9 ms (5.0x) 5.9 ms (6.6x)
23. The Future is bright with GPU Compute
Improve quality of human
care with improved accuracy
Empower new experiences with
next generation technology
Enhance performance while
reducing system cost
24. Endnotes
1Testing by AMD performance labs. Measured performance of OpenCL™ implementation of FK Stolts algorithm on AMD APU and AMD FirePro GPU.
System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete
GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 15.200.1045-150622a
2Testing by AMD performance labs. Measured performance of Speckle Noise Reduction pipeline with and without GPU acceleration, multi-threaded CPU
compiler option. Image size: 768 x 252, active ROI was 712 x 252.
System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete
GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 16.20-160405a-301215E
3Testing by AMD performance labs. Measured performance of Sobel Filter with and without GPU acceleration. 8.2 Multi Threaded Library. Image resolution:
1920x1080. Sobel filter size: 5x5
System Configuration: Advantech ComE board with Windows 7 64-bit, AMD RX425BB, 35W, 2.5/3.4 GHz, 1866 MHz DDR3, 4GB RAM, AMD driver version:
14.502.1001.1001, OpenCL 1.2
4Testing by AMD performance labs. Measured performance of Raster Image Processing with and without GPU acceleration.
System Configuration: AMD GX-424CC: 25W, 2.4 GHz, 1866 MHz DDR3, 8GB RAM, AMD GX-412HC: 7W, 1.2 GHz, 1333 MHz DDR3, 8 GB RAM. Ubuntu
14.04 with AMD Catalyst Driver 14.301.1001
25. Endnotes
5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with and
without GPU acceleration.
System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU
with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4
5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with and
without GPU acceleration.
System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU
with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4
6Testing by AMD performance labs. Measured performance of scan pipeline performance using proprietary customer code with and without
GPU acceleration.
System Configuration: AMD Olive Hill+ development board, AMD RX427BB: 25W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM, Windows 8.1, AMD
Catalyst 14.29 drivers and OpenCL™ 1.2
26. Endnotes
7Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code.
System Configuration: AMD Olive Hill+ development board with AMD RX427BB: 35W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM Ubuntu 14.04 and
AMD Catalyst driver 14.29
8Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code.
System Configuration: : 2015 MacBook Pro with Intel Core i7-4980HQ 2.8 GHz, 16 GB DDR3L RAM. AMD Radeon™ R9 M370X Graphics, 2GB
GDDR5, Mac OS X 10.10.3. AMD Catalyst 14.29