GPU Compute in Medical and Print Imaging

GPU Compute in Medical
and Print Imaging
Amey Deosthali
Director, Embedded Imaging

Medical Imaging Trends
SYSTEM
OPTIMIZATION AND
MINIATURIZATION
 Advances in visualization
and increased use of
3D/4D imaging for
improved diagnosis
 High-end systems of
yesterday becoming
portables of today
INCREASED USE
OF 3D/4D
IMAGING
INTEGRATION OF
MODALITIES &
ADVANCED FEATURES
 Endoscopic ultrasound,
Augmented reality,
Robotic endoscopy
INCREASED
SYSTEM COST
PRESSURES
 Expanding emerging
markets, regulatory
pressures, increased
competition

Print Imaging Trends
Traditional Multi-Function
Printer Architecture
GPU Compute based Multi-
Function Printer Architecture
SoC with GPU
SCALABLE SOFTWARE SCALABLE ARCHITECTURE SYSTEM COST SAVINGS

GPU Compute and AMD APU
GPU Compute in Imaging
 Medical and Print Imaging workloads are well suited for GPU compute
HSA architecture can deliver significant benefits in the field of Imaging
 AMD APUs integrate GPU with support for Heterogeneous System Architecture (HSA)

GPU COMPUTE IN
MEDICAL IMAGING

Typical Ultrasound Imaging Pipeline
Transmitter
Receiver
Beamforming IQ Demodulation
Filters
- Edge enhancement
- Speckle Reduction
Log Compression
Envelope
Detection
Frame Averaging
2D Image
formation
Frequency/Time
Compounding
Color flow analysis
Velocity
Estimation
Wall Filter
Spatial Doppler
Scan Conversion
Echo Processing
Color Flow Processing
Transducer
GPU Friendly

FASTER
SCANS
 Evolution in algorithm
complexity with GPU
 Reconstruct whole
image plane
IMPROVED IMAGE
QUALITY
ACCESS TO RAW
DATA
 Fast data transfer and efficient
use of system memory
SIMPLIFIED
ARCHITECTURE
 Scalable SW defined
architecture
GPU Compute for SW Beamforming
Bridge
Convert JESD-204b
to PCIe
JESD-204b
64-256 I/O Channels
Image Formation
Plane Wave Imaging
• FK Stolts with optimized
FFT/iFFT
• IQ Demodulation and
Log Compression
Image Post Processing
Separable Filters
• Sobel and Box filters
Non-separable Filter
• Laplacian of Gaussian
De-speckle Filter
• Median filter
Frequency Domain Filter
• Gaussian blur and Edge
Enhancement filters
Gen 3 PCIe® x16
dGMA support for 10+ GBps
GPU
coherent compounding
GPU + CPU
post processing

SW Beamforming on AMD APU
Transpose
1D
FFT
Z Shift &
Transpose
1D
IFFT
FK interpolation
1D
IFFT
Acquisition
Device
iGPU or dGPU
Software Beamformer
Direct
GMA
(> 10
GB/s)RF Data
1D
FFT
X Shift &
Transpose
Transpose
OpenCL™ implementation of FK Stolts algorithm
SW Beamformer
Performance1
APU dGPU
256 Channel, 2048
Samples
1.95 ms 0.47 ms
128 Channel, 2048
Samples
1.15ms 0.29 ms
Processed Output
5x5
Median
Filter

Speckle Noise Reduction
Down
Sample by 2
Subtract
Multiply
With
Coefficients
Up-sample
by 2
Gama
Correction
Down-
Sample by 2
Up-Sample
by 2
Sub
Gama
Correction
Down
Sample by 2
Sobel Diffusion
Gama
Correction
Pixel
Correction
IQ Demodulation
Output
Speckle Reduction
Output

Speckle Noise Reduction Optimization
• Combine multiple functions
into single kernel
• Get more compute per byte of global
memory access
• Reduce kernel launch delay overheads
• Reduce use of temporary
buffers and buffer copies
• Reduce CPU bottlenecks that
require blocking calls by
moving operations to GPU
• Optimize pipeline with “in
order” enqueue of OpenCL
commands
Block
A
Block
B
Block
C
Block
E
Block
D
Block A & B
(Multiple
OpenCL
kernels)
Block C & D
(Multiple
OpenCL
kernels)
Block E
(Multiple
OpenCL
kernels)
CPU Path
(4.10 ms)
GPU Path2
(1.01 ms)
Downsample
+ memcpy
Downsample
+ Optimized
memcpy
Color conversion, edge detection, diffusion,
normalization, gamma correction, image enhancement

Code Migration and Optimization Process
1. Profile
Identify target
workloads to convert
2. Convert
Target workloads from
CPU to GPU
3. Block
Optimization
Combine multiple CPU
calls to a single OpenCL
kernel
4. Buffer
Optimization
Reduce use of
temporary buffers and
buffer copies
5. Pipeline
Optimization
Move low workload CPU
operations to GPU to
reduce blocking calls
6. Reduce kernel
launch delay
“in order” enqueue of
OpenCL commands

Sobel Filter Optimization
8-bit Grayscale
Image
(1920x1080)
Median
Filter IPP
8 to 32-bit
Float
Sobel &
Sobel
Magnitude
Max & Min
6.51ms
19.47ms
Migrate Sobel filter to GPU
with OpenCL
A:
B:
8-bit Grayscale
Image
(1920x1080)
Median
Filter IPP
8 to 32-bit
Float
Sobel &
Sobel
Magnitude
Max & Min
CPU
Optimized
Modules
GPU
Optimized
Modules
OpenCL Optimized
2X faster computation time with
migration of single module to GPU3

Accelerated RIP Pipeline
Open source Ghostscript postscript
renderer accelerated using GPU4
AMD G-Series Reference Board
Ubuntu 14.04 Linux OS
KMD GFX Driver
OCL CodeGLSL Libraries
C
Libraries
OCL 2.0
Runtime
OGL 4.3
Runtime
Software Stack
PDF Files
on Disk
Bitmap
File
on
RAMdisk
PDL Interpreter
Element
Decompose
Generate Glyph
Bitmaps
Bitmap
Ghostscript App
Planarize
GPU
Raster
GPU
Color Conversion
GPU
DMA
DMA
OpenCL
GL Shader
Language
(GLSL)
CPU Operating in Host Memory GPU Operating in Device Memory

GPU compute can deliver large increase
in PPM performance4
RIP Pipeline acceleration: PPM performance
101.8
164
244.3
370
0
50
100
150
200
250
300
350
400
GX-412 GX-424
PPM
PPM - Test case 2 @600 dpi
Legacy code (no GPU accl)
GPU accelerated code
27.6
44
76.6
111
0
20
40
60
80
100
120
GX-412 GX-424
PPM
PPM - Test case 2 @1200 dpi
Legacy code (no GPU accl)
GPU accelerated code
2.4x
2.3x
2.8x
2.5x
PPM: Pages per Minute performance of
Ghostscript RIP pipeline

GPU compute can free up CPU
for other value added tasks4
CPU Load: Average load across all 4 CPU
cores of G-series devices under test
RIP Pipeline acceleration: CPU Load Reduction
0
10
20
30
40
50
60
30 40 50 60 70 75 80 90 100 125 150
%CPULoad(Avg)
PPM
Average CPU Load - Test case 2 @ 600 DPI*
Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412
GPU accelerated code: GX-424 GPU accelerated code: GX-412
0
10
20
30
40
50
60
70
80
5 10 15 20 25 30 35 40%CPULoad(Avg)
PPM
Average CPU Load - Test case 2 @ 1200 DPI*
Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412
GPU accelerated code: GX-424 GPU accelerated code: GX-412

Optical Character Recognition: Tesseract Project
Accelerated using GPU
Tesseract Flow Optical Character Recognition (OCR) Project
 Tesseract : Open source Optical Character
Recognition(OCR) Engine
GPU Compute for OCR
 Most of the image preprocessing and character
recognition is GPU friendly
 The data structures in word recognition phase are
not very GPU friendly
Expected Future Improvements
 Deep Neural Network (DNN) for character
recognition

Optical Character Recognition: Demo Performance
Processing time measured for above modules with CPU
processing and GPU accelerated processing5
AMD APU 95W
(Time in seconds)
AMD APU 35W
(Time in seconds)
Non OpenCL
(CPU only)
23.65 46.2
OpenCL
(GPU Compute)
16.79 36.3
Gain 41% 27%

Core Scan Processing Algorithms
• AMD worked with customer to accelerate partial scan pipeline using OpenCL on AMD APU
and GPU
• Scan pipeline includes several image processing algorithms such as grayscale conversion,
edge detection, rotation, color conversion etc.
• GPU compute can deliver significant improvement in processing time compared to CPU based
processing6
– Translates to faster scan time and higher scan ppm
Iterative algorithm
optimization on AMD APU
CPU Optimized
(Execution Time)*
OpenCL Optimized
(Execution Time)
OpenCL Optimized Fused Code
(Execution Time)
Grayscale 13.5 ms 4.6 ms (2.9x)
Median 25.6 ms 3.1 ms (8.3x)
Grayscale + Median 39.1 ms 7.9 ms (5.0x) 5.9 ms (6.6x)

Color
Conversion
Partial scan pipeline acceleration
Document Detect and
Alignment correction
Quality
Improvement
7 8

The Future is bright with GPU Compute
Improve quality of human
care with improved accuracy
Empower new experiences with
next generation technology
Enhance performance while
reducing system cost

Endnotes
1Testing by AMD performance labs. Measured performance of OpenCL™ implementation of FK Stolts algorithm on AMD APU and AMD FirePro GPU.
System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete
GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 15.200.1045-150622a
2Testing by AMD performance labs. Measured performance of Speckle Noise Reduction pipeline with and without GPU acceleration, multi-threaded CPU
compiler option. Image size: 768 x 252, active ROI was 712 x 252.
System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete
GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 16.20-160405a-301215E
3Testing by AMD performance labs. Measured performance of Sobel Filter with and without GPU acceleration. 8.2 Multi Threaded Library. Image resolution:
1920x1080. Sobel filter size: 5x5
System Configuration: Advantech ComE board with Windows 7 64-bit, AMD RX425BB, 35W, 2.5/3.4 GHz, 1866 MHz DDR3, 4GB RAM, AMD driver version:
14.502.1001.1001, OpenCL 1.2
4Testing by AMD performance labs. Measured performance of Raster Image Processing with and without GPU acceleration.
System Configuration: AMD GX-424CC: 25W, 2.4 GHz, 1866 MHz DDR3, 8GB RAM, AMD GX-412HC: 7W, 1.2 GHz, 1333 MHz DDR3, 8 GB RAM. Ubuntu
14.04 with AMD Catalyst Driver 14.301.1001

Endnotes
5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with and
without GPU acceleration.
System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU
with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4
5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with and
without GPU acceleration.
System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU
with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4
6Testing by AMD performance labs. Measured performance of scan pipeline performance using proprietary customer code with and without
GPU acceleration.
System Configuration: AMD Olive Hill+ development board, AMD RX427BB: 25W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM, Windows 8.1, AMD
Catalyst 14.29 drivers and OpenCL™ 1.2

Endnotes
7Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code.
System Configuration: AMD Olive Hill+ development board with AMD RX427BB: 35W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM Ubuntu 14.04 and
AMD Catalyst driver 14.29
8Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code.
System Configuration: : 2015 MacBook Pro with Intel Core i7-4980HQ 2.8 GHz, 16 GB DDR3L RAM. AMD Radeon™ R9 M370X Graphics, 2GB
GDDR5, Mac OS X 10.10.3. AMD Catalyst 14.29

Disclaimer
The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has
been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under
no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with
respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied
warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware,
software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted
by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between
the parties or in AMD's Standard Terms and Conditions of Sale.
AMD's products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into
the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD's product
could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to
discontinue or make changes to its products at any time without notice.
AMD does not provide a license/sublicense to any intellectual property rights relating to any to any standards, including but not limited to any
audio and/or video codec technologies such as AVC/H.264/MPEG-4, AVC, VC-1, MPEG-2, and DivX/xVid.
AMD, the AMD Arrow logo, AMD Catalyst, AMD CrossFire, AMD CrossFireX, AMD Radeon, ATI Radeon, and combinations thereof are
trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be
trademarks of their respective companies.
Windows and DirectX are registered trademarks of Microsoft Corporation. ARM is a registered trademark of ARM Limited. 3DMark is a
trademark of Futuremark Corporation. DivX is a registered trademark of DivX, Inc. HDMI is a trademark of HDMI Licensing, LLC. Linux is a
registered trademark of Linus Torvalds. OpenCL is a trademark of Apple Inc. used by permission of Khronos. PCIe and PCI Express are
registered trademarks of PCI-SIG Corporation.
© 2016 Advanced Micro Devices, Inc. All rights reserved.

GPU Compute in Medical and Print Imaging

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to GPU Compute in Medical and Print Imaging

Similar to GPU Compute in Medical and Print Imaging (20)

More from AMD

More from AMD (20)

Recently uploaded

Recently uploaded (20)

GPU Compute in Medical and Print Imaging