• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar
 

Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

on

  • 555 views

To view the corresponding video, please visit: http://bit.ly/1iBiW17 ...

To view the corresponding video, please visit: http://bit.ly/1iBiW17

This webinar takes you through a case study of accelerating a seismic algorithm on a cluster of AMD GPU compute nodes for a geophysical software provider. Acceleware Product Manager Chris Mason presents a programming example, step-by-step project phase profiling, optimization techniques, a look at the strategy behind taking advantage of the massively parallel GPU architecture, and run time performance results.

Chris has eight years of experience developing commercial applications for the GPU and multi-core CPUs. His previous experience also includes parallelization of algorithms on digital signal processors (DSPs) for cellular phones and base stations. His specialty is in electromagnetic simulations, medical imaging, signal processing and linear algebra.

Sign up for the developer newsletter and learn about future webinars here: http://bit.ly/176wril
For more training options from Accelerware, visit http://bit.ly/MRn6Gn
Share your ideas with other developers at http://bit.ly/P5ohUo

Statistics

Views

Total Views
555
Views on SlideShare
555
Embed Views
0

Actions

Likes
0
Downloads
17
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar Presentation Transcript

    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Case Study: Accelerating Full Waveform Inversion via OpenCL™ on AMD GPUs ©2014 Acceleware Ltd. All rights reserved. Chris Mason, Acceleware Product Manager March 5, 2014
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs About Acceleware  Software and services company specializing in HPC product development, developer training and consulting services  OpenCL training for AMD GPUs – Progressive lectures and hands-on lab exercises – Experienced instructors – Delivered worldwide – Find out more  High performance consulting – Feasibility studies – Porting and optimization – Code commercialization – Find out more 1
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Acceleware Software  Seismic Applications – Survey design and 3D modeling – Reverse Time Migration  Electromagnetics – FDTD Solver  Radio Frequency Heating – Simulation application for the RF heating of hydrocarbon reserves 2
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Outline  Watch the recording of this webinar  What is Full Waveform Inversion?  The Project  OpenCL  Optimizations – Coalescing – Iterative kernel for stencil operations – Fusing kernels together to eliminate redundant memory accesses  Key Performance Results 3
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs What is Full Waveform Inversion?  Seismic inversion technique  Used to build Earth models from recorded seismic data  Uses a finite-difference solution to the acoustic wave equation  Computationally expensive 4
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs What is FWI? From a basic starting point... ... to an accurate velocity model 5
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs FWI Algorithm Initial Model Estimate Forward Propagate Source → Residuals Back Propagate Residuals → Gradient Forward Propagation(s) → Step Length Update Model Increase Frequency Loop over shots Loop over frequencies Loop until convergence 6
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs FWI Compute Cost  Cluster size of 10s to 100s of CPU nodes  Many days of runtime  Accuracy and quality reduced to keep runtime acceptable 7
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs The Project  GeoTomo develops high-end geophysical software products that help geophysicists around the world to image beneath the subsurface  GeoTomo had pre-existing cluster-ready multi-threaded (OpenMP based) CPU FWI solution  GeoTomo required their FWI application to run faster so they could deliver the results quicker to their clients – Looked to AMD GPUs to potentially accelerate their FWI and approached Acceleware for our help to make it happen 8
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Why use GPUs? Performance! 9 AMD Opteron 6386 SE AMD FirePro W9000 AMD Firepro S10000 Memory Bandwidth 59.7 GB/s 264 GB/s 480 GB/s Peak Gflops (single) ~410 4000 5910 Peak Gflops (double) ~205 1000 1480 Total Memory >>6 GB 6GB 6 GB Power Consumption 140 W 274 W 375 W Gflops per Watt (single precision) <3 14.59 15.76
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs OpenCL Overview  Parallel computing architecture standardized by the Khronos Group  OpenCL: – Is a royalty free standard – Provides an API to coordinate parallel computation across heterogeneous processors  Of interest because heterogeneous devices can significantly accelerate certain (primarily data-parallel) workloads – Defines a cross-platform programming language – Used on handheld/embedded devices through supercomputers 10
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs OpenCL Programming Model  Heterogeneous model, including provisions for a host connected to one or more devices – Example: GPUs, CPUs Host Device 1 GPU Device 2 GPU … Device N GPU 11
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs The OpenCL Programming Model  Data-parallel portions of an algorithm are executed on the device as kernels – Kernels are C functions with some restrictions and a few language extensions – Many (parallel) work-items execute the kernel  The host executes serial code between device kernel launches – Memory management – Data exchange to/from device (usually) – Error handling 12 Work-Group (0,0) Work-Group (1,0) Work-Group (0,1) Work-Group (1,1) Work-Group (0,2) Work-Group( 1,2) ND Range Work-Group (0,0) Work-Group (1,0) Work-Group (2,0) Work-Group (0,1) Work-Group (1,1) Work-Group (2,1) ND Range Host Device Host Device
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs OpenCL Memory Model  OpenCL kernels have access to four distinct memory regions: – Global  Allows read/write access from all work-items in all work-groups  Persistent across kernels – Local  Memory that is local to all work-items within a work-group – Constant  Region of memory that remains constant (read-only) during the execution of a kernel – Private  Memory that is private to a work-item  OpenCL vendors map memory regions into physical resources – Local/constant/private memory usually several orders of magnitude lower capacity but orders of magnitude faster than global memory 13
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs OpenCL Syntax – Memory Spaces  Host and device have separate memory spaces – Data is explicitly moved between them  Typically over PCIe bus  Host functions to allocate, copy, and free memory on device, eg. – clCreateBuffer() – clEnqueueReadBuffer() – clEnqueueWriteBuffer() – clReleaseMemoryObject() 14
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Putting It All Together 15 A0 A1 A2 A3 A4 A5 A6 A7 B0 B1 B2 B3 B4 B5 B6 B7 C0 C1 C2 C3 C4 C5 C6 C7 Cx = Ax + Bx One work-item per element Operation __kernel void VectorAdd(__global float* a, __global float* b, __global float* c) { int idx = get_global_id(0); c[idx] = a[idx] + b[idx]; } Each work-item has a unique index, typically used to index into arrays
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Vector Add – Host Code 16 void VectorAdd(float* aH, float* bH, float* cH, int N) { int N_BYTES = N * sizeof(float); // Device management code … cl_mem aD = clCreateBuffer(…,N_BYTES, …); cl_mem bD = clCreateBuffer(…,N_BYTES, …); cl_mem cD = clCreateBuffer(…,N_BYTES, …); clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…); clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…); // Pass kernel arguments and launch kernel … clEnqueueNDRangeKernel(…, &N, …); clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…); } Allocate memory on device Transfer input arrays to device Launch kernel Transfer output array to host
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Project Steps  1) Profiling – Acquired code, datasets and reference benchmarks from GeoTomo – Set up local machines with near-equivalent hardware, compiled code and confirmed reference benchmark numbers – Augmented code with timers to determine time spent in parallel regions, areas of interest 17
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Project Steps  2) Feasibility Analysis – Investigated memory footprint for FWI jobs  GPU memory limited to 6GB per card – Investigated potential speedup / time to port code  Maximum speed up determined by time spent in parallel regions (Amdahl’s Law)  Time to port dependent on feature set – E.g. domain decomposition across multiple GPUs 18
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Project Steps  3) Implementation – Creating testing harnesses – Kernel implementation – Resolving hardware driver issues – Enabling multi-GPU device support – Optimization iterations  4) Wrapup – Delivery of port, along with installation documentation – Trained GeoTomo developer on OpenCL 19
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Key GeoTomo Optimizations  1) Coalescing – Changing memory access patterns in the kernels to those best suited for GPUs  Global memory is accessed via a request for a multi-byte word  Combine load/store requests from consecutive work-items to reduce the number of requested words – Fewer requests  less contention to global memory  Make one big multi-word burst request to global memory whenever possible – Contiguous bursts -> less global memory overhead 20
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Key GeoTomo Optimizations  2) Iterative kernel for stencil operations Input Volumes Stencil Kernels * • Outputs are weighted combinations of surrounding elements from input volumes • Off-axis weights are zero Acknowledgement: Paulius Micikevicius, 2009 21
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Key GeoTomo Optimizations  Naïve implementation would have each work-item read all of its neighboring elements directly from global memory – Possible to hit maximum GPU memory bandwidth but redundant reads hurt performance 22
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Key GeoTomo Optimizations  Alternative: Iterating over 2D slices along slowest dimension – Single items responsible for column of output array – Work-group caches 2D plane of input in local memory – Work-items store inputs in direction of iteration in registers – Reduces required number of global memory reads significantly Single Work- item View Register Local memory Acknowledgement: Paulius Micikevicius, 2009 23
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Key GeoTomo Optimizations  3) Kernel Fusion – Reduce redundant memory accesses by fusing kernels that operate on the same volume together – Improves performance by reducing redundant global memory reads  4) Kernel Fission – Improve occupancy by lowering kernel resource requirements (registers) via kernel simplification – Allows for more work-items to run concurrently on GPU, improving masking of global memory latency 24
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Performance Results  FWI 15 Hz, 15 shots – GPU version 7997 seconds – CPU (5 cores per shot) 67086 seconds [8.4X] – CPU (30 cores per shot) 166948 seconds [20.9X]  GPU: Sapphire Radeon HD 7970 GHz Edition – 6GB model 25
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Performance Results “Using GPU’s we can use higher frequencies and more if not all of the shots to improve the resolution and coverage.” James Jackson, President, GeoTomo 26
    • CaseStudy:AcceleratingFull WaveformInversionviaOpenCL onAMDGPUs Questions? Contact Us  Tel: +1 403.249.9099  Email: services@acceleware.com OpenCL Courses  June 3-6, 2014, Calgary, Canada  Private onsite classes also available  Find out more OpenCL Consulting  Feasibility studies  Code commercialization  Porting and optimization  Mentoring  Find out more Watch the recording of this webinar 27