Uploaded on

Following our successful participation at SIGGRAPH Asia 2012 in Singapore, the Khronos Group is excited to demonstrate and educate about Khronos APIs at SIGGRAPH Asia 2013 in Hong Kong. This …

Following our successful participation at SIGGRAPH Asia 2012 in Singapore, the Khronos Group is excited to demonstrate and educate about Khronos APIs at SIGGRAPH Asia 2013 in Hong Kong. This presentation covers OpenCL an Accelerated Science, by Tomasz Benarz, Computational Research Scientist at CSIRO

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
403
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
7
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Accelerated Science – use of OpenCL in Land Down Under Tomasz Bednarz Computational Research Scientist, CSIRO © Copyright Khronos Group & CSIRO, 2013 - Page 1
  • 2. Outline • CSIRO - Positive Impact • Bragg – the CSIRO GPU Cluster • CSIRO Accelerated Service • MASSIVE • Radiation therapy • Level set segmentation • CFD • Cloud based imaging • Other projects • OpenCL for Game AI Pro • Sydney Khronos Chapter • ACW at OzViz 2013 © Copyright Khronos Group & CSIRO, 2013 - Page 2
  • 3. CSIRO - Positive Impact • In the world’s Top 10 institutions for 3 research fields. • Top 1% of global research institutions in 14 of 22 research fields. • Trusted advisor, innovative. • 2000 doctorates, 500 masters. © Copyright Khronos Group & CSIRO, 2013 - Page 3
  • 4. Bragg – the CSIRO GPU Cluster • A bit of history - Heterogeneous compute cluster (CPUs + GPUs) - Funded by CSS TCP (2009) - first of its kind in AU • Continuously evolving: - GPUs upgraded in 2010 to NVIDIA Fermi Teslas - CPUs upgraded in 2012 to Intel Sandy Bridge and 3rd GPU added to each node - GPUs upgraded in 2013 to NVIDIA Kepler Teslas • Spec: - 128 Dual Xeon 8-core E5-2650 (= 2048 cores) - 128 GB RAM, FDR10 InfiniBand Interconnect - 132 Fermi Tesla M2050 GPUs (= 59,136 cores) - 254 Kepler Tesla M20 GPUs (= 633,984 cores) - 16 Intel SC5110P Xenon PHI (= 960 cores) - 289 on the Top500 list (as per June 2013) © Copyright Khronos Group & CSIRO, 2013 - Page 4
  • 5. CSIRO Accelerated Computing Service • About the Service - Supported by ASC Applications and Novel Technology Staff - Not limited to GPU support - Delivers: - Accelerated Computing Projects - HPC Workshops and Training - Benchmarking • Activities include - Porting and Parallelising - Targeting HPC clusters, GPUs - Profiling Performance Tuning Debugging Numerical Error Analysis Algorithmic Improvements OpenCL CUDA Python R OpenMP MPI C/C++ OpenACC Fortran Sam Moskwa – ACW at OzViz 2011 © Copyright Khronos Group & CSIRO, 2013 - Page 5
  • 6. MASSIVE - Two high performance computing facilities, located at the Australian Synchrotron and Monash University. - Specialised imaging and visualisation software and databases. - Expertise in visualisation, image processing, image analysis, HPC and GPU computing. - Training and Outreach program. - NCI Specialised Facility for Imaging and Visualisation. For more info contact Dr Wojtek James Goscinski © Copyright Khronos Group & CSIRO, 2013 - Page 6
  • 7. Scientific Visualisation – CT reconstruction Insect CT scan, rendered using Drishi (http://anusf.anu.edu.au/Vizlab/drishti/) by Sherry Mayo (CSIRO) © Copyright Khronos Group & CSIRO, 2013 - Page 7
  • 8. Radiation therapy applications • Modern radiation therapy is to a large extent a computational discipline and can greatly benefit from use of task- and data-parallelism. Some applications were demonstrated on GPUs already: - CT reconstructions - Image registrations - Treatment planning - Dose computations (e.g. X Gu, U Jelen et al 2011 PMB 56) (Adapted from Schlegel and Mahr 2001) • Need for speed: imaging and treatment verification can be used as feedback to improve the treatment (adaptive radiotherapy), currently offline (mostly population-based), one day online. • Particle (proton/carbon ion) therapy with raster scanning @ University of Marburg: - most precise external beam technique (only 5 centers worldwide: 3 active, 2 to start) - increased precision = increased need for verification (more computations) - longer computational times (small head case: 1 hour on single-thread) • Collaborative project between CSIRO and University of Marburg Ammazzalorso, Bednarz, Jelen © Copyright Khronos Group & CSIRO, 2013 - Page 8
  • 9. Plan robustness in radiation therapy • Automatic discovery of robust beam setups. • Results (mean and sd for a single beam): - 4-core Intel Xeon W3530 2.8GHz 12GB RAM + NVIDIA Tesla C2050 3GB RAM - 10 skull base cases, 42 beams directions (10 runs each for timing stats) - 4k-40k pencils of 120-350 samples, 2 mm analysis radius (0.5 mm step) - Single-precision floating-point operations only (sufficient precision) mean(sd) ms P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 Pool Native (1 thread) 21299 (6628) 9891 (2837) 6258 (1485) 15768 (4959) 4342 (1136) 10888 (3179) 10117 (2849) 5464 (1470) 8155 (2195) 11388 (3936) 10357 (5941) GPU OpenCL 219 (109) 122 (51) 88 (38) 148 (56) 61 (24) 160 (65) 151 (64) 52 (22) 109 (46) 126 (61) 124 (75) Gain 119 x (36) 98 x (34) 87 x (30) 123 x (36) 83 x (25) 81 x (24) 82 x (30) 124 x (42) 90 x (31) 106 x (29) 99 x (36) CPU OpenCL 6498 (1996) 2552 (615) 1898 (438) 4810 (1495) 1324 (331) 3280 (944) 3051 (841) 1396 (310) 2481 (649) 2935 (818) 3022 (1798) Gain 3.3 x (0.0) 3.8 x (0.4) 3.3 x (0.0) 3.3 x (0.0) 3.3 x (0.0) 3.3 x (0.0) 3.3 x (0.0) 3.9 x (0.4) 3.3 x (0.0) 3.8 x (0.4) 3.5 x (0.3) F. Ammazzalorso (Uni-Marburg), T. Bednarz (CSIRO) and U. Jelen (Uni-Marburg) - Oral poster presentation at Int. Conference on the Use of Computers in Radiation Therapy ICCR2013 (Melbourne, 6-9 May 2013) - Accepted for journal publication in IOP JPCS (upcoming) © Copyright Khronos Group & CSIRO, 2013 - Page 9
  • 10. Fast particle therapy dose computation • How OpenCL 2.0 can help? - Radiation therapy (RT) has need for both task- and data-parallelism and it represents a perfect potential application field for heterogeneous parallel systems. - OpenCL 2.0 can help exploiting these system in RT applications and open the way to new/simpler usage scenarios. - The robust planning approach requires continuous exchange of large dataset: for now latency can be hidden with ad-hoc synchronization → Great opportunity with OCL2.0 shared virtual memory (SVM). - Dose computation (especially for ions, where biological weighing is involved) is a more complex problem, which OCL2.0 dynamic parallelism (DP) can help breaking into simpler blocks, enabling scheduling of different kernels without host-side and communication overhead. - A combination of SVM and DP can open the way to treatment plan optimization on massively parallel architectures: large working datasets so far unsuitable for GPU computation (at decent resolutions) and complex scheduling to enable e.g. intermediate dose computation at each optimization step. © Copyright Khronos Group & CSIRO, 2013 - Page 10
  • 11. Image segmentation • What is image segmentation? - Wiki: Segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as super-pixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. HCA-Vision • Our Goal: fast, interactive and accurate segmentations even with noisy data © Copyright Khronos Group & CSIRO, 2013 - Page 11
  • 12. Multiple Myeloma deadly cancer of blood plasma cells • Facts - Despite a rash of new drugs and advances in stem-cell therapy, this rare bloodborne cancer is still an almost certain death sentence - A cure remains a long way off - Plasma cell cancer - Collections of abnormal cells accumulate in bones, where they cause bone lesions (abnormal areas of tissue), and in the bone marrow where they interfere with the production of normal blood cells - The disease develops in 1–4 per 100,000 people per year. PETER CROUCHER & KARIN VANDERKERKEN Micro-CT scans reveal bone damage from myeloma in a 5T2MM-bearing mouse. © Copyright Khronos Group & CSIRO, 2013 - Page 12
  • 13. Level sets segmentation - applications © Copyright Khronos Group & CSIRO, 2013 - Page 13
  • 14. Segmentation using Level Sets • Embed a seed surface in an image • Iteratively deform the surface along normal according to local properties of the surface and the underlying image iterations • Good: competitive accuracy compared to manual segmentation. • Advantages: the arbitrary complex shapes can be modeled and topological changes such as merges and splitting are handled implicitly. © Copyright Khronos Group & CSIRO, 2013 - Page 14
  • 15. Level Set Workflow • Governing equation the mean curvature (smoothing term) the velocity term é ¶f Ñf ù = - Ñf êa D(x ) + (1- a )Ñ × ú ¶t Ñf ú ê ë û Initialize f0 to Signed Euclidean Distance Transform from mask m Calculate Data Speed Term D(I) Repeat Until Converged Calculate First and Second Order Spatial Derivatives Calculate Curvature Terms Calculate Gradient of f Calculate Speed Term F Update Time Step, and Reinitialize f if needed © Copyright Khronos Group & CSIRO, 2013 - Page 15
  • 16. Input Image Applied Mask f0 as SDF iterations © Copyright Khronos Group & CSIRO, 2013 - Page 16
  • 17. Level Set 2D Effect of a 152.24 sec 160 140 120 • Governing equation: Max speedup ~70 X 100 80 é ¶f Ñf ù = - Ñf êa D(x ) + (1- a )Ñ × ú ¶t Ñf ú ê ë û 60 40 20 2.12 sec 15.43 sec 0 C2050 opt 1800 iterations, a = 0.001 8.94 sec C2050 900 iterations, a = 0.06 Quadro FX 580 Xeon E5520 900 iterations, a = 0.95 © Copyright Khronos Group & CSIRO, 2013 - Page 17
  • 18. Level Set Method • Computational Problems - Computationally intensive for moderate sized data sets - SDF is not maintained - Explicit schemes give CFL time step restriction • Current Apprach - OpenCL to accelerate adaptive timestepping PDE solver - OpenMPI to distribute processing to engage multiple GPUs - Qt – interactive user interface - Liar/C2Liar – Quantitative Imaging toolbox (CSIRO internal) - OpenGL + OpenCL for interactive real time volume rendering “Safe Interactions” with bacteria. level set segmentation researched also to track biofilm formation. • Further Implementation Notes - The Level Set PDE solver mapped across multiple GPUs using OpenCL and MPI - 3D volume split evenly among MPI processors - Ghost slices transmitted between processes at each iteration – minimal overhead - Simple boundary extension is performed at each iteraction © Copyright Khronos Group & CSIRO, 2013 - Page 18
  • 19. Bone Model C4-data-set Interactive volume visualisation – OpenGL + OpenCL © Copyright Khronos Group & CSIRO, 2013 - Page 19
  • 20. Bone Model C4-data-set Interactive volume visualisation – OpenGL + OpenCL © Copyright Khronos Group & CSIRO, 2013 - Page 20
  • 21. GPU Speed-up Results Execution Time (sec per 50 Iterations) 1.2 250 GPU 1 200 CPU 0.8 150 0.6 100 0.4 50 0.2 0 0 3 6 9 Number Of Processes 12 © Copyright Khronos Group & CSIRO, 2013 - Page 21
  • 22. Cloud based image analysis and processing • Available now: http://cloudimaging.net.au, see our demos © Copyright Khronos Group & CSIRO, 2013 - Page 22
  • 23. Cloud based image analysis and processing • Currently using WebGL based 3D viewer Slice:Drop http://slicedrop.com • Need for WebCL/WebGL  for interactive parameters tuning © Copyright Khronos Group & CSIRO, 2013 - Page 23
  • 24. Experiments with Fluids Air N2 gas Air Ozoe Lab, Candle Wakayama Jet Ozoe Lab, 10T magnet Courtesy of High Field Magnet Laboratory, NL © Copyright Khronos Group & CSIRO, 2013 - Page 24
  • 25. Verification – Physics vs Simulations numerical PIT © Copyright Khronos Group & CSIRO, 2013 - Page 25
  • 26. Methodology • Governing equations  Discretization  Implementation  Verification  Results  Transfer CPU code to heterogeneous architectures  Verification  Results • Method tested - Finite Difference (for discretization), HSMAC Method (for mutual iteration of pressure and velocity fields), BFC (to solve N-S on complex geometry), Upwind and UTOPIA (for getting more stable and accurate results), OpenCL (speedup) • Speedups achieved: ~200-230x • Easily extended, 2D, 3D, applied to simulate different cases, e.g. smoke, free surface flows, ventilation, turbulent flows, etc. • See: https://www.massive.org.au/images/stories/workshops-and-tutorials/computational-fluid-dynamics.pdf Boundary Fitted Coordinates © Copyright Khronos Group & CSIRO, 2013 - Page 26
  • 27. Ground Penetrating Radar and FDTD • Finite-difference time-domain method (FDTD) is a computational technique used to model electromagnetic fields (in our case, radio waves). It provides an approximation to both the spatial and temporal derivatives that appear in Maxwell’s equations. • FDTD models how radio waves propagate through space and reflect at boundaries depending on the properties of the material. • FDTD code is used to generate synthetic data for a ground penetrating radar (GPR) system. • FDTD simulated data can be compared to a real radar system data from a test environment with known geometry. With an aim to get modelled information to match the real system so that predictions of unknown geometry can be made. Implementation: Based on original code by Andrew Strange. OpenCL version used mixed precision code with single precision image buffers used for constant matrices. Contact: Josh Bowden CSIRO ASC © Copyright Khronos Group & CSIRO, 2013 - Page 27
  • 28. Other projects using OpenCL PCA NIPALS algorithm – rapid estimation of wood properties AWDA-DA Ground water data assimilation Finite Difference Time Domain (FDTD) - Ground Penetrating Radar Sparse Hydrodynamic Ocean Code (SHOC) PCA Contact: Josh Bowden CSIRO ASC © Copyright Khronos Group & CSIRO, 2013 - Page 28
  • 29. GPGPU for Games • GPU heavily used for game physics - Performance vs Accuracy • Nvidia PhysX and FLEX - Particles, Fluids and Rigid Bodies on the GPU • Bullet Physics - Version 3.x to have 100% GPU support • Not all games require cutting edge graphics, which leaves processing time available for other areas - Audio Processing, Artificial Intelligence Conan Bourke AIE © Copyright Khronos Group & CSIRO, 2013 - Page 29
  • 30. GPGPU Artificial Intelligence • Game Artificial Intelligence doesn’t easily lend itself to massively parallel implementations - Inter-Agent Communication - Complex decision making typically controlled by a non-programmer designer resulting in lots of dynamic branching - Numerous levels of world state data required • Some aspects do… - Flow Fields - Influence Maps - Path Finding - Steering Behaviors Conan Bourke, Tomasz Bednarz © Copyright Khronos Group & CSIRO, 2013 - Page 30
  • 31. GPGPU Steering Behavior • Steering Behaviors are a set of commonly used algorithms for agent motion - Craig Reynolds “Steering Behaviors for Autonomous Characters” GDC1999 • Easily adaptable to parallel implementations due to close similarities to particle physics - Position, Velocity etc 4000 3500 Conan Bourke, Tomasz Bednarz 2500 2000 Nvidia GTX 560 1500 1000 Nvidia Tesla C2050 500 16384 8192 4096 1024 0 512 • Simple brute-force algorithm can be implemented in OpenCL within a couple hours for vast performance gains - Conan Bourke & Tomasz Bednarz “Introduction to GPGPU for AI”, Game AI Pro 2013 Intel i72670Q M 3000 © Copyright Khronos Group & CSIRO, 2013 - Page 31
  • 32. Acknowledgments • Collaborators, coworkers, coauthors: - John Taylor Josh Bowden Luke Domanski Filippo Ammazzarolso Urszula Jelen Marc Piggott Wojtek Goscinski Conan Bourke Tim Gureyev Darren Thompson NeCTAR RT035 Project team Sam Moskwa Matt Adcock and many others… © Copyright Khronos Group & CSIRO, 2013 - Page 32
  • 33. Sydney Khronos Chapter http://www.meetup.com/Sydney-GPU-Users Last meetup, 17th Oct Tomasz Bednarz: Welcome, the Khronos Group Conan Bourke: The Fall and Rise of OpenGL for Games Rob Manson: The Augmented Web Visit to the CSIRO Vis Lab Pizza © Copyright Khronos Group & CSIRO, 2013 - Page 33
  • 34. Accelerated Computing Workshop @ OzViz https://sites.google.com/site/ozvizworkshop/ozviz-2013 • ACW and OzViz 2013 to be held on 8th10th December 2013 in Melbourne - Technologies: OpenCL, CUDA, etc. - Hardware architectures: GPU, APU, MIC, etc. • Previous events - OpenCL Workshop at OzViz 2010, Brisbane - Presenters: Mike Houston, Mark Harris, Derek Gerstmann, Tomasz Bednarz, Craig James, Con Caris, John Taylor - ACW at OzViz 2011, Sydney - Keynote: Prof. Takayuki Aoki - ACW at OzViz 2012, Perth © Copyright Khronos Group & CSIRO, 2013 - Page 34