Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Debugging Numerical Simulations on Accelerated Architectures - TotalView for OpenPOWER, CUDA and OpenMP

636 views

Published on

ScicomP 2015 presentation discussing best practices for debugging ‪CUDA‬ and ‪‎OpenACC‬ applications with a case study on our collaboration with LLNL to bring debugging to the OpenPOWER stack and OMPT.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Debugging Numerical Simulations on Accelerated Architectures - TotalView for OpenPOWER, CUDA and OpenMP

  1. 1. TotalView for OpenPOWER, CUDA and OpenMP Chris Gottbrath Dean Stewart May 18, 2015 ScicomP 2015
  2. 2. Agenda • Introduction • TotalView Overview • OpenMP, OMPT and OMPD • Current work and future plans • Questions and wrap-up © 2015 Rogue Wave Software, Inc. All Rights Reserved
  3. 3. Introduction
  4. 4. About Rogue Wave Software Rogue Wave proven technical solutions simplify the growing complexity of building and testing quality software code. Rogue Wave customers improve software quality and ensure code integrity, while shortening development cycle times. Founded: 1989 Headquarters: Boulder, CO Employees: 250 Offices Worldwide: 9 © 2015 Rogue Wave Software, Inc. All Rights Reserved
  5. 5. Company timeline 1989 Cross-platform commercial math & statistics libraries for C++ Technology Timeline Corporate Timeline 1994 First commercial database library for C++ 1989 Rogue Wave established Tools.h++ 1996 Rogue Wave publicly listed on NASDAQ 2003 Rogue Wave acquired by Quovadx 2007 Rogue Wave spun out of Quovadx by Battery Ventures 2009 Rogue Wave acquires TotalView and Visual Numerics 2010 Rogue Wave acquires Acumem 2008 First graphical reverse debugger for C, C++ and Fortran on Linux 2011 First Infiniband Cluster-capable reverse debugger First cache optimization product to market 2010 First GPU enabled commercial analytics in FORTRAN 2012 Rogue Wave acquires Visualizations for C++ Audax Group acquires Rogue Wave 2013 Rogue Wave acquires OpenLogic & Klocwork 1989 2015 © 2015 Rogue Wave Software, Inc. All Rights Reserved
  6. 6. Rogue Wave Solution Portfolio © 2015 Rogue Wave Software, Inc. All Rights Reserved
  7. 7. HPC Trends • What do we see – NVIDIA Tesla GP-GPU computational accelerators – Intel Xeon Phi Coprocessors – Complex memory hierarchies (numa, device vs host, etc) – Custom languages such as CUDA and OpenCL – Directive based programming such as OpenACC and OpenMP – Core and thread counts going up • A lot of complexity to deal with if you want performance – C or Fortran with MPI starts to look “simple” – Everything is Multiple Languages / Parallel Paradigms – Up to 4 “kinds” of parallelism (cluster, thread, heterogeneous, vector) – Data movement and load balancing © 2015 Rogue Wave Software, Inc. All Rights Reserved
  8. 8. How does Rogue Wave help? • Troubleshooting and analysis tool – Visibility into applications – Control over applications • Scalability • Usability • Support for HPC platforms and languages TotalView debugger © 2015 Rogue Wave Software, Inc. All Rights Reserved
  9. 9. TotalView Overview
  10. 10. Application Analysis and Debugging Tool: Code Confidently • Debug and Analyse C/C++ and Fortran on Linux™, Unix or Mac OS X • Laptops to supercomputers • Makes developing, maintaining, and supporting critical apps easier and less risky Major Features • Easy to learn graphical user interface with data visualization • Parallel Debugging – MPI, Pthreads, OpenMP™, Fortran Coarrays – CUDA™, OpenACC®, and Intel® Xeon Phi™ coprocessor • Low tool overhead resource usage • Includes a Remote Display Client which frees you to work from anywhere • Memory Debugging with MemoryScape™ • Deterministic Replay Capability Included on Linux/x86-64 • Non-interactive Batch Debugging with TVScript and the CLI • TTF & C++View to transform user defined objects What is TotalView®? © 2015 Rogue Wave Software, Inc. All Rights Reserved
  11. 11. What Is MemoryScape®? • Runtime Memory Analysis : Eliminate Memory Errors – Detects memory leaks before they are a problem – Explore heap memory usage with powerful analytical tools – Use for validation as part of a quality software development process • Major Features – Included in TotalView, or Standalone – Detects • Malloc API misuse • Memory leaks • Buffer overflows – Supports • C, C++, Fortran • Linux, Unix, and Mac OS X • Intel® Xeon Phi™ • MPI, pthreads, OMP, and remote apps – Low runtime overhead – Easy to use • Works with vendor libraries • No recompilation or instrumentation © 2015 Rogue Wave Software, Inc. All Rights Reserved
  12. 12. Deterministic Replay Debugging • Reverse Debugging: Radically simplify your debugging – Captures and Deterministically Replays Execution • Not just “checkpoint and restart” – Eliminate the Restart Cycle and Hard-to-Reproduce Bugs – Step Back and Forward by Function, Line, or Instruction • Specifications – A feature included in TotalView on Linux x86 and x86-64 • No recompilation or instrumentation • Explore data and state in the past just like in a live process, including C++View transformations – Replay on Demand: enable it when you want it – Supports MPI on Ethernet, Infiniband, Cray XE Gemini – Supports Pthreads, and OpenMP – New: Save / Load Replay Information © 2015 Rogue Wave Software, Inc. All Rights Reserved
  13. 13. Supported Platforms © 2015 Rogue Wave Software, Inc. All Rights Reserved Platforms C/C++ Compilers Fortran Compilers Linux x86 Gcc, Intel, PGI Absoft, GNU, Intel, PGI Linux x86-64 Gcc, Intel, PGI Absoft, GNU, Intel, PGI Power Linux Gcc, XL C++ GNU, XL Fortran RS6000 Power AIX Gcc, XL C++ XL Fortran BlueGene Gcc, XL C++ XL Fortran
  14. 14. TotalView for the NVIDIA ® GPU Accelerator • NVIDIA CUDA 6.0, 6.5 and 7.0 • Features and capabilities include – Support for dynamic parallelism – Support for MPI based clusters and multi- card configurations – Flexible Display and Navigation on the CUDA device • Physical (device, SM, Warp, Lane) • Logical (Grid, Block) tuples – CUDA device window reveals what is running where – Support for types and separate memory address spaces – Leverages CUDA memcheck © 2015 Rogue Wave Software, Inc. All Rights Reserved
  15. 15. • The following 6 slides are from an SC14 tutorial by: • Damian Alvarez – d.alvarez.mallon@fz-juelich.de • Dr. Mike Ashworth – mike.ashworth@stfc.ac.uk • Vincent Betro, Ph. D. – vbetro@utk.edu • Chris Gottbrath – Chris.Gottbrath@roguewave.com • Nikolay Piskun, Ph.D. – Nikolay.Piskun@roguewave.com • Sandra Wienke – Wienke@itc.rwth-aachen.de 11.17.2014 SC ‘14
  16. 16. • Setting breakpoints in CUDA kernels – Start debugging (e.g. “Go”) – Message box when kernel is loaded: – Set kernel breakpoints as in host code 11.17.2014 SC ‘14
  17. 17. • Debugger thread IDs in Linux CUDA process – Host thread: positive no. – CUDA thread: negative no. • GPU thread navigation – Logical coordinates: blocks (3 dimensions), threads (3 dimensions) – Physical coordinates: device, SM, warp, lane – Only valid selections are permitted 11.17.2014 SC ‘14
  18. 18. • Single Stepping – Advances all GPU hardware threads within same warp – Stepping over a __syncthreads() call advances all threads within the block • Advancing more than just one warp – “Run To” a selected line number in the source pane – Set a breakpoint and “Continue” the process • Halt – Stops all the host and device threads 11.17.2014 SC ‘14 … t0 t1 t31 … t32 t63 … warp group of 32 threads same program counter (PC)
  19. 19. • Displaying CUDA device properties – “Tools” - “CUDA Devices” – Helps mapping between logical & physical coordinates • PCs across SMs, warps, lanes – valid, active, divergent 11.17.2014 SC ‘14 program counter (PC) within warp …
  20. 20. • Displaying GPU data – “Dive” into variable or watch “Type” in “Expression List” – Device memory spaces: “@” notation 11.17.2014 SC ‘14 Storage Qualifier Meaning of address @global Offset within global storage @shared Offset within shared storage @local Offset within local storage @register PTX register name @generic Offset within generic address space (e.g. pointer to global, local or shared memory) @constant Offset within constant storage @parameter Offset within parameter storage (TV built- in type)
  21. 21. • Checking GPU memory – Enable “CUDA Memory checking” during startup or in the “Debug” menu – Detects global memory addressing violations and misaligned memory accesses • Further features – Multi-device support – Host-pinned memory support – MPI-CUDA applications 11.17.2014 SC ‘14 Note: Recent cuda-memcheck versions are also able to detect race conditions: cuda-memcheck -–tool racecheck <prog>
  22. 22. OpenMP, OMPT and OMPD
  23. 23. The Importance of OpenMP • Programming models are changing to accommodate changes in system architectures • Higher degree of on-node parallelism: many-core CPUs and/or GPUs • Hybrid programming models: MPI+X, where OpenMP is an important X – MPI across the nodes – OpenMP shared memory parallelism across the cores in a node • Why use OpenMP? – The most widely used standard for SMP systems, implemented by many vendors – Supports the Fortran, C, and C++ languages – Relatively small and simple specification, and supports incremental parallelism – OpenMP research keeps it up to date with the latest hardware developments – OpenMP 4 allows targeting GPUs • We see momentum building around OpenMP © 2015 Rogue Wave Software, Inc. All Rights Reserved
  24. 24. OpenMP Debugging Challenges • Programmers will attempt to exploit MPI+OpenMP hybrid parallelism • Porting existing large applications from MPI to a hybrid model is nontrivial and arduous, and having GPUs in the mix makes it even more challenging • Programming errors such as memory corruption, logic errors and concurrency bugs are inevitable • Bottom line – MPI+OpenMP+GPUs will present programmers with unprecedented debugging challenges – They need good debugging tools for MPI+OpenMP+GPUs © 2015 Rogue Wave Software, Inc. All Rights Reserved
  25. 25. The following are some features that TotalView supports: • Source-level debugging of the original OpenMP code. • The ability to plant breakpoints throughout the OpenMP code, including lines that are executed in parallel. • Visibility of OpenMP worker threads. • Access to SHARED and PRIVATE variables in OpenMP PARALLEL code. • Access to OMP THREADPRIVATE data in code compiled by supported compilers. Debugging OpenMP Applications © 2015 Rogue Wave Software, Inc. All Rights Reserved
  26. 26. Sample OpenMP Debugging Session OpenMP worker threads Local variables © 2015 Rogue Wave Software, Inc. All Rights Reserved
  27. 27. OpenMP code high and low level • Intention is expressed in the OpenMP code – Serial-correct code with OMP directives expressing parallelism – Higher level expression of the ideas • Compiler can create either serial or parallel executable programs from this source • A parallel executable includes both the program logic and a runtime – Teams of threads on the device, the host or both – Outlined routines – Runtime calls to dispatch work to worker threads – Work created on thread A may be executed on threads M-N © 2015 Rogue Wave Software, Inc. All Rights Reserved
  28. 28. What’s Needed? • Debugging and performance analysis support from OpenMP implementations • “OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging” technical report (TR) – First TR combined OMPT and OMPD in one document – OMPD was redacted to allow OMPT to progress • “TR2: OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis” – Accepted by the OpenMP ARB (March 2014) – OMPT is an API for first-party performance tools • It is now time to circle back and finish OMPD! – OMPD is similar to OMPT in its functionality, but… – OMPD is an API for third-party debugging tools © 2015 Rogue Wave Software, Inc. All Rights Reserved
  29. 29. What Does OMPT/OMPD Do? • OMPT – Enable performance tools to gather execution program/runtime costs – Allow construction of low-overhead performance tools – Allow logical stack unwinding (to handle outlined parallel regions) – Provide the state of a thread at any point in time (e.g., idle, work, wait) – Asynchronous signal safe • OMPD – Enable debugging tools to inspect the state of a live process or core file – Third-party versions of the OpenMP runtime inquiry functions – Third-party versions of the OMPT inquiry functions – Intercept the beginning/end of parallel/task regions (e.g., stepping in/out) – Enable the debugger to construct a “global view” of the process © 2015 Rogue Wave Software, Inc. All Rights Reserved
  30. 30. How is OMPD Structured? • Based on a commonly used idiom – pthread thread_db, MPI MQD, MPI Handles, and others – The OMPD DLL is “paired” with the OpenMP runtime library • The debugger – Attaches to the target OpenMP application – Loads the OMPD DLL (e.g., via dlopen()) – Registers callbacks in the OMPD DLL – Makes “requests” into the OMPD DLL to query runtime state • The OMPD DLL – Makes callbacks into the debugger (lookup symbols, read/write memory, etc.) – Returns the result to the debugger © 2015 Rogue Wave Software, Inc. All Rights Reserved
  31. 31. OMPD DLL OMPD DLL loaded into debugger and callbacks registered OMPD “In Action” Application Process OpenMP Runtime Library (RTL) Application address space Attach Debugger Debugger address space Request OpenMP state 1 • Handles for threads, parallel regions, tasks • Parent / child relationships • State of handles (wait, work, idle) 1 Request types Request symbols and address information 2 • Lookup symbols in the target process • Read/write target address spaces • Support for GPUs 2 Callback functions Callback ops 3Result © 2015 Rogue Wave Software, Inc. All Rights Reserved
  32. 32. OMPD Status • Collaboration between LLNL and Rogue Wave Software (TotalView) – LLNL: Dong Ahn, Ignacio Laguna, Joachim Protze, Martin Schulz – RWS: Ariel Burton, John DelSignore • Resurrect OMPD with the ultimate goal of having it accepted by the ARB – Fix the current OMPD spec – Implement OMPD DLL in the Intel OpenMP runtime – Implement OMPD-based features in TotalView • IWOMP Paper (International Workshop on OpenMP) © 2015 Rogue Wave Software, Inc. All Rights Reserved
  33. 33. Current Work and Future Plans
  34. 34. TotalView 8.15 New Features • Scalable Infrastructure • Faster start up on Linux • Scales to O(100,000) processes & O(1,000,000) threads • Updated CUDA support • CUDA 7.0 • Support updates including: • Clang 3.5 • Intel 15.0 • MPT 2.12 • SLES 12, Fedora 21 TV Client MRNet CP MRNet CP TV Server TV Server TV Server TV Server © 2015 Rogue Wave Software, Inc. All Rights Reserved
  35. 35. TotalView’s Scalability Strategy Multicast Reduction TotalView uses an “MRNet tree” of servers TV Client MRNet CP MRNet CP TV Server TV Server TV Server TV Server Remain lightweight in the backend! Smarts Smarts Push debugger “smarts” to the backend, not the whole debugger! Use classic optimization techniques too: caching, hoisting invariants, etc. © 2015 Rogue Wave Software, Inc. All Rights Reserved
  36. 36. Linux Start up Performance in TV 8.15.4 5x faster (600s / 120s) at 16k between 8.14.1 and 8.15.4. Note that we switched to mrnet by default in 8.15.0 © 2015 Rogue Wave Software, Inc. All Rights Reserved
  37. 37. BG Start up Performance in TV 8.15.4 6.4x faster (180s / 28s) at 16k between 8.14.1 and 8.15.4. Note that we switched to mrnet by default in 8.15.0 © 2015 Rogue Wave Software, Inc. All Rights Reserved
  38. 38. TotalView debugs 786,432 cores. Climb with Rogue Wave towards exacale.
  39. 39. Some more details on the 786,432 core test • The test was performed on 48 racks of Sequoia • The test code – Implements a Jacobi Linear Equation Solver – The test code is a hybrid MPI + OpenMP code – 16 threads per process, one process per node • The test operations – Start up – Setting breakpoints / removing breakpoints – Single stepping all threads • Tests performed at a variety of scales to understand scalability © 2015 Rogue Wave Software, Inc. All Rights Reserved
  40. 40. TotalView’s Memory Efficiency 40 • TotalView is lightweight in the back-end (server) • Servers don’t “steal” memory from the application • Each server is a multi-process debugger agent – One server can debug thousands of processes – Not a conglomeration of single process debuggers – TotalView’s architecture provides flexibility (e.g., P/SVR) – No artificial limits to accommodate the debugger (e.g., BG/Q 1 P/CN) • Symbols are read, stored, and shared in the front-end (client) • Example: LLNL APP ADB, 920 shlibs, Linux, 64 P, 4 CN, 16 P/CN, 1 SVR/CN Process VSZ (largest, MB) RSS (largest, MB) Where TV Client 4,469 3,998 Front End ONLY MRNet CP 497 4 Compute Nodes TV Server 304 53 Compute Nodes © 2015 Rogue Wave Software, Inc. All Rights Reserved
  41. 41. Future plans • Contact sales@roguewave.com with any inquires about our future plans with regard to TotalView product.
  42. 42. Questions
  43. 43. Thanks! • Visit the website – http://www.roguewave.com/products/totalview.aspx – Documentation – Sign up for an evaluation – Contact customer support & post on the user forum © 2015 Rogue Wave Software, Inc. All Rights Reserved

×