Anatomy of ROCgdb presentation at gcc cauldron 2022

Anatomy of ROCgdb
(GDB for AMD GPUs)
GNU Cauldron 2022
Pedro Alves, Simon Marchi, Lancelot Six,
Zoran Zaric, Tony Tye, Laurent Morichetti

2 |
Cautionary Statement
This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) such AMD’s vision, mission and focus; AMD’s market
opportunity and total addressable markets; AMD’s technology and architecture roadmaps; the features, functionality, performance, availability, timing and expected
benefits of future AMD products and product roadmaps; AMD’s path forward in data center, PCs and gaming; AMD’s market and financial momentum; and the
expected benefits from the acquisition of Xilinx, Inc., which are made pursuant to the Safe Harbor provisions of the Private Securities Litigation Reform Act of 1995.
Forward-looking statements are commonly identified by words such as "would," "may," "expects," "believes," "plans," "intends," "projects" and other terms with
similar meaning. Investors are cautioned that the forward-looking statements in this presentation are based on current beliefs, assumptions and expectations, speak
only as of the date of this presentation and involve risks and uncertainties that could cause actual results to differ materially from current expectations. Such
statements are subject to certain known and unknown risks and uncertainties, many of which are difficult to predict and generally beyond AMD's control, that could
cause actual results and other future events to differ materially from those expressed in, or implied or projected by, the forward-looking information and statements.
Investors are urged to review in detail the risks and uncertainties in AMD’s Securities and Exchange Commission filings, including but not limited to AMD’s most
recent reports on Forms 10-K and 10-Q. AMD does not assume, and hereby disclaims, any obligation to update forward-looking statements made in this
presentation, except as may be required by law.

3 |
Outline
• What are ROCgdb / ROCm™ / HIP
• GPU compute kernels (HIP)
• GPU debugging challenges
• ROCgdb's component diagram
• GPU + Host threads under the same inferior, target stack
• SIMT lanes, commands, and lane divergence
• DWARF extensions
• Address spaces
• Other related contributions
• Upstreaming status

4 |
What is ROCgdb
• GDB port (+extras) targeting AMD GPUs
• Debug ROCm™ (Radeon Open Compute) applications
• HIP (Heterogeneous Interface for Portability)
• OpenCL™
• Offload compute kernel workloads on AMD GPUs

5 |
What is HIP
• Heterogeneous-compute Interface for Portability
• C++ runtime API and kernel language
• Create portable applications:
• Run on AMD's accelerators as well as CUDA devices
• Uses the underlying Radeon Open Compute (ROCm™) or CUDA platform installed on a system
HIP:
• Is open-source
• Provides API to leverage GPU acceleration
• Syntactically similar to CUDA
• Good talk if you want to learn more:
• https://www.exascaleproject.org/event/amd-gpuprogramming-hip/

6 |
GPU compute kernels (HIP example)
__device__ void bar (int *out) { … }
__device__ void foo (int *out) { … }
__global__ void kernel (int *out) {
int tid_x = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
if (tid_x % 2)
foo (out);
else
bar (out);
}
int main () {
int *device_out;
hipMalloc (&device_out, 16 * 16 * sizeof (int));
kernel<<<16, 16>>> (device_out);
std::vector<int> res (16 * 16);
hipMemcpy (res.data (), device_out, 16 * 16 * sizeof (int), hipMemcpyDeviceToHost);
}
GPU/device
code
CPU/host
code

7 |
GPU Debugging Challenges
• Multiple memory address spaces
• Many scalar registers
• Many wide vector registers
• Language threads of execution => lanes in a SIMD/SIMT execution model
Example GPU Hardware
…
VGPR 0
…
VGPR 1
…
VGPR 255
…
Variable
X
Single Source
Language Thread
SIMD/SIMT Execution Model
Lane 0 Lane 1 Lane 3
Lane 2 Lane 4 Lane 5 Lane 63

8 |
ROCgdb's component diagram
+---------------------------+ +-------------+ +------------------+
| GDB | amd-dbgapi target | <-> | AMD | | Linux kernel |
| +-------------------+ | Debug | +--------+ |
| | amdgcn gdbarch | <-> | API | <=> | kfd | |
| +-------------------+ | | | driver | |
| | solib-rocm | <-> | (dbgapi.so) | +--------+---------+
+---------------------------+ +-------------+
^
|
+---------------------+
| Code Object Manager |
| (libamd_comgr.so) |
+---------------------+
BFD is not used // both target_ops and gdbarch talk to dbgapi

9 |
Host threads and GPU threads (waves) under single inferior
(gdb) info threads
Id Target Id Frame
1 Thread ... (LWP 476966) main () from libhsa-runtime64.so.1
2 Thread ... (LWP 476969) in ioctl () at syscall-template.S:78
4 Thread ... (LWP 477504) in ioctl () at syscall-template.S:78
* 5 AMDGPU Wave 1:1:1:1 (0,0,0)/0 my_kernel () at kernel.cc:41
6 AMDGPU Wave 1:1:1:2 (0,0,0)/1 my_kernel () at kernel.cc:41
...
• GDB GPU threads are mapped to GPU waves
• Same program & unified memory
• Those wave Id numbers will be explained shortly

10 |
GPU Threads (waves)'s Target Id
(gdb) info threads
...
/- agent
| /- queue
| | /- dispatch
| | | /- wave id
| | | |
^^^^^ |
| - wave number in work group
- work group coordinates in work grid
...
(and yes, "info {agent, queue, dispatch}" commands)

11 |
Host + GPU threads under single inferior, target stack
(gdb) maint print target-stack
The current target stack is:
- amd-dbgapi (GPU debugging using the AMD Debugger API) # arch_stratum
- multi-thread (multi-threaded child process.) # thread_stratum
- native (Native process) # process_stratum
- exec (Local exec file) # file_stratum
- None (None) # dummy_stratum
(gdb)
• New target on top of the stack, in the arch_stratum slot
• Pushed when the (native) inferior is started, un-pushed on kill/detach/exit

12 |
Host + GPU threads under single inferior, target stack
bool
amd_dbgapi_target::foo_target_method (....)
{
if (!ptid_is_gpu (inferior_ptid))
return beneath ()->foo_target_method (....);
// handle GPU things.
}

13 |
ptid_is_gpu hack^W in detail
/* Return true if the given ptid is a GPU thread (wave) ptid. */
static inline bool ptid_is_gpu (ptid_t ptid) {
/* FIXME: Currently using values that are known not to conflict with
other processes to indicate if it is a GPU thread. ptid.pid 1 is
the init process and is the only process that could have a
ptid.lwp of 1. The init process cannot have a GPU. No other
process can have a ptid.lwp of 1. The GPU wave ID is stored in
the ptid.tid. */
return ptid.pid () != 1 && ptid.lwp () == 1;
}
• Same target stack as the native target => make sure gpu ptids don't collide with host threads
• gpu ptids: (process_id, 1, wave_id)

14 |
Host + GPU threads under single inferior, target stack redesign?
• We've wished the amd-dbgapi target was a process_stratum target
• Not needing ptid_is_gpu would be great
• We've experimented with and/or debated solutions, including:
• Removing restriction of only one target per stratum => list of targets per stratum
• Making each inferior have a set of target stacks, one per device
• However:
• Changes are invasive and in core of gdb => not good to carry downstream
• OTOH, hard to justify changes upstream if no upstream port needs them
• Every problem we've ran into is solvable in the current design (minus ptid hack)
• Our plan is to upstream using current stack design, ptid hack included
• Does not break any current target
• Does not prevent other targets from doing something different
• Target stack redesign can then happen upstream, w/ at least one port making use of it

15 |
SIMT Lanes
New entity under threads: threads become vectorized, multiple
lanes under one thread.
GDB threads are mapped to GPU waves. All lanes
progress side-by-side forming a wavefront.
One physical PC for the whole thread (for all lanes), but:
• Each lane works with its own slice of the register set, on its
share of data, its version of locals in scope.
• Lanes can be seen as multiple "regular" threads running in
lockstep.
(Note: lane divergence => provides illusion that different lanes
execute code at different PCs. More later.)
"current lane" concept added (augmenting "current inferior",
"current thread").
…
VGPR 0
…
VGPR 1
…
VGPR 255
…
Variable
X
Single Source
Language Thread
SIMD/SIMT Execution Model
Lane 0 Lane 1 Lane 3
Lane 2 Lane 4 Lane 5 Lane 63

16 |
SIMT Lanes, command examples
(gdb) info lanes
Id State Target Id Frame
1 A AMDGPU Lane 1:1:1:1/1 (0,0,0)[1,0,0] my_kernel ...
...
• Usage: info lanes [-all | -active | -inactive]... [ID]...

17 |
(gdb) lane 2
[Switching to thread 5, lane 2 (AMDGPU Lane 1:1:1:1/2 (0,0,0)[2,0,0])]
#0 my_kernel (C_d=0x7fffe5c00000, ) at kernel.cc:41
41 size_t offset = (hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x);
(gdb) c
Continuing.
[Switching to thread 5, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
Thread 5 "dw2-lane-pc" hit Breakpoint 1, with lanes [0-4 10-20],
func (gid=0, in=..., out=...) at kernel.cc:89
89 {
(gdb) b 155 if $_lane > 3

18 |
(gdb) lane 2
…
(gdb) print local_func_var
$1 = 123
(gdb) lane 3
…
(gdb) print local_func_var
$2 = 500
(gdb) lane apply –active all print local_func_var
Lane 2 (AMDGPU Lane 1:2:1:2/2 (1,0,0)[2,0,0]):
$3 = 123
Lane 3 (AMDGPU Lane 1:2:1:2/3 (1,0,0)[3,0,0]):
$4 = 500

20 |
Lane divergence
// if (foo (lid)) {
NoOp;
// } else {
elem = in[lid] + 3;
// }
if (foo (lid)) {
elem = in[lid] + 1;
} else {
elem = in[lid] + 3;
}
// if (foo (lid)) {
elem = in[lid] + 1;
// } else {
NoOp;
// }
L0 L1 L2 L3 L4 L5 L6 L7 ... L31
L2 L3 L4 L5 L6 L7 L8 L0 L1 L9 L10 L11 ... L31
Else lanes Then lanes

21 |
Without lane divergence support, step 1
Stepping stops in all branches => surprising
__device__ void
function (unsigned tid, const int *in, int *out)
{
int elem;
>> (1) if (tid % 2) <<<<<<<<<
(3) elem = in[tid] + 1;
else
(2) elem = in[tid] + 3;
(4) atomicAdd (out, elem);
}

22 |
__device__ void
{
int elem;
(1) if (tid % 2)
(3) elem = in[tid] + 1;
else
>> (2) elem = in[tid] + 3; <<<<<<<<<
}

23 |
__device__ void
{
int elem;
(1) if (tid % 2)
>> (3) elem = in[tid] + 1; <<<<<<<<<
else
(2) elem = in[tid] + 3;
}

24 |
__device__ void
{
int elem;
(1) if (tid % 2)
(3) elem = in[tid] + 1;
else
(2) elem = in[tid] + 3;
>> (4) atomicAdd (out, elem); <<<<<<<<<
}

25 |
With lane divergence support, step 1
Stepping doesn't stop if current lane is inactive => intuitive
__device__ void
{
int elem;
>> (1) if (tid % 2) <<<<<<<<<
(X) elem = in[tid] + 1;
else
(2) elem = in[tid] + 3;
}

26 |
__device__ void
{
int elem;
(1) if (tid % 2)
else
>> (2) elem = in[tid] + 3; <<<<<<<<<
}

27 |
__device__ void
{
int elem;
(1) if (tid % 2)
else
(2) elem = in[tid] + 3;
>> (3) atomicAdd (out, elem); <<<<<<<<<
}

28 |
Lane divergence, lane state
WITHOUT lane divergence debug info
(gdb) info lanes
1 A AMDGPU Lane 1:1:1:1/1 (0,0,0)[1,0,0] kernel.cc:34
2 I AMDGPU Lane 1:1:1:1/2 (0,0,0)[2,0,0] <inactive>
...
A - active / I - inactive

29 |
Lane divergence, lane state
WITH lane divergence debug info
(gdb) info lanes
2 D AMDGPU Lane 1:1:1:1/2 (0,0,0)[2,0,0] kernel.cc:41
...
A - active / D - divergent

30 |
Lane divergence, lane PC
Use source/logical PC instead of physical PC throughout
/* The frame's source/logical `resume' address. Returns the physical
thread-wide PC register. */
extern CORE_ADDR get_frame_pc (struct frame_info *);
+ /* The frame's source/logical `resume' address. This returns the
+ source/logical PC register, not the physical register. */
+ extern CORE_ADDR get_frame_lane_pc (struct frame_info *);
NEW!

31 |
DWARF extensions
DWARF Extensions For Heterogeneous Debugging
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
• Allow Location Description on the DWARF Expression Stack
• Generalize CFI & DWARF Base Objects to Allow Any locdesc Kind
• Generalize DWARF Operation Expressions to Support Multiple Places
• Generalize Offsetting of Location Descriptions
• General Support for Address Spaces
• Operations to create Vector Composite locdescs
• Support for Divergent Control Flow of SIMT Hardware
• More...

32 |
DWARF extensions, Cauldron 2021
• Last year's Cauldron presentation, by Tony Tye, Scott Linder, and Zoran Zaric:
• DWARF Extensions for Optimized SIMT/SIMD (GPU) Debugging
• https://www.youtube.com/watch?v=Iv2WO67nklc

33 |
Architecture address spaces
• Independent from the current global memory concept
• Not part of source language syntax => address space cannot be defined as type qualifier
• Part of an address information => CORE_ADDR concept needs extending
• Require special pointer and reference type handling (and potentially more)
• Should not be exposed in user expressions, except when creating an address directly
• Address spaces can be relative to lane/wave/core/device
Per lane memory
Per wave memory
Core local
Per lane memory
Per wave memory
Per lane memory
Per wave memory
Per lane memory
Per wave memory

34 |
Architecture address spaces
Not part of source language syntax => address space cannot be defined as type qualifier
__device__ int global_var;
__device__ void func (int *arg) { // arg can point to memory in any address space
int local_var[3] = { *arg };
// …
if (local_var[1] == global_var) {
// …
}
}
• Typically, local variables => private_lane address space
• But, for optimization reasons, they could be put elsewhere
• DWARF describes where that is

35 |
Architecture address spaces, CORE_ADDR
• CORE_ADDR and the typical bit hacks would work, though high bits won't always be free:
• Pointer/memory tagging on Aarch64, soon x86 too (UAI for AMD, LAM for Intel)
• CORE_ADDR better represents offset into address space
• Introducing tuple to carry both address space and offset:
struct address {
addr_space_id addr_space;
CORE_ADDR offset;
};
• Needed mostly for addresses that come from DWARF debug info, and user expressions
• Many many places can infer global (default) address space from context
=> can continue working with CORE_ADDR

36 |
Address spaces notation
Introducing the '#' operator to compose pointer from:
• An address space name (maintenance print address-spaces)
• An offset
(gdb) p &k
$1 = (int *) private_lane#0x0
(gdb) p private_lane#0x1
Operation: OP_ASPACE
Operation: OP_LONG
Type: int
Constant: 0x0000000000000001
String: private_lane
$2 = (void *) private_lane#0x1

37 |
Architecture address spaces in DWARF
• Address space location description (DW_OP_LLVM_form_aspace_address)
# Variable located at address 0x0 of private_lane (0x5) address space
DW_TAG_variable
…
DW_AT_location
DW_OP_lit0 # address 0x0
DW_OP_lit5 # address space 0x5 (private_lane)
DW_OP_LLVM_form_aspace_address # pops two arguments from stack
• Pointer and reference type DIE attribute (DW_AT_LLVM_address_space)
# Type of a pointer object which holds a private_lane (0x5) address
DW_TAG_pointer_type
…
DW_AT_LLVM_address_space 0x5 # different from DW_AT_address_class

38 |
Other related contributions
• Ctrl-C redesign, there's a separate talk for this:
• Redesigning GDB's Ctrl-C handling
• Performance improvements for large number of threads
• List of threads with pending status
• Per-inferior ptid -> thread map
• Commit-resumed
• Step over clone and thread exit
• tail end of kernel code contains kernel exit instruction

39 |
Upstreaming status
• Linux Kernel module (kfd)
• Module exists upstream, but does not support debug there
• Finalizing debug interface, and plan to upstream soon ™
• https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
• AMD Debug API (amd-dbgapi)
• https://github.com/ROCm-Developer-Tools/ROCdbgapi
• DWARF extensions
• DWARF for GPUs pseudo informal working group: AMD, Intel, Perforce (so far)
• Some bits agreed in group and filed for DWARF v6:
• DW_OP_push_lane, DW_AT_num_lanes
• Working on submitting rest
• GDB
• https://github.com/ROCm-Developer-Tools/ROCgdb
• Some BFD / binutils bits merged
• GDB submission in preparation

40 |
Disclaimer & Attribution
©2022 Advanced Micro Devices, Inc. All rights reserved.
AMD, the AMD Arrow logo, AMD Instinct™, AMD ROCm™, Radeon™, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product
names used in this publication are for identification purposes only and may be trademarks of their respective companies.
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The
information contained herein is subject to change and may be rendered inaccurate releases, for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product differences between differing manufacturers, software changes, BIOS flashes,
firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no
obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this informat
ion and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECTTO THE CONTENTS HEREOF AND
ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY
DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY
, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT
WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE
USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLYADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

42 |
The end, time for questions
Anatomy of ROCgdb (GDB for AMD GPUs)

Anatomy of ROCgdb presentation at gcc cauldron 2022

Recommended

Recommended

More Related Content

Similar to Anatomy of ROCgdb presentation at gcc cauldron 2022

Similar to Anatomy of ROCgdb presentation at gcc cauldron 2022 (20)

More from ssuser866937

More from ssuser866937 (10)

Recently uploaded

Recently uploaded (20)

Anatomy of ROCgdb presentation at gcc cauldron 2022