Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Anatomy of ROCgdb presentation at gcc cauldron 2022
1. Anatomy of ROCgdb
(GDB for AMD GPUs)
GNU Cauldron 2022
Pedro Alves, Simon Marchi, Lancelot Six,
Zoran Zaric, Tony Tye, Laurent Morichetti
2. 2 |
Cautionary Statement
This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) such AMD’s vision, mission and focus; AMD’s market
opportunity and total addressable markets; AMD’s technology and architecture roadmaps; the features, functionality, performance, availability, timing and expected
benefits of future AMD products and product roadmaps; AMD’s path forward in data center, PCs and gaming; AMD’s market and financial momentum; and the
expected benefits from the acquisition of Xilinx, Inc., which are made pursuant to the Safe Harbor provisions of the Private Securities Litigation Reform Act of 1995.
Forward-looking statements are commonly identified by words such as "would," "may," "expects," "believes," "plans," "intends," "projects" and other terms with
similar meaning. Investors are cautioned that the forward-looking statements in this presentation are based on current beliefs, assumptions and expectations, speak
only as of the date of this presentation and involve risks and uncertainties that could cause actual results to differ materially from current expectations. Such
statements are subject to certain known and unknown risks and uncertainties, many of which are difficult to predict and generally beyond AMD's control, that could
cause actual results and other future events to differ materially from those expressed in, or implied or projected by, the forward-looking information and statements.
Investors are urged to review in detail the risks and uncertainties in AMD’s Securities and Exchange Commission filings, including but not limited to AMD’s most
recent reports on Forms 10-K and 10-Q. AMD does not assume, and hereby disclaims, any obligation to update forward-looking statements made in this
presentation, except as may be required by law.
3. 3 |
Outline
• What are ROCgdb / ROCm™ / HIP
• GPU compute kernels (HIP)
• GPU debugging challenges
• ROCgdb's component diagram
• GPU + Host threads under the same inferior, target stack
• SIMT lanes, commands, and lane divergence
• DWARF extensions
• Address spaces
• Other related contributions
• Upstreaming status
4. 4 |
What is ROCgdb
• GDB port (+extras) targeting AMD GPUs
• Debug ROCm™ (Radeon Open Compute) applications
• HIP (Heterogeneous Interface for Portability)
• OpenCL™
• Offload compute kernel workloads on AMD GPUs
5. 5 |
What is HIP
• Heterogeneous-compute Interface for Portability
• C++ runtime API and kernel language
• Create portable applications:
• Run on AMD's accelerators as well as CUDA devices
• Uses the underlying Radeon Open Compute (ROCm™) or CUDA platform installed on a system
HIP:
• Is open-source
• Provides API to leverage GPU acceleration
• Syntactically similar to CUDA
• Good talk if you want to learn more:
• https://www.exascaleproject.org/event/amd-gpuprogramming-hip/
7. 7 |
GPU Debugging Challenges
• Multiple memory address spaces
• Many scalar registers
• Many wide vector registers
• Language threads of execution => lanes in a SIMD/SIMT execution model
Example GPU Hardware
…
VGPR 0
…
VGPR 1
…
VGPR 255
…
Variable
X
Single Source
Language Thread
SIMD/SIMT Execution Model
Lane 0 Lane 1 Lane 3
Lane 2 Lane 4 Lane 5 Lane 63
9. 9 |
Host threads and GPU threads (waves) under single inferior
(gdb) info threads
Id Target Id Frame
1 Thread ... (LWP 476966) main () from libhsa-runtime64.so.1
2 Thread ... (LWP 476969) in ioctl () at syscall-template.S:78
4 Thread ... (LWP 477504) in ioctl () at syscall-template.S:78
* 5 AMDGPU Wave 1:1:1:1 (0,0,0)/0 my_kernel () at kernel.cc:41
6 AMDGPU Wave 1:1:1:2 (0,0,0)/1 my_kernel () at kernel.cc:41
7 AMDGPU Wave 1:1:1:3 (0,0,0)/2 my_kernel () at kernel.cc:41
...
• GDB GPU threads are mapped to GPU waves
• Same program & unified memory
• Those wave Id numbers will be explained shortly
10. 10 |
GPU Threads (waves)'s Target Id
(gdb) info threads
...
/- agent
| /- queue
| | /- dispatch
| | | /- wave id
| | | |
9 AMDGPU Wave 1:2:1:3 (0,0,0)/2 my_kernel () at kernel.cc:41
^^^^^ |
| - wave number in work group
- work group coordinates in work grid
...
(and yes, "info {agent, queue, dispatch}" commands)
11. 11 |
Host + GPU threads under single inferior, target stack
(gdb) maint print target-stack
The current target stack is:
- amd-dbgapi (GPU debugging using the AMD Debugger API) # arch_stratum
- multi-thread (multi-threaded child process.) # thread_stratum
- native (Native process) # process_stratum
- exec (Local exec file) # file_stratum
- None (None) # dummy_stratum
(gdb)
• New target on top of the stack, in the arch_stratum slot
• Pushed when the (native) inferior is started, un-pushed on kill/detach/exit
12. 12 |
Host + GPU threads under single inferior, target stack
bool
amd_dbgapi_target::foo_target_method (....)
{
if (!ptid_is_gpu (inferior_ptid))
return beneath ()->foo_target_method (....);
// handle GPU things.
}
13. 13 |
ptid_is_gpu hack^W in detail
/* Return true if the given ptid is a GPU thread (wave) ptid. */
static inline bool ptid_is_gpu (ptid_t ptid) {
/* FIXME: Currently using values that are known not to conflict with
other processes to indicate if it is a GPU thread. ptid.pid 1 is
the init process and is the only process that could have a
ptid.lwp of 1. The init process cannot have a GPU. No other
process can have a ptid.lwp of 1. The GPU wave ID is stored in
the ptid.tid. */
return ptid.pid () != 1 && ptid.lwp () == 1;
}
• Same target stack as the native target => make sure gpu ptids don't collide with host threads
• gpu ptids: (process_id, 1, wave_id)
14. 14 |
Host + GPU threads under single inferior, target stack redesign?
• We've wished the amd-dbgapi target was a process_stratum target
• Not needing ptid_is_gpu would be great
• We've experimented with and/or debated solutions, including:
• Removing restriction of only one target per stratum => list of targets per stratum
• Making each inferior have a set of target stacks, one per device
• However:
• Changes are invasive and in core of gdb => not good to carry downstream
• OTOH, hard to justify changes upstream if no upstream port needs them
• Every problem we've ran into is solvable in the current design (minus ptid hack)
• Our plan is to upstream using current stack design, ptid hack included
• Does not break any current target
• Does not prevent other targets from doing something different
• Target stack redesign can then happen upstream, w/ at least one port making use of it
15. 15 |
SIMT Lanes
New entity under threads: threads become vectorized, multiple
lanes under one thread.
GDB threads are mapped to GPU waves. All lanes
progress side-by-side forming a wavefront.
One physical PC for the whole thread (for all lanes), but:
• Each lane works with its own slice of the register set, on its
share of data, its version of locals in scope.
• Lanes can be seen as multiple "regular" threads running in
lockstep.
(Note: lane divergence => provides illusion that different lanes
execute code at different PCs. More later.)
"current lane" concept added (augmenting "current inferior",
"current thread").
…
VGPR 0
…
VGPR 1
…
VGPR 255
…
Variable
X
Single Source
Language Thread
SIMD/SIMT Execution Model
Lane 0 Lane 1 Lane 3
Lane 2 Lane 4 Lane 5 Lane 63
16. 16 |
SIMT Lanes, command examples
(gdb) info lanes
Id State Target Id Frame
1 A AMDGPU Lane 1:1:1:1/1 (0,0,0)[1,0,0] my_kernel ...
2 A AMDGPU Lane 1:1:1:1/2 (0,0,0)[2,0,0] my_kernel ...
3 A AMDGPU Lane 1:1:1:1/3 (0,0,0)[3,0,0] my_kernel ...
...
63 A AMDGPU Lane 1:1:1:1/63 (0,0,0)[63,0,0] my_kernel ...
• Usage: info lanes [-all | -active | -inactive]... [ID]...
17. 17 |
SIMT Lanes, command examples
(gdb) lane 2
[Switching to thread 5, lane 2 (AMDGPU Lane 1:1:1:1/2 (0,0,0)[2,0,0])]
#0 my_kernel (C_d=0x7fffe5c00000, ) at kernel.cc:41
41 size_t offset = (hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x);
(gdb) c
Continuing.
[Switching to thread 5, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
Thread 5 "dw2-lane-pc" hit Breakpoint 1, with lanes [0-4 10-20],
func (gid=0, in=..., out=...) at kernel.cc:89
89 {
(gdb) b 155 if $_lane > 3
18. 18 |
SIMT Lanes, command examples
(gdb) lane 2
…
(gdb) print local_func_var
$1 = 123
(gdb) lane 3
…
(gdb) print local_func_var
$2 = 500
(gdb) lane apply –active all print local_func_var
Lane 2 (AMDGPU Lane 1:2:1:2/2 (1,0,0)[2,0,0]):
$3 = 123
Lane 3 (AMDGPU Lane 1:2:1:2/3 (1,0,0)[3,0,0]):
$4 = 500
19. 19 |
SIMT Lanes's Target Id
/- agent
| /- queue
| | /- dispatch
| | | /- wave id
| | | |
| | | |
AMDGPU Lane 1:2:1:3/6 (0,0,0)[4,1,3]
| ^^^^^ ^^^^^
| | |
| | - work item coordinates in work group
| - work group coordinates in work grid
- lane index
21. 21 |
Without lane divergence support, step 1
Stepping stops in all branches => surprising
__device__ void
function (unsigned tid, const int *in, int *out)
{
int elem;
>> (1) if (tid % 2) <<<<<<<<<
(3) elem = in[tid] + 1;
else
(2) elem = in[tid] + 3;
(4) atomicAdd (out, elem);
}
22. 22 |
Without lane divergence support, step 2
Stepping stops in all branches => surprising
__device__ void
function (unsigned tid, const int *in, int *out)
{
int elem;
(1) if (tid % 2)
(3) elem = in[tid] + 1;
else
>> (2) elem = in[tid] + 3; <<<<<<<<<
(4) atomicAdd (out, elem);
}
23. 23 |
Without lane divergence support, step 3
Stepping stops in all branches => surprising
__device__ void
function (unsigned tid, const int *in, int *out)
{
int elem;
(1) if (tid % 2)
>> (3) elem = in[tid] + 1; <<<<<<<<<
else
(2) elem = in[tid] + 3;
(4) atomicAdd (out, elem);
}
24. 24 |
Without lane divergence support, step 4
Stepping stops in all branches => surprising
__device__ void
function (unsigned tid, const int *in, int *out)
{
int elem;
(1) if (tid % 2)
(3) elem = in[tid] + 1;
else
(2) elem = in[tid] + 3;
>> (4) atomicAdd (out, elem); <<<<<<<<<
}
25. 25 |
With lane divergence support, step 1
Stepping doesn't stop if current lane is inactive => intuitive
__device__ void
function (unsigned tid, const int *in, int *out)
{
int elem;
>> (1) if (tid % 2) <<<<<<<<<
(X) elem = in[tid] + 1;
else
(2) elem = in[tid] + 3;
(3) atomicAdd (out, elem);
}
26. 26 |
With lane divergence support, step 2
Stepping doesn't stop if current lane is inactive => intuitive
__device__ void
function (unsigned tid, const int *in, int *out)
{
int elem;
(1) if (tid % 2)
(X) elem = in[tid] + 1;
else
>> (2) elem = in[tid] + 3; <<<<<<<<<
(3) atomicAdd (out, elem);
}
27. 27 |
With lane divergence support, step 3
Stepping doesn't stop if current lane is inactive => intuitive
__device__ void
function (unsigned tid, const int *in, int *out)
{
int elem;
(1) if (tid % 2)
(X) elem = in[tid] + 1;
else
(2) elem = in[tid] + 3;
>> (3) atomicAdd (out, elem); <<<<<<<<<
}
28. 28 |
Lane divergence, lane state
WITHOUT lane divergence debug info
(gdb) info lanes
Id State Target Id Frame
1 A AMDGPU Lane 1:1:1:1/1 (0,0,0)[1,0,0] kernel.cc:34
2 I AMDGPU Lane 1:1:1:1/2 (0,0,0)[2,0,0] <inactive>
3 A AMDGPU Lane 1:1:1:1/3 (0,0,0)[3,0,0] kernel.cc:34
...
63 A AMDGPU Lane 1:1:1:1/63 (0,0,0)[63,0,0] kernel.cc:41
A - active / I - inactive
29. 29 |
Lane divergence, lane state
WITH lane divergence debug info
(gdb) info lanes
Id State Target Id Frame
1 A AMDGPU Lane 1:1:1:1/1 (0,0,0)[1,0,0] kernel.cc:34
2 D AMDGPU Lane 1:1:1:1/2 (0,0,0)[2,0,0] kernel.cc:41
3 A AMDGPU Lane 1:1:1:1/3 (0,0,0)[3,0,0] kernel.cc:34
...
63 A AMDGPU Lane 1:1:1:1/63 (0,0,0)[63,0,0] kernel.cc:41
A - active / D - divergent
30. 30 |
Lane divergence, lane PC
Use source/logical PC instead of physical PC throughout
/* The frame's source/logical `resume' address. Returns the physical
thread-wide PC register. */
extern CORE_ADDR get_frame_pc (struct frame_info *);
+ /* The frame's source/logical `resume' address. This returns the
+ source/logical PC register, not the physical register. */
+ extern CORE_ADDR get_frame_lane_pc (struct frame_info *);
NEW!
31. 31 |
DWARF extensions
DWARF Extensions For Heterogeneous Debugging
https://llvm.org/docs/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html
• Allow Location Description on the DWARF Expression Stack
• Generalize CFI & DWARF Base Objects to Allow Any locdesc Kind
• Generalize DWARF Operation Expressions to Support Multiple Places
• Generalize Offsetting of Location Descriptions
• General Support for Address Spaces
• Operations to create Vector Composite locdescs
• Support for Divergent Control Flow of SIMT Hardware
• More...
32. 32 |
DWARF extensions, Cauldron 2021
• Last year's Cauldron presentation, by Tony Tye, Scott Linder, and Zoran Zaric:
• DWARF Extensions for Optimized SIMT/SIMD (GPU) Debugging
• https://www.youtube.com/watch?v=Iv2WO67nklc
33. 33 |
Architecture address spaces
• Independent from the current global memory concept
• Not part of source language syntax => address space cannot be defined as type qualifier
• Part of an address information => CORE_ADDR concept needs extending
• Require special pointer and reference type handling (and potentially more)
• Should not be exposed in user expressions, except when creating an address directly
• Address spaces can be relative to lane/wave/core/device
Per lane memory
Per wave memory
Core local
Per lane memory
Per wave memory
Per lane memory
Per wave memory
Per lane memory
Per wave memory
34. 34 |
Architecture address spaces
Not part of source language syntax => address space cannot be defined as type qualifier
__device__ int global_var;
__device__ void func (int *arg) { // arg can point to memory in any address space
int local_var[3] = { *arg };
// …
if (local_var[1] == global_var) {
// …
}
}
• Typically, local variables => private_lane address space
• But, for optimization reasons, they could be put elsewhere
• DWARF describes where that is
35. 35 |
Architecture address spaces, CORE_ADDR
• CORE_ADDR and the typical bit hacks would work, though high bits won't always be free:
• Pointer/memory tagging on Aarch64, soon x86 too (UAI for AMD, LAM for Intel)
• CORE_ADDR better represents offset into address space
• Introducing tuple to carry both address space and offset:
struct address {
addr_space_id addr_space;
CORE_ADDR offset;
};
• Needed mostly for addresses that come from DWARF debug info, and user expressions
• Many many places can infer global (default) address space from context
=> can continue working with CORE_ADDR
36. 36 |
Address spaces notation
Introducing the '#' operator to compose pointer from:
• An address space name (maintenance print address-spaces)
• An offset
(gdb) p &k
$1 = (int *) private_lane#0x0
(gdb) p private_lane#0x1
Operation: OP_ASPACE
Operation: OP_LONG
Type: int
Constant: 0x0000000000000001
String: private_lane
$2 = (void *) private_lane#0x1
37. 37 |
Architecture address spaces in DWARF
• Address space location description (DW_OP_LLVM_form_aspace_address)
# Variable located at address 0x0 of private_lane (0x5) address space
DW_TAG_variable
…
DW_AT_location
DW_OP_lit0 # address 0x0
DW_OP_lit5 # address space 0x5 (private_lane)
DW_OP_LLVM_form_aspace_address # pops two arguments from stack
• Pointer and reference type DIE attribute (DW_AT_LLVM_address_space)
# Type of a pointer object which holds a private_lane (0x5) address
DW_TAG_pointer_type
…
DW_AT_LLVM_address_space 0x5 # different from DW_AT_address_class
38. 38 |
Other related contributions
• Ctrl-C redesign, there's a separate talk for this:
• Redesigning GDB's Ctrl-C handling
• Performance improvements for large number of threads
• List of threads with pending status
• Per-inferior ptid -> thread map
• Commit-resumed
• Step over clone and thread exit
• tail end of kernel code contains kernel exit instruction
39. 39 |
Upstreaming status
• Linux Kernel module (kfd)
• Module exists upstream, but does not support debug there
• Finalizing debug interface, and plan to upstream soon ™
• https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver
• AMD Debug API (amd-dbgapi)
• https://github.com/ROCm-Developer-Tools/ROCdbgapi
• DWARF extensions
• DWARF for GPUs pseudo informal working group: AMD, Intel, Perforce (so far)
• Some bits agreed in group and filed for DWARF v6:
• DW_OP_push_lane, DW_AT_num_lanes
• Working on submitting rest
• GDB
• https://github.com/ROCm-Developer-Tools/ROCgdb
• Some BFD / binutils bits merged
• GDB submission in preparation