SlideShare a Scribd company logo
Having fun with GPU
resets :D
André Almeida
XDC 2023
1
Hi!
Kernel developer
Working in the Steam Deck
2
Oh no, my GPU hanged!
You are playing your game on Linux
Something wrong is sent to the device
???
Game over, reboot your machine
3
4
Modern GPUs are complex
Really complex
AMD Radeon RX 7900 XTX
96 Compute units, 384 texture units, 6 shader
engines, 58 B transistors...
Shaders are Turing Complete
5
Modern GPUs are complex
DCHUB
HUBP
(n)
DPP
(n)
MPC OPTC DIO
DCCG DMU AZ
MMHUBBUB
DWB
(n)
Global sync
Pixel data
Sideband signal
Config. Bus
SDP Monitor
OPP
dc_plane dc_stream
dc_state
Code struct
dc_link
Floating point
calculation
bit-depth
reduction/dither
}
Notes
6
Modern GPUs are complex
If you have an infinity loop in the CPU, it's not that
bad
CPU programs has virtual memory and virtual
processor
Things might be more barebone in the GPU
But in a GPU, the display won't be able to update
7
Detecting GPU hangs
From the hardware to the application
8
Detecting GPU hangs
Device to Kernel
Submit a job and wait until is done (halting problem)
Check fences
Or timeouts
The kernel driver does a GPU reset
This can be "soft" resets, one hw engine/context
reset or full device reset
Discrete amdgpu struggles with per-context
resets
More complete resets are more destructive
Total lost of VRAM
Now report to userspace
9
Reporting GPU hangs
Kernel to Mesa
DRM has no API for that
I915_GET_RESET_STATS
AMDGPU_CTX_OP_QUERY_STATE2
MSM_PARAM_FAULTS
Return -ERROR for ioctls
Nothing for Raspberry or Vivante (?)
It's not really hw specific
But there's no common DRM context concept
10
Reporting GPU hangs
Mesa to application
Vulkan
Before submitting commands or wait operation, Mesa
asks the kernel if the device is around
Otherwise, it propagates VK_ERROR_DEVICE_LOST
to the app
Then the app (maybe) do something about it
Recreate the context
Or just exit
VK_EXT_device_fault to query info about the
reset
11
Reporting GPU hangs
Mesa to application
OpenGL
Before context creation and cs flush do a reset check
If the app has GL_ARB_robustness support, it
propagates an error for the app
Otherwise:
Some drivers just kill the application
Other just block new calls
12
Reporting GPU hangs
Mesa to application
APIs provide a way to tell apps that a reset
happened:
VK_ERROR_DEVICE_LOST
GL_ARB_robustness
Non-robust GL apps are just killed
But even robust apps can misbehave
Mesa/kernel can ban fd if they keep resetting the GPU
13
Reporting GPU hangs
Mesa to application
Applications then can recreate the context
Usually that means:
Oh no, my submit command returned a reset error
The GPU state is corrupted, but the CPU still
good!
Let's recreate all contexts, buffers, shaders, etc
Cool, let's go!
Bad apps may fall in a broken loop
14
#version 330 core
out vec4 FragColor;
void main()
{
for (float x = 0.01; x < 1; x--) {
FragColor = vec4(x * 1.0f, 1.0f, 1.0f, 1.0f)
}
}
15
What can we do here?
DRM <-> Mesa it's not really hw specific
How about we have a DRM_GET_RESET_STATE?
drm/doc: Document DRM device reset
expectations: A DRM documentation explaining
what DRM drivers and Mesa should do when a reset
happens
16
DRM_IOCTL_GET_RESET
struct drm_get_reset {
/** Context ID to query resets (in) */
__u32 ctx_id; // no global context ID...
/** Flags (out) */
__u32 flags;
/** Global reset counter for this card (out) */
__u64 reset_count;
/** Reset counter for this context (out) */
__u64 reset_count_ctx;
};
17
drmGetReset(ctx_id, &reset);
if (reset.reset_count_ctx)
return PIPE_GUILTY_CONTEXT;
if (reset.reset_count)
return PIPE_INNOCENT_CONTEXT;
18
What happens in practice?
There was no reset check at RADV
+ device->vk.check_status = radv_check_status;
19
What happens in practice?
For RadeonSI, there was reset check...
But the driver would return
GL_CONTEXT_RESET_ARB forever
So the app would recreate the context, get a reset
return and recreate the context, get a reset return, ad
infinitum
20
What happens in practice?
For Iris (Intel), there is reset check...
But it only notify the guilty application that a reset
happened
The guilty application then quits/recreate the context
The rest aren't notified and counts on being luck to
still be alive
Some resets are more destructive than others...
21
What happens in practice?
Not much shared code, standardization, tests,
validation...
22
What happens in practice?
Each vendor reacts differently to resets
My focus is on amdgpu
The state for discrete is that it would be
unrecoverable for any kind of reset
Just a black screen and not responsive. Access via
ssh/tty sometimes worked
Pierre-Eric (AMD) and improved this for KDE
compositor
radeonsi wasn't following the spec
More testing is need for robustness
23
What happens in practice?
Other OSs have more control in the stack, so they can
be more reliable
In particular in the compositor side, so it's easier to
get in a standard behavior
24
Good reporting of GPU hangs
Apart from reporting to userspace that the GPU was
reset, would be nice to tell what happened
Currently Mesa developers have a hard time figuring
out what in the game caused the hang
25
Good reporting of GPU hangs
GPU hang have two main sources:
Hardware settings (voltage, frequency)
Application errors (infinite loops)
There's no way to distinguish this right now
26
Good reporting of GPU hangs
Ideally without overhead so can be enabled by default
I proposed AMDGPU_INFO_GUILTY_APP to capture
data about the hanged app (e.g. buffer in use)
This callbacks need to be platform specific
Reads some registers about which buffer was in
use
AMD replied that we can't trust register values after
a reset
We probably need some firmware support at this
i t
27
Good reporting of GPU hangs
RADV_DEBUG=hang isn't always effective, it changes
the ordering of jobs
Challenge: when the GPU hangs the hardware state
can be a bit unreliable.
How to get the right info correctly?
Using the GPU in "debug" mode or inserting fences,
barrier and extra information causes overhead
No easy way to deploy to all users
28
Roadmap for better GPU resets
Standardization of how DRM reports GPU hangs to
userspace
of how usermode driver deals with a hang and with
the guilty application
what compositors should do after a hang
Better hang log
Show which buffer caused the hang
Dump hardware state reliably
devcoredump
Tests!
29
Links
https://lore.kernel.org/lkml/20230501185747.3351
9-1-andrealmeid@igalia.com/
https://lore.kernel.org/lkml/20230424014324.2185
31-1-andrealmeid@igalia.com/
https://lore.kernel.org/dri-devel/20230929092509
.42042-1-andrealmeid@igalia.com/
https://gitlab.freedesktop.org/mesa/mesa/-/merge
_requests/22290
https://gitlab.freedesktop.org/mesa/mesa/-/merge
_requests/22253
30
31
Thanks!
igalia.com/jobs
32
33
Having fun with GPU resets in Linux – XDC  2023

More Related Content

Similar to Having fun with GPU resets in Linux – XDC 2023

Laptop syllabus 1 month
Laptop syllabus 1 monthLaptop syllabus 1 month
Laptop syllabus 1 month
chiptroniks
 
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsMainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Dobrica Pavlinušić
 
laptop repairing institute
laptop repairing institutelaptop repairing institute
laptop repairing institute
chiptroniks
 
laptop repairing institute
laptop repairing institutelaptop repairing institute
laptop repairing institute
chiptroniks
 
Laptop Chip level repairing(CPU section)
Laptop Chip level repairing(CPU section)Laptop Chip level repairing(CPU section)
Laptop Chip level repairing(CPU section)
chiptroniks
 
CUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesCUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : Notes
Subhajit Sahu
 
GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013
ecoumans
 
Cuda intro
Cuda introCuda intro
Cuda intro
Anshul Sharma
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach
 
Dx diag
Dx diagDx diag
SpeedIT FLOW
SpeedIT FLOWSpeedIT FLOW
SpeedIT FLOW
University of Zurich
 
Android CTS training
Android CTS trainingAndroid CTS training
Android CTS training
jtbuaa
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerOSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
NETWAYS
 
Graphics card ppt
Graphics card pptGraphics card ppt
Graphics card ppt
Sowrabh Tandle
 
ADS Lab 5 Report
ADS Lab 5 ReportADS Lab 5 Report
ADS Lab 5 Report
Riddhi Shah
 
Intel galileo gen 2
Intel galileo gen 2Intel galileo gen 2
Intel galileo gen 2
srknec
 

Similar to Having fun with GPU resets in Linux – XDC 2023 (20)

Laptop syllabus 1 month
Laptop syllabus 1 monthLaptop syllabus 1 month
Laptop syllabus 1 month
 
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernelsMainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
Mainline kernel on ARM Tegra20 devices that are left behind on 2.6 kernels
 
laptop repairing institute
laptop repairing institutelaptop repairing institute
laptop repairing institute
 
laptop repairing institute
laptop repairing institutelaptop repairing institute
laptop repairing institute
 
Laptop Chip level repairing(CPU section)
Laptop Chip level repairing(CPU section)Laptop Chip level repairing(CPU section)
Laptop Chip level repairing(CPU section)
 
CUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesCUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : Notes
 
GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Dx diag
Dx diagDx diag
Dx diag
 
SpeedIT FLOW
SpeedIT FLOWSpeedIT FLOW
SpeedIT FLOW
 
Android CTS training
Android CTS trainingAndroid CTS training
Android CTS training
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerOSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
 
Graphics card ppt
Graphics card pptGraphics card ppt
Graphics card ppt
 
ADS Lab 5 Report
ADS Lab 5 ReportADS Lab 5 Report
ADS Lab 5 Report
 
Intel galileo gen 2
Intel galileo gen 2Intel galileo gen 2
Intel galileo gen 2
 

More from Igalia

The journey towards stabilizing Chromium’s Wayland support
The journey towards stabilizing Chromium’s Wayland supportThe journey towards stabilizing Chromium’s Wayland support
The journey towards stabilizing Chromium’s Wayland support
Igalia
 
Sustainable Futures: Funding the Web Ecosystem
Sustainable Futures: Funding the Web EcosystemSustainable Futures: Funding the Web Ecosystem
Sustainable Futures: Funding the Web Ecosystem
Igalia
 
Status of the Layer-Based SVG Engine in WebKit
Status of the Layer-Based SVG Engine in WebKitStatus of the Layer-Based SVG Engine in WebKit
Status of the Layer-Based SVG Engine in WebKit
Igalia
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
 
Building End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEBuilding End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPE
Igalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Igalia
 
Automated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesAutomated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded Devices
Igalia
 
Embedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceEmbedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to Maintenance
Igalia
 
Optimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfOptimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdf
Igalia
 
Running JS via WASM faster with JIT
Running JS via WASM      faster with JITRunning JS via WASM      faster with JIT
Running JS via WASM faster with JIT
Igalia
 
To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!
Igalia
 
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Igalia
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa
Igalia
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Igalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera Linux
Igalia
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVM
Igalia
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
Igalia
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devices
Igalia
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Igalia
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the web
Igalia
 

More from Igalia (20)

The journey towards stabilizing Chromium’s Wayland support
The journey towards stabilizing Chromium’s Wayland supportThe journey towards stabilizing Chromium’s Wayland support
The journey towards stabilizing Chromium’s Wayland support
 
Sustainable Futures: Funding the Web Ecosystem
Sustainable Futures: Funding the Web EcosystemSustainable Futures: Funding the Web Ecosystem
Sustainable Futures: Funding the Web Ecosystem
 
Status of the Layer-Based SVG Engine in WebKit
Status of the Layer-Based SVG Engine in WebKitStatus of the Layer-Based SVG Engine in WebKit
Status of the Layer-Based SVG Engine in WebKit
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Building End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPEBuilding End-user Applications on Embedded Devices with WPE
Building End-user Applications on Embedded Devices with WPE
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded DevicesAutomated Testing for Web-based Systems on Embedded Devices
Automated Testing for Web-based Systems on Embedded Devices
 
Embedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to MaintenanceEmbedding WPE WebKit - from Bring-up to Maintenance
Embedding WPE WebKit - from Bring-up to Maintenance
 
Optimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdfOptimizing Scheduler for Linux Gaming.pdf
Optimizing Scheduler for Linux Gaming.pdf
 
Running JS via WASM faster with JIT
Running JS via WASM      faster with JITRunning JS via WASM      faster with JIT
Running JS via WASM faster with JIT
 
To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!To crash or not to crash: if you do, at least recover fast!
To crash or not to crash: if you do, at least recover fast!
 
Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamer
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera Linux
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVM
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devices
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the web
 

Recently uploaded

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 

Recently uploaded (20)

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 

Having fun with GPU resets in Linux – XDC 2023

  • 1. Having fun with GPU resets :D André Almeida XDC 2023 1
  • 3. Oh no, my GPU hanged! You are playing your game on Linux Something wrong is sent to the device ??? Game over, reboot your machine 3
  • 4. 4
  • 5. Modern GPUs are complex Really complex AMD Radeon RX 7900 XTX 96 Compute units, 384 texture units, 6 shader engines, 58 B transistors... Shaders are Turing Complete 5
  • 6. Modern GPUs are complex DCHUB HUBP (n) DPP (n) MPC OPTC DIO DCCG DMU AZ MMHUBBUB DWB (n) Global sync Pixel data Sideband signal Config. Bus SDP Monitor OPP dc_plane dc_stream dc_state Code struct dc_link Floating point calculation bit-depth reduction/dither } Notes 6
  • 7. Modern GPUs are complex If you have an infinity loop in the CPU, it's not that bad CPU programs has virtual memory and virtual processor Things might be more barebone in the GPU But in a GPU, the display won't be able to update 7
  • 8. Detecting GPU hangs From the hardware to the application 8
  • 9. Detecting GPU hangs Device to Kernel Submit a job and wait until is done (halting problem) Check fences Or timeouts The kernel driver does a GPU reset This can be "soft" resets, one hw engine/context reset or full device reset Discrete amdgpu struggles with per-context resets More complete resets are more destructive Total lost of VRAM Now report to userspace 9
  • 10. Reporting GPU hangs Kernel to Mesa DRM has no API for that I915_GET_RESET_STATS AMDGPU_CTX_OP_QUERY_STATE2 MSM_PARAM_FAULTS Return -ERROR for ioctls Nothing for Raspberry or Vivante (?) It's not really hw specific But there's no common DRM context concept 10
  • 11. Reporting GPU hangs Mesa to application Vulkan Before submitting commands or wait operation, Mesa asks the kernel if the device is around Otherwise, it propagates VK_ERROR_DEVICE_LOST to the app Then the app (maybe) do something about it Recreate the context Or just exit VK_EXT_device_fault to query info about the reset 11
  • 12. Reporting GPU hangs Mesa to application OpenGL Before context creation and cs flush do a reset check If the app has GL_ARB_robustness support, it propagates an error for the app Otherwise: Some drivers just kill the application Other just block new calls 12
  • 13. Reporting GPU hangs Mesa to application APIs provide a way to tell apps that a reset happened: VK_ERROR_DEVICE_LOST GL_ARB_robustness Non-robust GL apps are just killed But even robust apps can misbehave Mesa/kernel can ban fd if they keep resetting the GPU 13
  • 14. Reporting GPU hangs Mesa to application Applications then can recreate the context Usually that means: Oh no, my submit command returned a reset error The GPU state is corrupted, but the CPU still good! Let's recreate all contexts, buffers, shaders, etc Cool, let's go! Bad apps may fall in a broken loop 14
  • 15. #version 330 core out vec4 FragColor; void main() { for (float x = 0.01; x < 1; x--) { FragColor = vec4(x * 1.0f, 1.0f, 1.0f, 1.0f) } } 15
  • 16. What can we do here? DRM <-> Mesa it's not really hw specific How about we have a DRM_GET_RESET_STATE? drm/doc: Document DRM device reset expectations: A DRM documentation explaining what DRM drivers and Mesa should do when a reset happens 16
  • 17. DRM_IOCTL_GET_RESET struct drm_get_reset { /** Context ID to query resets (in) */ __u32 ctx_id; // no global context ID... /** Flags (out) */ __u32 flags; /** Global reset counter for this card (out) */ __u64 reset_count; /** Reset counter for this context (out) */ __u64 reset_count_ctx; }; 17
  • 18. drmGetReset(ctx_id, &reset); if (reset.reset_count_ctx) return PIPE_GUILTY_CONTEXT; if (reset.reset_count) return PIPE_INNOCENT_CONTEXT; 18
  • 19. What happens in practice? There was no reset check at RADV + device->vk.check_status = radv_check_status; 19
  • 20. What happens in practice? For RadeonSI, there was reset check... But the driver would return GL_CONTEXT_RESET_ARB forever So the app would recreate the context, get a reset return and recreate the context, get a reset return, ad infinitum 20
  • 21. What happens in practice? For Iris (Intel), there is reset check... But it only notify the guilty application that a reset happened The guilty application then quits/recreate the context The rest aren't notified and counts on being luck to still be alive Some resets are more destructive than others... 21
  • 22. What happens in practice? Not much shared code, standardization, tests, validation... 22
  • 23. What happens in practice? Each vendor reacts differently to resets My focus is on amdgpu The state for discrete is that it would be unrecoverable for any kind of reset Just a black screen and not responsive. Access via ssh/tty sometimes worked Pierre-Eric (AMD) and improved this for KDE compositor radeonsi wasn't following the spec More testing is need for robustness 23
  • 24. What happens in practice? Other OSs have more control in the stack, so they can be more reliable In particular in the compositor side, so it's easier to get in a standard behavior 24
  • 25. Good reporting of GPU hangs Apart from reporting to userspace that the GPU was reset, would be nice to tell what happened Currently Mesa developers have a hard time figuring out what in the game caused the hang 25
  • 26. Good reporting of GPU hangs GPU hang have two main sources: Hardware settings (voltage, frequency) Application errors (infinite loops) There's no way to distinguish this right now 26
  • 27. Good reporting of GPU hangs Ideally without overhead so can be enabled by default I proposed AMDGPU_INFO_GUILTY_APP to capture data about the hanged app (e.g. buffer in use) This callbacks need to be platform specific Reads some registers about which buffer was in use AMD replied that we can't trust register values after a reset We probably need some firmware support at this i t 27
  • 28. Good reporting of GPU hangs RADV_DEBUG=hang isn't always effective, it changes the ordering of jobs Challenge: when the GPU hangs the hardware state can be a bit unreliable. How to get the right info correctly? Using the GPU in "debug" mode or inserting fences, barrier and extra information causes overhead No easy way to deploy to all users 28
  • 29. Roadmap for better GPU resets Standardization of how DRM reports GPU hangs to userspace of how usermode driver deals with a hang and with the guilty application what compositors should do after a hang Better hang log Show which buffer caused the hang Dump hardware state reliably devcoredump Tests! 29
  • 31. 31
  • 33. 33