Dealing with GPU resets

igalia 11 views 9 slides Oct 14, 2024
Slide 1
Slide 1 of 9
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9

About This Presentation

GPU resets are a common problem for every vendor, due to the nature of the
stack. A bad shader can put the render node in an infinite loop, and we need to
reset the GPU, partially or completely. However, each driver (both at userspace
and kernelspace) have different ideas of what to do when a reset ...


Slide Content

Dealing with GPU resetsDealing with GPU resets
André AlmeidaAndré Almeida
LPC 2024LPC 2024
11//99

GPU resetsGPU resets
Can happen after an invalid GPU command, infinite loop,Can happen after an invalid GPU command, infinite loop,
power management issuespower management issues
Userspace driver asks kernel driver if everything is OK, andUserspace driver asks kernel driver if everything is OK, and
(might) do something about it(might) do something about it
Kill the guilty app?Kill the guilty app?
Kernel driver can also act on itKernel driver can also act on it
“A job has timed out, so let’s reset the GPU”“A job has timed out, so let’s reset the GPU”
22//99

How should GPU resets beHow should GPU resets be
implemented?implemented?
Nowadays, everyone does their own wayNowadays, everyone does their own way
But at least now we have a doc!But at least now we have a doc!
https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#dehttps://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#de
vice-resetvice-reset
33//99

How should GPU resets beHow should GPU resets be
implemented?implemented?
Every DRM driver has a specific API for reporting GPU resets:Every DRM driver has a specific API for reporting GPU resets:
I915_GET_RESET_STATSI915_GET_RESET_STATS
AMDGPU_CTX_OP_QUERY_STATE2AMDGPU_CTX_OP_QUERY_STATE2
MSM_PARAM_FAULTSMSM_PARAM_FAULTS
How can we get a DRM API for that?How can we get a DRM API for that?
Can we have a Can we have a struct drm_contextstruct drm_context??
44//99

DRM_IOCTL_GET_RESETDRM_IOCTL_GET_RESET
structstruct drm_get_resetdrm_get_reset { {
/** Context ID to query resets (in) *//** Context ID to query resets (in) */
__u32 ctx_id; __u32 ctx_id; // no global context ID...// no global context ID...
/** Flags (out) *//** Flags (out) */
__u32 flags; __u32 flags;
/** Global reset counter for this card (out) *//** Global reset counter for this card (out) */
__u64 reset_count; __u64 reset_count;
/** Reset counter for this context (out) *//** Reset counter for this context (out) */
__u64 reset_count_ctx; __u64 reset_count_ctx;
};};
55//99

How should GPU resets beHow should GPU resets be
implemented?implemented?
It would be very nice to know what caused the reset in the firstIt would be very nice to know what caused the reset in the first
placeplace
But this seems to be locked inside the firmware sideBut this seems to be locked inside the firmware side
What’s available for us in the open source driver?What’s available for us in the open source driver?
66//99

OpenGL robustnessOpenGL robustness
Apps using OpenGL should use the available robustApps using OpenGL should use the available robust
interfaces, like the extension GL_ARB_robustness. Thisinterfaces, like the extension GL_ARB_robustness. This
interface tells if a reset has happened, and if so, all theinterface tells if a reset has happened, and if so, all the
context state is considered lost and the app proceeds bycontext state is considered lost and the app proceeds by
creating new ones. creating new ones. There’s no consensus on what to do ifThere’s no consensus on what to do if
robustness is not in use.robustness is not in use.
Some drivers think they should kill the non-robust app, someSome drivers think they should kill the non-robust app, some
others don’t think soothers don’t think so
77//99

End to end testingEnd to end testing
How to write a proper IGT test for resetting?How to write a proper IGT test for resetting?
How can userland know that a driver is properly puttingHow can userland know that a driver is properly putting
something on screen?something on screen?
amdgpu sometimes gets just a dark screen after a bad resetamdgpu sometimes gets just a dark screen after a bad reset
88//99

DiscussionDiscussion
Join us!Join us!
https://www.igalia.com/jobshttps://www.igalia.com/jobs
99//99