Discussion:
[Bug 108854] [polaris11] - Failed GPU reset after hang
b***@freedesktop.org
2018-11-24 20:41:30 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=108854

Bug ID: 108854
Summary: [polaris11] - Failed GPU reset after hang
Product: DRI
Version: DRI git
Hardware: x86-64 (AMD64)
OS: Linux (All)
Status: NEW
Severity: normal
Priority: medium
Component: DRM/AMDgpu
Assignee: dri-***@lists.freedesktop.org
Reporter: ***@gmail.com

Created attachment 142604
--> https://bugs.freedesktop.org/attachment.cgi?id=142604&action=edit
dmesg showing the hang and failed gpu reset

Problem:

While running RuneLite [1] with GPU acceleration enabled, the system hangs
after several minutes of seemingly normal operation. Once the GPU hangs, it
attempts to reset itself but fails with the following message:

[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

This hang causes the system to lock up and ssh is the only access possible.
There is no graphical corruption, the displays are simply frozen.

System Information:

GPU: POLARIS11 - RX 560 4GB (1002:67ff)
Mesa: 18.0.5
X11: 1.19.6
Firmware files should be the latest as I've pulled them from adg5f's repo [2].

Kernel parameters: "quiet splash scsi_mod.use_blk_mq=1 apparmor=2
security=apparmor amdgpu.gpu_recovery=1 spectre_v2=off"

I have reproduced this issue on:

4.20-rc3
amd-staging-drm-next (as of commit 1179994039abc10aab0d2f0ecfc4c65dfbd77438)

[1] https://github.com/runelite/runelite
[2] https://people.freedesktop.org/~agd5f/radeon_ucode/
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-12-01 18:17:02 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #1 from Tom Seewald <***@gmail.com> ---
I can confirm this is still happening on 4.20-rc4 as well as with more up to
date userspace software.

libdrm: 3.27.0
Mesa: 18.2.4

The hangs can be reliably reproduced at least as far back as kernel 4.15 so I
am not confident I can bisect this.

Here is a dump of my card's firmware information in case I missed an update.

# cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

VCE feature version: 0, firmware version: 0x34040300
UVD feature version: 0, firmware version: 0x01821000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 47, firmware version: 0x000000a2
PFP feature version: 47, firmware version: 0x000000f0
CE feature version: 47, firmware version: 0x00000089
RLC feature version: 1, firmware version: 0x00000035
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 47, firmware version: 0x000002cb
MEC2 feature version: 47, firmware version: 0x000002cb
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x001d0900
SDMA0 feature version: 31, firmware version: 0x00000036
SDMA1 feature version: 0, firmware version: 0x00000036
VCN feature version: 0, firmware version: 0x00000000
DMCU feature version: 0, firmware version: 0x00000000
VBIOS version: 113-C98121-M01

Would umr[1] be useful here? I have not used it before, so I'd need some
guidance on what arguments would produce output relevant to this hang.

Any help is appreciated.

[1] https://cgit.freedesktop.org/amd/umr/
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-12-08 18:20:35 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #2 from Tom Seewald <***@gmail.com> ---
Created attachment 142754
--> https://bugs.freedesktop.org/attachment.cgi?id=142754&action=edit
dmesg of 4.20-rc5 with drm.debug=0xe
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-12-08 18:27:07 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #3 from Tom Seewald <***@gmail.com> ---
Installed the new Polaris firmware released on December 3rd, however that
doesn't appear to affect my card as the content of
/sys/kernel/debug/dri/1/amdgpu_firmware_info is unchanged.

Upgraded to Mesa 18.3.0 from 18.2.4 - no change.

Added dmesg of 4.20-rc5 with drm.debug=0xe, showing the hang. It now prints
hung kernel tasks backtraces rather than "[drm:amdgpu_cs_ioctl [amdgpu]]
*ERROR* Failed to initialize parser -125!".

I've also included the power management information before and after the GPU
hang.

/sys/kernel/debug/dri/1/amdgpu_pm_info *before* GPU hang:

Clock Gating Flags Mask: 0x3fbcf
Graphics Medium Grain Clock Gating: On
Graphics Medium Grain memory Light Sleep: On
Graphics Coarse Grain Clock Gating: On
Graphics Coarse Grain memory Light Sleep: On
Graphics Coarse Grain Tree Shader Clock Gating: Off
Graphics Coarse Grain Tree Shader Light Sleep: Off
Graphics Command Processor Light Sleep: On
Graphics Run List Controller Light Sleep: On
Graphics 3D Coarse Grain Clock Gating: Off
Graphics 3D Coarse Grain memory Light Sleep: Off
Memory Controller Light Sleep: On
Memory Controller Medium Grain Clock Gating: On
System Direct Memory Access Light Sleep: Off
System Direct Memory Access Medium Grain Clock Gating: On
Bus Interface Medium Grain Clock Gating: Off
Bus Interface Light Sleep: On
Unified Video Decoder Medium Grain Clock Gating: On
Video Compression Engine Medium Grain Clock Gating: On
Host Data Path Light Sleep: On
Host Data Path Medium Grain Clock Gating: On
Digital Right Management Medium Grain Clock Gating: Off
Digital Right Management Light Sleep: Off
Rom Medium Grain Clock Gating: On
Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
1750 MHz (MCLK)
1196 MHz (SCLK)
387 MHz (PSTATE_SCLK)
625 MHz (PSTATE_MCLK)
993 mV (VDDGFX)
20.30 W (average GPU)

GPU Temperature: 38 C
GPU Load: 0 %

UVD: Disabled

VCE: Disabled


/sys/kernel/debug/dri/1/amdgpu_pm_info *after* GPU hang:

Clock Gating Flags Mask: 0x6400
Graphics Medium Grain Clock Gating: Off
Graphics Medium Grain memory Light Sleep: Off
Graphics Coarse Grain Clock Gating: Off
Graphics Coarse Grain memory Light Sleep: Off
Graphics Coarse Grain Tree Shader Clock Gating: Off
Graphics Coarse Grain Tree Shader Light Sleep: Off
Graphics Command Processor Light Sleep: Off
Graphics Run List Controller Light Sleep: Off
Graphics 3D Coarse Grain Clock Gating: Off
Graphics 3D Coarse Grain memory Light Sleep: Off
Memory Controller Light Sleep: Off
Memory Controller Medium Grain Clock Gating: Off
System Direct Memory Access Light Sleep: On
System Direct Memory Access Medium Grain Clock Gating: Off
Bus Interface Medium Grain Clock Gating: Off
Bus Interface Light Sleep: Off
Unified Video Decoder Medium Grain Clock Gating: On
Video Compression Engine Medium Grain Clock Gating: On
Host Data Path Light Sleep: Off
Host Data Path Medium Grain Clock Gating: Off
Digital Right Management Medium Grain Clock Gating: Off
Digital Right Management Light Sleep: Off
Rom Medium Grain Clock Gating: Off
Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
1750 MHz (MCLK)
1196 MHz (SCLK)
387 MHz (PSTATE_SCLK)
625 MHz (PSTATE_MCLK)
993 mV (VDDGFX)
28.186 W (average GPU)

GPU Temperature: 42 C
GPU Load: 100 %

UVD: Disabled

VCE: Disabled
--
You are receiving this mail because:
You are the assignee for the bug.
Loading...