Discussion:
[Bug 102322] System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma_v3_0 is hung!
b***@freedesktop.org
2017-08-20 22:53:09 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

Bug ID: 102322
Summary: System crashes after "[drm] IP block:gmc_v8_0 is
hung!" / [drm] IP block:sdma_v3_0 is hung!
Product: DRI
Version: DRI git
Hardware: x86-64 (AMD64)
OS: Linux (All)
Status: NEW
Severity: critical
Priority: medium
Component: DRM/AMDgpu
Assignee: dri-***@lists.freedesktop.org
Reporter: ***@20mm.eu

I consistently experience complete system crashes when browsing web pages using
firefox for about 30 minutes, with the following dmesg output from the amdgpu
driver:

[ 2330.720711] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout,
last signaled seq=40778, last emitted seq=40780
[ 2330.720768] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout,
last signaled seq=31305, last emitted seq=31306
[ 2330.720771] [drm] IP block:gmc_v8_0 is hung!
[ 2330.720774] [drm] IP block:gmc_v8_0 is hung!
[ 2330.720775] [drm] IP block:sdma_v3_0 is hung!
[ 2330.720778] [drm] IP block:sdma_v3_0 is hung!

(Above cited messages are the last to make it to a network-filesystem by
running "dmesg -w" before the system stops to do anything.)

I am running a kernel compiled from
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next as of
"commit 94097b0f7f1bfa54b3b1f8b0d74bbd271a0564e4" (so the very latest as of
today).
My GPU is an RX 460.

Notice that this bug may be the same symptom as reported in
https://bugs.freedesktop.org/show_bug.cgi?id=98874

However, the system crashes for me occur usually while vertically scrolling
through some (ordinary) web page.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2017-11-19 16:40:30 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #1 from dwagner <***@20mm.eu> ---
Sadly, not only did this bug not attract any attention, it also still occurs,
and seemingly even more frequent than before, on current bleeding-edge kernels
from amd-staging-drm-next, and also with the now current Firefox 57 and the now
current versions of Xorg, Mesa etc. from Arch Linux.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-02-24 18:36:55 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #2 from dwagner <***@20mm.eu> ---
Just to mention this once again: These system crashes still occur, and way too
frequently to consider the amdgpu driver stable enough for professional use.
Sample dmesg output from today:

Feb 24 18:26:55 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
last signaled seq=5430589, last emitted seq=5430591
Feb 24 18:26:55 [drm] IP block:gmc_v8_0 is hung!
Feb 24 18:26:55 [drm] IP block:gfx_v8_0 is hung!
Feb 24 18:27:02 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout,
last signaled seq=185928, last emitted seq=185930
Feb 24 18:27:02 [drm] IP block:gmc_v8_0 is hung!
Feb 24 18:27:02 [drm] IP block:gfx_v8_0 is hung!
Feb 24 18:27:05 [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:43:crtc-0]
hw_done or flip_done timed out
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-03 21:00:01 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #3 from dwagner <***@20mm.eu> ---
Just for the record, others have reported similar symptoms - here is a recent
example: https://bugs.freedesktop.org/show_bug.cgi?id=106666
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-03 21:02:41 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #4 from dwagner <***@20mm.eu> ---
I was asked in
https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1027705-amdgpu-on-linux-4-18-to-offer-greater-vega-power-savings-displayport-1-4-fixes?p=1027933#post1027933
to mention here that I have experienced this kind of bug only when using the
"new" display code (amdgpu.dc=1).

I cannot strictly rule out that it could also happen with dc=0, since I have
tried dc=0 only for short periods occasionally, but during those periods I did
not see this kind of crash.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-25 21:43:03 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #5 from dwagner <***@20mm.eu> ---
Just for the record: To rule out my personally compiled kernels are somehow
"more buggy than what others compile", I tried the current Arch-Linux-supplied
Linux 4.17.2-1-ARCH kernel.

Survives about 5 minutes of Firefox-browsing between crashes with:

Jun 20 00:01:11 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
sdma0 timeout, last signaled seq=1895, last em>
Jun 20 00:01:11 ryzen kernel: [drm] IP block:gmc_v8_0 is hung!

(4.13.* did at least survive days.)
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-25 22:11:14 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #6 from Andrey Grodzovsky <***@amd.com> ---
Verify you are using latest AMD firmware and up to date MESA/LLVM

Firmware here (amdgpu folder) -
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/

Andrey
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-25 23:08:10 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #7 from dwagner <***@20mm.eu> ---
(In reply to Andrey Grodzovsky from comment #6)
Post by b***@freedesktop.org
Verify you are using latest AMD firmware and up to date MESA/LLVM
Firmware:

pacman -Q linux-firmware
linux-firmware 20180606.d114732-1

ll /usr/lib/firmware/amdgpu/vega10_vce.bin
-rw-r--r-- 1 root root 165344 Jun 7 08:01
/usr/lib/firmware/amdgpu/vega10_vce.bin


MESA:

pacman -Q mesa
mesa 18.1.2-1


LLVM:
pacman -Q llvm-libs
llvm-libs 6.0.0-4

Is this new enough?


BTW: In a forum somebody asked what the dmesg output on crash looked like if I
enabled amdgpu.gpu_recovery=1 - the result is a few lines more of output, but
still a fatal system crash:

Jun 26 00:50:09 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
gfx timeout, last signaled seq=12277, last emitted seq=12279
Jun 26 00:50:09 ryzen kernel: [drm] IP block:gmc_v8_0 is hung!
Jun 26 00:50:09 ryzen kernel: [drm] IP block:gfx_v8_0 is hung!
Jun 26 00:50:09 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_flip_done
[drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out
Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies
[drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out
Jun 26 00:50:25 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies
[drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-26 15:20:45 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #8 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #7)
Post by b***@freedesktop.org
(In reply to Andrey Grodzovsky from comment #6)
Post by b***@freedesktop.org
Verify you are using latest AMD firmware and up to date MESA/LLVM
pacman -Q linux-firmware
linux-firmware 20180606.d114732-1
ll /usr/lib/firmware/amdgpu/vega10_vce.bin
-rw-r--r-- 1 root root 165344 Jun 7 08:01
/usr/lib/firmware/amdgpu/vega10_vce.bin
pacman -Q mesa
mesa 18.1.2-1
pacman -Q llvm-libs
llvm-libs 6.0.0-4
Is this new enough?
The kernel and MESA seems new enough, LLVM is 6 so maybe you should try 7.
The firmware also looks pretty late but I still would advise to manually
override all firmware files with files from here
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu
Just backup your existing firmware/amdgpu folder for any case.
Post by b***@freedesktop.org
BTW: In a forum somebody asked what the dmesg output on crash looked like if
I enabled amdgpu.gpu_recovery=1 - the result is a few lines more of output,
Jun 26 00:50:09 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
ring gfx timeout, last signaled seq=12277, last emitted seq=12279
Jun 26 00:50:09 ryzen kernel: [drm] IP block:gmc_v8_0 is hung!
Jun 26 00:50:09 ryzen kernel: [drm] IP block:gfx_v8_0 is hung!
Jun 26 00:50:09 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_flip_done
[drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out
Jun 26 00:50:15 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies
[drm_kms_helper]] *ERROR* [CRTC:42:crtc-0] flip_done timed out
Jun 26 00:50:25 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies
[drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out
It's a know issue, try the patch I attached to resolve the deadlock , but you
will probably experience other failures after that anyway.

Andrey
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-26 15:21:27 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #9 from Andrey Grodzovsky <***@amd.com> ---
Created attachment 140345
--> https://bugs.freedesktop.org/attachment.cgi?id=140345&action=edit
Deadlock fix
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-26 22:52:22 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #10 from dwagner <***@20mm.eu> ---
(In reply to Andrey Grodzovsky from comment #8)
Post by b***@freedesktop.org
The kernel and MESA seems new enough, LLVM is 6 so maybe you should try 7.
LLVM 7 has not been released, and replacing LLVM 6 with the current subversion
head of LLVM 7 means to basically recompile and reinstall half of the operating
system (starting at radeonsi, then Xorg, then its dependencies...)

I'm fine with using experimental new kernels to find a more stable amdgpu
driver - but if a kernel driver crashes just because some user-space
application (X11) utilizes a wrong compiler version at run time, then some part
of the driver design is very wrong.
Post by b***@freedesktop.org
The firmware also looks pretty late but I still would advise to manually
override all firmware files with files from here
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/
tree/amdgpu
I did a "diff -r" on the git files with the ones installed by Arch, they are
all binary identical.
Post by b***@freedesktop.org
Post by b***@freedesktop.org
Jun 26 00:50:25 ryzen kernel: [drm:drm_atomic_helper_wait_for_dependencies
[drm_kms_helper]] *ERROR* [PLANE:40:plane-4] flip_done timed out
It's a know issue, try the patch I attached to resolve the deadlock , but
you will probably experience other failures after that anyway.
Ok, thanks for the patch, will try this next time I compile a new kernel.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-27 07:48:45 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #11 from Michel DÀnzer <***@daenzer.net> ---
(In reply to Andrey Grodzovsky from comment #8)
Post by b***@freedesktop.org
The kernel and MESA seems new enough, LLVM is 6 so maybe you should try 7.
LLVM 6 is fine.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-27 13:53:37 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #12 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #2)
Post by b***@freedesktop.org
Just to mention this once again: These system crashes still occur, and way
too frequently to consider the amdgpu driver stable enough for professional
Feb 24 18:26:55 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
last signaled seq=5430589, last emitted seq=5430591
Feb 24 18:26:55 [drm] IP block:gmc_v8_0 is hung!
Feb 24 18:26:55 [drm] IP block:gfx_v8_0 is hung!
Feb 24 18:27:02 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, last signaled seq=185928, last emitted seq=185930
Feb 24 18:27:02 [drm] IP block:gmc_v8_0 is hung!
Feb 24 18:27:02 [drm] IP block:gfx_v8_0 is hung!
Feb 24 18:27:05 [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR*
[CRTC:43:crtc-0] hw_done or flip_done timed out
Can you load the kernel with grub command line amdgpu.vm_update_mode=3 to force
CPU VM update mode and see if this helps ?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-27 23:15:48 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #13 from dwagner <***@20mm.eu> ---
(In reply to Andrey Grodzovsky from comment #12)
Post by b***@freedesktop.org
Can you load the kernel with grub command line amdgpu.vm_update_mode=3 to
force CPU VM update mode and see if this helps ?
Sure. Too early yet to say "hurray", but at an uptime of one hour, currently,
4.17.2 survived with amdgpu.vm_update_mode=3 already about 20 times longer than
without that option before the first crash.

One (probably just informal) message is emitted by the kernel:
[ 19.319565] CPU update of VM recommended only for large BAR system

Can you explain a little: What is a "large BAR system", and what does the
vm_update_mode=3 option actually cause? Should I expect any weird side effects
to look for?


BTW: Not a result of that option, but of the kernel version, seems to be the
fact that the shader clock keeps at a pretty high frequency all the time - even
without any 3d or compute load, just displaying a quiet 4k/60Hz desktop image:

cat pp_dpm_sclk
0: 214Mhz
1: 481Mhz
2: 760Mhz
3: 1020Mhz
4: 1102Mhz
5: 1138Mhz
6: 1180Mhz *
7: 1220Mhz

Much lower shader clocks are used only if I lower the refresh rate of the
screen. Is there a reason why the shader clocks should stay high even in the
absence of 3d/compute load?

(I would have better understood if the minimum memory clock was depending on
the refresh rate, but memory clock stays as low as with the older kernels.)
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-28 02:17:57 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #14 from Alex Deucher <***@gmail.com> ---
(In reply to dwagner from comment #13)
Post by b***@freedesktop.org
Much lower shader clocks are used only if I lower the refresh rate of the
screen. Is there a reason why the shader clocks should stay high even in the
absence of 3d/compute load?
Certain display requirements can cause the engine clock to be kept higher as
well.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-28 04:17:19 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #15 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #13)
Post by b***@freedesktop.org
(In reply to Andrey Grodzovsky from comment #12)
Post by b***@freedesktop.org
Can you load the kernel with grub command line amdgpu.vm_update_mode=3 to
force CPU VM update mode and see if this helps ?
Sure. Too early yet to say "hurray", but at an uptime of one hour,
currently, 4.17.2 survived with amdgpu.vm_update_mode=3 already about 20
times longer than without that option before the first crash.
[ 19.319565] CPU update of VM recommended only for large BAR system
Can you explain a little: What is a "large BAR system", and what does the
vm_update_mode=3 option actually cause? Should I expect any weird side
effects to look for?
I think it just means systems with large VRAM so it will require large BAR for
mapping. But I am not sure on that point.
vm_update_mode=3 means GPUVM page tables update is done using CPU. By default
we do it using DMA engine on the ASIC. The log showed a hang in this engine so
I assumed there is something wrong with SDMA commands we submit.
I assume more CPU utilization as a side effect and maybe slower rendering.
Post by b***@freedesktop.org
BTW: Not a result of that option, but of the kernel version, seems to be the
fact that the shader clock keeps at a pretty high frequency all the time -
even without any 3d or compute load, just displaying a quiet 4k/60Hz desktop
cat pp_dpm_sclk
0: 214Mhz
1: 481Mhz
2: 760Mhz
3: 1020Mhz
4: 1102Mhz
5: 1138Mhz
6: 1180Mhz *
7: 1220Mhz
Much lower shader clocks are used only if I lower the refresh rate of the
screen. Is there a reason why the shader clocks should stay high even in the
absence of 3d/compute load?
(I would have better understood if the minimum memory clock was depending on
the refresh rate, but memory clock stays as low as with the older kernels.)
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-28 04:36:41 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #16 from Alex Deucher <***@gmail.com> ---
(In reply to Andrey Grodzovsky from comment #15)
Post by b***@freedesktop.org
I think it just means systems with large VRAM so it will require large BAR
for mapping. But I am not sure on that point.
That's correct. the updates are done with the CPU rather than the GPU (SDMA).
The default BAR size on most systems is usually 256MB for 32 bit compatibility
so the window for CPU access to vram (where the page tables live) is limited.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-28 10:33:22 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #17 from Andrey Grodzovsky <***@amd.com> ---
(In reply to Alex Deucher from comment #16)
Post by b***@freedesktop.org
(In reply to Andrey Grodzovsky from comment #15)
Post by b***@freedesktop.org
I think it just means systems with large VRAM so it will require large BAR
for mapping. But I am not sure on that point.
That's correct. the updates are done with the CPU rather than the GPU
(SDMA). The default BAR size on most systems is usually 256MB for 32 bit
compatibility so the window for CPU access to vram (where the page tables
live) is limited.
Thanks Alex.

dwagner, this is obviously just a work around and not a fix. It points to some
problem with SDMA packets, if you want to continue exploring we can try to dump
some fence traces and SDMA HW ring content to examine the latest packets before
the hang happened.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-28 19:56:46 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #18 from dwagner <***@20mm.eu> ---
The good news: So far no crashes during normal uptime with
amdgpu.vm_update_mode=3

The bad news: System crashes immediately upon S3 resume (with messages quite
different from the ones I saw with earlier S3-resume crashes) - I filed bug
report https://bugs.freedesktop.org/show_bug.cgi?id=107065 on this.

(In reply to Andrey Grodzovsky from comment #17)
Post by b***@freedesktop.org
dwagner, this is obviously just a work around and not a fix. It points to
some problem with SDMA packets, if you want to continue exploring we can try
to dump some fence traces and SDMA HW ring content to examine the latest
packets before the hang happened.
If you can include some debug output into "amd-staging-drm-next" that helps
finding the root cause, I might be able to provide some output - if the kernel
survives long enough after the crash to write the system journal - this has not
always been the case.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-28 21:09:09 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #19 from Andrey Grodzovsky <***@amd.com> ---
Can you use addr2line or gdb with 'list' command to give the line number
matching (In reply to dwagner from comment #18)
Post by b***@freedesktop.org
The good news: So far no crashes during normal uptime with
amdgpu.vm_update_mode=3
The bad news: System crashes immediately upon S3 resume (with messages quite
different from the ones I saw with earlier S3-resume crashes) - I filed bug
report https://bugs.freedesktop.org/show_bug.cgi?id=107065 on this.
(In reply to Andrey Grodzovsky from comment #17)
Post by b***@freedesktop.org
dwagner, this is obviously just a work around and not a fix. It points to
some problem with SDMA packets, if you want to continue exploring we can try
to dump some fence traces and SDMA HW ring content to examine the latest
packets before the hang happened.
If you can include some debug output into "amd-staging-drm-next" that helps
finding the root cause, I might be able to provide some output - if the
kernel survives long enough after the crash to write the system journal -
this has not always been the case.
No need to recompile, just need to see what is the content of SDMA ring buffer
when the hang occurs.

Clone and build our register analyzer from here -
https://cgit.freedesktop.org/amd/umr/ and once the hang happens just run

sudo umr -lb
sudo umr -R gfx[.]
sudo umr -R sdma0[.]
sudo umr -R sdma1[.]

I will probably need more info later but let's try this first.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-28 22:56:03 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #20 from dwagner <***@20mm.eu> ---
(In reply to Andrey Grodzovsky from comment #19)
Post by b***@freedesktop.org
No need to recompile, just need to see what is the content of SDMA ring
buffer when the hang occurs.
Clone and build our register analyzer from here -
https://cgit.freedesktop.org/amd/umr/ and once the hang happens just run
sudo umr -lb
sudo umr -R gfx[.]
sudo umr -R sdma0[.]
sudo umr -R sdma1[.]
I will probably need more info later but let's try this first.
How can I run "umr" on a crashed system? I guess those register values are
retained over a press of the reset button / reboot?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-28 22:57:21 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #21 from dwagner <***@20mm.eu> ---
(I meant to write "I guess those register values are NOT retained over a
reboot, right?")
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-06-29 00:10:03 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #22 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #21)
Post by b***@freedesktop.org
(I meant to write "I guess those register values are NOT retained over a
reboot, right?")
Yes, my assumption was that at least some times you still have SSH access to
the system in those cases.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-04 23:03:36 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #23 from dwagner <***@20mm.eu> ---
Just for the record: At this point, I can say that with
amggpu.vm_update_mode=3 4.17.2-ARCH runs at least for hours,
not only the minutes it runs without this option before crashing.

I cannot, however, say that above combination reaches the
some-days-between-amdgpu-crashes uptimes that 4.13.x reached -
in order to be able to test this, I would need S3 resumes to work,
which is subject to bug report 107065.

Without working S3 resumes, there is no way for me to test longer
uptimes because amdgpu consistently crashes (in any version I know
of) if I just let the system run but switch off the display, and I do
not want to keep the connected 4k TV switched on all day and night.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-05 13:59:56 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #24 from Michel DÀnzer <***@daenzer.net> ---
Can you try bisecting between 4.13 and 4.17 to find where stability went
downhill for you?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-05 23:32:43 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #25 from dwagner <***@20mm.eu> ---
(In reply to Michel DÀnzer from comment #24)
Post by b***@freedesktop.org
Can you try bisecting between 4.13 and 4.17 to find where stability went
downhill for you?
A bisect like that is not likely to converge in any reasonable time, given the
stochastic nature of those crashes.

While the mean-time-between-driver-crashes is dramatically different, there
will be occasions on which 4.13 will crash early enough to yield a false "bad",
and there will be occasions on which 4.17 is lasting like the 20 minutes or so
to assume a false "good".

What about the multitude of debug-options - isn't there one that could allow
for some more insight on when/why the driver crashes?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-06 23:20:20 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #26 from dwagner <***@20mm.eu> ---
Today for the first time I had a sudden "crash while just browsing with
Firefox" while using the amggpu.vm_update_mode=3 parameter with the
current-as-of-today amd-staging-drm-next
(bb2e406ba66c2573b68e609e148cab57b1447095) with patch
https://bugs.freedesktop.org/attachment.cgi?id=140418 applied on top.

Different kernel messages than with previous crashed of this kind were emitted:

Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: GPU fault detected: 146
0x0c80440c
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0:
VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100190
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0:
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C
Jul 07 01:08:20 ryzen kernel: amdgpu 0000:0a:00.0: VM fault (0x0c, vmid 7,
pasid 32768) at page 1048976, read from 'TC1' (0x54433100) (68)
Jul 07 01:08:25 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
gfx timeout, last signaled seq=75244, last emitted seq=75245
Jul 07 01:08:25 ryzen kernel: amdgpu 0000:0a:00.0: GPU reset begin!

Hope this helps somehow.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-07 08:36:28 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #27 from Michel DÀnzer <***@daenzer.net> ---
(In reply to dwagner from comment #26)
Post by b***@freedesktop.org
Today for the first time I had a sudden "crash while just browsing with
Firefox" [...]
That could be a Mesa issue, anyway it should probably be tracked separately
from this report.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-07 20:08:40 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #28 from dwagner <***@20mm.eu> ---
(In reply to Michel DÀnzer from comment #27)
Post by b***@freedesktop.org
That could be a Mesa issue, anyway it should probably be tracked separately
from this report.
Created separate bug report https://bugs.freedesktop.org/show_bug.cgi?id=107152

(If that is a Mesa issue, no more than user processes / X11 should have crashed
- but not the kernel amdgpu driver... right?)
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-09 14:34:51 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #29 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #28)
Post by b***@freedesktop.org
(In reply to Michel DÀnzer from comment #27)
Post by b***@freedesktop.org
That could be a Mesa issue, anyway it should probably be tracked separately
from this report.
Created separate bug report
https://bugs.freedesktop.org/show_bug.cgi?id=107152
(If that is a Mesa issue, no more than user processes / X11 should have
crashed - but not the kernel amdgpu driver... right?)
Not exactly, MESA could create a bad request (faulty GPU address) which would
lead to this. It can even be triggered on purpose using a debug flag from MESA.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-11 22:32:41 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #30 from dwagner <***@20mm.eu> ---
(In reply to Andrey Grodzovsky from comment #29)
Post by b***@freedesktop.org
Post by b***@freedesktop.org
(If that is a Mesa issue, no more than user processes / X11 should have
crashed - but not the kernel amdgpu driver... right?)
Not exactly, MESA could create a bad request (faulty GPU address) which
would lead to this. It can even be triggered on purpose using a debug flag
from MESA.
My understanding is that all parts of MESA run as user processes, outside of
the kernel space. If such code is allowed to pass parameters into kernel
functions that make the kernel crash, that would be a veritable security hole
which attackers could exploit to stage at least denial-of-service attacks, if
not worse.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-15 08:56:58 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #31 from Doctor <***@gmail.com> ---
I got that one too and was able to track the problem down a bit further. Chrome
and video with the gpu enabled will blow it up too. Interesting I was able to
reproduce it consistantly with my rtl8188eu usb driver plug it in connect and
wpa_supplicant will cause it to explode.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-15 09:03:01 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #32 from Doctor <***@gmail.com> ---
I ended up due to working on a live dev cd for codexl since all my machines are
memory based and use no magnetic media. Just cherry picking the code back to
the last 4.16 and no problems Heres the working 4.16 . I chased this rabbit
for awhile and it pops up like the dam wood chuck in caddie shack.


Here is the latest as of 11 hours ago 4.19-wip
https://github.com/tekcomm/linux-image-4.19-wip-generic


Here is the latest as of 11 hours ago 4.16 version from three weeks ago with no
woodchucks
https://github.com/tekcomm/linux-kernel-amdgpu-binaries
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-15 09:07:08 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #33 from Doctor <***@gmail.com> ---
I think it may be something as stupid as a var too.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-15 19:59:36 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #34 from dwagner <***@20mm.eu> ---
(In reply to Doctor from comment #32)
Post by b***@freedesktop.org
Just cherry picking the code
back to the last 4.16 and no problems Heres the working 4.16 . I chased
this rabbit for awhile and it pops up like the dam wood chuck in caddie
shack.
Here is the latest as of 11 hours ago 4.19-wip
https://github.com/tekcomm/linux-image-4.19-wip-generic
I am not sure I understand what you are trying to tell us, here.

The repository you linked does not seem to contain any relevant commits
changing kernel source code.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-16 14:06:32 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #35 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #30)
Post by b***@freedesktop.org
(In reply to Andrey Grodzovsky from comment #29)
Post by b***@freedesktop.org
Post by b***@freedesktop.org
(If that is a Mesa issue, no more than user processes / X11 should have
crashed - but not the kernel amdgpu driver... right?)
Not exactly, MESA could create a bad request (faulty GPU address) which
would lead to this. It can even be triggered on purpose using a debug flag
from MESA.
My understanding is that all parts of MESA run as user processes, outside of
the kernel space. If such code is allowed to pass parameters into kernel
functions that make the kernel crash, that would be a veritable security
hole which attackers could exploit to stage at least denial-of-service
attacks, if not worse.
There is no impact on the kernlel, please note that this is a GPU page fault,
not CPU page fault so the kernel keeps working normal, doesn't hang and
workable. You might get black screen out of this and have to reboot the graphic
card or maybe the entire system to recover but I don't see any system security
and stability compromise here.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-07-29 10:02:00 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

Roshless <***@gmail.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |***@gmail.com

--- Comment #36 from Roshless <***@gmail.com> ---
*** Bug 107311 has been marked as a duplicate of this bug. ***
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-08 23:07:38 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #37 from dwagner <***@20mm.eu> ---
In the related bug report (https://bugs.freedesktop.org/show_bug.cgi?id=107152)
I noticed that this bug can be triggered very reliably and quickly by playing a
video with a deliberately lowered frame rate:
"mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm"

This led me to assume this bug might be caused by the dynamic power management,
that often ramps performance up/down when a video is played at such a low frame
rate.

And indeed, I found this confirmed by many experiments: If I use a script like
#!/bin/bash
cd /sys/class/drm/card0/device
echo manual >power_dpm_force_performance_level
# low
echo 0 >pp_dpm_mclk
echo 0 >pp_dpm_sclk
# medium
#echo 1 >pp_dpm_mclk
#echo 1 >pp_dpm_sclk
# high
#echo 1 >pp_dpm_mclk
#echo 6 >pp_dpm_sclk
to enforce just any performance level, then the crashes do not occur anymore -
also with the "low frame rate video test".

So it seems that the transition from one "dpm" performance level to another,
with a certain probability, causes these crashes. And the more often the
transitions occur, the sooner one will experience them.

(BTW: For unknown reason, invoking "xrandr" or enabling a monitor after sleep
causes the above settings to get lost, so one has to invoke above script
again.)
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-09 20:56:06 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #38 from dwagner <***@20mm.eu> ---
*** Bug 107152 has been marked as a duplicate of this bug. ***
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-14 21:27:41 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #39 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #37)
Post by b***@freedesktop.org
In the related bug report
(https://bugs.freedesktop.org/show_bug.cgi?id=107152) I noticed that this
bug can be triggered very reliably and quickly by playing a video with a
"mpv --no-correct-pts --fps=3 --ao=null some_arbitrary_video.webm"
This led me to assume this bug might be caused by the dynamic power
management, that often ramps performance up/down when a video is played at
such a low frame rate.
I tried exactly the same - reproduce with same card model and latest kernel and
run webm clip with mpv same way you did and it didn't happen.
Post by b***@freedesktop.org
And indeed, I found this confirmed by many experiments: If I use a script
like
#!/bin/bash
cd /sys/class/drm/card0/device
echo manual >power_dpm_force_performance_level
# low
echo 0 >pp_dpm_mclk
echo 0 >pp_dpm_sclk
# medium
#echo 1 >pp_dpm_mclk
#echo 1 >pp_dpm_sclk
# high
#echo 1 >pp_dpm_mclk
#echo 6 >pp_dpm_sclk
to enforce just any performance level, then the crashes do not occur anymore
- also with the "low frame rate video test".
So it seems that the transition from one "dpm" performance level to another,
with a certain probability, causes these crashes. And the more often the
transitions occur, the sooner one will experience them.
(BTW: For unknown reason, invoking "xrandr" or enabling a monitor after
sleep causes the above settings to get lost, so one has to invoke above
script again.)
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-15 14:24:24 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #40 from Andrey Grodzovsky <***@amd.com> ---
Created attachment 141112
--> https://bugs.freedesktop.org/attachment.cgi?id=141112&action=edit
.config

I uploaded my .config file - maybe something in your Kconfig flags makes this
happen - you can try and rebuild latest kernel from Alex's repository using my
.config and see if you don't experience this anymore.
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

Other than that, since you system hard hangs so you can't do any postmortem
dumps, you can at least provide output from events tracing though trace_pipe to
catch live logs on the fly. Maybe we can infer something from there...

So again -
Load the system and before starting reproduce run the following trace command -

sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e
"amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"

then cd /sys/kernel/debug/tracing && cat trace_pipe

When the problem happens just copy all the output from the terminal to a log
file. Make sure your terminal app has largest possible buffer to catch ALL the
output.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-15 22:03:38 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #41 from dwagner <***@20mm.eu> ---
(In reply to Andrey Grodzovsky from comment #40)
Created attachment 141112 [details]
.config
I uploaded my .config file - maybe something in your Kconfig flags makes
this happen - you can try and rebuild latest kernel from Alex's repository
using my .config and see if you don't experience this anymore.
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
Did just that - but still the video test crashes after at most few minutes, and
does not crash with DPM turned off. So we can rule out our .config differences
(of which there are many).
Other than that, since you system hard hangs so you can't do any postmortem
dumps, you can at least provide output from events tracing though trace_pipe
to catch live logs on the fly. Maybe we can infer something from there...
So again -
Load the system and before starting reproduce run the following trace
command -
sudo trace-cmd start -e dma_fence -e gpu_scheduler -e amdgpu -v -e
"amdgpu:amdgpu_mm_rreg" -e "amdgpu:amdgpu_mm_wreg" -e "amdgpu:amdgpu_iv"
then cd /sys/kernel/debug/tracing && cat trace_pipe
When the problem happens just copy all the output from the terminal to a log
file. Make sure your terminal app has largest possible buffer to catch ALL
the output.
Will try that on next opportunity, probably tomorrow evening.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-16 21:53:57 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #42 from dwagner <***@20mm.eu> ---
Ok, did the proposed debugging session with trace-cmd, with output to a
different PC over ssh. Using today's amd-staging-drm-next and btw., Arch
updated the Xorg server earlier today.

This time it took about 4 minutes until the video playback with 3 fps crashed -
the symptom was the same (as in one-colored blank screen and a subsequent
system crash), but this time the kernel and ssh session survived the crash for
some seconds, enough for me to also issue the earlier suggested "umr -O verbose
-R gfx[.]" command after the amdgpu crash, so I can upload the output of that,
too, but this was the last command executed, the system crashed completely
while running it (so its output may be partial).

Find attached dmesg, trace, and umr output.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-16 21:55:49 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #43 from dwagner <***@20mm.eu> ---
Created attachment 141155
--> https://bugs.freedesktop.org/attachment.cgi?id=141155&action=edit
trace-cmd induced output during 3-fps-video replay and crash
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-16 21:56:38 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #44 from dwagner <***@20mm.eu> ---
Created attachment 141156
--> https://bugs.freedesktop.org/attachment.cgi?id=141156&action=edit
dmesg from boot to after the 3-fps-video test crash
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-16 21:57:19 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #45 from dwagner <***@20mm.eu> ---
Created attachment 141157
--> https://bugs.freedesktop.org/attachment.cgi?id=141157&action=edit
output of umr command after 3-fps-video test crash
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-16 22:31:11 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #46 from Andrey Grodzovsky <***@amd.com> ---
Thanks.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-17 21:25:08 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #47 from Andrey Grodzovsky <***@amd.com> ---
Created attachment 141174
--> https://bugs.freedesktop.org/attachment.cgi?id=141174&action=edit
add_debug_info.patch

A am attaching a basic debug patch, please try to apply it. It should give a
bit more info in dmesg whe VM fault happens. I wasn't able to test it on my
system so it might be buggy or crash.

Reproduce again like before with the cmd-trace like before and once the fault
happens if possible try quickly run

sudo umr -O halt_waves -wa

and only if you still have running system after that do the
sudo umr -O verbose -R gfx[.]

The driver should be loaded amdgpu.vm_fault_stop=2 from grub
Also check if adding amdgpu.vm_debug=1 makes the issue reproduce more quickly
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-18 21:36:03 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #48 from dwagner <***@20mm.eu> ---
(In reply to Andrey Grodzovsky from comment #47)
Created attachment 141174 [details] [review]
add_debug_info.patch
A am attaching a basic debug patch, please try to apply it.
Done.
It should give a
bit more info in dmesg whe VM fault happens.
Hmm - I could not see any additional output resulting from it.
Reproduce again like before with the cmd-trace like before and once the
fault happens if possible try quickly run
sudo umr -O halt_waves -wa
and only if you still have running system after that do the
sudo umr -O verbose -R gfx[.]
The driver should be loaded amdgpu.vm_fault_stop=2 from grub
Did that - will attach the script "gpu_debug3.sh" and its output - this time,
dmesg and trace output are in the same file, if you want to look only at the
dmesg part, "grep '^\[' gpu_debug_3.txt" will get it.

I reproduced the bug 4 times, on 2 occasions no error was emitted before
crashing, the 2 other times both umr commands could still run - since the error
message looked the same, I'll attach the shorter file, where the crash occurred
more quickly.
Also check if adding amdgpu.vm_debug=1 makes the issue reproduce more quickly
I used that setting, but it did not seem to make a difference for how quickly
the crash occurred - still "some seconds to some minutes".
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-18 21:37:20 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #49 from dwagner <***@20mm.eu> ---
Created attachment 141189
--> https://bugs.freedesktop.org/attachment.cgi?id=141189&action=edit
script used to generate the gpu_debug_3.txt (when executed via ssh -t ...)
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-18 21:38:10 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #50 from dwagner <***@20mm.eu> ---
Created attachment 141190
--> https://bugs.freedesktop.org/attachment.cgi?id=141190&action=edit
dmesg / trace / umr output from gpu_debug3.sh
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-18 21:40:01 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

dwagner <***@20mm.eu> changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #141190|0 |1
is obsolete| |

--- Comment #51 from dwagner <***@20mm.eu> ---
Created attachment 141191
--> https://bugs.freedesktop.org/attachment.cgi?id=141191&action=edit
xz-compressed output of gpu_debug3.sh - dmesg, trace, umr
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-18 21:43:23 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #52 from dwagner <***@20mm.eu> ---
One other experiment I made: I wrote a script to quickly toggle pp_dpm_mclk and
pp_dpm_sclk while playing a 3 fps video with
power_dpm_force_performance_level=manual. Could not reproduce the crashes that
happen with power_dpm_force_performance_level=auto this way.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-20 14:16:08 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #53 from Andrey Grodzovsky <***@amd.com> ---
Created attachment 141198
--> https://bugs.freedesktop.org/attachment.cgi?id=141198&action=edit
add_debug_info2.patch

Try this patch instead, i might be missing some prints in the first one.
In the last log you attached I haven't seen any UMR dumps or GPU fault prints
in dmesg. THe GPU fault has to be in the log to compare the faulty address
against the debug prints in the patch.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-21 08:41:52 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #54 from dwagner <***@20mm.eu> ---
(In reply to Andrey Grodzovsky from comment #53)
Created attachment 141198 [details] [review]
add_debug_info2.patch
Try this patch instead, i might be missing some prints in the first one.
Can try that this evening.
In the last log you attached I haven't seen any UMR dumps or GPU fault
prints in dmesg. THe GPU fault has to be in the log to compare the faulty
address against the debug prints in the patch.
In above attached file "xz-compressed output of gpu_debug3.sh" there is umr
output at the time of the crash (238 seconds after the reboot):

----------------------------------------------
...
mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start:
driver=drm_sched timeline=gfx context=162 seqno=87
mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signal:
driver=drm_sched timeline=gfx context=162 seqno=87
kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled:
driver=amdgpu timeline=sdma1 context=11 seqno=210
kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled:
driver=amdgpu timeline=sdma1 context=11 seqno=211
[ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout,
signaled seq=32624, emitted seq=32626
[ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
[ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!

crash detected!

executing umr -O halt_waves -wa
No active waves!


executing umr -O verbose -R gfx[.]

polaris11.gfx.rptr == 1792
polaris11.gfx.wptr == 1792
polaris11.gfx.drv_wptr == 1792
polaris11.gfx.ring[1761] == 0xffff1000 ...
polaris11.gfx.ring[1762] == 0xffff1000 ...
polaris11.gfx.ring[1763] == 0xffff1000 ...
polaris11.gfx.ring[1764] == 0xffff1000 ...
polaris11.gfx.ring[1765] == 0xffff1000 ...
polaris11.gfx.ring[1766] == 0xffff1000 ...
polaris11.gfx.ring[1767] == 0xffff1000 ...
polaris11.gfx.ring[1768] == 0xffff1000 ...
polaris11.gfx.ring[1769] == 0xffff1000 ...
polaris11.gfx.ring[1770] == 0xffff1000 ...
polaris11.gfx.ring[1771] == 0xffff1000 ...
polaris11.gfx.ring[1772] == 0xffff1000 ...
polaris11.gfx.ring[1773] == 0xffff1000 ...
polaris11.gfx.ring[1774] == 0xffff1000 ...
polaris11.gfx.ring[1775] == 0xffff1000 ...
polaris11.gfx.ring[1776] == 0xffff1000 ...
polaris11.gfx.ring[1777] == 0xffff1000 ...
polaris11.gfx.ring[1778] == 0xffff1000 ...
polaris11.gfx.ring[1779] == 0xffff1000 ...
polaris11.gfx.ring[1780] == 0xffff1000 ...
polaris11.gfx.ring[1781] == 0xffff1000 ...
polaris11.gfx.ring[1782] == 0xffff1000 ...
polaris11.gfx.ring[1783] == 0xffff1000 ...
polaris11.gfx.ring[1784] == 0xffff1000 ...
polaris11.gfx.ring[1785] == 0xffff1000 ...
polaris11.gfx.ring[1786] == 0xffff1000 ...
polaris11.gfx.ring[1787] == 0xffff1000 ...
polaris11.gfx.ring[1788] == 0xffff1000 ...
polaris11.gfx.ring[1789] == 0xffff1000 ...
polaris11.gfx.ring[1790] == 0xffff1000 ...
polaris11.gfx.ring[1791] == 0xffff1000 ...
polaris11.gfx.ring[1792] == 0xc0032200 rwD

trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
trying to get VMID from dmesg output for 'umr -O verbose -vm ...'

done after crash, flashing NUMLOCK LED.
amdgpu_cs:0-799 [001] .... 286.852838: amdgpu_bo_list_set:
list=0000000099c16b5c, bo=000000001771c26f, bo_size=131072
amdgpu_cs:0-799 [001] .... 286.852846: amdgpu_bo_list_set:
list=0000000099c16b5c, bo=0000000046bfd439, bo_size=131072
...
----------------------------------------------

But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages this
time. Sometimes such are emitted, sometimes not.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-21 14:43:24 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #55 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #54)
Post by b***@freedesktop.org
(In reply to Andrey Grodzovsky from comment #53)
Created attachment 141198 [details] [review] [review]
add_debug_info2.patch
Try this patch instead, i might be missing some prints in the first one.
Can try that this evening.
In the last log you attached I haven't seen any UMR dumps or GPU fault
prints in dmesg. THe GPU fault has to be in the log to compare the faulty
address against the debug prints in the patch.
In above attached file "xz-compressed output of gpu_debug3.sh" there is umr
----------------------------------------------
...
driver=drm_sched timeline=gfx context=162 seqno=87
driver=drm_sched timeline=gfx context=162 seqno=87
driver=amdgpu timeline=sdma1 context=11 seqno=210
driver=amdgpu timeline=sdma1 context=11 seqno=211
[ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=32624, emitted seq=32626
[ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
[ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
crash detected!
executing umr -O halt_waves -wa
No active waves!
Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that
should have froze GPUs compute units and hence the above command would produce
a lot of wave info.
Post by b***@freedesktop.org
executing umr -O verbose -R gfx[.]
polaris11.gfx.rptr == 1792
polaris11.gfx.wptr == 1792
polaris11.gfx.drv_wptr == 1792
polaris11.gfx.ring[1761] == 0xffff1000 ...
polaris11.gfx.ring[1762] == 0xffff1000 ...
polaris11.gfx.ring[1763] == 0xffff1000 ...
polaris11.gfx.ring[1764] == 0xffff1000 ...
polaris11.gfx.ring[1765] == 0xffff1000 ...
polaris11.gfx.ring[1766] == 0xffff1000 ...
polaris11.gfx.ring[1767] == 0xffff1000 ...
polaris11.gfx.ring[1768] == 0xffff1000 ...
polaris11.gfx.ring[1769] == 0xffff1000 ...
polaris11.gfx.ring[1770] == 0xffff1000 ...
polaris11.gfx.ring[1771] == 0xffff1000 ...
polaris11.gfx.ring[1772] == 0xffff1000 ...
polaris11.gfx.ring[1773] == 0xffff1000 ...
polaris11.gfx.ring[1774] == 0xffff1000 ...
polaris11.gfx.ring[1775] == 0xffff1000 ...
polaris11.gfx.ring[1776] == 0xffff1000 ...
polaris11.gfx.ring[1777] == 0xffff1000 ...
polaris11.gfx.ring[1778] == 0xffff1000 ...
polaris11.gfx.ring[1779] == 0xffff1000 ...
polaris11.gfx.ring[1780] == 0xffff1000 ...
polaris11.gfx.ring[1781] == 0xffff1000 ...
polaris11.gfx.ring[1782] == 0xffff1000 ...
polaris11.gfx.ring[1783] == 0xffff1000 ...
polaris11.gfx.ring[1784] == 0xffff1000 ...
polaris11.gfx.ring[1785] == 0xffff1000 ...
polaris11.gfx.ring[1786] == 0xffff1000 ...
polaris11.gfx.ring[1787] == 0xffff1000 ...
polaris11.gfx.ring[1788] == 0xffff1000 ...
polaris11.gfx.ring[1789] == 0xffff1000 ...
polaris11.gfx.ring[1790] == 0xffff1000 ...
polaris11.gfx.ring[1791] == 0xffff1000 ...
polaris11.gfx.ring[1792] == 0xc0032200 rwD
trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
done after crash, flashing NUMLOCK LED.
list=0000000099c16b5c, bo=000000001771c26f, bo_size=131072
list=0000000099c16b5c, bo=0000000046bfd439, bo_size=131072
...
----------------------------------------------
But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages
this time. Sometimes such are emitted, sometimes not.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-21 21:16:52 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #56 from dwagner <***@20mm.eu> ---
(In reply to Andrey Grodzovsky from comment #55)
Post by b***@freedesktop.org
Post by b***@freedesktop.org
In above attached file "xz-compressed output of gpu_debug3.sh" there is umr
----------------------------------------------
...
driver=drm_sched timeline=gfx context=162 seqno=87
driver=drm_sched timeline=gfx context=162 seqno=87
driver=amdgpu timeline=sdma1 context=11 seqno=210
driver=amdgpu timeline=sdma1 context=11 seqno=211
[ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=32624, emitted seq=32626
[ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
[ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
crash detected!
executing umr -O halt_waves -wa
No active waves!
Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that
should have froze GPUs compute units and hence the above command would
produce a lot of wave info.
Yes I did, as can be seen from the kernel command line at the very beginning of
the file I attached:
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux_amd
root=UUID=b5d56e15-18f3-4783-af84-bbff3bbff3ef rw
cryptdevice=/dev/nvme0n1p2:root:allow-discards libata.force=1.5 video=DP-1:d
video=DVI-D-1:d video=HDMI-A-1:1024x768 amdgpu.dc=1 amdgpu.vm_update_mode=0
amdgpu.dpm=-1 amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_fault_stop=2
amdgpu.vm_debug=1

Could the "amdgpu 0000:0a:00.0: GPU reset begin!" message indicate a procedure
that discards whatever has been in thoses "waves" before? If yes, could
amdgpu.gpu_recovery=0 prevent that from happening?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-21 21:29:48 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #57 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #56)
Post by b***@freedesktop.org
(In reply to Andrey Grodzovsky from comment #55)
Post by b***@freedesktop.org
Post by b***@freedesktop.org
In above attached file "xz-compressed output of gpu_debug3.sh" there is umr
----------------------------------------------
...
driver=drm_sched timeline=gfx context=162 seqno=87
driver=drm_sched timeline=gfx context=162 seqno=87
driver=amdgpu timeline=sdma1 context=11 seqno=210
driver=amdgpu timeline=sdma1 context=11 seqno=211
[ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=32624, emitted seq=32626
[ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
[ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
crash detected!
executing umr -O halt_waves -wa
No active waves!
Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that
should have froze GPUs compute units and hence the above command would
produce a lot of wave info.
Yes I did, as can be seen from the kernel command line at the very beginning
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux_amd
root=UUID=b5d56e15-18f3-4783-af84-bbff3bbff3ef rw
cryptdevice=/dev/nvme0n1p2:root:allow-discards libata.force=1.5 video=DP-1:d
video=DVI-D-1:d video=HDMI-A-1:1024x768 amdgpu.dc=1 amdgpu.vm_update_mode=0
amdgpu.dpm=-1 amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_fault_stop=2
amdgpu.vm_debug=1
Could the "amdgpu 0000:0a:00.0: GPU reset begin!" message indicate a
procedure that discards whatever has been in thoses "waves" before? If yes,
could amdgpu.gpu_recovery=0 prevent that from happening?
Yes, missed that one. No resets.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-22 00:24:35 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #58 from dwagner <***@20mm.eu> ---
Here comes another trace log, with your info2.patch applied.

Something must have changed since the last test, as it took pretty long this
time to reproduce the crash. Could that have been caused by
https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c?h=amd-staging-drm-next&id=b385925f3922faca7435e50e31380bb2602fd6b8
now being part of the kernel?

However, the latest trace you find attached below is not much different to the
last one, xzcat /tmp/gpu_debug5.txt.xz | grep '^\[' will tell you:

[ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout,
signaled seq=475104, emitted seq=475106
[ 1510.023117] [drm] GPU recovery disabled.

amdgpu_cs:0-806 [012] .... 1787.493126: amdgpu_vm_bo_cs:
soffs=00001001a0, eoffs=00001001b9, flags=70
amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs:
soffs=0000100200, eoffs=00001021e0, flags=70
amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs:
soffs=0000102200, eoffs=00001041e0, flags=70
amdgpu_cs:0-806 [012] .... 1787.493129: amdgpu_vm_bo_cs:
soffs=000010c1e0, eoffs=000010c2e1, flags=70
amdgpu_cs:0-806 [012] .... 1787.493131: drm_sched_job:
entity=00000000406345a7, id=10239, fence=000000007a120377, ring=gfx, job
count:8, hw job count:0

And later in the file you can find:
------------------------------------------------------
crash detected!

executing umr -O halt_waves -wa
No active waves!

executing umr -O verbose -R gfx[.]

polaris11.gfx.rptr == 512
polaris11.gfx.wptr == 512
polaris11.gfx.drv_wptr == 512
polaris11.gfx.ring[ 481] == 0xffff1000 ...
polaris11.gfx.ring[ 482] == 0xffff1000 ...
polaris11.gfx.ring[ 483] == 0xffff1000 ...
polaris11.gfx.ring[ 484] == 0xffff1000 ...
polaris11.gfx.ring[ 485] == 0xffff1000 ...
polaris11.gfx.ring[ 486] == 0xffff1000 ...
polaris11.gfx.ring[ 487] == 0xffff1000 ...
polaris11.gfx.ring[ 488] == 0xffff1000 ...
polaris11.gfx.ring[ 489] == 0xffff1000 ...
polaris11.gfx.ring[ 490] == 0xffff1000 ...
polaris11.gfx.ring[ 491] == 0xffff1000 ...
polaris11.gfx.ring[ 492] == 0xffff1000 ...
polaris11.gfx.ring[ 493] == 0xffff1000 ...
polaris11.gfx.ring[ 494] == 0xffff1000 ...
polaris11.gfx.ring[ 495] == 0xffff1000 ...
polaris11.gfx.ring[ 496] == 0xffff1000 ...
polaris11.gfx.ring[ 497] == 0xffff1000 ...
polaris11.gfx.ring[ 498] == 0xffff1000 ...
polaris11.gfx.ring[ 499] == 0xffff1000 ...
polaris11.gfx.ring[ 500] == 0xffff1000 ...
polaris11.gfx.ring[ 501] == 0xffff1000 ...
polaris11.gfx.ring[ 502] == 0xffff1000 ...
polaris11.gfx.ring[ 503] == 0xffff1000 ...
polaris11.gfx.ring[ 504] == 0xffff1000 ...
polaris11.gfx.ring[ 505] == 0xffff1000 ...
polaris11.gfx.ring[ 506] == 0xffff1000 ...
polaris11.gfx.ring[ 507] == 0xffff1000 ...
polaris11.gfx.ring[ 508] == 0xffff1000 ...
polaris11.gfx.ring[ 509] == 0xffff1000 ...
polaris11.gfx.ring[ 510] == 0xffff1000 ...
polaris11.gfx.ring[ 511] == 0xffff1000 ...
polaris11.gfx.ring[ 512] == 0xc0032200 rwD


trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
trying to get VMID from dmesg output for 'umr -O verbose -vm ...'

done after crash.
-------------------------------------------

So even without GPU reset, still no "waves". And the error message also does
not state any VM fault address.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-22 00:26:06 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #59 from dwagner <***@20mm.eu> ---
Created attachment 141228
--> https://bugs.freedesktop.org/attachment.cgi?id=141228&action=edit
latest crash trace output, without gpu_reset
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-22 14:33:03 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #60 from Andrey Grodzovsky <***@amd.com> ---
(In reply to dwagner from comment #58)
Post by b***@freedesktop.org
Here comes another trace log, with your info2.patch applied.
Something must have changed since the last test, as it took pretty long this
time to reproduce the crash. Could that have been caused by
https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdgpu/
nbio_v7_4.c?h=amd-staging-drm-
next&id=b385925f3922faca7435e50e31380bb2602fd6b8 now being part of the
kernel?
Don't think it's related. This code is more related to virtualization.
Post by b***@freedesktop.org
However, the latest trace you find attached below is not much different to
[ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=475104, emitted seq=475106
[ 1510.023117] [drm] GPU recovery disabled.
That just means you are again running with GPU VM update mode set to use SDMA.
Which is seen in you dmesg (amdgpu.vm_update_mode=0) , so are again
experiencing the original issue of SDMA hang. Please use
amdgpu.vm_update_mode=3 to get back to VM_FAULTs issue.
Post by b***@freedesktop.org
soffs=00001001a0, eoffs=00001001b9, flags=70
soffs=0000100200, eoffs=00001021e0, flags=70
soffs=0000102200, eoffs=00001041e0, flags=70
soffs=000010c1e0, eoffs=000010c2e1, flags=70
entity=00000000406345a7, id=10239, fence=000000007a120377, ring=gfx, job
count:8, hw job count:0
------------------------------------------------------
crash detected!
executing umr -O halt_waves -wa
No active waves!
executing umr -O verbose -R gfx[.]
polaris11.gfx.rptr == 512
polaris11.gfx.wptr == 512
polaris11.gfx.drv_wptr == 512
polaris11.gfx.ring[ 481] == 0xffff1000 ...
polaris11.gfx.ring[ 482] == 0xffff1000 ...
polaris11.gfx.ring[ 483] == 0xffff1000 ...
polaris11.gfx.ring[ 484] == 0xffff1000 ...
polaris11.gfx.ring[ 485] == 0xffff1000 ...
polaris11.gfx.ring[ 486] == 0xffff1000 ...
polaris11.gfx.ring[ 487] == 0xffff1000 ...
polaris11.gfx.ring[ 488] == 0xffff1000 ...
polaris11.gfx.ring[ 489] == 0xffff1000 ...
polaris11.gfx.ring[ 490] == 0xffff1000 ...
polaris11.gfx.ring[ 491] == 0xffff1000 ...
polaris11.gfx.ring[ 492] == 0xffff1000 ...
polaris11.gfx.ring[ 493] == 0xffff1000 ...
polaris11.gfx.ring[ 494] == 0xffff1000 ...
polaris11.gfx.ring[ 495] == 0xffff1000 ...
polaris11.gfx.ring[ 496] == 0xffff1000 ...
polaris11.gfx.ring[ 497] == 0xffff1000 ...
polaris11.gfx.ring[ 498] == 0xffff1000 ...
polaris11.gfx.ring[ 499] == 0xffff1000 ...
polaris11.gfx.ring[ 500] == 0xffff1000 ...
polaris11.gfx.ring[ 501] == 0xffff1000 ...
polaris11.gfx.ring[ 502] == 0xffff1000 ...
polaris11.gfx.ring[ 503] == 0xffff1000 ...
polaris11.gfx.ring[ 504] == 0xffff1000 ...
polaris11.gfx.ring[ 505] == 0xffff1000 ...
polaris11.gfx.ring[ 506] == 0xffff1000 ...
polaris11.gfx.ring[ 507] == 0xffff1000 ...
polaris11.gfx.ring[ 508] == 0xffff1000 ...
polaris11.gfx.ring[ 509] == 0xffff1000 ...
polaris11.gfx.ring[ 510] == 0xffff1000 ...
polaris11.gfx.ring[ 511] == 0xffff1000 ...
polaris11.gfx.ring[ 512] == 0xc0032200 rwD
trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
done after crash.
-------------------------------------------
So even without GPU reset, still no "waves". And the error message also does
not state any VM fault address.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-22 22:18:11 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322
Please use amdgpu.vm_update_mode=3 to get back to VM_FAULTs issue.
The "good" news is that reproduction of the crashes with 3-fps-video-replay is
very quick when using amdgpu.vm_update_mode=3.

But the bad news is that I have not been able to get useful error output when
using vm_update_mode=3.

At first I tried with also amdgpu.vm_debug=1, and with that in 10 crashes not a
single error output line was emitted to either the ssh channel or the system
journal.

I then tried with amdgpu.vm_debug=0, and while a few error lines output become
logged, then, not quite anything useful - see also in attached example:

[ 912.447139] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
signaled seq=12818, emitted seq=12819
[ 912.447145] [drm] GPU recovery disabled.

These are the only lines indicating the error, not even the
echo "crash detected!"
after the
"dmesg -w | tee /dev/tty | grep -m 1 -e "amdgpu.*GPU" -e "amdgpu.*ERROR"
gets emitted, much less the theoretically following umr commands.

What could I do to not let the kernel die so quickly when using
amdgpu.vm_update_mode=3?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-08-22 22:18:49 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #62 from dwagner <***@20mm.eu> ---
Created attachment 141243
--> https://bugs.freedesktop.org/attachment.cgi?id=141243&action=edit
crash trace with amdgpu.vm_update_mode=3
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-09-19 23:35:10 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #63 from Anthony Ruhier <***@hotmail.com> ---
FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have been
fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-09-19 23:35:42 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #64 from Anthony Ruhier <***@hotmail.com> ---
(In reply to Anthony Ruhier from comment #63)
Post by b***@freedesktop.org
FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have
been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed.
Forgot to say that I have a vega 64.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-09-23 22:04:23 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #65 from dwagner <***@20mm.eu> ---
(In reply to Anthony Ruhier from comment #63)
Post by b***@freedesktop.org
FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have
been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed.
Unluckily, I cannot confirm either observation: The current
amd-staging-drm-next git head still crashes on me quickly, still well
reproduceable with the 3-fps-video-replay test.

And going into S3 suspend does not work for me with the current
amd-staging-drm-next either.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-09-23 23:42:52 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #66 from Anthony Ruhier <***@hotmail.com> ---
(In reply to dwagner from comment #65)
Post by b***@freedesktop.org
(In reply to Anthony Ruhier from comment #63)
Post by b***@freedesktop.org
FYI, I also had this bug under linux 4.17 and 4.18, but it seems to have
been fixed in 4.19-rc3. The suspend/hibernate issue has also been fixed.
Unluckily, I cannot confirm either observation: The current
amd-staging-drm-next git head still crashes on me quickly, still well
reproduceable with the 3-fps-video-replay test.
And going into S3 suspend does not work for me with the current
amd-staging-drm-next either.
Last time I tested, amd-staging-drm-next seemed to be based on 4.19-rc1, on
which I had the issue too. I switched to vanilla 4.19-rc4 (now -rc5) and it was
fixed.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-09-25 12:11:29 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #67 from Roshless <***@gmail.com> ---
Tried on 4.19-rc5, still crashes for me after about 2-3 days (of 6-12h use)
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-11-14 00:23:15 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #68 from dwagner <***@20mm.eu> ---
Tested today's current amd-staging-drm-next git head, to see if there has been
any improvement over the last two months.

The bad news: The 3-fps-video-replay test still crashes the driver reproducably
after few minutes, as long as the default automatic power management is active.

The mediocre news: At least it looks as if the linux kernel now survives the
driver crash to some extent, I found messages in the journal like this:

Nov 14 00:59:36 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
sdma0 timeout, signaled seq=22008, emitted seq=22010
Nov 14 00:59:36 ryzen kernel: [drm] GPU recovery disabled.
Nov 14 00:59:37 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
sdma1 timeout, signaled seq=107, emitted seq=109
Nov 14 00:59:37 ryzen kernel: [drm] GPU recovery disabled.
Nov 14 00:59:40 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
sdma0 timeout, signaled seq=22008, emitted seq=22010
Nov 14 00:59:40 ryzen kernel: [drm] GPU recovery disabled.
Nov 14 00:59:41 ryzen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
sdma1 timeout, signaled seq=107, emitted seq=109

... and so on repeating for several minutes after the screen went blank.

Will test tomorrow if this means I can now collect the diagnostics outputs that
were asked for earlier.

Some good news: S3 suspends/resumes are working fine right now. There are some
scary messages emitted upon resume, but they do not seem to have bad
consequences:

[ 281.465654] [drm:emulated_link_detect [amdgpu]] *ERROR* Failed to read EDID
[ 281.490719] [drm:emulated_link_detect [amdgpu]] *ERROR* Failed to read EDID
[ 282.006225] [drm] Fence fallback timer expired on ring sdma0
[ 282.512879] [drm] Fence fallback timer expired on ring sdma0
[ 282.556651] [drm] UVD and UVD ENC initialized successfully.
[ 282.657771] [drm] VCE initialized successfully.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-11-15 23:37:57 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #69 from dwagner <***@20mm.eu> ---
As promised in above comment, today I ran my debug script "gpu_debug4.sh" to
obtain the diagnostic output after the crash as requested above.
This output is in attached "gpu_debug4_output.txt".
Since the trace output, the "dmesg -w" output and stdout are written to the
same file, they are roughly chronologic.

If you want to look only at the dmesg-output, use
grep '^\[' gpu_debug4_output.txt
(gpu_debug4.sh is a slight variation of earlier gpu_debug3.sh, just writing to
a local log file.)

BTW: I ran the script multiple times, crashes occurred after 5 to 300 seconds,
the diagnostic output always looked like in attached gpu_debug4_output.txt.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-11-15 23:38:29 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #70 from dwagner <***@20mm.eu> ---
Created attachment 142483
--> https://bugs.freedesktop.org/attachment.cgi?id=142483&action=edit
test script
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-11-15 23:39:44 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=102322

--- Comment #71 from dwagner <***@20mm.eu> ---
Created attachment 142484
--> https://bugs.freedesktop.org/attachment.cgi?id=142484&action=edit
gpu_debug4_output.txt.gz
--
You are receiving this mail because:
You are the assignee for the bug.
Loading...