Discussion:
[Bug 105425] 3D & games produce periodic GPU crashes (Radeon R7 370)
b***@freedesktop.org
2018-03-13 10:38:20 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

Michel DÀnzer <***@daenzer.net> changed:

What |Removed |Added
----------------------------------------------------------------------------
Component|Mesa core |Drivers/Gallium/radeonsi
QA Contact|mesa-***@lists.freedesktop. |dri-***@lists.freedesktop
|org |.org
Assignee|mesa-***@lists.freedesktop. |dri-***@lists.freedesktop
|org |.org
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-03-25 21:21:04 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #7 from MirceaKitsune <***@yahoo.com> ---
I've been testing this crash using Xonotic during the past two days, granted
it's a game I have a lot of experience customizing. What I found is pretty
interesting and should be a good start in shedding light on this bug.

Initially the system freeze occurred somewhere between 10 and 40 minutes. Upon
changing a few cvars, I seem to have almost entirely gotten rid of it: After
nearly 5 hours of continuous testing, only one lockup has taken place! Below
are the cvar overrides I added to my autoexec.cfg for the test: At least one of
them had an influence... I'm still working on pinning down which, and that will
take several more days due to the probability rate of the issue.

r_batch_multidraw 0 // old: 1
r_batch_dynamicbuffer 0 // old: 1
r_depthfirst 0 // old: 2
gl_vbo 0 // old: 3
gl_vbo_dynamicindex 0 // old: 1
gl_vbo_dynamicvertex 0 // old: 1
r_glsl_skeletal 0 // old: 1
vid_samples 1 // old: 4
gl_texture_anisotropy 0 // old: 16

I know the issue has something to do with triangles or vertices: The crash
seems more frequent when there are a lot of players or objects present,
indicating that an increased surface count may be a contributor. I've suspected
mesh data stored on the video card to be the culprit, especially shared data
with multiple objects using one instance of a mesh from video memory. This is
why my bet is currently on gl_vbo (Vertex Buffer Objects /
GL_ARB_vertex_buffer_object) being the variable that made a difference... again
I still got a lockup even without it, so if anything it just heavily mitigated
the crash.

This belief is reinforced by my previous experience in Blender: The only scene
causing the GPU lockup is one where several high-poly objects share common mesh
data, and the crash always occurred upon me adding a Subdivision Surface to
just one of them (increasing its polygon count). It's been confirmed that as of
Blender 2.77 (I have 2.79) VBO is indeed enabled in the 3D viewport. Note that
I was also using the untextured viewport, thus I doubt textures play a role.

Lastly I ruled out the possibility of overheating having anything to do with
it: During the first 3 hours in which I got no lockup, the temperature in my
room was above 26°C. When I did get that one lockup later at night, the
temperature of my room had long dropped to 23°C. The stress on the GPU was the
same at all times, absolutely no settings were changed including the map.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-03-28 00:20:31 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #8 from MirceaKitsune <***@yahoo.com> ---
Testing is still heavily undergoing. There's still nothing conclusive yet, but
I should definitely share a piece of information early on.

To my surprise, it would appear the culprit may be either Anti-Aliasing or
Anisotropic Filtering. I decided to re-enable their cvars first in Xonotic
since I honestly suspected them the least... the moment I did that all hell
broke lose again: In 30 minutes I had two system lockups! Then I disabled them
once more, and could play a 40 minute match with no problem.

I have no idea which of the two it could be, but I should be getting there in
the following days. I'm slowly re-enabling the other cvars first to rule them
out, then I'll see whether AA or texture filtering is behind the crashes.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-03-28 23:03:05 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #9 from MirceaKitsune <***@yahoo.com> ---
And we have a verdict. The influential factor is by far the anti-aliasing, at
least in the case of Xonotic. The other cvars I previously mentioned have
absolutely no effect on the frequency of this freeze.

Today I enabled the feature again and tried playing another match: I instantly
got two lockups, one after 8 minutes and the other after only 20 seconds! I
then disabled it and let the bots play again while I was away: This time the
machine froze after more than 2 whole hours of experiencing no issues.

I find it interesting how the probability of the freeze seems to scale with the
number of samples: If I use 4x AA ("vid_samples 4"), I get a crash roughly
every 30 minutes... if I disable AA ("vid_samples 1"), I get a crash less than
once per 2 hours... 30 minutes * 4x = 2 hours. Maybe this is just me seeing
patterns but I thought I should suggest the idea.

I'd like to hear some thoughts from the developers or experienced users at this
point. Can we close in on the source of this GPU lockup, knowing that Anti
Aliasing greatly affects its frequency in Darkplaces engine? Are there any open
bugs about AA related X11 crashes I should check out? What else can I test,
ideally still under Xonotic where I have the best test case prepared?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-03-30 01:49:24 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #10 from MirceaKitsune <***@yahoo.com> ---
Created attachment 138438
--> https://bugs.freedesktop.org/attachment.cgi?id=138438&action=edit
Screenshot of the Blender window glitching

I should add another detail to the discussion. I know this may be a separate
issue which might have nothing to do with the crash, but at the same time I
wouldn't be surprised if it does: Glitched graphics often indicate something
going wrong with the display, such as corrupt textures in video memory, which
may ultimately lead to just such a lockup.

On occasion, certain programs (namely Firefox and Blender) glitch out and draw
broken rectangles all over the window. Some of those glitches are just boxes of
random colors, others contain pieces of past images (for instance I saw
patterns from my lock screen background). Sometimes they quickly disappear on
their own, at other times I have to restart the program as it becomes illegible
and unusable. If I move anything the squares flicker all over the place. The
glitches continue even after I disable desktop effects, thus KDE compositing
should have nothing to do with it.

Attached is a screenshot of the glitch happening in Blender, showing its window
covered in the corrupt squares. I'm curious what your opinion is. Again I know
this may be an unrelated issue, but I'm wondering whether it indicates some
video storage corruption that's also leading up to the lockups.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-01 19:53:17 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #11 from MirceaKitsune <***@yahoo.com> ---
I'm still struggling to debug this. The more I see the more my jaw drops.

First of all, the rule that disabling anti-aliasing decreases the frequency of
the freeze (see the comments above) was just patched out: AA no longer has any
effect either, it always freezes between 0 and 30 minutes now.

I ran the following new tests in Xonotic, none of which had any influence:

- Running with the following environment variable set:
R600_DEBUG=checkir,precompile,nooptvariant,nodpbb,nodfsm,nodma,nowc,nooutoforder,nohyperz,norbplus,no2d,notiling,nodcc,nodccclear,nodccfb,nodccmsaa

- Disabling all shaders, even turning off OpenGL 2.0 support entirely.

- Resetting the entire BIOS to its failsafe defaults, making sure that neither
overclocking nor any other settings are involved.

- Running under both an X11 and Wayland session (Plasma). In Wayland it crashes
instantly so it's even worse.

- Verified that this occurs on both the "radeon" and "amdgpu" modules, meaning
the video driver makes no difference either.

It's clear to me at this point that this is the work of a professional: The
code causing the crash is carefully maintained and injected into my system. If
this was just a bug, at least one of the countless things I tried would have
affected it somehow, it's impossible for a randomly occurring bug to survive so
many different settings and environments... the issue instead is adaptive, so
that the moment I find and disable one implementation another is activated
within minutes to keep the crashes going. I imagine the objective is to block
the user from finding a solution and ultimately censor them from using specific
programs. I find it unbelievable that someone out there is actively doing this.

Please help me get to the bottom of this: The crash clearly acts by simulating
some sort of bug, so there must be a vulnerability deep in the system which
hidden code is exploiting. I offered a lot of test data on this report: If the
developers read this, please let me know what to try next!
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-01 23:11:01 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #12 from MirceaKitsune <***@yahoo.com> ---
Created attachment 138483
--> https://bugs.freedesktop.org/attachment.cgi?id=138483&action=edit
Output of: watch cat /sys/kernel/debug/dri/0/amdgpu_pm_info

I decided to turn my attention to the last logical thing I can imagine: DPM
(Dynamic Power Management) and the clocks on my video card. The kernel added
support for realtime tuning of the frequencies a while ago, so I was pondering
if the default setup may have led to excess overclocking.

I left a console to watch the file /sys/kernel/debug/dri/0/amdgpu_pm_info which
I understand contains the video card frequencies. The maximum "power level" I
seem to reach is 4, at sclk 101500 and mclk 140000. I'm attaching the peak
output of this file here.

My video card is supposed to run at 1015 MHz (core clock) + 5600 MHz (memory
clock). I don't fully understand how those numbers translate to frequencies,
but from what I heard that represents the MHz * 100. If such is the case, my
GPU clock is just right whereas my VRAM is actually under-clocked to a quarter
of its default frequency! Can anyone confirm this so at least the hypothesis of
bad clocks is out of the way?

I may try testing with the kernel parameters "radeon.dpm=0 amdgpu.dpm=0" later:
I tried doing so briefly but the performance is too horrible to play a game, so
I'll instead leave a bot match running in spectator mode while I'm away.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-02 09:33:21 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #13 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #10)
Created attachment 138438 [details]
Screenshot of the Blender window glitching
I should add another detail to the discussion. I know this may be a separate
issue which might have nothing to do with the crash, but at the same time I
wouldn't be surprised if it does: Glitched graphics often indicate something
going wrong with the display, such as corrupt textures in video memory,
which may ultimately lead to just such a lockup.
On occasion, certain programs (namely Firefox and Blender) glitch out and
draw broken rectangles all over the window. Some of those glitches are just
boxes of random colors, others contain pieces of past images (for instance I
saw patterns from my lock screen background). Sometimes they quickly
disappear on their own, at other times I have to restart the program as it
becomes illegible and unusable. If I move anything the squares flicker all
over the place. The glitches continue even after I disable desktop effects,
thus KDE compositing should have nothing to do with it.
Attached is a screenshot of the glitch happening in Blender, showing its
window covered in the corrupt squares. I'm curious what your opinion is.
Again I know this may be an unrelated issue, but I'm wondering whether it
indicates some video storage corruption that's also leading up to the
lockups.
I think you should try running your hardware under Windows.

You might also want to check if you still have warranty on the card
and act quickly if it expires soon.

I do suspect that it is a common hardware fault that happens with most video
cards over time. I also had it on my very old HD5670, but with some help I did
manage to salvage it, for now. My first symptoms were problems with RAM, that
could be "workaround"-ed by lowering the memory frequency to half of the
nominal frequency.

The problem is in micro BGA (Ball Grid Array). The GPU chip is the size of a
fingernail and is placed on a "pad" that is usually about square inch size. The
chip and the pad are connected with microBGA. The pad has a normal BGA on its
other side that is soldered to the GPU card. With thermal expansion and
contraction the soldier of the micro pad fractures and starts to misbehave.
It's common to point out that lead-free soldier is not as reliable under
temperature changes.

You might have heard about solutions like baking the card or re-balling the
BGA. I do not recommending trying these. Baking the card might damage other
components on it (capacitors, everything plastic). Re-balling changes the
soldier of the normal BGA, but it is very expensive manual labor that is not
even fixing the BGA which causes the problems.

All these "solutions" work because they also heat the small GPU chip and melt
the microBGA soldier.

If you are sure that this is your problem, you can find somebody who knows how
to use a hot air soldering station and heat just the small GPU chip with
200-250C for about 2-5 minutes. These the the temperatures and duration used
for manufacturing the card, so they should be safe.

Good luck.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-02 12:15:49 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #14 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #13)

Like I said, I don't currently believe this is a hardware defect: My video card
isn't even 3 years old. It's a card from Gigabyte which makes high quality
products. The temperature of the GPU is within bounds at all times (45°C to
70°C). Its GPU clock seems to be at the right frequency, whereas the memory
clock appears to be at 1/3 the supported frequency so it already is
under-clocked and more stable! Also why does only 3D ever produce the freeze,
even simple scenes that don't stress either the GPU nor the VRAM... whereas 2D
never does it even when it's intensive (eg: games, desktop compositing)?

If people believe hardware hasn't been ruled out, please suggest a GPU
stressing tool for Linux (I use openSUSE Tumbleweed) which you believe is
adequate for this situation. I no longer have Windows and can't redo my entire
setup by installing another OS, this is my main desktop on which I do my work
and activities.

I still think this is related to a driver or kernel vulnerability of some sort.
Please let me know which logs I can post or what else I can monitor to confirm
this and see exactly where and how it's happening.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-04 01:06:00 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

MirceaKitsune <***@yahoo.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
See Also| |https://bugs.freedesktop.or
| |g/show_bug.cgi?id=98520
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-04 17:36:00 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #15 from MirceaKitsune <***@yahoo.com> ---
Today I've ran two tests to ensure that frequencies and DPM are not a factor.

- Setting the DPM profile to low by running the following commands as root:

echo battery > /sys/class/drm/card0/device/power_dpm_state
echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level

- Booting my system with the following Kernel parameters to disable DPM:

radeon.dpm=0 amdgpu.dpm=0

Just like with everything else, they made absolutely no difference: Xonotic
froze the machine after only 8 minutes of running each time. The settings are
applied and visible by checking /sys/kernel/debug/dri/0/amdgpu_pm_info, and are
even reflected in the performance which was reduced from 60 FPS to below 30
FPS.

This is NOT a hardware failure: The freezes occur identically even if both the
core (GPU) and memory (VRAM) clocks are under-clocked to very safe frequencies.
The key must be something in the Linux firmware for this card.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-08 22:47:14 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #16 from MirceaKitsune <***@yahoo.com> ---
I have moved on to testing the various kernel parameters available for my
driver and card. As was pointed out by malcolmlewis on the openSUSE forums,
they can be listed with the following commands:

modinfo amdgpu
systool -vm amdgpu

I tested nearly half of them today, almost none made any difference. There were
however a few settings that appeared to influence the frequency of the freeze.
The most notable one of all seems to be the following:

amdgpu.moverate=4

With no parameters changed, the freeze now occurs roughly once per 30 minutes
in Xonotic. With that move rate limited to 4MB/s however, I seemingly reduced
it to only 90 minutes! The FPS will constantly drop and recover, but that makes
sense as this setting explicitly limits the buffer migration rate.

I may test other variables in the days to come, but for now I'm hoping this
offers at least some clue to get things started. My feeling is that the video
card may be slowly loaded with information until something fills up, or perhaps
some events throw too much data in at once and it reaches a bottleneck?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-09 19:43:32 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #17 from MirceaKitsune <***@yahoo.com> ---
I have a very important preliminary result: Today I tested the last amdgpu
parameters on the list, and seem to have found a set that greatly mitigates the
problem. Those parameters have given me up to 144 minutes before experiencing
the freeze, a huge record compared to the previous 90 minutes! They are:

amdgpu.prim_buf_per_se=16
amdgpu.pos_buf_per_se=16
amdgpu.cntl_sb_buf_per_se=16
amdgpu.param_buf_per_se=16

By default, all 4 of those settings are set to 0 by the system. Setting them to
16 has, at least during one test case, reduced the problem to 1/5 of its
previous frequency. The descriptions of the variables are:

parm: prim_buf_per_se:the size of Primitive Buffer per Shader Engine (default
depending on gfx) (int)
parm: pos_buf_per_se:the size of Position Buffer per Shader Engine (default
depending on gfx) (int)
parm: cntl_sb_buf_per_se:the size of Control Sideband per Shader Engine
(default depending on gfx) (int)
parm: param_buf_per_se:the size of Off-Chip Pramater Cache per Shader Engine
(default depending on gfx) (int)

I will continue trying different values and seeing how tweaking them changes
the issue. Please let me know what you think.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-09 20:16:34 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #18 from Alex Deucher <***@gmail.com> ---
(In reply to MirceaKitsune from comment #17)
Post by b***@freedesktop.org
I will continue trying different values and seeing how tweaking them changes
the issue. Please let me know what you think.
Those parameters are not used on your chip.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-09 20:22:45 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #19 from MirceaKitsune <***@yahoo.com> ---
(In reply to Alex Deucher from comment #18)
Post by b***@freedesktop.org
Those parameters are not used on your chip.
That would be quite something, since after setting them I've clearly seen an
enormous difference. I will investigate further in the upcoming days.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-10 09:13:14 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #20 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #16)
Post by b***@freedesktop.org
I have moved on to testing the various kernel parameters available for my
driver and card. As was pointed out by malcolmlewis on the openSUSE forums,
modinfo amdgpu
systool -vm amdgpu
I tested nearly half of them today, almost none made any difference. There
were however a few settings that appeared to influence the frequency of the
amdgpu.moverate=4
With no parameters changed, the freeze now occurs roughly once per 30
minutes in Xonotic. With that move rate limited to 4MB/s however, I
seemingly reduced it to only 90 minutes! The FPS will constantly drop and
recover, but that makes sense as this setting explicitly limits the buffer
migration rate.
I may test other variables in the days to come, but for now I'm hoping this
offers at least some clue to get things started. My feeling is that the
video card may be slowly loaded with information until something fills up,
or perhaps some events throw too much data in at once and it reaches a
bottleneck?
You are making a progress.

I just want to give you few tips.

1. You are always using 3D acceleration. The glamor driver that is used by XOrg
for 2D (DDX) acceleration is using EGL and shaders for drawing. If you have
composite manager (kde has one), it might do more load on it.
You might try "AccelMethod" "None" in xorg.conf, just to check if it makes any
difference. I hope that won't disable OpenGL entirely...

2. My videocard is also Gigabyte. I had it replaced ones, because in the first
month my initial card (same model) had major issues. Like not starting up at
boot after few hours of gameplay.

3. On my chip failure the pins affected were these controlling the internal
VideoRAM. If you have chip problems, it might affect other pins first, like the
PCIE ones. So HW problem is not ruled out.

4. PCIE standard allows using of less parallel lanes for data transfer. If
broken pins are suspected, moving to 4x slot might alleviate the issue.
BTW, I see that the card is on PCI_ID #3.00.1 , is it in the first slot?
Usually the first slow is 16x and has extra electric power.

5. If you suspect issue with filled RAM, you might try environment variable
"GALLIUM_HUD" it has some GTT displays.

6. In that manner of thinking. Make sure that kernel option for CMA is
disabled... that's been causing me problems every time I enable it. You might
also have IOMMU enabled, try disabling it, just for tests.

Once again,
Keep digging and good luck.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-10 11:55:42 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #21 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #20)

That's some amazing feedback, thank you very much! I'll definitely try those
out, but I have a few questions about a few of these points:

1. My OS (openSUSE Tumbleweed) doesn't have an xorg.conf file. I instead have
an /etc/X11/xorg.conf.d directory with the following files in it:

00-keyboard.conf
00-keyboard.conf.backup
10-amdgpu.conf
10-evdev.conf
10-libvnc.conf
10-quirks.conf
11-evdev.conf
40-libinput.conf
50-device.conf
50-elotouch.conf
50-extensions.conf
50-monitor.conf
50-screen.conf
70-synaptics.conf
70-vmmouse.conf
70-wacom.conf

5. So before running the program from a console, I do "export GALLIUM_HUD=1" or
"GALLIUM_HUD=1;./my_program"? I don't suspect filled RAM as the game's process
doesn't seem to leak memory... I do however suspect filled VRAM.

6. What is the Kernel parameter for CMA please? It doesn't seem to be an amdgpu
setting so I assume it's separate.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-10 17:17:14 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #22 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #21)
Post by b***@freedesktop.org
(In reply to iive from comment #20)
That's some amazing feedback, thank you very much! I'll definitely try those
1. My OS (openSUSE Tumbleweed) doesn't have an xorg.conf file. I instead
00-keyboard.conf
00-keyboard.conf.backup
10-amdgpu.conf
It goes in the Section "Device" of the video driver.
You can see all options with `man amdgpu` or
`man radeon`, etc.
Post by b***@freedesktop.org
5. So before running the program from a console, I do "export GALLIUM_HUD=1"
or "GALLIUM_HUD=1;./my_program"? I don't suspect filled RAM as the game's
process doesn't seem to leak memory... I do however suspect filled VRAM.
Do `GALLIUM_HUD=help glxgears` it would print all available graphs and the
syntax.

Here is what I used last time:
export GALLIUM_HUD=\
".dfps+cpu+GPU-load+temperature,.dGTT-usage+VRAM-usage,num-compilations+num-shaders-created;"\
"primitives-generated+draw-calls,samples-passed+ps-invocations+vs-invocations,buffer-wait-time;"\
"CS-thread-busy+gallium-thread-busy,dma-calls+cp-dma-calls,num-bytes-moved;"\
"num-vs-flushes+num-ps-flushes+num-cs-flushes+num-CB-cache-flushes+num-DB-cache-flushes"
Post by b***@freedesktop.org
6. What is the Kernel parameter for CMA please? It doesn't seem to be an
amdgpu setting so I assume it's separate.
"cma=0 iommu=off"
You can look in linux-source/Documentation/admin-guide/kernel-parameters.txt
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-12 13:55:30 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #23 from MirceaKitsune <***@yahoo.com> ---
I just finished running the GALLIUM_HUD test, and will be taking a look at the
other options next. It was more difficult to test now since the freeze occurs
almost instantly after Xonotic loads the map (only a few seconds).

I managed to make two photos which I'll attach below: One is the last
screenshot I managed to take within the system a few seconds before it froze.
The other is a photo of my screen after the freeze has taken place, obviously
taken with my phone camera as the computer itself was bricked.

A footnote I will add, even if I don't know whether people will even believe
me: I had to take those screenshots several times because they kept getting
deleted. Whenever I booted the machine back, every screenshot of Xonotic with
this HUD was corrupted and turned into a 0 byte file... even ones that were
quickly moved to other directories precisely to avoid this, and were also taken
by an external process! Other files on my drive are fine, it's only those
screenshots... thankfully one survived and it shows all of the graphs and
parameters right before the crash. I'm legitimately crept out, as I didn't
expect a potential attack program to contain software capable of identifying
and deleting evidence of testing, which is the only explanation I can find for
what I just saw. I'll do the next tests carefully as I don't know what else may
happen to my computer.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-12 13:56:59 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #24 from MirceaKitsune <***@yahoo.com> ---
Created attachment 138798
--> https://bugs.freedesktop.org/attachment.cgi?id=138798&action=edit
GALLIUM_HUD pre crash screenshot
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-12 13:57:27 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #25 from MirceaKitsune <***@yahoo.com> ---
Created attachment 138799
--> https://bugs.freedesktop.org/attachment.cgi?id=138799&action=edit
GALLIUM_HUD post crash photo
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-12 15:42:27 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #26 from MirceaKitsune <***@yahoo.com> ---
I preformed the next test suggested to me, by changing
/etc/X11/xorg.conf.d/50-device.conf to the following content:

Section "Device"
Identifier "Default Device"
Driver "amdgpu"
Option "AccelMethod" "None"
EndSection

The frequency of the crash was reduced from a matter of seconds to 45 minutes,
but a freeze still occurred after that time.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-12 17:02:15 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #27 from ***@yahoo.com ---
Loosing recently written files is unfortunately way too common, despite all
filesystem using journaling.
It might help if you call `sync` after writing the file.

If you have kernel with enabled magic-sysrq, after crash you could hold
"Alt+PrintScrn+" and then press (one by one) "s" to sync, "u" to umount and "b"
to reboot.
All info about it could be found in:
linux-source/Documentation/admin-guide/sysrq.rst


Since now hangs happen in a minute after starting gameplay, does that mean that
the "workarounds" that you reported previously doesn't help anymore?


Few ideas to test.
1. Try disabling gallium threads. They are recent feature and it seems they've
been working a lot in your graphs.
`export mesa_glthread=false`
Check also /etc/drirc , ~/.drirc etc...

2. I'm not quite sure what is the difference between num_shaders_created and
num_compilations, but at the crash there are 2 shaders created and 0 compiled.
This reminds me that you might want to turn off the shader cache. This might
introduce some stuttering during gameplay.
`export MESA_GLSL_CACHE_DISABLE=true`

3. Your framerate is limited to 60fps. It's synced to your monitor vertical
refresh. Try
`export vblank_mode=1`
and see if you can control it from the game.
See what happens when you disable it. (Might make things much worse, much
faster.)

4. Generally it is not good idea to test hangs with real game play. It is too
random. It would be ideal if you can record an apitrace that would reproduce
the hang reliably.
Obviously it might not be possible to do that recording on the system that
hangs. (The trace could be lost at reboot, or the commands that cause the hang
might not even be written).
If you have another machine or video card, that works reliably, try recording
gameplay of a single level. Then do the test replaying it. Would it play
entirely, would it hang, would it hang at the same place?

Can you trigger hang with `glxgears` ?

5. You might find something else to test here (e.g. disable DRI3?):
https://www.mesa3d.org/envvars.html.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-12 17:14:04 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #28 from MirceaKitsune <***@yahoo.com> ---
Just finished the last test from yesterday's recommendations. It appears I
cannot boot with iommu=off as that disables all USB devices, so I can't use a
keyboard and mouse and cannot do anything. I tried the closest working
equivalent I could find, which still froze after 15 minutes from bootup:

cma=0 iommu=soft intel_iommu=off amd_iommu=off

(In reply to iive from comment #27)

Thanks again, I'll be moving to these tests next and posting the results here.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-13 19:39:35 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #29 from MirceaKitsune <***@yahoo.com> ---
For the first time ever, I might finally have some very good news on this
issue! It will take several more days to confirm, then possibly another month
to pinpoint the exact option responsible. However it's possible I may have
found something that finally gets rid of the crash.

The issue appears to go away when playing Xonotic with those parameters:

export LIBGL_DEBUG=true LIBGL_NO_DRAWARRAYS=true LIBGL_DRI3_DISABLE=true
MESA_DEBUG=true MESA_NO_ASM=true MESA_NO_MMX=true MESA_NO_3DNOW=true
MESA_NO_SSE=true MESA_NO_ERROR=true MESA_GLSL_CACHE_DISABLE=true
MESA_NO_MINMAX_CACHE=true RADEON_NO_TCL=true DRAW_NO_FSE=true DRAW_USE_LLVM=0

I additionally disabled the cvar "r_shadows 2" which I forgot I had on for a
while now, as it enabled a shadowing system that might have itself been the
culprit.

With these two changes, I was able to clock up to 120 minutes of continuous
gameplay last night, followed by an outstanding 200 minutes today! That's over
2 respectively 3 hours with no system freeze whatsoever. I need to repeat this
test several times to be 100% sure there's not still some obscure chance of it
happening, but in any case there is definitely a major difference visible.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-14 18:41:05 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #30 from MirceaKitsune <***@yahoo.com> ---
Today's testing reveals an important detail I had missed: There are likely
multiple different crashes taking place... or at most one crash but triggered
by several unrelated occurrences. There is no one option or central point.

In the case of Xonotic: Using shadows (r_shadows 2) was by far the primary
factor... even without that however, a crash may still occur after roughly 3
hours of a match running. The MESA variables I mentioned are likely the source
of the second much rarer crash (seen after 2+ hours).

I don't know how I'm going to get to the bottom of this and find the exact
parameters involved: I can't leave 4 hour matches running every single day, and
even then I'd have to test a combination several days in total. I'll continue
slowly testing, but at this rate expect it to take many months.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-17 09:27:54 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #31 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #29)
Post by b***@freedesktop.org
For the first time ever, I might finally have some very good news on this
issue! It will take several more days to confirm, then possibly another
month to pinpoint the exact option responsible. However it's possible I may
have found something that finally gets rid of the crash.
export LIBGL_DEBUG=true LIBGL_NO_DRAWARRAYS=true LIBGL_DRI3_DISABLE=true
MESA_DEBUG=true MESA_NO_ASM=true MESA_NO_MMX=true MESA_NO_3DNOW=true
MESA_NO_SSE=true MESA_NO_ERROR=true MESA_GLSL_CACHE_DISABLE=true
MESA_NO_MINMAX_CACHE=true RADEON_NO_TCL=true DRAW_NO_FSE=true DRAW_USE_LLVM=0
I additionally disabled the cvar "r_shadows 2" which I forgot I had on for a
while now, as it enabled a shadowing system that might have itself been the
culprit.
With these two changes, I was able to clock up to 120 minutes of continuous
gameplay last night, followed by an outstanding 200 minutes today! That's
over 2 respectively 3 hours with no system freeze whatsoever. I need to
repeat this test several times to be 100% sure there's not still some
obscure chance of it happening, but in any case there is definitely a major
difference visible.
"MESA_NO_ASM=true" supersedes the other "MESA_NO_MMX=true MESA_NO_3DNOW=true
MESA_NO_SSE=true", so you don't need to make combinations with all of them.


Also I don't see you testing `export mesa_glthread=false`. Race conditions are
one of the hardest bugs to catch and reproduce.

If you think that 'r_shadow' could quickly and "reliably" trigger a hang, then
I would ask you to focus on it first.
1. Read about sysrq and make sure you have it enabled in the kernel and that it
works. Make sure you have text console, as it might need it.
2. Enable back "r_shadows 2"
3. Use apitrace to capture a hang, while playing the game.
4. Try to reboot gracefully, using sysrq to sync and reboot, or get in text
console and restart.
5. Test if the recorded trace could reproduce the crash reliably.

If the trace seems complete and it cannot reproduce the bug, then maybe it does
capture everything, but the bug is not simple infinite loop in the shader.
(These seem to be common cause of hangs).

If the bug can be reliably reproduced, it will be fixed.

Good luck.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-17 13:58:18 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #32 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #31)

Sounds a lot more complicated, but I'm gladly willing to try it as long as
there's no risk of anything permanently breaking my system.

The main problem is that I wasn't able to get the SysRq keys working in
openSUSE Tumbleweed, which I tried as to enable the "REISUB" keys. I could
really use clear instructions on how to enable and test them in openSUSE...
ideally during runtime without having to make any permanent system changes.

I need to remember how apitrace works, been a while since I used that. Also I
remember it generated really a huge file, and the longer you run the program
for the bigger it gets... if it doesn't happen within a few seconds I may gave
a +1 GB trace, and I'm not sure where I can share that with the devs online.

One thing to note: I have two computers at home, with mine being the crashy one
and my mother's being an old and slow but stable machine. I can use SSH to
connect in between them from bash. The problem is that the moment my machine
freezes, its SSH connection instantly dies on the other PC as well... therefore
I'm not sure how helpful this option is.

The "r_shadows 2" option in Xonotic clearly makes a difference: Without it the
crash only occurs after 3 hours... with it it's anywhere between a few seconds
and at most 45 minutes. Definitely my best test case so far.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-17 19:10:35 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #33 from ***@yahoo.com ---
This doesn't sound good.
The sshd dying indicates that the kernel or the CPU has hang. If there is GPU
shader hang this doesn't happen right away, it usually waits 10 seconds before
attempting to reset the GPU and then panics.


1. When the system hangs, do you see LEDs on the keyboard flashing?
When kernel panics this is how it signals it. You might need to wait for 10
seconds or minute...

2. It seems that OpenSuse disables "sysrq", google told me that
"You can enable it in YaST->Security and Users->Security Center and
Hardening..."
Alternatively you should be able to enable it with executing this as root:
echo 1 > /proc/sys/kernel/sysrq

Check if it works with "Alt+PrtScr+h", it should display help message in
`dmesg` .

3. After you have sysrq working, try to reproduce the crash, (without
apitrace).
This is to check if sysrq is working at all during hang and if it does then
hopefully getting a kernel panic message in the log.

4. If you cannot get crash messages in the logs/journal, then you might to use
`serial console` or `netconsole`.
The Serial console is best option, if both computers have their own serial
ports and you happen to have a serial cable to connect them.
linux-source/Documentation/admin-guide/serial-console.rst

Otherwise you might try network console logger, that sends UDP packets to the
second computer.
linux-source/Documentation/networking/netconsole.txt

Setting up these might be tricky, as they might not even be compiled in the
stock kernel. So if you need detailed instructions, at least check if they are
present as modules or built-in the kernel.
zgrep CONFIG_NETCONSOLE /proc/config.gz
zgrep SERIAL_8250_CONSOLE /proc/config.gz

5. Disable vsync and run `glxgears` for hours. Leave it to work through the
night or something.
I just want to know if your computer hangs with that simple 3D.
vblank_mode=1 glxgears

---

Let me be clear.
I want to see the crash messages for only 2 reasons:
- To see that there is a kernel crash.
- To see if the crash is in the graphics stack.

Since the `sshd` stops working, it might be network-card crash. (Multiplayer
games, using network...)

If the machine just hangs, without actual kernel crash... then it might be
hardware problem, but not a graphic card, it might also be MB, CPU, PSU, RAM,
etc...
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-17 20:06:57 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #34 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #33)

Ahhh... you've reminded me of a detail that I have in fact noticed but forgot
to mention: After the machine freezes and becomes completely unresponsive, some
keyboard leds will indeed turn off after roughly 10 seconds. I only noticed
this because I currently have a backlit keyboard that has the lighting
controlled by the Scroll Lock LED... I saw that a few seconds after the crash,
the keyboard lighting always turns itself off.

I cannot connect the computers with a serial cable: I think the motherboards
are too modern to have a serial port, and they're at a far distance in opposite
rooms. Both computers are connected to the same home router via LAN cables
though, and can communicate through local IP... so net console sounds like a
good idea, but I've never heard of it before so I'll have to look this up.

Your sysrq suggestion seems to have worked! I first did this:

echo 1 > /proc/sys/kernel/sysrq

Then I pressed 'Alt + PrintScreen + H'. Now dmesg shows me:

[265102.938475] sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c)
terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i)
thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l)
show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p)
show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V)
show-blocked-tasks(w) dump-ftrace-buffer(z)

I assume that after the crash, I should first use it to test REISUB?

And here is the output of the kernel features you said to check. If they're not
there, I'm out of luck on this one, as I don't know how to compile my own
kernel and can't risk breaking my machine with dangerous tests.

***@linux-qz0r:~> zgrep CONFIG_NETCONSOLE /proc/config.gz
CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y
***@linux-qz0r:~> zgrep SERIAL_8250_CONSOLE /proc/config.gz
CONFIG_SERIAL_8250_CONSOLE=y

Lastly I'll try glxgears without vsync for a few hours in the next days: I have
to leave my computer locked while I'm away from home or sleeping, but can leave
it on for roughly 3 hours of the day while I'm around but AFK. I should note
that I tried running Xonotic without vsync, and that seems to have made
absolutely no difference. Also this likely isn't network related, I always test
in a local match with bots and not online multiplayer.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-18 21:51:45 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #35 from ***@yahoo.com ---
Any results?

Enable SysRq, start Xonotic, set r_shadow 2, play until it crashes, use SysRq
to sync, umonut, reboot.

After reboot, check if the crash has been captured by syslog/journald.

If there is nothing, then you'd have to use `netconsole`. OpenSUSE has it
compiled as module, so the description that involves `insmod` or `modprobe`
applies to you.

If you have the crash in the logs, then it is more likely that apitrace file
will remain whole after the hang&reboot.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-18 22:43:52 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #36 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #35)

I will be busy tomorrow, and also wanted to look into how to do the other more
complicated tests. I'll be trying this out sometime in the next days.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-20 14:34:01 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #38 from MirceaKitsune <***@yahoo.com> ---
Created attachment 138950
--> https://bugs.freedesktop.org/attachment.cgi?id=138950&action=edit
cat /var/log/messages
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-20 14:34:23 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #39 from MirceaKitsune <***@yahoo.com> ---
Created attachment 138951
--> https://bugs.freedesktop.org/attachment.cgi?id=138951&action=edit
journalctl --since yesterday
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-20 14:32:07 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #37 from MirceaKitsune <***@yahoo.com> ---
I have just finished preforming the first new test.

First of all I must say I'm utterly amazed at the variability of this issue:
Last week when I played Xonotic with "r_shadows 2", the freeze occurred in just
a few seconds or minutes at most... today after a few openSUSE Tumbleweed
snapshots, I was able to play for over 60 minutes even with this option
enabled! It's clear that package updates are causing the lockup to vary
unpredictably.

In any case, I can confirm the SysRq keys also stop functioning after the
freeze: I tried REISUB for a few minutes, but there was no form of response.

I also kept both NumLock and CapsLock enabled during the game to better see how
they behave in a crash. 5 seconds after the freeze, they both turned off...
that is the very last noticeable activity of the system.

None the less I will attach the logs you suggested below, just in case they
still captured something. Please let me know exactly what you believe I should
try next, in as much detail as possible since I'm unfamiliar with the other
tests you hinted to in our last conversation (eg: apitrace).
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-21 11:46:41 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #40 from ***@yahoo.com ---
It seems like the sysrq actually worked, I see it in the logs.

The crash has happened around Apr20 15:54. Unfortunately the kernel error/panic
message is missing in both logs. My distro doesn't have systemd, so I cannot
tell you what the magic journalctl options are to get these messages out. On my
system, these usually go in /var/log/syslog . See if you can find something
more useful from about that time.

As for the apitrace, it should work out of the box.
First test it with:
apitrace trace glxgears
It should create a glxgears.trace . If you run it again, it would create
glxgears.1.trace etc.

It's important to be sure that you are using 64 bit apitrace with 64bit
applications.

apitrace trace ./xonotic-linux64-glx

Should do the trick.
First, start a game match and exit right away. Then try to replay the result
with:

apitrace replay xonotic-linux64.glx.trace

This is to make sure that tracing is working properly.

Then just make sure you have enough free space. Enable vsync, to limit the
frames per second. Using smaller textures should also help with the trace size
(textures are loaded at the level start, so playing longer match should help
too).

After you record a crash and reboot with sysrq, see if replaying the resulting
trace file would cause hang at its end.

You can compress the trace with `xz -9e xonotic*.trace` .

Good Luck.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-24 12:08:00 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #41 from MirceaKitsune <***@yahoo.com> ---
Created attachment 139052
--> https://bugs.freedesktop.org/attachment.cgi?id=139052&action=edit
apitrace trace xonotic-sdl

I have attempted the apitrace test as instructed. Something interesting
happens: Whenever I run Xonotic through the apitrace command, it always crashes
a few seconds after the match starts. By crashes I don't mean the GPU crash,
but the process closing and sending me back to the desktop. This never happens
when running Xonotic normally, only when running it through apitrace... I tried
it several times to be extra sure of this.

I'm attaching the apitrace it successfully recorded, which just barely fit
under 200 MB thanks to xz compression.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-24 12:13:05 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #42 from MirceaKitsune <***@yahoo.com> ---
Created attachment 139053
--> https://bugs.freedesktop.org/attachment.cgi?id=139053&action=edit
Output of: apitrace trace xonotic-sdl

Here is the console output generated by the Xonotic process when it crashes to
the desktop through apitrace.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-24 12:15:28 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #43 from MirceaKitsune <***@yahoo.com> ---
Created attachment 139054
--> https://bugs.freedesktop.org/attachment.cgi?id=139054&action=edit
Output of: apitrace replay xonotic-sdl.trace

And here's the console output of Xonotic crashing to the desktop when replaying
the recorded apitrace it crashed with.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-24 16:26:06 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #44 from ***@yahoo.com ---
You haven't looked for the kernel panic message in the logs.
I'm still waiting for it.


As for the trace. I did download and traced Xonotic before writing the
instructions. I had no issues using the EXACT commands I have given you.
Why are you asking for instructions if you do not follow them!

You are using the SDL instead of the GLX and that might be what causes issues.

Please remove the failed traces. If the trace cannot reliably reproduce the
hang then it is useless.

Let me say it again. If you cannot hang your computer using the recorded trace,
we have no use for it. And it must hang at the exact same place.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-24 18:34:55 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #45 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #44)

I'm doing my best to debug this as well as possible, but there are a lot of
points and it's hard to pay attention to everything. I thought the crashing to
the desktop in the trace I posted might still hold some information, especially
as it seemed to complain about some OpenGL related issues.

I'll try the GLX version as well and let you know if that works. If not it
means I can't test using apitrace, because Xonotic crashes for some reason I
can't explain when I run it on that.

And I don't yet know where to extract the kernel logs from on my distribution.
I don't have a /var/log/syslog so it must be somewhere else. I'll try looking
that up as well and will post them once I find them.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-24 19:23:14 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #46 from MirceaKitsune <***@yahoo.com> ---
I found out what's causing the apitrace crash: It no longer happens when I use
fresh settings, therefore something in my config was breaking it. Upon further
inspection it seems to be one of the visual effects. I'll try again with
different settings, and see if I can find a config that still triggers the GPU
lockup without also making apitrace crash to the desktop.

I also looked up where the kernel output should be located on openSUSE. I'm
told that it's /var/log/messages which indeed exists on my system. Next time I
experience the freeze, I will post that file here as instructed.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-24 20:14:14 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #47 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #46)
Post by b***@freedesktop.org
I found out what's causing the apitrace crash: It no longer happens when I
use fresh settings, therefore something in my config was breaking it. Upon
further inspection it seems to be one of the visual effects. I'll try again
with different settings, and see if I can find a config that still triggers
the GPU lockup without also making apitrace crash to the desktop.
I also looked up where the kernel output should be located on openSUSE. I'm
told that it's /var/log/messages which indeed exists on my system. Next time
I experience the freeze, I will post that file here as instructed.
You can report the apitrace crash to the apitrace issue tracker. Include
everything needed to replicate it (aka Xonotic version, options, commands).
https://github.com/apitrace/apitrace/issues

You already have posted a /var/log/messages with a crash here, it doesn't have
any kernel oops/panics. No point of posting another.
See if there are any other /var/log/* files that might contain these.

Here is how a shader hang look in my logs (that bug was reported and fixed):

kernel: [19746.660911] radeon 0000:01:00.0: ring 0 stalled for more than
10030msec
kernel: [19746.660915] radeon 0000:01:00.0: GPU lockup (current fence id
0x000000000039874d last fence id 0x0000000000398759 on ring 0)
kernel: [19746.844799] radeon 0000:01:00.0: couldn't schedule ib
kernel: [19746.844837] [drm:radeon_uvd_suspend [radeon]] *ERROR* Error
destroying UVD (-22)!
kernel: [19748.260945] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait
timed out.
kernel: [19748.260965] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon:
failed testing IB on GFX ring (-110).
kernel: [19748.438730] radeon 0000:01:00.0: couldn't schedule ib
kernel: [19748.438761] [drm:radeon_uvd_suspend [radeon]] *ERROR* Error
destroying UVD (-22)!
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-24 20:25:47 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #48 from MirceaKitsune <***@yahoo.com> ---
Created attachment 139071
--> https://bugs.freedesktop.org/attachment.cgi?id=139071&action=edit
/var/log/messages

I managed to trigger the GPU freeze while running Xonotic through apitrace.
Upon restarting the machine, I still found the resulting xonotic-glx.trace file
on my drive. Unfortunately the trace seems to end several seconds before the
crash, despite my attempts to restart the system using the REISUB SysRq keys. A
warning in the console also indicates this when playing back the trace:

warning: unexpected end of file while reading trace

Do you have any advice on how I can make sure the trace captures the moment of
the freeze, rather than last recording several seconds before it happens? Would
it perhaps be possible to use SSH or some other library to create or deposit
the recorded trace on my other computer via LAN connection?

Meanwhile I'm attaching the output of /var/log/messages which I understand is
the name of /var/log/syslog for my distribution. Please let me know if this is
the correct kernel log you mentioned.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-24 20:29:21 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #49 from MirceaKitsune <***@yahoo.com> ---
Sorry about that: I posted my last message before reading your last one, and
didn't notice your mention about /var/log/messages being obsolete. Just
mentioning so you don't think I ignored what you said.

I'll look deeper in /var/log for anything useful. Here's what that directory
contains in case the list is of any use:

linux-qz0r:/var/log # ls ./ -1
acpid
alternatives.log
apache2
apparmor
audit
boot.log
boot.msg
boot.omsg
btmp
chrony
cluster
ConsoleKit
cups
fglrx-build.log
firebird
firewall
firewall-20180313.xz
hp
journal
kdm.log
kdm.log-20130112.xz
kdm.log-20130331.xz
kdm.log-20130601.xz
kdm.log-20130818.xz
kdm.log-20131108.xz
krb5
lastlog
localmessages
mail
mail.err
mail.info
mail.warn
mcelog
messages
messages-20180313.xz
messages-20180402.xz
messages-20180411.xz
messages-20180421.xz
mumble-server
mysql
NetworkManager
NetworkManager-20180313.xz
news
nscd.log
ntp
pbl-20180313.log.xz
pbl.log
pk_backend_zypp
pk_backend_zypp-1
pm-powersave.log
samba
snapper.log
speech-dispatcher
sssd
tallylog
teamviewer11
tor
tuned
updateTestcase-2018-04-21-16-05-56
updateTestcase-2018-04-24-15-18-49
warn
warn-20180313.xz
wpa_supplicant.log
wpa_supplicant.log-20180313.xz
wtmp
wtmp-20180313.xz
wtmp-20180413.xz
xinetd.log
Xorg.0.log
Xorg.0.log.old
Xorg.1.log
Xorg.1.log.old
YaST2
zypp
zypper.log
zypper.log-20180313
zypper.log-20180316.xz
zypper.log-20180321.xz
zypper.log-20180324.xz
zypper.log-20180328.xz
zypper.log-20180329.xz
zypper.log-20180402.xz
zypper.log-20180405.xz
zypper.log-20180409.xz
zypper.log-20180411.xz
zypper.log-20180414.xz
zypper.log-20180416.xz
zypper.log-20180419.xz
zypper.log-20180421.xz
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-25 10:03:43 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #50 from ***@yahoo.com ---
If this `messages` file is from the failed apitrace crash recording, then maybe
you should try again.

In the previous file I could see that SysRq has been used. Since the first
command you use also kills all programs, including the logging, there are no
more logs from the session.

I do not see anything SysRq in the new `messages` file. So one possibility is
that you forgot to enable sysrq and it just hang. Another possibility is that
you need to do things with certain timing.

Here is how "transactions" work in linux.
The program (apitrace) can make many writes and they all could be cached in
ram, without been send to disk. Even if they are written on disk, they might
not be committed to the file, until `flush()` or `close()` is called on the
file.
That is, "in theory", the file should remain unchanged until it is flushed,
even when old content is overwritten.

SysRq+r takes over the keyboard, so keep pressing that first.
SysRq+e would send TERM signal to all programs. It is very likely that
apitrace could handle that signal and close all files, thus committing them to
disk. Give it a few seconds to finish. Count to 5 or wait until hdd stops
working.
SysRq+i sends KILL signal to all programs. This is forcibly termination and
might eliminate programs that are still handling the TERM. In your case I would
ask you not to use that.
SysRq+s sync all kernel buffered reads to disk. Wait for hdd to stop before
pressing next key combination.
SysRq+u unmount all filesystems. Same as before, wait for the hdd to stop
before pressing the next key.
SysRq+b reboot.

So basically, watch the HDD LED on the PC box and wait for it to stop, before
pressing the next key.

Have in mind, when you kill all programs, systemd remains as it is running as
init#1, and it would try to restart everything again. So disabling some
services you don't need might be good idea. I think I saw apache web server in
the previous log.


Good Luck and try again.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-25 15:28:55 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #51 from MirceaKitsune <***@yahoo.com> ---
Created attachment 139103
--> https://bugs.freedesktop.org/attachment.cgi?id=139103&action=edit
Output of: systemctl | grep running

(In reply to iive from comment #50)

I've preformed many more tests during the past two hours, getting nearly a
dozen freezes in the process. I tried with both the glx and sdl versions of
Xonotic, and even ran the "Alt + SysRq + RESUB" combination at different rates
(instantly as well as 1 minute in between each press). Before each test I made
sure the SysRq keys are working, by using "Alt + SysRq + H" then checking that
the help message appears at the end of the "dmesg" output.

In all cases the trace file never catches the crash: Either I find a zero byte
file when I reboot, either it ends several seconds before the crash.

I couldn't find any obviously useless systemctl services that I can shut down
(such as Apache). In case there is anything dangerous or that I could disable
in there, I'm attaching the output of "systemctl | grep running".

I'm clearly going to need a different approach to recording this trace:
apitrace must be dying the moment the lockup occurs, so it never finishes
writing the complete trace file. I found some info on how a trace can be played
back from a remote machine, but not how to record it from one. What should I do
next?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-26 00:51:06 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #52 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #51)
Created attachment 139103 [details]
Output of: systemctl | grep running
(In reply to iive from comment #50)
I've preformed many more tests during the past two hours, getting nearly a
dozen freezes in the process. I tried with both the glx and sdl versions of
Xonotic, and even ran the "Alt + SysRq + RESUB" combination at different
rates (instantly as well as 1 minute in between each press). Before each
test I made sure the SysRq keys are working, by using "Alt + SysRq + H" then
checking that the help message appears at the end of the "dmesg" output.
In all cases the trace file never catches the crash: Either I find a zero
byte file when I reboot, either it ends several seconds before the crash.
I couldn't find any obviously useless systemctl services that I can shut
down (such as Apache). In case there is anything dangerous or that I could
disable in there, I'm attaching the output of "systemctl | grep running".
apitrace must be dying the moment the lockup occurs, so it never finishes
writing the complete trace file. I found some info on how a trace can be
played back from a remote machine, but not how to record it from one. What
should I do next?
I'm running out of ideas.

I just want to make sure that the `apitrace` you are using is recent enough.
The last release of apitrace-7.1 is almost 3 years old and there are many fixes
that it is missing. For example, there is 2 years old commit that calls
"localWrite.flush()" on "_exit".
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-26 01:10:10 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #53 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #52)

Ah... I do indeed have apitrace 7.1-3.89. I don't know if a newer version
exists on https://software.opensuse.org which is currently down. I need to head
off to bed in a minute, but I'll check whether a new version compiled for my OS
exists tomorrow (if anyone knows please share a link). I'm sorry this is taking
so much effort, and I'm sure the cause of this freeze can't elude us forever.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-26 15:23:27 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #54 from MirceaKitsune <***@yahoo.com> ---
I have some very interesting results from today: As instructed, I used the
latest version of apitrace. I cloned it straight from its Github repository and
compiled it myself, then ran Xonotic through it.

https://github.com/apitrace/apitrace

Same thing: The trace always ends several seconds before the moment of the
freeze and prints an "end of file" warning in the console.

Then I decided to do something different: I ran Blender 3D through apitrace,
loading up the scene that triggered this same lockup last time. I used various
features and went into several modes which I remembered were responsible.
Eventually I got the exact same freeze as I do with Xonotic. I rebooted and
played back the Blender trace. Same story as with Xonotic: It cuts a few
seconds earlier and complains about EoF.

But there's a bizarre twist this time: When playing back the trace generated by
Blender, my system will freeze at various points during the replay! Sometimes
it freezes early, sometimes it freezes late, at other times I can replay the
whole trace without getting a freeze at all.

This is very peculiar: The crash must be occurring beyond what apitrace is even
capturing, likely something deep in the kernel or renderer which is only
triggered when the conditions are just right. What do you make of this?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-26 20:19:56 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #55 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #54)
Post by b***@freedesktop.org
But there's a bizarre twist this time: When playing back the trace generated
by Blender, my system will freeze at various points during the replay!
Sometimes it freezes early, sometimes it freezes late, at other times I can
replay the whole trace without getting a freeze at all.
This is very peculiar: The crash must be occurring beyond what apitrace is
even capturing, likely something deep in the kernel or renderer which is
only triggered when the conditions are just right. What do you make of this?
Well, this makes hardware issue a lot more probable.

Still, it is good that you have a trace that can trigger crashes.
Having an apitrace issuing same OpenGL commands eliminates a lot of variables.

From now on, you shell be using only this trace for your tests.

But first, you should try and setup `netconsole`.
I haven't used it myself so I can't give you any hints.
Still the documentation looks detailed. AFAIR you have it as module.

After you have it working, you can resume your experiments with environment
variables. And keep an eye on the kernel messages when a crash happens.

Few hints. If `MESA_NO_ASM=true` is set, then the other(MESA_NO_MMX=true ;
MESA_NO_3DNOW=true ; MESA_NO_SSE=true) have no effect.
And don't forget to test `export mesa_glthread=false` too.
Also try `export RADEON_THREAD=false` with the above.
Threading and concurrency just increase the random variables.

Your hope is to find something that always works, or some error that is always
present before crash.

You should also seriously consider testing the card on other OS or computer.
If that blender trace hangs on Windows, it definitely is not software issue.

Keep digging.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-29 19:41:01 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #56 from MirceaKitsune <***@yahoo.com> ---
I've preformed the netconsole test today. After over an hour of learning how it
works, I set it up and could confirm that system messages are properly received
by netcat on the other computer. Unfortunately, as expected, no messages get
sent at the time of the freeze: Even the netconsole kernel module dies
immediately.

The MESA parameters I mentioned don't seem to affect the freeze produced by the
Blender trace either. For reference, my testing string was:

export LIBGL_DEBUG=true LIBGL_NO_DRAWARRAYS=true LIBGL_DRI3_DISABLE=true
MESA_DEBUG=true MESA_NO_ASM=true MESA_NO_MMX=true MESA_NO_3DNOW=true
MESA_NO_SSE=true MESA_NO_ERROR=true MESA_GLSL_CACHE_DISABLE=true
MESA_NO_MINMAX_CACHE=true RADEON_NO_TCL=true DRAW_NO_FSE=true DRAW_USE_LLVM=0

I retain my conviction that this is nothing hardware related. Mainly because
the freeze doesn't seem to be affected by VRAM fill nor GPU stress, but by
specific renderer features regardless of the complexity of the scene. I see no
way in which for instance, a Blender scene with a load of high-poly objects
won't ever trigger a hardware failure, but a Blender scene with a few low-poly
objects can do so within seconds if some obscure conditions are just right.

If anyone can suggest anything else, please do. This is the weirdest and most
difficult test I've ever had to preform on a computer to debug a crash, mainly
due to the way in which absolutely nothing seems to work. There's definitely a
way to catch it... I just don't know what that is.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-29 21:37:02 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #57 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #56)
Post by b***@freedesktop.org
I've preformed the netconsole test today. After over an hour of learning how
it works, I set it up and could confirm that system messages are properly
received by netcat on the other computer. Unfortunately, as expected, no
messages get sent at the time of the freeze: Even the netconsole kernel
module dies immediately.
When the system hangs, is SysRq still operational?
Aka, if you have netconsole working and press SysRq+h, it should show help and
send that text over the network.
If you press SysRq+r it should reboot.

I want to confirm that netconsole indeed stops working, but SysRq is still
working.

There is another method for capturing panic messages. It involves preserving
portion of the memory and loading a second kernel in there, that is started at
the event of panic.
Actually there was even a method storing kernel panics in non-volatile memory
of the uefi bios... (That might be a bit risky).
However at this point I am not convinced that you are even getting any kernel
panic.


It is very strange that the system hangs, without the kernel panic issuing a
panic. And it is even more strange that the GPU is causing such a hang.

You see, the GPU for the most part is working on its own, so if the GPU hangs,
it should not affect the CPU operation. The radeon/amdgpu drivers could detect
GPU hang and they should complain. I've shown you how they do that for me.

This points us again in the direction of hardware. I do remember that you had
some success with `amdgpu.moverate=4` . So the issue might be around DMA and
PCIE...

For now, try `export R600_DEBUG=nodma` .

This environment variable has remained with this name, despite the fact that it
now works on much newer drivers than R600. You can see all supported options
with `R600_DEBUG=help glxgears` .

Also, you've done overclock before, maybe some options has remained. See if
your bios/uefi have something in the equivalent of "safe defaults"...
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-30 11:50:40 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #58 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #57)

I believe I've already tried with nodma under Xonotic, as I previously
attempted playing with the following options and still got the crash:

export
R600_DEBUG=checkir,precompile,nooptvariant,nodpbb,nodfsm,nodma,nowc,nooutoforder,nohyperz,norbplus,no2d,notiling,nodcc,nodccclear,nodccfb,nodccmsaa

I've also checked out the BIOS, no overclocking settings are responsible. The
freeze happened even with the failsafe defaults of my BIOS in use.

And during the netconsole test, I did enable and use the SysRq keys (RESUB)...
nothing got printed to the other machine. SysRq doesn't seen to do anything
after the freeze: Nothing happens if I press them, including the HDD led which
never flashes again indicating that even the drive is never used any more.

Loading a second kernel into memory sounds complicated and much more dangerous.
At this stage I also doubt even that would work, as the machine literally acts
as if the whole system (CPU / RAM) is bricked and jammed upon crashing.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-30 12:07:36 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #59 from MirceaKitsune <***@yahoo.com> ---
I'm not sure if this helps, but here is the darkplaces engine file responsible
for drawing shadows. Remember that when disabling shadows, the frequency of the
freeze is reduced from 0 - 30 minutes to 60 - 240 minutes in Xonotic. Maybe if
someone more experienced takes a look at what renderer functions get enabled
when shadows are turned on, they might notice what could be speeding up the
freeze?

https://gitlab.com/xonotic/darkplaces/blob/master/r_shadow.c

Remember that one or a combination of the following two cvars triggers it. You
can search that file and see where those settings are used and what they
enable.

r_shadows 2
r_shadow_shadowmapping 1
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-30 20:59:16 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #60 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #58)
Post by b***@freedesktop.org
(In reply to iive from comment #57)
[...]
Post by b***@freedesktop.org
And during the netconsole test, I did enable and use the SysRq keys
(RESUB)... nothing got printed to the other machine. SysRq doesn't seen to
do anything after the freeze: Nothing happens if I press them, including the
HDD led which never flashes again indicating that even the drive is never
used any more.
After hang, does SysRq+R reboot the machine?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-04-30 21:41:31 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #61 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #60)

No, it never does. SysRq + R - E - S - U - B (pressed in slow order after one
another) does not reboot, nor make the hard drive led flash, nor have any other
noticeable effects once the freeze has occurred.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-01 00:03:26 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #62 from ***@yahoo.com ---
You don't even get kernel panic, the machine just freezes.

Just to confirm that, add the following to the kernel line options in grub
"panic=30" . Then freeze the computer again.
If the kernel panics, then it should reboot after 30 seconds.


Do you have a temperature reading for the Mother Board chipset? Can you make
sure it doesn't overheat or something during gameplay?

Use ssh to log into your computer and run `watch sensors`. You will have the
last readings when the computer hangs.


I think you should try to compile a vanilla kernel and enable every debug
option that you can. (You can use the SUSE kernel /proc/config.gz as template).
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-01 21:06:20 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #63 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #62)

I booted my machine with the kernel parameter panic=30 as instructed. I then
waited for over two minutes to see if there's any sign of movement. Nothing
happened: The machine never reboots on its own after the freeze, I still need
to press the reset button on the computer case to restart it.

I also noticed another detail worth noting: The keyboard NumLock led only seems
to turn off after I press a key on the keyboard post freeze. So let's say the
machine just crashed: I can wait a whole minute and the led is still on... then
I press Control or Shift or any other key, and after roughly 3 seconds, the led
then turns off. This is always the last noticeable response from the PC.

Lastly I have something important to mention: Someone just replied to my thread
about this crash on the openSUSE forum, and confirmed they're getting the same
issue! They even posted a screenshot showing the exact same graphical garbage
I'm noticing in various applications (colorful little squares littering the
screen). This might be the first time someone else can confirm the problem,
which is very exciting news if the person will provide us with more info.

https://forums.opensuse.org/showthread.php/525727-3D-engines-causing-frequent-GPU-lockups?p=2864284#post2864284
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-02 00:29:05 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #64 from H4nN1baL <***@gmail.com> ---
I've been having a very different problem with AMD cards. But I have reason to
think that the problem could vary from one processor/chipset to another.
My problems disappear using 'MESA_EXTENSION_OVERRIDE=-GL_ARB_buffer_storage',
can you try that?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-02 01:21:39 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #65 from MirceaKitsune <***@yahoo.com> ---
(In reply to H4nN1baL from comment #64)

Just tested with "export MESA_EXTENSION_OVERRIDE=-GL_ARB_buffer_storage". It
did not affect the freeze triggered by playing back the Blender trace.

Also, to answer iive's last point which I forgot in the previous response: I
have temperature monitors on my desktop, and one of the Xonotic freezes
happened just a second after I alt-tab switched back from checking it. All fans
and temperatures were perfectly fine during that test: The CPU was around the
typical 48°C, whereas the GPU never exceeded 68°C itself. Temperatures are most
surely not an issue.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-02 02:55:29 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #66 from H4nN1baL <***@gmail.com> ---
Okay, thanks for your reply. Then our problems are unrelated.

Even so let me share you some intel. Disable any BIOS configuration related to
"GART" and "PCIE Spread Spectrum"(PCIe overclock).
And keep in mind that GPUs also come with a BIOS, in some cases, they really
need to be updated.

Good luck.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-02 09:59:54 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #67 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #63)
Post by b***@freedesktop.org
(In reply to iive from comment #62)
I booted my machine with the kernel parameter panic=30 as instructed. I then
waited for over two minutes to see if there's any sign of movement. Nothing
happened: The machine never reboots on its own after the freeze, I still
need to press the reset button on the computer case to restart it.
As I suspected.
It just hangs.
Post by b***@freedesktop.org
I also noticed another detail worth noting: The keyboard NumLock led only
seems to turn off after I press a key on the keyboard post freeze. So let's
say the machine just crashed: I can wait a whole minute and the led is still
on... then I press Control or Shift or any other key, and after roughly 3
seconds, the led then turns off. This is always the last noticeable response
from the PC.
Are you using USB keyboard plugged into USB port?
Your motherboard does have PS/2 ports, see if you can use the one for keyboard.
(Sometimes keyboards come with small dongle that lets you plug USB keyboard
into PS/2 port).
Post by b***@freedesktop.org
Lastly I have something important to mention: Someone just replied to my
thread about this crash on the openSUSE forum, and confirmed they're getting
the same issue! They even posted a screenshot showing the exact same
graphical garbage I'm noticing in various applications (colorful little
squares littering the screen). This might be the first time someone else can
confirm the problem, which is very exciting news if the person will provide
us with more info.
https://forums.opensuse.org/showthread.php/525727-3D-engines-causing-
frequent-GPU-lockups?p=2864284#post2864284
At very least they does manage to get errors from the kernel driver before the
crash. You haven't seen such kind of errors, have you?

Still, you can share the blender trace with them. See if it causes hang for
them too. See if it hangs at the same place...
Post by b***@freedesktop.org
Also, to answer iive's last point which I forgot in the previous response: I
have temperature monitors on my desktop, and one of the Xonotic freezes
happened just a second after I alt-tab switched back from checking it. All
fans and temperatures were perfectly fine during that test: The CPU was
around the typical 48°C, whereas the GPU never exceeded 68°C itself.
Temperatures are most surely not an issue.
Chipset temperature is different than CPU and GPU.

The motherboard has a huge chips that connects the CPU with the RAM and the
PCIE slots. The GB website says it has temperature sensor on the North Bridge
and there seems to be a huge passive cooler with heat pipes on it. One leading
to a radiator on top of the MB, probably to use the PSU intake for cooling.

`sensors` should display everything that is available, even if they are not
libeled correctly.

GB like to increase the voltage a bit on their hardware, so it is more stable
when overclocked. This however means it also runs a bit hotter.
(And you need to dust off the heat sinks once or twice per year. Unless you
have special dust filters on the PC case intake vents.)
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-02 20:35:54 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #68 from MirceaKitsune <***@yahoo.com> ---
Created attachment 139283
--> https://bugs.freedesktop.org/attachment.cgi?id=139283&action=edit
Output of: watch --interval 0.1 sensors

(In reply to H4nN1baL from comment #66)

My BIOS offers no options regarding GART and Spread Spectrum as far as I
recall. Here is my exact motherboard model, in case anyone has extra info on
what its available BIOS settings mean which I may have missed.

https://www.gigabyte.com/Motherboard/GA-X58A-UD7-rev-10

(In reply to iive from comment #67)

That is indeed a difference: Everyone else reporting those crashes seems to be
able to record a Kernel panic, but in my case the system records nothing as it
fundamentally stops functioning at that very moment.

I logged into SSH from my other machine and ran "watch --interval 0.1 sensors".
Attached are my readings at most 0.1 seconds before the freeze, which seem to
capture every relevant voltage and temperature available. Obviously there are
no fans in slots 2 and 4 which is why they show 0 RPM.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-05 14:24:29 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #69 from ***@yahoo.com ---
I'm really out of ideas...

Could you try using only the radeon kernel driver, just blacklist amdgpu one.
See if the blender trace hangs and netconsole still doesn't give any warnings.

See if you can completely disable iommu, when using radeon.ko.

I've asked you at least 3 times to test "export mesa_glthread=false", but you
never included it in your list of things you've tried.
Same for `export RADEON_THREAD=false`.
I haven't asked you, but add `MESA_DEBUG=flush` to the things to test.


Now, if you have run out of things to test. You can try a prolonged experiment,
that might not even bring usable result.
If we had a case that hanged reliably, one thing to do is to locate the exact
operation that causes the hang.

So, you start `qapitrace` with the blender trace.
You then do a binary search for the frame that causes hang. It's done by
"Lookup State" at a frame number, it would replay the trace to that frame. You
start with the full range, let's say [0 - 10000], so you pick the frame from
the middle of that range, in this case frame#5000. If it hangs during replay,
you use [0 - 5000] as interval, if it doesn't hang, then you use the other half
[5000-10000] (because the cause of hang mush be there). Then you pick the
middle of the new interval and repeat the experiment.
(e.g. [0 - 2500]; [1250 - 2500]; [1250 - 1875].

Once you locate the exact frame that could cause the first hang, you can do the
binary search, but this time on the draw operations inside that frame. It can
help if you set:
"qapitrace->Trace->Options->Only_show_the_following_events->Draw_events".

Now, since crashing to you is kind of random, you might try to disable all
threaded options (all options from above) and run same lookup a dozen of times.
If it crashes even once, then it crashes.
Also, be sure to write down the current range, as to not loose it at reboot.


I also strongly encourage you to at least try some other distribution,
something you can start from life-cd or something. Or build your own vanilla
kernel.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-12 22:42:12 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #70 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #69)

Just tried mesa_glthread=false RADEON_THREAD=false MESA_DEBUG=flush and got the
same results. The others seem a lot more complex: I might try them later, but
currently I'm very busy and it's difficult to organize myself accordingly. I
wish I had a better way of helping to understand this issue, as I really need
to get it fixed, sadly I feel stuck myself at the moment.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-29 13:01:25 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #71 from MirceaKitsune <***@yahoo.com> ---
The Mesa 18.1.0 update, which was supposed to fix several GPU crashes, seems to
have managed to expand this freeze instead: I now get it even when playing
simple 3D games with low-poly models and low-res textures, such as MegaGlest.

At this moment the issue is at a point where it may have real life
implications: I may be constrained to buy a new video card just to stop this,
and if I do that I literally won't have money to eat for a month. As I make my
living from animation and game development, it's either that or this issue can
be solved. I know it has to be software related, but in some mind boggling way
every way to see what's doing it gets covered up and no kernel or MESA
parameters make it go away.

Can someone please ask other developers and people experienced with the video
drivers to subscribe to this and post their ideas? iive helped me with a lot of
advice, but somehow whatever is doing it managed to dodge everything even he
could think of. Perhaps someone else has some new suggestions?
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-30 11:36:56 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #72 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #71)
Post by b***@freedesktop.org
The Mesa 18.1.0 update, which was supposed to fix several GPU crashes, seems
to have managed to expand this freeze instead: I now get it even when
playing simple 3D games with low-poly models and low-res textures, such as
MegaGlest.
Can you confirm this?
Does reverting to older Mesa release "fix" the new issues?
Or/and reverting to older kernel.

Slow deterioration of the situation is consistent with hardware problems. That
might not be so bad, because it means it could be fixed relatively easy.

BTW are you using suspend to RAM? My card had worse symptoms after resume, even
if it has been suspended for seconds. Suspend still provides +5V on PCIE, so
the card might still be partially powered, but not cooled.

This reminds me of something we haven't tested - ASPM.
Try kernel parameter "pcie_aspm=off"

Disabling it might lead to more power consumption by the card, even when idle.
But it might improve stability.
https://wiki.archlinux.org/index.php/Power_management#Bus_power_management
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-30 12:12:41 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #73 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #72)

I've thought about testing an older version of Mesa too. Especially since, from
what I can vaguely remember, certain system instabilities were introduced
roughly two years ago (autumn of 2016) when I switched from Mesa 13 to 17. I
doubt that's related after so long but figured I'd still mention.

The only issue is that I'm not sure how far I can downgrade my Mesa version
without it asking for old dependencies, potentially rendering my system
unusable due to library conflicts. On the other hand, I remember there was once
a way to run games against a custom version of Mesa, by separately compiling a
.so library and using an environment variable to point to it.

Is it possible to download a Mesa 13.x library from any repository? And what
was the environment variable to point an executable to it when running a game?

I will try booting with pcie_aspm=off next and let you know how it goes.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-30 13:32:34 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #74 from MirceaKitsune <***@yahoo.com> ---
pcie_aspm=off makes no difference. In addition, I tried booting back to the
radeon module (instead of amdgpu) and disabling the SI scheduler: This seems to
have slightly mitigated the problem in some cases (eg: Blender) but made no
difference in others (eg: Xonotic).

As for suspending to RAM: I haven't used Standby mode in ages. I never suspend
my computer to RAM, so this could not be an issue.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-31 13:00:43 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #75 from ***@yahoo.com ---
(In reply to MirceaKitsune from comment #73)
Post by b***@freedesktop.org
(In reply to iive from comment #72)
I've thought about testing an older version of Mesa too. Especially since,
from what I can vaguely remember, certain system instabilities were
introduced roughly two years ago (autumn of 2016) when I switched from Mesa
13 to 17. I doubt that's related after so long but figured I'd still mention.
The only issue is that I'm not sure how far I can downgrade my Mesa version
without it asking for old dependencies, potentially rendering my system
unusable due to library conflicts. On the other hand, I remember there was
once a way to run games against a custom version of Mesa, by separately
compiling a .so library and using an environment variable to point to it.
Is it possible to download a Mesa 13.x library from any repository? And what
was the environment variable to point an executable to it when running a
game?
(In reply to MirceaKitsune from comment #74)
Post by b***@freedesktop.org
pcie_aspm=off makes no difference. In addition, I tried booting back to the
radeon module (instead of amdgpu) and disabling the SI scheduler: This seems
to have slightly mitigated the problem in some cases (eg: Blender) but made
no difference in others (eg: Xonotic).
As for suspending to RAM: I haven't used Standby mode in ages. I never
suspend my computer to RAM, so this could not be an issue.
No easy solutions...

If you are sure that mesa 13.x works for you, then you must try it, again.
I can't help you with packages, but you should be able to download the old
packages manually and install them manually too.
It might be PITA as it seems that OpenSUSE breaks Mesa on multiple packages,
like mesa, mesa-drm, mesa-libva, mesa-libgl1, mesa-libd3d...
https://software.opensuse.org/package/Mesa

Most packages should be forward compatible, so you don't have to downgrade
stuff like libdrm.
However the tricky moment here is LLVM. Most likely only Mesa depends on LLVM,
so you have to downgrade both at the same time (and nothing else).

Theoretically it might still be possible to compile your own mesa-13.x from
source, if there are still some issue with the other dependencies. LLVM might
be the tricky part here, you might need matching older version.

Are there live cd's with OpenSuse? Something you can start without installing
it?.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-05-31 21:07:09 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #76 from MirceaKitsune <***@yahoo.com> ---
(In reply to iive from comment #75)

That's what I feared too: I know Mesa depends on a lot of other libraries
(including LLVM) and you can't mix old and new versions between them. This is
my primary desktop on which I do all my activities, so I can't risk breaking it
nor downgrade the whole OS to an ancient version.

A live DVD would solve this however. Unfortunately I don't know how far I can
still find those for openSUSE, nor what the last openSUSE release was that came
with Mesa 13 instead of 17. Does anyone else have this information?

As a side note, I should mention that I'm now in the process of trying to
obtain a new video card: This couldn't be investigated in several months and I
can't wait much longer. Once I get a new card, I might not be able to continue
this test any more. I may still ask a friend to try my video card in Windows,
just so we at least know if this was a combination of bad hardware or the devs
should still be on the lookout for an obscure driver bug.
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-09-01 19:48:01 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

--- Comment #77 from ***@gmail.com ---
I had the same problem with Xubuntu 17.10 and maybe 18.04 (can't remember). I
GPU would hang when watching Videos with mpv or even in Firefox. When I tested
gnome-shell this would also sporadically hang.


What it solved it for me was to switch from Radeonsi to Amdgpu.
(add radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1
amdgpu.cik_support=1 to your grub.cfg kernel boot parameters)

Now I've upgraded to 18.10 and decided to give radensi another try (mainly
because VLC refuses to deinterlace mpeg2 under amdgpu for some reason) and it
works without problems for about ~2 weeks daily use.

MSI R7 370 4G
FX 8320
Gigabyte GA-970A-DS3P FX
--
You are receiving this mail because:
You are the assignee for the bug.
b***@freedesktop.org
2018-12-08 14:53:50 UTC
Permalink
https://bugs.freedesktop.org/show_bug.cgi?id=105425

MirceaKitsune <***@yahoo.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |WONTFIX

--- Comment #78 from MirceaKitsune <***@yahoo.com> ---
This will likely be my last update on the situation. A few months ago I got a
new video card and replaced my R7 370 with it. Since then I've never once
experienced this type of crash again, on either the "radeon" or "amdgpu"
module. The old card is now on my mother's computer... since she doesn't play
games it's working well for her, there hasn't been a GPU crash on there either.

I still believe this was a driver or firmware bug, not a damaged video card;
There's no way only specific 3D games would ever trigger the problem,
independent of the card's GPU or VRAM load. But we'll likely never know.
Closing this as I'll no longer be able nor interested to keep testing it.
--
You are receiving this mail because:
You are the assignee for the bug.
Loading...