After my fourth abrupt crash on my newly-installed Debian system, I'd had enough. CPU, disk and graphics card temperatures were all fine and Memtest didn't record any problems with my 96GB RAM, so the PC is physically as fine as I can check it to be. So: it clearly must be the operating system!
I therefore swiftly wiped Debian Testing and installed Fedora 29 (KDE Spin). Looks a lot better than most Fedoras I remember, everything application-wise installed fine too, with minimal fuss. Then the neighbour popped round for a short chat and once she'd gone, I returned to whatever I'd been up to when she first arrived… and discovered my PC had completely locked up again! Maybe it wasn't the operating system after all!
Again, physical health checks were fine, so in desperation, I did what I should have done the first time… and read the contents of /var/log/messages. Therein, I found this:
Mar 25 16:49:10 britten kscreenlocker_greet: Connecting to deprecated signal QDBusConnectionInterface::serviceOwnerChanged(QString,QString,QString) Mar 25 16:54:12 britten kernel: nouveau 0000:02:00.0: fifo: SCHED_ERROR 0a [CTXSW_TIMEOUT] Mar 25 16:54:12 britten kernel: nouveau 0000:02:00.0: fifo: runlist 0: scheduled for recovery Mar 25 16:54:12 britten kernel: nouveau 0000:02:00.0: fifo: channel 2: killed Mar 25 16:54:12 britten kernel: nouveau 0000:02:00.0: fifo: engine 7: scheduled for recovery Mar 25 16:54:12 britten kernel: nouveau 0000:02:00.0: fifo: engine 0: scheduled for recovery Mar 25 16:54:12 britten kernel: nouveau 0000:02:00.0: Xorg: channel 2 killed! Mar 25 16:54:43 britten kernel: perf: interrupt took too long (3929 > 3925), lowering kernel.perf_event_max_sample_rate to 50000 Mar 25 16:55:01 britten cupsd: REQUEST localhost - - "POST / HTTP/1.1" 200 182 Renew-Subscription successful-ok Mar 25 17:25:58 britten kernel: Linux version 5.0.3-200.fc29.x86_64 ([email protected]) (gcc version 8.3.1 20190223 (Red Hat 8.3.1-2) (GCC)) #1 SMP Tue Mar 19 15:07:58 UTC 2019 Mar 25 17:25:58 britten kernel: Command line: BOOT_IMAGE=/vmlinuz-5.0.3-200.fc29.x86_64 root=UUID=5b2c6cd0-adfb-427b-af31-864ce0dd2d0c ro resume=UUID=1b31f287-5672-4a4f-8e28-a7a9ec39f3a6 rhgb quiet LANG=en_GB.UTF-8 Mar 25 17:25:58 britten kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Mar 25 17:25:58 britten kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Mar 25 17:25:58 britten kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
You can see from the time jump that the lock-up happened around 4:55pm and I had to wield the power switch at 5:25pm. The interesting lines are what proceed the 4:55pm dramas: they are all, in various ways, reporting that the Nouveau graphic driver is doing something peculiar, including 'killing off' channel 2 (whatever that means).
This would explain why, although my machine appears to lock up, it will quite happily continue playing music in the background if any were playing beforehand: the error messages suggest that the PC itself has not locked up at all, but that everything graphical (which, thanks to Xorg, also means keyboard-related) is dead beyond repair.
So: my suspicions were raised that all was not well with Nouveau and its attempts to interact with my Quadro K4000 graphics card. A swift bit of Googling later, and my suspicions seemed confirmed. Others had reported sporadic 'nouveau channel X killed' messages in the past, too: this report seemed to describe my situation exactly, for example.
I therefore followed this invaluable guide to installing the “proper” NVIDIA drives (and taking Nouveau off the system entirely). So far, so good (though the intermittent nature of the lock-ups means I can't yet guarantee that the problem has genuinely been fixed).
I can't say I've ever had Nouveau problems on this PC before… but, come to think of it, I used to install Nvidia's proprietary drivers routinely because I was using a dual-monitor setup that Nouveau had difficulty configuring properly. These days, I'm single-monitor, and thus didn't think the proprietary drivers were needed… but it seems that sometimes, they are, no matter how many monitors you're using!
It seems I owe Debian Testing an apology, therefore! It wasn't really the distro at fault, just the Nouveau drivers (which every distro uses anyway, so all of them would presumably have faced the same issue, sooner or later).
By way of a happy side-effect, I had also noticed that when I ran my Stellarium astronomy program, it's version of the daylight sky had a weird pink glow about it, like this one (reported by another user a couple of years ago, presumably not for the same reasons):
After the change to proprietary Nvidia graphics drivers, everything looks normal once again:
So, there are a few 'morals-of-the-story' here. First, check your logs before you go doing drastic things like changing operating systems! Second, if one of your applications is showing signs of 'graphical distress', something probably isn't all good with your graphics subsystem. Third, don't blame an entire distro for a single driver's mishap! Fourth, Nouveau is good, but it's not perfect, so don't imagine it can never produce strange results (though this is the first time in many years it has for me and my multitudinous and varied computers).
And there's just one 'outstanding issue' here, too: do I revert back to Debian Testing (which I was enjoying before I unfairly lost my temper with it!)? Or do I stick with Fedora 29 because it's now installed and seems to be working fine??
I suspect I'll stick with Fedora for a while, just because I've done too many installations of late! But watch this space…