What happened before the error:
I've been trying to diagnose a restart issue that only happens when playing Skyrim (modded).
Restarts happened before, but this is the first time i saw this error upon booting:
❯ journalctl -p 3 -b lip 04 04:55:02 cachyos kernel: [Hardware Error]: System Fatal error. lip 04 04:55:02 cachyos kernel: [Hardware Error]: CPU:14 (19:21:2) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000001000108 lip 04 04:55:02 cachyos kernel: [Hardware Error]: Error Addr: 0x00006ffffaf0ffd7 lip 04 04:55:02 cachyos kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000 lip 04 04:55:02 cachyos kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0 lip 04 04:55:02 cachyos kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN lip 04 04:55:02 cachyos kernel: amdgpu: Overdrive is enabled, please disable it before reporting any bugs unrelated to overdrive. lip 04 04:55:03 cachyos kernel: Bluetooth: hci0: No support for _PRR ACPI method
The only thing different now and before is that i have a new CPU. And it only happened after the restart, those errors. I rebooted again myself, and there was no error anymore.
Wat does this mean?
For reference, this is the boot before (the one that forced a restart):
❯ journalctl -p 3 -b -1
lip 04 02:36:48 cachyos kernel: amdgpu: Overdrive is enabled, please disable it before reporting any bugs unrelated to overdrive.
lip 04 02:36:48 cachyos kernel: Bluetooth: hci0: No support for _PRR ACPI method
lip 04 02:36:48 cachyos kernel: Bluetooth: hci0: FW download error recovery failed (-19)
lip 04 02:36:48 cachyos kernel: Bluetooth: hci0: sending frame failed (-19)
lip 04 02:36:48 cachyos kernel: Bluetooth: hci0: Failed to read MSFT supported features (-19)
lip 04 02:36:49 cachyos kernel: Bluetooth: hci0: No support for _PRR ACPI method
lip 04 02:39:17 cachyos plasmashell[1247]: qt.network.http2.connection: [0x7075f404e5f0] Connection error: HPACK decompression failed (9)
lip 04 02:48:03 cachyos kernel: playstation 0005:054C:0CE6.000E: DualSense input CRC's check failed
lip 04 02:59:11 cachyos kernel: playstation 0005:054C:0CE6.000E: DualSense input CRC's check failed
lip 04 03:01:59 cachyos kernel: playstation 0005:054C:0CE6.000E: DualSense input CRC's check failed
lip 04 03:03:31 cachyos kernel: playstation 0005:054C:0CE6.000E: DualSense input CRC's check failed
lip 04 03:06:16 cachyos kernel: playstation 0005:054C:0CE6.000E: DualSense input CRC's check failed
lip 04 03:08:25 cachyos kernel: playstation 0005:054C:0CE6.000E: DualSense input CRC's check failed
lip 04 03:08:33 cachyos kernel: playstation 0005:054C:0CE6.000E: DualSense input CRC's check failed
lip 04 03:08:59 cachyos kernel: playstation 0005:054C:0CE6.000E: DualSense input CRC's check failed
lip 04 03:21:29 cachyos kernel: playstation 0005:054C:0CE6.000E: DualSense input CRC's check failed
lip 04 04:29:38 cachyos systemd-coredump[25951]: [🡕] Process 25946 (sed) of user 1000 dumped core.
Stack trace of thread 25946:
#0 0x000070009ca00d2b n/a (/usr/lib/ld-linux-x86-64.so.2 + 0x26d2b)
#1 0x000070009c9fae23 n/a (/usr/lib/ld-linux-x86-64.so.2 + 0x20e23)
#2 0x000070009c9fc6d2 n/a (/usr/lib/ld-linux-x86-64.so.2 + 0x226d2)
#3 0x000070009c9fb488 n/a (/usr/lib/ld-linux-x86-64.so.2 + 0x21488)
ELF object binary architecture: AMD x86-64
inxi -b:
System: Host: cachyos Kernel: 6.15.0-2-cachyos arch: x86_64 bits: 64 Desktop: KDE Plasma v: 6.3.5 Distro: CachyOS Machine: Type: Desktop Mobo: ASRock model: B550M Pro4 serial: <superuser required> UEFI: American Megatrends LLC. v: P3.40 date: 01/18/2024 CPU: Info: 8-core AMD Ryzen 7 5700X3D [MT MCP] speed (MHz): avg: 3592 min/max: 575/4151 Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Navi 32 [Radeon RX 7700 XT / 7800 XT] driver: amdgpu v: kernel Display: wayland server: X.org v: 1.21.1.16 with: Xwayland v: 24.1.6 compositor: kwin_wayland driver: gpu: amdgpu resolution: 1: 2560x1440~75Hz 2: 2560x1440~75Hz API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 25.1.1-cachyos1.3 renderer: AMD Radeon RX 7800 XT (radeonsi navi32 LLVM 19.1.7 DRM 3.63 6.15.0-2-cachyos) Info: Tools: api: clinfo, eglinfo, glxinfo, vulkaninfo de: kscreen-console,kscreen-doctor gpu: lact wl: wayland-info x11: xdpyinfo, xprop, xrandr Network: Device-1: Realtek RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet driver: r8169 Device-2: Intel Wi-Fi 6E AX210/AX1675 2x2 [Typhoon Peak] driver: iwlwifi Device-3: ASUSTek TUF GAMING M4 WIRELESS driver: hid-generic,usbhid type: USB Drives: Local Storage: total: 2.96 TiB used: 769.19 GiB (25.4%) Info: Memory: total: 32 GiB available: 31.26 GiB used: 5.07 GiB (16.2%) Processes: 414 Uptime: 1h 5m Shell: fish inxi: 3.3.38
While booting garuda grub lists: amdgpu: Overdrive is enabled, please disable it before reporting any bugs unrelated to overdrive
Amdgpu issue with linux-cachyos kernel 6.12.x [SOLVED: NOT AN ISSUE]
AMDGPU Overdrive not working on Fedora 31 64 bit
How to overclock your AMD GPU on Linux
Videos
One thing I missed from Windows after my transition to Linux was the ability to easily adjust my GPU's clock speeds and voltages. I went to the godly Arch Wiki and found there's a way to overclock AMD GPUs, but some steps are not very clear and I had to do some googling to get everything working.
EDIT: Vega GPU are not supported as of kernel 4.20.2! Here's a workaround by u/whatsaspecialusername.
First things first, your kernel has to be at least version 4.17 (you can check by running uname -a), although it's recommended to update it to the latest version for system stability, bug fixes and new features (for instance, Hawaii support for overclocking was introduced in 4.20). The driver should be amdgpu (not the proprietary amdgpu-pro). I suggest installing the latest mesa+amdgpu from this PPA for *buntu, but I don't know about other distros. It might not even be a necessary step.
You need to add the parameter amdgpu.ppfeaturemask=0xffffffff to your GRUB configuration. To do so, edit /etc/default/grub as root and add the parameter between the quotes of GRUB_CMDLINE_LINUX_DEFAULT. Save, then run sudo update-grub2 or sudo grub-mkconfig -o /boot/grub/grub.cfg, depending on your distro. Reboot. If you're running any bootloader other than GRUB, check this Arch Wiki page.
Now, we need to find the file with our GPU's clocks and voltages. In my case it was in /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/, but you can find the directory by running readlink -f /sys/class/drm/card0/device.
The file we want to work with is called pp_od_clk_voltage. Mine looked like the following (my card is a Sapphire RX 580 Nitro+ 4GB):
OD_SCLK: 0: 300MHz 750mV 1: 600MHz 769mV 2: 900MHz 887mV 3: 1145MHz 1100mV 4: 1215MHz 1181mV 5: 1257MHz 1150mV 6: 1300MHz 1150mV 7: 1411MHz 1150mV OD_MCLK: 0: 300MHz 750mV 1: 1000MHz 800mV 2: 1750MHz 950mV OD_RANGE: SCLK: 300MHz 2000MHz MCLK: 300MHz 2250MHz VDDC: 750mV 1200mV
We want to edit the P-state #7 for the core and #2 for the VRAM, as those are the values that our GPU is going to run at while under load. On Windows, my optimal values were 1450MHz for core and 2065MHz for memory, so I'm going to edit the file as follows:
sudo sh -c "echo 's 7 1450 1150' > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage"
Where "s" means we're editing the core's values, 7 is the seventh P-state, 1450 is the speed we want in MHz, 1150 is the voltage in mV. Note that I didn't run sudo echo "s 7 1450 1150" > /sys/class/drm/card0/device/pp_od_clk_voltage like the Arch Wiki states, because it would throw an error and not apply the changes (this might have worked without "sudo" if we logged in as root with sudo su, but it's best not to do so for safety reasons). See here.
Same with the VRAM: sudo sh -c "echo 'm 2 2065 950' > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage"
After these two commands the file is going to be the same except for the two lines of the P-states we just edited. We can check by running cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage.
I didn't mess with voltages because I'm already satisfied with my results and I'm very paranoid about damaging my GPU. If you really want to, please be really careful as you might cause fatal damage to your card!
Once we are done, running sudo sh -c "echo 'c' > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage" will apply the changes and the GPU will start running at those new frequencies when under load.
While I haven't found a way to actively monitor clock speeds à la MSI Afterburner (EDIT: there is actually! See this comment by u/AlienOverlordXenu), I could see a sudden increase in FPS in Heaven Benchmark as soon as I applied the new clocks. I set the camera to free mode (so that it stops moving) and after applying the FPS went from 55-56 to 60-61!
(The guide on ArchWiki also has a command to change the maximum power consumption in Watts: I didn't mess with it as I wasn't sure what was a safe value)
Now there's one problem: every time we reboot our PC the clocks are going to reset. So how do we make them stick?
Assuming your distro has systemd, we can create a service that runs the three commands that edit and apply the clocks at boot. If your distro doesn't have systemd, you can follow these steps.
First, we need to create a script. I named mine "overclock" and put it in /usr/bin/. It looks like this:
#!/bin/sh sudo sh -c "echo 's 7 1450 1150' > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage" sudo sh -c "echo 'm 2 2065 950' > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage" sudo sh -c "echo 'c' > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage"
Then, we have to create a file in /etc/systemd/system/ with a .service extension. I named mine overclock.service:
[Unit] Description=Increase GPU core and memory clocks [Service] Type=oneshot ExecStart=/usr/bin/overclock [Install] WantedBy=multi-user.target
sudo systemctl enable overclock.service will enable our service. After rebooting it should automatically overclock the GPU. We can check if it did by running cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage.
(It's not necessary, but I also made a script that sets the GPU back to the stock clock speeds. I didn't make a service for it, I just put it in my Documents folder.)
So that should be it! Keep in mind that it might not work on any AMD GPU, in fact I couldn't find a way to do it on my Ryzen+Vega laptop (something with power saving mode I'm guessing), but it's always worth a try. This is my first "real" guide so any feedback is very much appreciated.