Watchdog time bomb

The Proxmox time bomb watchdog

November 21, 2024

The title of this post is inspired by the very statement of “[watchdogs] are like a loaded gun” from Proxmox wiki. 1 Proxmox include one such active-by-default tool on every single node anyway. There’s further misinformation, including on official forums, when watchdogs are “disarmed” and it is thus impossible to e.g. isolate genuine non-software related reboots. Design flaws might get your node auto-reboot with no indication in the GUI. The CLI part is undocumented as is reliably disabling this feature.

Always ticking

Auto-reboots are often associated with High Availability (HA), 1 but in fact, every fresh Proxmox VE (PVE) install, unlike Debian, comes with an obscure setup out of the box, set at boot time and ready to be triggered at any point - it does NOT matter if you make use of HA or not.

Important

There are different kinds of watchdog mechanisms other than the one covered by this post, e.g. kernel NMI watchdog, 2 Corosync watchdog, 3 etc. The subject of this post is merely the Proxmox multiplexer-based implementation that the HA stack relies on.

Watchdogs

In terms of computer systems, watchdogs ensure that things either work well or the system at least attempts to self-recover into a state which retains overall integrity after a malfunction. No watchdog would be needed for a system that can be attended in due time, but some additional mechanism is required to avoid collisions for automated recovery systems which need to make certain assumptions.

The watchdog employed by PVE is based on a timer - one that has a fixed initial countdown value set and once activated, a handler needs to constantly attend it by resetting it back to the initial value, so that it does NOT go off. In a twist, it is the timer making sure that the handler is all alive and well attending it, not the other way around.

The timer itself is accessed via a watchdog device and is a feature supported by Linux kernel 4 - it could be an independent hardware component on some systems or entirely software-based, such as softdog 5 - that Proxmox default to when otherwise left unconfigured.

When available, you will find /dev/watchdog on your system. You can also inquire about its handler:

lsof +c12 /dev/watchdog
COMMAND         PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
watchdog-mux 484190 root    3w   CHR 10,130      0t0  686 /dev/watchdog

And more details:

wdctl /dev/watchdog0 
Device:        /dev/watchdog0
Identity:      Software Watchdog [version 0]
Timeout:       10 seconds
Pre-timeout:    0 seconds
Pre-timeout governor: noop
Available pre-timeout governors: noop

The bespoke PVE process is rather timid with logging:

journalctl -b -o cat -u watchdog-mux
Started watchdog-mux.service - Proxmox VE watchdog multiplexer.
Watchdog driver 'Software Watchdog', version 0

But you can check how it is attending the device, every second:

strace -r -e ioctl -p $(pidof watchdog-mux)
strace: Process 484190 attached
     0.000000 ioctl(3, WDIOC_KEEPALIVE) = 0
     1.001639 ioctl(3, WDIOC_KEEPALIVE) = 0
     1.001690 ioctl(3, WDIOC_KEEPALIVE) = 0
     1.001626 ioctl(3, WDIOC_KEEPALIVE) = 0
     1.001629 ioctl(3, WDIOC_KEEPALIVE) = 0

If the handler stops resetting the timer, your system WILL undergo an emergency reboot. Killing the watchdog-mux process would give you exactly that outcome within 10 seconds.

Caution

If you stop the handler correctly, it should gracefully stop the timer. However the device is still available, a simple touch will get you a reboot.

The multiplexer

The obscure watchdog-mux service is a Proxmox construct of a multiplexer - a component that combines inputs from other sources to proxy to the actual watchdog device. You can confirm it being part of the HA stack:

dpkg-query -S $(which watchdog-mux)
pve-ha-manager: /usr/sbin/watchdog-mux

The primary purpose of the service, apart from attending the watchdog device (and keeping your node from rebooting), is to listen on a socket to its so-called clients - these are the better known services of pve-ha-crm and pve-ha-lrm. The multiplexer signifies there are clients connected to it by creating a directory /run/watchdog-mux.active/, but this is rather confusing as the watchdog-mux service itself is ALWAYS active.

While the multiplexer is supposed to handle the watchdog device (at ALL times), it is itself handled by the clients (if the are any active). The actual mechanisms behind the HA and its fencing 6 are out of scope for this post, but it is important to understand that none of the components of HA stack can be removed, even if unused:

apt remove -s -o Debug::pkgProblemResolver=true pve-ha-manager
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Starting pkgProblemResolver with broken count: 3
Starting 2 pkgProblemResolver with broken count: 3
Investigating (0) qemu-server:amd64 < 8.2.7 @ii K Ib >
Broken qemu-server:amd64 Depends on pve-ha-manager:amd64 < 4.0.6 @ii pR > (>= 3.0-9)
  Considering pve-ha-manager:amd64 10001 as a solution to qemu-server:amd64 3
  Removing qemu-server:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-container:amd64 < 5.2.2 @ii K Ib >
Broken pve-container:amd64 Depends on pve-ha-manager:amd64 < 4.0.6 @ii pR > (>= 3.0-9)
  Considering pve-ha-manager:amd64 10001 as a solution to pve-container:amd64 2
  Removing pve-container:amd64 rather than change pve-ha-manager:amd64
Investigating (0) pve-manager:amd64 < 8.2.10 @ii K Ib >
Broken pve-manager:amd64 Depends on pve-container:amd64 < 5.2.2 @ii R > (>= 5.1.11)
  Considering pve-container:amd64 2 as a solution to pve-manager:amd64 1
  Removing pve-manager:amd64 rather than change pve-container:amd64
Investigating (0) proxmox-ve:amd64 < 8.2.0 @ii K Ib >
Broken proxmox-ve:amd64 Depends on pve-manager:amd64 < 8.2.10 @ii R > (>= 8.0.4)
  Considering pve-manager:amd64 1 as a solution to proxmox-ve:amd64 0
  Removing proxmox-ve:amd64 rather than change pve-manager:amd64

Considering the PVE stack is so inter-dependent with its components, they can’t be removed or disabled safely without taking extra precautions.

How to get rid of the auto-reboot

You can find two separate snippets on how to reliably put the feature out of action here, depending on whether you are looking for a temporary or lasting solution.