SSH - hidden regressions

Improved SSH with hidden regressions

November 10, 2024

If you pop into the release notes of PVE 8.2, ¹ there’s a humble note on changes to SSH behaviour under Improved management for Proxmox VE clusters:

Modernize handling of host keys for SSH connections between cluster nodes ([bugreport] 4886).
Previously, /etc/ssh/ssh_known_hosts was a symlink to a shared file containing all node hostkeys. This could cause problems if conflicting hostkeys appeared in /root/.ssh/known_hosts, for example after re-joining a node to the cluster under its old name. Now, each node advertises its own host key over the cluster filesystem. When Proxmox VE initiates an SSH connection from one node to another, it pins the advertised host key. For existing clusters, pvecm updatecerts can optionally unmerge the existing /etc/ssh/ssh_known_hosts.

The original bug

This is a complete rewrite - of a piece that has been causing endless symptoms since over 10 years ² manifesting as inexplicable:

WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!
Offending RSA key in /etc/ssh/ssh_known_hosts

This was particularly bad as it concerned pvecm updatecerts ³ - the very tool that was supposed to remedy these kinds of situations.

The irrational rationale

First, there’s the general misinterpretation on how SSH works:

problems if conflicting hostkeys appeared in /root/.ssh/known_hosts, for example after re-joining a node to the cluster under its old name.

Let’s establish that the general SSH behaviour is to accept ALL of the possible multiple host keys that it recognizes for a given host when verifying its identity. ⁴ There’s never any issue in having multiple records in known_hosts, in whichever location, that are “conflicting” - if ANY of them matches, it WILL connect.

Important

And one machine, in fact, has multiple host keys that it can present, e.g. RSA and ED25519-based ones.

What was actually fixed

The actual problem at hand was that PVE used to tailor the use of what would be system-wide (not user specific) /etc/ssh/ssh_known_hosts by making it into a symlink pointing into /etc/pve/priv/known_hosts - which was shared across the cluster nodes. Within this architecture, it was necessary to be merging any changes from any node performed on this file and in the effort of pruning it - to avoid growing it too large - it was mistakenly removing newly added entries for the same host, i.e. if host was reinstalled with same name, its new host key could never make it to be recognised by the cluster.

Because there were additional issues associated with this, e.g. running ssh-keygen -R would remove such symlink, eventually, instead of fixing the merging, a new approach was chosen.

What has changed

The new implementation does not rely on shared known_hosts anymore, in fact it does not even use the local system or user locations to look up the host key to verify. It makes a new entry with a single host key into /etc/pve/local/ssh_known_hosts which then appears in /etc/pve/<nodename>/ for each respective node and then overrides SSH parameters during invocation from other nodes with:

-o UserKnownHosts="/etc/pve/<nodename>/ssh_known_hosts" -o GlobalKnownHosts=none

So this is NOT how you would be typically running your own ssh sessions, therefore you will experience different behaviour in CLI than before.

What was not fixed

The linking and merging of shared ssh_known_hosts, if still present, is happening with the original bug - despite trivial to fix, regression-free. The not fixed part is the merging, i.e. it will still be silently dropping out your new keys. Do not rely on it.

Regressions

There’s some strange behaviours left behind. First of all, even if you create a new cluster from scratch on v8.2, the initiating node will have the symlink created, but none of the subsequently joined nodes will be added there and will not have those symlinks anymore.

Then there was the QDevice setup issue, ⁵ discovered only by a user, since fixed.

Lately, there was the LXC console relaying issue, ⁶ also user reported.

The takeaway

It is good to check which of your nodes are which PVE versions.

pveversion -v | grep -e proxmox-ve: -e pve-cluster:

The bug was fixed for pve-cluster: 8.0.6 (not to be confused with proxmox-ve).

Check if you have symlinks present:

readlink -v /etc/ssh/ssh_known_hosts

You either have the symlink present - pointing to the shared location:

/etc/pve/priv/known_hosts

Or an actual local file present:

readlink: /etc/ssh/ssh_known_hosts: Invalid argument

Or nothing - neither file nor symlink - there at all:

readlink: /etc/ssh/ssh_known_hosts: No such file or directory

Consider removing the symlink with the newly provided option:

pvecm updatecerts --unmerge-known-hosts

And removing (with a backup) the local machine-wide file as well:

mv /etc/ssh/ssh_known_hosts{,.disabled}

If you are running own scripting that e.g. depends on SSH being able to successfully verify identity of all current and future nodes, you now need to roll your own solution going forward.

Most users would not have noticed except when suddenly being asked to verify authenticity when “jumping” cluster nodes, something that was previously seamless.

What is not covered here

This post is meant to highlight the change in default PVE cluster behaviour when it comes to verifying remote hosts against known_hosts by the connecting clients. It does NOT cover still present bugs relating to the use of shared authorized_keys that are used to authenticate the connecting clients by the remote host.

SSH - passwordless lockout Proxmox Corosync fallacy