How PVE shreds SSDs

How Proxmox VE shreds your SSDs

December 20, 2024

Time has come to revisit the initial piece on inexplicable writes that even empty Proxmox VE cluster makes, especially we have already covered what we are looking at: a completely virtual filesystem 1 with a structure that is completely generated on-the-fly, some of which never really exists in any persistent state - that is what lies behind the mountpoint of /etc/pve and what the process of pmxcfs created the illusion of.

We know how to set up our own cluster probe that the rest of the cluster will consider to be just another node and have the exact same pmxcfs running on top of it to expose the filesystem, without burdening ourselves with anything else from the PVE stack on the probe itself. We can now make this probe come and go as an extra node would do and observe what the cluster is doing over Corosync messaging delivered within the Closed Process Group (CPG) made up of the nodes (and the probe).

References below will be sparse, as much has been already covered on the linked posts above.

All nodes are created equal

And so is our probe. Well, almost. One of the distinctive features of Proxmox VE stack is that there’s no concept of master and that also means that any single node is as good as any other in terms of capabilities (save for hardware differences). Our probe advertises itself over Corosync and pmxcfs just like any other node. It’s only if you were trying to e.g. migrate a guest to the probe you would discover it’s not possible - after all there’s no API endpoint to facilitate such action on our probe. We will exploit this difference that only we know of to draw conclusions on what pmxcfs writes and where it comes from.

Our familiar test setup will be:

  • 3 standard install cluster nodes with no guests running, no replication jobs set up, etc.
  • 1 probe with Corosync/pmxcfs only

Our common corosync.conf still looks like this: 2

cat /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.10.10.2
  }
  node {
    name: pve3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.10.10.3
  }

  node {
    name: probe
    nodeid: 99
    quorum_votes: 1
    ring0_addr: 10.10.10.99
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: ca
  config_version: 7
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Probe first

We will start our probe first of them all as we know it has no watchdog that would cause it to self-reboot - under no circumstances, be them part of the design considerations or unintended. The corosync service will start by itself on our probe, as we can check: 3

corosync-quorumtool 
Quorum information
------------------
Date:             Wed Dec 18 22:02:21 2024
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          99
Ring ID:          63.3fa
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      1
Quorum:           3 Activity blocked
Flags:            

Membership information
----------------------
    Nodeid      Votes Name
        99          1 probe (local)

All is well, let’s start pmxcfs we had prepared there manually:

~/stage/pmxcfs
[main] notice: resolved node name 'probe' to '10.10.10.99' for default node IP address

And see how it’s doing: 4

journalctl -b -t pmxcfs

Tip

You can add -f to follow the logs live, leave with C^c key. You can also use -e to jump to the end without being carried away with the flow. For the outputs below -o cat is used to omit the timestamps, machine and process name to improve readability.

[main] notice: resolved node name 'probe' to '10.10.10.99' for default node IP address
[status] notice: update cluster info (cluster name  ca, version = 7)
[dcdb] notice: members: 99/535
[dcdb] notice: all data is up to date
[status] notice: members: 99/535
[status] notice: all data is up to date

Everything is great - apart from we are alone and thus have no quorum. Time to start up the actual 3 cluster nodes and just keep watching the probe logs of pmxcfs:

[dcdb] notice: members: 1/787, 99/535
[dcdb] notice: starting data syncronisation
[status] notice: members: 1/787, 99/535
[status] notice: starting data syncronisation
[dcdb] notice: received sync request (epoch 1/787/00000001)
[status] notice: received sync request (epoch 1/787/00000001)
[dcdb] notice: received all states
[dcdb] notice: leader is 1/787
[dcdb] notice: synced members: 1/787, 99/535
[dcdb] notice: all data is up to date
[status] notice: received all states
[status] notice: all data is up to date
[status] notice: node has quorum
[dcdb] notice: members: 1/787, 2/801, 99/535

And so on and so forth, so this is nice as we are enlightened on how the synchronisation is going with everyone.

Important

This is NOT corosync logs, which was running all along - yes, we are getting all the messaging over Corosync, but this is pmxcfs process with its layer above it - telling us about the state of the filesystem.

Now one small issue comes right up on our probe, copious amounts of these:

[status] notice: RRD create error /var/lib/rrdcached/db/pve2-node/pve1: Cannot create temporary file
[status] notice: RRD update error /var/lib/rrdcached/db/pve2-node/pve1: opening '/var/lib/rrdcached/db/pve2-node/pve1': No such file or directory

We have previously learnt that pmxcfs wants to write the time series data once it starts receiving any writes to its virtual /etc/pve/.rrd - we do not write anything there on our probe - which is why we have never seen this error before we started the other nodes - but clearly the nodes are writing there and those writes to the virtual side on their end are reaching our pmxcfs process as well. We do not care for this data now, we can just let it peacefully write there by creating the target directory:

mkdir -p /var/lib/rrdcached/db

After this, there will be quite some silence in the log, unless you start shutting down or starting up the nodes.

Something is happening

We can now even restart our probe, not forgetting to launch pmxcfs manually thereafter, and check its log from the new boot again:

[status] notice: starting data syncronisation
[dcdb] notice: received all states
[dcdb] notice: leader is 1/787
[dcdb] notice: synced members: 1/787, 2/801, 3/777
[dcdb] notice: waiting for updates from leader
[status] notice: received sync request (epoch 1/787/00000005)
[dcdb] notice: update complete - trying to commit (got 4 inode updates)
[dcdb] notice: all data is up to date
[status] notice: received all states
[status] notice: all data is up to date

Now this is much calmer. Interestingly, our probe picked up 4 updates from the rest of the cluster - that’s some writes that must have happened on the nodes while our probe was down (and before we started the pmxcfs process again on the node).

And what did corosync report at the same time, let’s look at the end:

journalctl -b -u corosync.service
Started corosync.service - Corosync Cluster Engine.
  [KNET  ] rx: host: 3 link: 0 is up
  [KNET  ] link: Resetting MTU for link 0 because host 3 joined
  [KNET  ] rx: host: 2 link: 0 is up
  [KNET  ] link: Resetting MTU for link 0 because host 2 joined
  [KNET  ] rx: host: 1 link: 0 is up
  [KNET  ] link: Resetting MTU for link 0 because host 1 joined
  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
  [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
  [QUORUM] Sync members[4]: 1 2 3 99
  [QUORUM] Sync joined[3]: 1 2 3
  [TOTEM ] A new membership (1.40f) was formed. Members joined: 1 2 3
  [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
  [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
  [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
  [KNET  ] pmtud: Global data MTU changed to: 1397
  [QUORUM] This node is within the primary component and will provide service.
  [QUORUM] Members[4]: 1 2 3 99
  [MAIN  ] Completed service synchronization, ready to provide service.

So corosync service was already in the game for all this time prior, when pmxcfs process was started after by us it used Corosync messaging to synchronise some data in its virtual filesystem.

But something is writing all the time and not exactly to /var/lib/rrdcached/db - we know this from the original post. Where can we see that? Actually, there’s a built-in way to get more debug-level logs even mentioned in the official docs: 1 /etc/pve/.debug - if we write 1 into it, pmxcfs will start logging more - one of those virtual dotfiles we have learnt that actually do not exist, most were read-only, this one is writeable, an exception, really.

Warning

Before you do this, consider it will start logging a lot, see for yourself below first.

echo 1 > /etc/pve/.debug

So now we start seeing lots of these:

[database] debug: enter dbd_backend_delete_inode (database.c:209:bdb_backend_delete_inode)
[database] debug: enter backend_write_inode 000000000015E4FA 'lrm_status', size 83 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[dcdb] debug: leave dcdb_deliver (1) (dcdb.c:932:dcdb_deliver)
[dcdb] debug: dfsm mode is 2 (dfsm.c:661:dfsm_cpg_deliver_callback)
[dcdb] debug: process message 4 (length = 72) (dcdb.c:818:dcdb_deliver)
[database] debug: enter dbd_backend_delete_inode (database.c:209:bdb_backend_delete_inode)
[database] debug: enter backend_write_inode 000000000015E4FB 'lrm_status', size 83 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[dcdb] debug: leave dcdb_deliver (1) (dcdb.c:932:dcdb_deliver)
[status] debug: dfsm mode is 2 (dfsm.c:661:dfsm_cpg_deliver_callback)
[status] debug: got node 3 status update tasklist (status.c:1694:cfs_kvstore_node_set)

There’s a little bit of everything there - amongst others - things that got delivered from the CPG and when something is actually being written or deleted on the backend - that’s all from the viewpoint of our local probe now. And that’s what we are after, we will filter for the backend_write events:

journalctl -b -t pmxcfs -g backend_write

There will be monotonous blocks - amongst others - of these, plenty of them:

[database] debug: enter backend_write_inode 000000000015EEA5 'lrm_status', size 83 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 000000000015EEA8 'lrm_status.tmp.999', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 000000000015EEA8 'lrm_status.tmp.999', size 83 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

Now let’s look only at the lrm_status (as it comes in patterns and is very frequent on its own), how many times in the past minute did all these events occur, let’s count the lines: 5 6

journalctl -b -t pmxcfs -g backend_write --since '1 min ago' | grep lrm_status | wc -l

And it counts 108 lines per minute that the backend function touches lrm_status alone.

Note

If you do not filter for lrm_status, it is actually 162 per minute for backend_write.

The lrm_status is a file, we could find it, e.g.:

ls -l /etc/pve/nodes/pve1/lrm_status 
-rw-r----- 1 root www-data 83 Dec 18 23:06 /etc/pve/nodes/pve1/lrm_status

And it will be present in each node’s subdirectory - but not our probe’s /etc/pve/local (remember that’s just a symbolic link to the individually named directory for each node itself - our probe acts like one - under /etc/pve/nodes) as we have not created it and we are not writing anything there. Let’s watch one of them for pve1:

watch cat /etc/pve/nodes/pve1/lrm_status

It contains something like this:

{"results":{},"mode":"active","state":"wait_for_agent_lock","timestamp":1734563528}

It will have constantly changing “timestamp” in its content, but other than that only order of the values change.

We know the file is real and it belongs to the High Availability (HA) stack: 7

Each command requested by the CRM is uniquely identifiable by a UID. When the worker finishes, its result will be processed and written in the LRM status file /etc/pve/nodes/$nodename/lrm_status. There the CRM may collect it and let its state machine - respective to the commands output - act on it.

But we have no services running on any of the nodes, and certainly no HA-enabled ones.

High Availability down

We know we are not using HA stack, let’s go safely shut it down - one by one. Yes, this is something you are not expected to do, not even for maintenance. It’s not an exposed feature, but it has been common with the most popular 3rd party scripts for a reason.

First we do it on one node and check what our probe is measuring - we wait for a minute so that the “past minute” measurement reflects the “one node’s HA has been brought down” effect fully.

And we are down to 72 lrm_status writes.

Then we do the second node, leaving only one node with HA active in the cluster and we are down to 36 lrm_status writes (and 90 total when measuring backend_write of “anything”).

Let’s take down HA on the last node and measure again: 0 for lrm_status, no wonder.

We can be sure now that there’s no HA-associated (not just lrm_status) writes now and the remaining backend_write log entries are down to 18.

What else

So what’s left (for the past minute on an inactive 3-node cluster)?

[database] debug: enter backend_write_inode 000000000016485A 'file-jobs_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 000000000016485B 'file-replication_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 000000000016485E 'file-jobs_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 000000000016485F 'file-replication_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164862 'file-jobs_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164864 'file-replication_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

There’s some versioning happening that’s out of scope for now. However, going forward, do keep in mind that for all writes we count there’s likely an additional separate one for updating the __version__ entry, so everything is doubled in reality.

Other than that - and while we will also leave this part of the topic out of scope here - the only values written are of file-jobs_cfg and file-replication_cfg. These _cfg entries are logged somewhat specially, but appear to be relating to the files of:

  • /etc/pve/jobs.cfg related to pvescheduler; 8 and
  • /etc/pve/replication.cfg related to pvesr. 9

The simpler observation is that there is exactly 6 of them in the past minute and that is a conspicious number - when talking of 3 nodes and 2 files.

One node down

Let’s try to shut down one node now - of course, we get down to 4, so that’s 2 per node per minute (without HA).

We will NOT shut another node down because we would be losing quorum at which point we know pmxcfs turns everything read-only, but we will keep observing - from our probe - the remaining two nodes:

[database] debug: enter backend_write_inode 000000000016497C 'file-jobs_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 000000000016497D 'file-replication_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164980 'file-jobs_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164981 'file-replication_cfg', size 0 (database.c:157:backend_write_inode)

Notable is that the hexadecimal number in the log entry which should refer to the inode number differs despite the name of the target of the write appears the same. We already know that pmxcfs stores content of its files as BLOBs in database table rows together with their bogus inode references - no concept of separation of the inode entry and the actual data exists here. There’s only two possibilities where an entry can have the same name but be getting associated with different inode - either it is in a different directory path (as was the case of lrm_status) or it was previously deleted and now being re-created.

Let’s inspect the debug logs for everying coming from the [database] component of pmxcfs now and NOT just the final backend_write entries:

journalctl -b -t pmxcfs -g '\[database\]' --since '1 min ago'
[database] debug: enter backend_write_inode 00000000001649FC 'file-jobs_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter dbd_backend_delete_inode (database.c:209:bdb_backend_delete_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 00000000001649FE 'file-replication_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter dbd_backend_delete_inode (database.c:209:bdb_backend_delete_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164A00 'file-jobs_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164A01 'file-replication_cfg', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter dbd_backend_delete_inode (database.c:209:bdb_backend_delete_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter dbd_backend_delete_inode (database.c:209:bdb_backend_delete_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

So this now makes more sense, the entry is created, delete, created, deleted, … That looks like an odd way of updating something. And what’s with all the versioning? We will think about this later.

Everything down

Let’s stop the remaining nodes, let’s just keep our probe with plain pmxcfs now. This will be immediately noticed:

[status] notice: node lost quorum (status.c:1984:cfs_set_quorate)

[dcdb] notice: members: 99/492 (dfsm.c:1159:dfsm_cpg_confchg_callback)

[quorum] debug: quorum notification called, quorate = 0, number of nodes = 1 (quorum.c:52:quorum_notification_fn)

We already know pmxcfs process gets Corosync indication of having become inquorate and protects itself by going read-only.

As we want to experiment, we reboot our probe again and start pmxcfs in local mode - there’s a switch -l for just that. 10

Note

This is how pmxcfs runs on single node installs and it defaults to it when it does not find any corosync configuration inside its database, which we do not want to fiddle with now though.

We will also add -d switch which enables debug logs from the very beginning, equivalent to /etc/pve/.debug getting 1 written into before on a running one. The nice side effect of starting with the debug mode is to see the debug-level logs from the very start.

~/stage/pmxcfs -ld

And now we even see the entire database loading up into the virtual file structure:

[main] notice: resolved node name 'probe' to '10.10.10.99' for default node IP address (pmxcfs.c:859:main)
[database] debug: name __version__ (inode = 0000000000000000, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name datacenter.cfg (inode = 0000000000000002, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name user.cfg (inode = 0000000000000004, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name storage.cfg (inode = 0000000000000006, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name virtual-guest (inode = 0000000000000008, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name priv (inode = 0000000000000009, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name nodes (inode = 000000000000000B, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name pve1 (inode = 000000000000000C, parent = 000000000000000B) (database.c:370:bdb_backend_load_index)
[database] debug: name lxc (inode = 000000000000000D, parent = 000000000000000C) (database.c:370:bdb_backend_load_index)
[database] debug: name qemu-server (inode = 000000000000000E, parent = 000000000000000C) (database.c:370:bdb_backend_load_index)
[database] debug: name openvz (inode = 000000000000000F, parent = 000000000000000C) (database.c:370:bdb_backend_load_index)
[database] debug: name priv (inode = 0000000000000010, parent = 000000000000000C) (database.c:370:bdb_backend_load_index)
[database] debug: name lock (inode = 0000000000000011, parent = 0000000000000009) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-www.key (inode = 0000000000000018, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-ssl.key (inode = 000000000000001A, parent = 000000000000000C) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-root-ca.key (inode = 000000000000001C, parent = 0000000000000009) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-root-ca.pem (inode = 000000000000001E, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-root-ca.srl (inode = 0000000000000020, parent = 0000000000000009) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-ssl.pem (inode = 0000000000000023, parent = 000000000000000C) (database.c:370:bdb_backend_load_index)
[database] debug: name firewall (inode = 0000000000000030, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name ha (inode = 0000000000000031, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name mapping (inode = 0000000000000032, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name acme (inode = 0000000000000033, parent = 0000000000000009) (database.c:370:bdb_backend_load_index)
[database] debug: name sdn (inode = 0000000000000034, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name known_hosts (inode = 0000000000000396, parent = 0000000000000009) (database.c:370:bdb_backend_load_index)
[database] debug: name pve2 (inode = 00000000000003AC, parent = 000000000000000B) (database.c:370:bdb_backend_load_index)
[database] debug: name lxc (inode = 00000000000003AD, parent = 00000000000003AC) (database.c:370:bdb_backend_load_index)
[database] debug: name qemu-server (inode = 00000000000003AE, parent = 00000000000003AC) (database.c:370:bdb_backend_load_index)
[database] debug: name openvz (inode = 00000000000003AF, parent = 00000000000003AC) (database.c:370:bdb_backend_load_index)
[database] debug: name priv (inode = 00000000000003B0, parent = 00000000000003AC) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-ssl.key (inode = 00000000000003BB, parent = 00000000000003AC) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-ssl.pem (inode = 00000000000003BD, parent = 00000000000003AC) (database.c:370:bdb_backend_load_index)
[database] debug: name pve3 (inode = 0000000000000418, parent = 000000000000000B) (database.c:370:bdb_backend_load_index)
[database] debug: name lxc (inode = 0000000000000419, parent = 0000000000000418) (database.c:370:bdb_backend_load_index)
[database] debug: name qemu-server (inode = 000000000000041A, parent = 0000000000000418) (database.c:370:bdb_backend_load_index)
[database] debug: name openvz (inode = 000000000000041B, parent = 0000000000000418) (database.c:370:bdb_backend_load_index)
[database] debug: name priv (inode = 000000000000041C, parent = 0000000000000418) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-ssl.key (inode = 0000000000000431, parent = 0000000000000418) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-ssl.pem (inode = 0000000000000433, parent = 0000000000000418) (database.c:370:bdb_backend_load_index)
[database] debug: name probe (inode = 000000000005C263, parent = 000000000000000B) (database.c:370:bdb_backend_load_index)
[database] debug: name pve-ssl.pem (inode = 000000000005D516, parent = 000000000005C263) (database.c:370:bdb_backend_load_index)
[database] debug: name corosync.conf (inode = 00000000000DA551, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name vzdump.cron (inode = 00000000000E4867, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name replication.cfg (inode = 00000000000E492E, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name 101.conf (inode = 000000000015ADB0, parent = 000000000000000D) (database.c:370:bdb_backend_load_index)
[database] debug: name 102.conf (inode = 000000000015ADB9, parent = 00000000000003AD) (database.c:370:bdb_backend_load_index)
[database] debug: name 103.conf (inode = 000000000015ADBF, parent = 0000000000000419) (database.c:370:bdb_backend_load_index)
[database] debug: name authkey.pub.old (inode = 000000000016332B, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name authkey.pub (inode = 000000000016332E, parent = 0000000000000000) (database.c:370:bdb_backend_load_index)
[database] debug: name authkey.key (inode = 0000000000163331, parent = 0000000000000009) (database.c:370:bdb_backend_load_index)
[database] debug: name ssh_known_hosts (inode = 0000000000163434, parent = 000000000000000C) (database.c:370:bdb_backend_load_index)
[database] debug: name lrm_status (inode = 0000000000164426, parent = 000000000000000C) (database.c:370:bdb_backend_load_index)
[database] debug: name lrm_status (inode = 000000000016473B, parent = 0000000000000418) (database.c:370:bdb_backend_load_index)
[database] debug: name ssh_known_hosts (inode = 0000000000164765, parent = 0000000000000418) (database.c:370:bdb_backend_load_index)
[database] debug: name lrm_status (inode = 0000000000164835, parent = 00000000000003AC) (database.c:370:bdb_backend_load_index)
[database] debug: name authorized_keys (inode = 0000000000164962, parent = 0000000000000009) (database.c:370:bdb_backend_load_index)
[database] debug: name ssh_known_hosts (inode = 0000000000164965, parent = 00000000000003AC) (database.c:370:bdb_backend_load_index)
[main] debug: memdb open '/var/lib/pve-cluster/config.db' successful (version = 0000000000164A2B) (memdb.c:536:memdb_open)
[main] notice: forcing local mode (although corosync.conf exists) (pmxcfs.c:923:main)

Where does it load to? Well, it does not have many options - RAM, of course.

And now dead silence, we do not even have to filter:

journalctl -b -t pmxcfs -o cat --since '1 min ago'

Nothing is written, nowhere. So pmxcfs only writes to the backend when it itself receives writes into the virtual mountpoint.

How does it write

In the middle of this serenity - we have full control over any writes, let’s write something ourselves: 11

touch /etc/pve/t0

And we will filter again because we only care for the [database] component:

journalctl -b -t pmxcfs -g '\[database\]' --since '1 min ago'
[database] debug: enter backend_write_inode 0000000000164A2C 't0', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164A2C 't0', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

That’s rather unexpected, new empty file was written twice.

Let’s touch it again, this should only refresh its timestamp:

[database] debug: enter backend_write_inode 0000000000164A2C 't0', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

That’s more like it.

Let’s write a character to it, rewriting it:

echo t > /etc/pve/t0
[database] debug: enter backend_write_inode 0000000000164A2C 't0', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164A2C 't0', size 2 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

Here we go again, written twice, first time with 0 size and second time with content.

Let’s append to it:

echo t >> /etc/pve/t0
[database] debug: enter backend_write_inode 0000000000164A2C 't0', size 4 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

One write again.

Let’s move it:

mv /etc/pve/t0 /etc/pve/t1
[database] debug: enter backend_write_inode 0000000000164A36 't1', size 2 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

And remove it:

rm /etc/pve/t1
[database] debug: enter dbd_backend_delete_inode (database.c:209:bdb_backend_delete_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

Now let’s create a new file made up from random characters up to the the pmxcfs’s file limit size of 1M: 12

dd if=/dev/random of=/etc/pve/t2
dd: writing to '/etc/pve/t2': File too large
2049+0 records in
2048+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 9.19993 s, 114 kB/s

Notice the time it took - and we are using NVMe SSD.

The logs are overflowing:

[database] debug: enter backend_write_inode 0000000000164A3A 't2', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164A3A 't2', size 512 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164A3A 't2', size 1024 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164A3A 't2', size 1536 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000164A3A 't2', size 2048 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

… many more entries here … and the last:

[database] debug: enter backend_write_inode 0000000000164A3A 't2', size 694272 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000000000 '__version__', size 0 (database.c:157:backend_write_inode)

Yes, there’s 2049 writes which can be verified with:

journalctl -b -t pmxcfs -g '\[database\]' | grep t2 | wc -l

It did not like dd’s default block size at all.

So let’s be nice, let’s set the block size for dd to 1M and let pmxcfs gulp it all in one shot, a new file:

dd if=/dev/random of=/etc/pve/t3 bs=1M
dd: error writing '/etc/pve/t3': File too large
2+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0749296 s, 14.0 MB/s

And the debug log:

[database] debug: enter backend_write_inode 0000000000165244 't3', size 0 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000165244 't3', size 131072 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000165244 't3', size 262144 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000165244 't3', size 393216 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000165244 't3', size 524288 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000165244 't3', size 655360 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000165244 't3', size 786432 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000165244 't3', size 917504 (database.c:157:backend_write_inode)
[database] debug: enter backend_write_inode 0000000000165244 't3', size 1048576 (database.c:157:backend_write_inode)

That’s 9 writes. And they are not 9 writes appending to the file, they are:

  • empty file
  • chunk of first 128K
  • chunk of first 256K (includes the previous first 128k again)
  • chunk of first 384K (includes the previous first 256k again)

… you get the idea.

Do these really hit the block layer? Or course, you can check how much the process of pmxcfs is writing. We will install iotop, 13 and separately observe accumulated writes (from the point when launched iotop) on our (sleepy) system:

apt install -y iotop
iotop -ao

Meanwhile running the first dd (default bs=512) again seeing:

Total DISK READ:         0.00 B/s | Total DISK WRITE:        19.81 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      27.73 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    555 be/4 root          0.00 B    393.35 M  0.00 %  5.77 % pmxcfs -ld
    550 be/4 root          0.00 B    420.58 M  0.00 %  5.70 % pmxcfs -ld
    514 be/4 root          0.00 B    416.77 M  0.00 %  5.60 % pmxcfs -ld
    558 be/4 root          0.00 B    437.01 M  0.00 %  5.45 % pmxcfs -ld
    590 be/4 root          0.00 B    319.58 M  0.00 %  4.19 % pmxcfs -ld
    561 be/4 root          0.00 B    304.93 M  0.00 %  4.16 % pmxcfs -ld
    515 be/4 root          0.00 B    277.55 M  0.00 %  3.94 % pmxcfs -ld
    539 be/4 root          0.00 B    300.83 M  0.00 %  3.77 % pmxcfs -ld
    551 be/4 root          0.00 B      0.00 B  0.00 %  0.21 % [kworker/u4:0-ext4-rsv-conversion]
    593 be/4 root          0.00 B      0.00 B  0.00 %  0.01 % [kworker/0:1-events]
    197 be/3 root          0.00 B     60.00 K  0.00 %  0.00 % [jbd2/vda1-8]
    238 be/4 root          0.00 B     31.42 M  0.00 %  0.00 % systemd-journald

These were 8 pmxcfs threads that indeed wrote total of 2.8G into the backend from an original write that was 1M of data written into the virtual filesystem.

Note

Our copious debug logging is also captured here separately, but systemd-journald is far behind.

With the second dd (bs=1M):

Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    515 be/4 root          0.00 B      2.66 M  0.00 %  1.37 % pmxcfs -ld
    555 be/4 root          0.00 B      3.66 M  0.00 %  0.28 % pmxcfs -ld
    558 be/4 root          0.00 B      2.06 M  0.00 %  0.04 % pmxcfs -ld
    590 be/4 root          0.00 B      2.05 M  0.00 %  0.03 % pmxcfs -ld
    603 be/4 root          0.00 B      0.00 B  0.00 %  0.01 % [kworker/0:2-events]
    238 be/4 root          0.00 B   1128.00 K  0.00 %  0.00 % systemd-journald
    514 be/4 root          0.00 B    404.00 K  0.00 %  0.00 % pmxcfs -ld
    539 be/4 root          0.00 B   1432.00 K  0.00 %  0.00 % pmxcfs -ld
    550 be/4 root          0.00 B   1068.00 K  0.00 %  0.00 % pmxcfs -ld

That’s 7 threads rushing to write over 13M of what was 1M originally in a single block.

The debug logs did not lie, after all.

The non-so-atomic transactions

There’s more to this than what meets the eye, actually, because when not filtering the logs, you see that every time a file is touched (pun intended), it triggers an SQL transaction, subsequent write is another one or more of them depending on the block size and then for each of these transactions there’s an accompanying separate SQL transaction updating the __version__ bogus entry that designates itself as if inode 0 living at the root and not containing anything but a sequence counter - of how many updates this particular backend has already received.

Tip

If you are skeptical about whether these backend_write entries are actually all separate SQL transactions referred to by the debug log, whilst out of scope here, you can confirm they indeed are in the source code. 14

When you look back at the database structure, this is quite interesting because for one, every row (representing a file) in the database already contains its own VERSION column, so when that row is updated, the piece of information has reached the backend and then also, the “root” __version__ marker is NOT updated in the same transaction as the file was.

There are other inner workings here related to the algorithm of synchronising the individual backends that are meant to guarantee the “most recent” database always wins, but that does not take away from the interesting aversion to make use of built-in database-level guarantees, but rather doubling down on transactionally separate writes.

Another consideration to make is that e.g the HA stack was furiously writing, renaming and deleting entries like lrm_status and lrm_status.tmp.<XXX> attempting to make atomically-appearing file contents changes despite a database would provide for such facility.

Perhaps it was meant to bypass the ill-effect of appending a file written in multiple blocks when committing it to database at blocks-spaced-apart points as complete individual transactions - as that is not yielding consistent states within the database until the full content has been written.

One of the caveats of FUSE is that one has to cater for the multithreaded environment, but that’s maybe why the use of a backend database would appear like a great way to outsource the problem.

Was it built for the purpose?

There’s some interesting developer commits to the pmxcfs repository you could find with git log, 15 such as when you self-built it:

There was an increase in the virtual filesystem limits (to the current ones) in 2021 with good rationale:

commit a8df0863b5851dacb4f76ae6364ac1a02fbd56db
Date:   Wed Jun 30 12:06:16 2021 +0200

    pmxcfs: bump basic FS limits, 1 MiB per-file, 128 MiB total
    
    We have some users running into issues in some cases, like syncing
    huge user base through LDAP into users.cfg or having a few thousands+
    of HA services, as then the per-file limit is exhausted.
    
    Bumping that one provides only half of the solution as the total
    limit of 30 MiB would only allow a few files getting that big, or
    reduce the amount left over for actual guest configurations quite a
    bit.
    
    So also bump the total filesystem limit from 30 MiB to 128 MiB, so by
    a factor of ~4 and in the same spirit bump the maximal numbers of
    inodes (i.e., different files) from 10k to 256k, which pmxcfs can
    handle still rather easily (tested with touch) and would allow to max
    out the full FS limit with 512 byte files, which fits small guest
    configs, so sounds like an OK proportioned limit.
    
    That should give use quite some wiggle room again, and should be
    relatively safe as most of our access is rather small and on a few
    files only, only root has full access anyway and that user can break
    everything already, so not much lost here.

As there was an earlier one back in 2016 out of necessity:

commit 342c3c559214d6aa295c58199dd60b294685a55c
Date:   Mon Sep 12 17:50:54 2016 +0200

    pmxcfs: increase max filesize from 128k to 512k
    
    This fixes bug 1014 and also fixes a few other problems where user
    ran into the file size limitation, I did not found the bug entries
    for them, but they covered:
    1) there was a maximum of about <1500 services which could be
       managed by our HA manager, as after that the manager_status file
       got to big
    
    2) firewall rules may also reach this limit on a bigger setup
    
    I tested this with concurrent started read/writes of random data
    files from and into RAM (tmpfs mounts), as long as we do not flush
    often and read everything at once (i.e. write/read with a big block
    size) the performance stays good.
    
    The limiting factor in speed is not corosyncs CPG but sqlite, that
    can be seen when comparing worst case scenarios between local pmxcfs
    and clustered pmxcfs instances and simple debug logging.
    
    We optimize our sqlite usage quite heavy, relevant additional speed
    gains cannot be made without loosing reliability, as far as I've
    seen.
    
    So I only got into problems if I read/wrote small blocks
    with a few hundred big writes started at once, e.g.
    for i in {1..100}
    do
        dd if=/tmp/random512k.data of="/etc/pve/data$i" bs=1k &
    done
    
    As with the above worst case each block gets written as a single
    transaction to the database, where each transaction has to be locked
    and synced to disk for reliability.
    So packing all changes (i.e. the whole file) into one DB transaction
    does not produces much overhead of 512k files compared to 128k files
    
    As data written through the PVE framework is written and read in
    such a way we can increase this without seeing much of a
    performance impact.
    
    It should be also noted that just because files can now get bigger
    not a lot will get that. Rather there may be just one to three files
    bigger than 128k on some setups.

So these were all known issues, or were they? Well, there’s the most recent one that made it to v8.3 now, 16 even that only after user-reported: 17

commit 1db5cfcf93e6275db1b9e8c44e6355f7ea658f95
Date:   Mon Oct 14 12:09:38 2024 +0200

    fix #5728: pmxcfs: allow bigger writes than 4k for fuse
    
    by default libfuse2 limits writes to 4k size, which means that on writes
    bigger than that, we do a whole write cycle for each 4k block that comes
    in. To avoid that, add the option 'big_writes' to allow writes bigger
    than 4k at once (namely up to 128 KiB).
    
    This means that if we update a file with more than 4KiB data, the
    following pattern occurs:
    
    * cfs_fuse_write is called with at offset 0 with 4096 size
    * sqlite writes the partial file to disk since it's a transaction
    * cfs_fuse_write is called with an offset 4096 and with 4096 size
    * sqlite updates the data and writes again
    * repeat until all data reached cfs_fuse_write
    
    So when cfs_fuse_write accepts bigger chunks, we have less
    cfs_fuse_write -> sqlite write cycles, leading to a reduced disk
    activity.
    
    Note that sqlite itself uses 4096 byte blocks to write to the file
    system layer below.
    
    Most files on pmxcfs are written with `file_set_contents`, which writes
    the file into a tmp file and renames it, so we always have some write
    overhead.
    
    Previous to pve-common commit
    ef0bcc9 (tools: file_set_contents: use syswrite instead of print)
    
    it used `print` to write, which uses an internal 8k buffer, and after
    the commit it uses `syswrite`, which writes the file unbuffered in one
    go. (Fuse still splits writes at it's defined maximum)
    
    The commit message of that patch includes benchmarks for various sizes
    of writes on pmxcfs with this patch included. Results show that we can
    reduce the amount of bytes written to disk for files larger than 4 KiB
    by a significant amount (with both patches we can reduce the
    amplification at 8KiB from ~15x to ~11x, and for 1024KiB from ~360x to
    ~15x)
    
    When we change to libfuse3, we have to remove this option again, since
    it got removed and is the default there.

So the 2016 idea based on the assumption that “packing all changes (i.e. the whole file) into one DB transaction does not [produce] much overhead of 512k files compared to 128k files” did not really work as intended over all those layers of the stack and even now after “the option ‘big_writes’ to allow writes bigger than 4k at once (namely up to 128 KiB)” is at play, the 128k is the limiting factor because of FUSE v2 still at use. And even if FUSE v2 is replaced by FUSE v3 and the buffers get bigger - something that was available since 2016, 18 it won’t change anything about the nature of the filesystem observed above.

But that’s not really that important now, the best takeaway here was actually the one from 2016 already:

The limiting factor in speed is not [Corosync’s] CPG but [SQLite], that can be seen when comparing worst case scenarios between local pmxcfs and clustered pmxcfs instances and simple debug logging.

We optimize our sqlite usage quite heavy, relevant additional speed gains cannot be made without loosing reliability, as far as I’ve seen.

What all those optimisations are (beyond avoiding using any of the potentially overhead-producing database features) is out of scope here, however they do closely relate to the wasteful backend writes.

But why has been SQLite chosen for this purpose when it does not provide much more than a single-table schema with single-column primary key and limited use of constraints? There’s good documentation of just that as well from back beyond 2011, probably 2009: 19

* Backend Database

Each node stores the state using a backend database. That database
need to have transaction support, because we want to do atomic
updates. It must also be possible to get a copy/snapshot of the
current state.

** File Based Backend (not implemented)

Seems possible, but its hard to implement atomic update and snapshots.

** Berkeley Database Backend (not implemented)

The Berkeley DB provides full featured transaction support, including
atomic commits and snapshot isolation. 

** SQLite Database Backend (currently in use)

This is simpler than BDB. All data is inside a single file. And there
is a defined way to access that data (SQL). It is also very stable.

We can use the following simple database table:

  INODE PARENT NAME WRITER VERSION SIZE VALUE

We use a global 'version' number (64bit) to uniquely identify the
current version. This 'version' is incremented on any database
modification. We also use it as 'inode' number when we create a new
entry. The 'inode' is the primary key.

** RAM/File Based Backend

If the state is small enough we can hold all data in RAM. Then a
'snapshot' is a simple copy of the state in RAM. Although all data is
in RAM, a copy is written to the disk. The idea is that the state in
RAM is the 'correct' one. If any file/database operations fails the
saved state can become inconsistent, and the node must trigger a state
resync operation if that happens.

We can use the DB design from above to store data on disk.

This also explains the strange behaviour where data was only read from the backend database on pmxcfs start (as we finally saw above with -d switch), but never after. The state is always stored in RAM, read from RAM, modified in RAM, as the whole state is in-memory first and foremost. How good of a “snapshot representation” of that state is an SQLite database file is questionable, but it can’t be performant. It is, however, a simple, stable start.

High Availability state

Part of the task of persisting state lies in persisting the portion related to High Availability. It is why those lrm_status updates were coming continually. Do they have to be coming when nothing is changing? Certainly not, but the absence of a piece of information does not equal confirmation of no change to the piece of information, i.e. at least timestamps would need to be updated. But we have seen that there’s a timestamp on the file as well as in the data of that file, is that necessary? Well, it does not really matter as long as pmxcfs writes everything all of the time. It would absolutely rewrite entire content of a file if you were to just update its timestamp.

Does the HA state have to be persisted at such a high rate? This is an entirely separate question related to the quality of HA stack which is a very unusual implementation and would warrant a separate analysis. One of the reasons of constantly trying to “persist an in-memory state” (by mirroring every transaction onto the backend) is the concern of a “power-loss event”, but when you take a step back, the HA is the driver behind exactly such events. Even if you have deployed every node with redundant PSUs and have the entire cluster backed up by a dependable UPS solution, it is the HA that will pull the plug off your server. And if things go wrong, with race conditions, there might be an entire discotheque of full-house self-rebooting perfectly timed so that not a single cluster partition ever gets any glimpse of a quorum. The developer’s fear factor is high, it’s easier to just persist everything.

Liability in the proof of concept

Both pmxcfs and HA are such staple components of Proxmox VE that it could be hard to pick a side, either way some developer somewhere will not be happy with the assessment. A case could be made that if pmxcfs was meant to be a robust filesystem, it would not matter how many writes are inbound, it would know exactly which ones are worth persisting and not blindly write everything. It is a rather simplistic implementation based of FUSE boilerplate, which is out of the scope here as a topic as well, but it is the reason why it does not do anything other than follows every file operation from every thread that it receives and passes it onto not-so-fortunately chosen backend of SQLite.

There were arguably some great concepts behind the stack when initially developed, it served as a great proof of concept, it must have been sufficient to take off with 4K sized files, but it aged and it did not scale well for what it is advertised today. It is easy to see why there’s something beyond suboptimal conceptually in a filesystem that amplifies a single dd run by thousands of times and it is certainly palpable for the backend.

The database makes use of so-called Write Ahead Log (WAL) 20 for journalling so that’s another barrage of writes every now and then when it has to sweep the log to the base every 1000 or so pages, something beyond the scope here again though. One just would not want to depend on it as-is, not even if the writes were exclusive to a designated user (which they are not) only available for PVE internal use.

The lack of understanding between 2016 and 2024 amongst Proxmox staff themselves on how the “bumping up” of limits may or may not adversely impact users is a testament that this part of stack is prime for a rewrite.

What can you do?

Short of suggesting simply not using such filesystem stack, which as absurd as it might come across, would be entirely feasible for single-node installs, which are plentiful in the wild, there’s some general options:

  • ask Proxmox to owe up to their technical debt, it has been a long time;
  • rewrite (some portion) of pmxcfs yourself to make use of (a little) better backend;
  • avoid using HA altogether, although this does not solve the liability part in and of itself;
  • do not use host filesystems such as ZFS or BTRFS (other than nodatacow) to avoid amplifying the amplified;
  • do not let pmxcfs run over your precious block storage; which would then also require you to:
  • have a better method of persisting the “snaphost” of the state.

Enterprise SSDs

There’s a notable omission above, one about using so-called enterprise SSDs. For one, it does not really protect from database corruption in case of power loss in this case, even it has the advertised PLP - the way SQLite is checkpointing its journal would guarantee you corruption should a power loss occur during a checkpoint event whether on a PLP SSD or not. For another, while it is more than likely that these SSDs will manage better in terms of performance without dying at so many concurrent continuous synchronous writes, there’s basically no good reason to be making those writes in the first place, so such recommendation is tantamount to a piece of advice to keep a fire extinguisher handy while merrily smoking next to a suspiciously hissing burner box.

The takeaway

It is extremely unlikely an external party would want to rewrite pmxcfs for Proxmox as they only take in contributions in the form of gifts. Anyone who does it for themselves has the difficult licensing conundrum to navigate to avoid effectively giving it away for free either.

Most users - as has been obvious from polls - are extremely unlikely to be willing to compile own pmxcfs even with minor changes that would make it “persistently shred” their block device less. That basically leaves one with using RAM-disk based solutions which, however, will not play nice with HA and will increase usage of RAM, because there will be a copy of the state in RAM, which is copied from RAM, completely superfluosly handled by non-performant SQLite backend, also using RAM, buffered in by FUSE using yet more RAM. Nevertheless, we might look at some such in a follow-up post and see how much RAM it takes.