Why PVE shreds SSDs

Why Proxmox VE shreds your SSDs

November 3, 2024

Tip

This post has a more detailed follow-up available here.

You must have read, at least once, that Proxmox recommend “enterprise” SSDs 1 for their virtualisation stack. But why does it shred regular SSDs? It would not have to, in fact the modern ones, even without PLP, can endure as much as 2,000 TBW per life. And where do the writes come from? ZFS? Let’s have a look.

The below is particularly of interest for any homelab user, but in fact everyone who cares about wasted system performance might be interested.

Probe

If you have a cluster, you can actually safely follow this experiment. Add a new “probe” node that you will later dispose of and let it join the cluster. On the “probe” node, let’s isolate the configuration state backend database onto a separate filesystem, to be able to benchmark only pmxcfs 2 - the virtual filesystem that is mounted to /etc/pve and holds your configuration files, i.e. cluster state.

dd if=/dev/zero of=/root/pmxcfsbd bs=1M count=256
mkfs.ext4 /root/pmxcfsbd
systemctl stop pve-cluster
cp /var/lib/pve-cluster/config.db /root/
mount -o loop /root/pmxcfsbd /var/lib/pve-cluster

This creates a separate loop device, 3 sufficiently large, 4 shuts down the service 5 issuing writes to the backend database and copies it out of its original location before mounting 6 the blank device over the original path where the service will look for it again. 7

lsblk
NAME                                    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0                                     7:0    0  256M  0 loop /var/lib/pve-cluster

Now copy the backend database onto the dedicated - so far blank - loop device and restart the service.

cp /root/config.db /var/lib/pve-cluster/
systemctl start pve-cluster.service 
systemctl status pve-cluster.service

If all went well, your service is up and running and issuing its database writes onto separate loop device.

Observation

From now on, you can measure the writes occurring solely there: 8

vmstat -d

You are interested in the loop device, in my case loop0, wait some time, e.g. an hour, and list the same again:

disk- ------------reads------------ ------------writes----------- -----IO------
       total merged sectors      ms  total merged sectors      ms    cur    sec
loop0   1360      0    6992      96   3326      0  124180   16645      0     17

I did my test with different configurations, all idle:

  • single node (no cluster);
  • 2-nodes cluster;
  • 5-nodes cluster.

The rate of writes on these otherwise freshly installed and idle (zero guests) systems is impressive:

  • single ~ 1,000 sectors / minute writes
  • 2-nodes ~ 2,000 sectors / minute writes
  • 5-nodes ~ 5,000 sectors / minute writes

But this is not real life scenario, in fact, these are bare minimums and in the wild, the growth is NOT LINEAR at all, it will depend on e.g. number of HA services running and frequency of migrations.

Important

These measurements are filesystem-agnostic, so if your root is e.g. installed on ZFS, you would need to multiply the numbers by the amplification of the filesystem on top.

But suffice to say, even just the idle writes amount to minimum ~ 0.5TB per year for single-node, or 2.5TB (on each node) with a 5-node cluster.

Summary

This might not look like much until you consider these are copious tiny writes of very much “nothing” being written all of the time. Consider that in my case at the least (no migrations, no config changes - no guests after all), almost none of this data needs to be hitting the block layer.

That’s right, these are completely avoidable writes wasting out your filesystem performance. If it’s a homelab, you probably care about shredding your SSDs prematurely. In any environment, this increases risk of data loss during power failure as the backend might come back up corrupt.

And these are just configuration state related writes, nothing to do with your guests writing onto their block layer. But then again, there were no state changes in my test scenarios.

So in a nutshell, consider that deploying clusters takes its toll and account for factor of the above quoted numbers due to actual filesystem amplifications and real files being written in operational environment.