Mountpoint of /etc/pve

The pmxcfs mountpoint of /etc/pve

December 8, 2024

This post will provide superficial overview of the Proxmox cluster filesystem, also dubbed pmxcfs ¹ that goes beyond the official terse:

a database-driven file system for storing configuration files, replicated in real time to all cluster nodes

Most users would have encountered it as the location where their guest configurations are stored and simply known by its path of /etc/pve.

Mountpoint

Foremost, it is important to understand that the directory itself as it resides on the actual system disk is empty simply because it is just a mountpoint, serving similar purpose as e.g. /mnt.

This can be easily verified: ²

findmnt /etc/pve

TARGET   SOURCE    FSTYPE OPTIONS
/etc/pve /dev/fuse fuse   rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other

Somewhat counterintuitive as it is bit of a stretch from the Filesystem Hierarchy Standard ³ on the point that /etc is meant to hold host-specific configuration files which are understood as local and static - as can be seen above, this is not a regular mountpoint. And those are not regular files within.

Tip

If you find yourself in a situation of genuinely unpopulated /etc/pve on a regular PVE node, you are most likely experiencing an issue where the filesystem did not get mounted, one of such was described here previously.

Virtual filesystem

The filesystem type as reported by findmnt is that of a Filesystem in userspace (FUSE) which is feature provided by the Linux kernel. ⁴

Filesystems are commonly implemented on kernel level, adding support for a new such one would then require bespoke kernel modules. With FUSE, it is this middle interface layer that resides in kernel and a regular user-space process interacts with it through the use of a library - this is especially useful for virtual filesystems that are making some representation of arbitrary data through regular filesystem paths.

A good example of a FUSE filesystem is SSHFS ⁵ which uses SSH (or more precisely a subsystem of sftp) to connect to a remote system whilst making the appearance of working with a regular mounted filesystem. But in fact, virtual filesystems do not even have to store the actual data, but may instead e.g. generate them on-the-fly.

The process of pmxcfs

The PVE process that provides such FUSE filesystem is - unsurprisingly - pmxcfs and needs to be always running, at least if you want to be able to access anything in /etc/pve - this is what gives the user the illusion that there is any structure there.

You will find it on any standard PVE install in the pve-cluster package:

dpkg-query -S $(which pmxcfs)

pve-cluster: /usr/bin/pmxcfs

And it is started by a service called pve-cluster:

systemctl status $(pidof pmxcfs)

● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled)
     Active: active (running) since Sat 2024-12-07 10:03:07 UTC; 1 day 3h ago
    Process: 808 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 835 (pmxcfs)
      Tasks: 8 (limit: 2285)
     Memory: 61.5M

---8<---

Important

The name might be misleading as this service is enabled and active on every node, including single (non-cluster) node installs.

Magic

Interestingly, if you launch pmxcfs on a standalone host with no PVE install - such when we built our own without PVE packages, i.e. with no files having ever been written to it, it will still present you with some content of /etc/pve:

ls -la

total 4
drwxr-xr-x  2 root www-data    0 Jan  1  1970 .
drwxr-xr-x 70 root root     4096 Dec  8 14:23 ..
-r--r-----  1 root www-data  152 Jan  1  1970 .clusterlog
-rw-r-----  1 root www-data    2 Jan  1  1970 .debug
lrwxr-xr-x  1 root www-data    0 Jan  1  1970 local -> nodes/dummy
lrwxr-xr-x  1 root www-data    0 Jan  1  1970 lxc -> nodes/dummy/lxc
-r--r-----  1 root www-data   38 Jan  1  1970 .members
lrwxr-xr-x  1 root www-data    0 Jan  1  1970 openvz -> nodes/dummy/openvz
lrwxr-xr-x  1 root www-data    0 Jan  1  1970 qemu-server -> nodes/dummy/qemu-server
-r--r-----  1 root www-data    0 Jan  1  1970 .rrd
-r--r-----  1 root www-data  940 Jan  1  1970 .version
-r--r-----  1 root www-data   18 Jan  1  1970 .vmlist

There’s telltale signs that this content is not real, the times are all 0 seconds from the UNIX Epoch. ⁶

stat local

  File: local -> nodes/dummy
  Size: 0         	Blocks: 0          IO Block: 4096   symbolic link
Device: 0,44	Inode: 6           Links: 1
Access: (0755/lrwxr-xr-x)  Uid: (    0/    root)   Gid: (   33/www-data)
Access: 1970-01-01 00:00:00.000000000 +0000
Modify: 1970-01-01 00:00:00.000000000 +0000
Change: 1970-01-01 00:00:00.000000000 +0000
 Birth: -

On a closer look, all of the pre-existing symbolic links, such as the one above point to non-existent (not yet created) directories.

There’s only dotfiles and what they contain looks generated:

cat .members

{
"nodename": "dummy",
"version": 0
}

And they are not all equally writeable:

echo > .members

-bash: .members: Input/output error

We are witnessing the implementation details hidden under the very facade of a virtual file system. Nothing here is real, not before we start writing to it anyways. That is, when and where allowed.

For instance, we can create directories, but when we create a second (imaginary node’s) directory and create a config-like file in it, it will not allow us to create second with the same name in the other “node” location - as if already existed.

mkdir -p /etc/pve/nodes/dummy/{qemu-server,lxc}
mkdir -p /etc/pve/nodes/another/{qemu-server,lxc}
echo > /etc/pve/nodes/dummy/qemu-server/100.conf
echo > /etc/pve/nodes/another/qemu-server/100.conf

-bash: /etc/pve/nodes/another/qemu-server/100.conf: File exists

But it’s not really there:

ls -la /etc/pve/nodes/another/qemu-server/

total 0
drwxr-xr-x 2 root www-data 0 Dec  8 14:27 .
drwxr-xr-x 2 root www-data 0 Dec  8 14:27 ..

And when newly created file does not look like a config one, it is suddenly fine:

echo > /etc/pve/nodes/dummy/qemu-server/a.conf
echo > /etc/pve/nodes/another/qemu-server/a.conf

ls -R /etc/pve/nodes/

/etc/pve/nodes/:
another  dummy

/etc/pve/nodes/another:
lxc  qemu-server

/etc/pve/nodes/another/lxc:

/etc/pve/nodes/another/qemu-server:
a.conf

/etc/pve/nodes/dummy:
lxc  qemu-server

/etc/pve/nodes/dummy/lxc:

/etc/pve/nodes/dummy/qemu-server:
100.conf  a.conf

None of the magic - that is clearly there to prevent e.g. allowing a guest running off the same configuration, thus accessing the same (shared) storage, on two different nodes - however explains where the files are actually stored, or how. That is, when they are real.

Persistent storage

It’s time to look at where pmxcfs is actually writing to. We know these files do not really exist as such, but when not readily generated, the data must go somewhere, otherwise we could not retrieve what we had previously written.

We will take our probe node we had built previously with 3 real nodes (the probe just monitoring) - but you can check this on any real node - we will make use of fatrace: ⁷

apt install -y fatrace

fatrace

fatrace: Failed to add watch for /etc/pve: No such device
pmxcfs(864): W   /var/lib/pve-cluster/config.db-wal

---8<---

The nice thing about running a dedicated probe is not have anything else really writing much other than pmxcfs itself, so we will immediately start seeing its write targets. Another notable point about this tool is that it ignores events on virtual filesystems, that’s why the reported fail with /etc/pve as such - it is not a device.

We are be getting exactly what we want, just the actual block device writes on the system, but we can nail it further down (e.g. if we had a busy system, like a real node) and also, we will let it observe the activity for 5 minutes and create a log:

fatrace -c pmxcfs -s 300 -o fatrace-pmxcfs.log

When done, we can explore the log as-is to get the idea of how busy it’s been going or where the hits were particularly popular, but let’s just summarise it for unique filepaths and sort by paths:

sort -u -k3 fatrace-pmxcfs.log

pmxcfs(864): W   /var/lib/pve-cluster/config.db
pmxcfs(864): W   /var/lib/pve-cluster/config.db-wal
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-node/pve1
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-node/pve1
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-node/pve1
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-node/pve2
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-node/pve2
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-node/pve2
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-node/pve3
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-node/pve3
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-node/pve3
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-storage/pve1/local
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-storage/pve1/local
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-storage/pve1/local
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-storage/pve1/local-zfs
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-storage/pve1/local-zfs
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-storage/pve1/local-zfs
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-storage/pve2/local
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-storage/pve2/local
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-storage/pve2/local
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-storage/pve2/local-zfs
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-storage/pve2/local-zfs
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-storage/pve2/local-zfs
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-storage/pve3/local
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-storage/pve3/local
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-storage/pve3/local
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-storage/pve3/local-zfs
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-storage/pve3/local-zfs
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-storage/pve3/local-zfs
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-vm/100
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-vm/100
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-vm/100
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-vm/101
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-vm/101
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-vm/101
pmxcfs(864): O   /var/lib/rrdcached/db/pve2-vm/102
pmxcfs(864): CW  /var/lib/rrdcached/db/pve2-vm/102
pmxcfs(864): CWO /var/lib/rrdcached/db/pve2-vm/102

Now that’s still a lot of records, but it’s basically just:

/var/lib/pve-cluster/ with SQLite ⁸ database files
/var/lib/rrdcached/db and rrdcached ⁹ data

Also, there’s an interesting anomaly in the output, can you spot it?

SQLite backend

We now know the actual persistent data must be hitting the block layer when written into a database. We can dump it (even on a running node) to better see what’s inside: ¹⁰

apt install -y sqlite3

sqlite3 /var/lib/pve-cluster/config.db .dump > config.dump.sql

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;

CREATE TABLE tree (
  inode   INTEGER PRIMARY KEY NOT NULL,
  parent  INTEGER NOT NULL CHECK(typeof(parent)=='integer'),
  version INTEGER NOT NULL CHECK(typeof(version)=='integer'),
  writer  INTEGER NOT NULL CHECK(typeof(writer)=='integer'),
  mtime   INTEGER NOT NULL CHECK(typeof(mtime)=='integer'),
  type    INTEGER NOT NULL CHECK(typeof(type)=='integer'),
  name    TEXT NOT NULL,
  data    BLOB);

INSERT INTO tree VALUES(0,0,1044298,1,1733672152,8,'__version__',NULL);
INSERT INTO tree VALUES(2,0,3,0,1731719679,8,'datacenter.cfg',X'6b6579626f6172643a20656e2d75730a');
INSERT INTO tree VALUES(4,0,5,0,1731719679,8,'user.cfg',X'757365723a726f6f744070616d3a313a303a3a3a6140622e633a3a0a');
INSERT INTO tree VALUES(6,0,7,0,1731719679,8,'storage.cfg',X'---8<---');
INSERT INTO tree VALUES(8,0,8,0,1731719711,4,'virtual-guest',NULL);
INSERT INTO tree VALUES(9,0,9,0,1731719714,4,'priv',NULL);
INSERT INTO tree VALUES(11,0,11,0,1731719714,4,'nodes',NULL);
INSERT INTO tree VALUES(12,11,12,0,1731719714,4,'pve1',NULL);
INSERT INTO tree VALUES(13,12,13,0,1731719714,4,'lxc',NULL);
INSERT INTO tree VALUES(14,12,14,0,1731719714,4,'qemu-server',NULL);
INSERT INTO tree VALUES(15,12,15,0,1731719714,4,'openvz',NULL);
INSERT INTO tree VALUES(16,12,16,0,1731719714,4,'priv',NULL);
INSERT INTO tree VALUES(17,9,17,0,1731719714,4,'lock',NULL);
INSERT INTO tree VALUES(24,0,25,0,1731719714,8,'pve-www.key',X'---8<---');
INSERT INTO tree VALUES(26,12,27,0,1731719715,8,'pve-ssl.key',X'---8<---');
INSERT INTO tree VALUES(28,9,29,0,1731719721,8,'pve-root-ca.key',X'---8<---');
INSERT INTO tree VALUES(30,0,31,0,1731719721,8,'pve-root-ca.pem',X'---8<---');
INSERT INTO tree VALUES(32,9,1077,3,1731721184,8,'pve-root-ca.srl',X'30330a');
INSERT INTO tree VALUES(35,12,38,0,1731719721,8,'pve-ssl.pem',X'---8<---');
INSERT INTO tree VALUES(48,0,48,0,1731719721,4,'firewall',NULL);
INSERT INTO tree VALUES(49,0,49,0,1731719721,4,'ha',NULL);
INSERT INTO tree VALUES(50,0,50,0,1731719721,4,'mapping',NULL);
INSERT INTO tree VALUES(51,9,51,0,1731719721,4,'acme',NULL);
INSERT INTO tree VALUES(52,0,52,0,1731719721,4,'sdn',NULL);
INSERT INTO tree VALUES(918,9,920,0,1731721072,8,'known_hosts',X'---8<---');
INSERT INTO tree VALUES(940,11,940,1,1731721103,4,'pve2',NULL);
INSERT INTO tree VALUES(941,940,941,1,1731721103,4,'lxc',NULL);
INSERT INTO tree VALUES(942,940,942,1,1731721103,4,'qemu-server',NULL);
INSERT INTO tree VALUES(943,940,943,1,1731721103,4,'openvz',NULL);
INSERT INTO tree VALUES(944,940,944,1,1731721103,4,'priv',NULL);
INSERT INTO tree VALUES(955,940,956,2,1731721114,8,'pve-ssl.key',X'---8<---');
INSERT INTO tree VALUES(957,940,960,2,1731721114,8,'pve-ssl.pem',X'---8<---');
INSERT INTO tree VALUES(1048,11,1048,1,1731721173,4,'pve3',NULL);
INSERT INTO tree VALUES(1049,1048,1049,1,1731721173,4,'lxc',NULL);
INSERT INTO tree VALUES(1050,1048,1050,1,1731721173,4,'qemu-server',NULL);
INSERT INTO tree VALUES(1051,1048,1051,1,1731721173,4,'openvz',NULL);
INSERT INTO tree VALUES(1052,1048,1052,1,1731721173,4,'priv',NULL);
INSERT INTO tree VALUES(1056,0,376959,1,1732878296,8,'corosync.conf',X'---8<---');
INSERT INTO tree VALUES(1073,1048,1074,3,1731721184,8,'pve-ssl.key',X'---8<---');
INSERT INTO tree VALUES(1075,1048,1078,3,1731721184,8,'pve-ssl.pem',X'---8<---');
INSERT INTO tree VALUES(2680,0,2682,1,1731721950,8,'vzdump.cron',X'---8<---');
INSERT INTO tree VALUES(68803,941,68805,2,1731798577,8,'101.conf',X'---8<---');
INSERT INTO tree VALUES(98568,940,98570,2,1732140371,8,'lrm_status',X'---8<---');
INSERT INTO tree VALUES(270850,13,270851,99,1732624332,8,'102.conf',X'---8<---');
INSERT INTO tree VALUES(377443,11,377443,1,1732878617,4,'probe',NULL);
INSERT INTO tree VALUES(382230,377443,382231,1,1732881967,8,'pve-ssl.pem',X'---8<---');
INSERT INTO tree VALUES(893854,12,893856,1,1733565797,8,'ssh_known_hosts',X'---8<---');
INSERT INTO tree VALUES(893860,940,893862,2,1733565799,8,'ssh_known_hosts',X'---8<---');
INSERT INTO tree VALUES(893863,9,893865,3,1733565799,8,'authorized_keys',X'---8<---');
INSERT INTO tree VALUES(893866,1048,893868,3,1733565799,8,'ssh_known_hosts',X'---8<---');
INSERT INTO tree VALUES(894275,0,894277,2,1733566055,8,'replication.cfg',X'---8<---');
INSERT INTO tree VALUES(894279,13,894281,1,1733566056,8,'100.conf',X'---8<---');
INSERT INTO tree VALUES(1016100,0,1016103,1,1733652207,8,'authkey.pub.old',X'---8<---');
INSERT INTO tree VALUES(1016106,0,1016108,1,1733652207,8,'authkey.pub',X'---8<---');
INSERT INTO tree VALUES(1016109,9,1016111,1,1733652207,8,'authkey.key',X'---8<---');
INSERT INTO tree VALUES(1044291,12,1044293,1,1733672147,8,'lrm_status',X'---8<---');
INSERT INTO tree VALUES(1044294,1048,1044296,3,1733672150,8,'lrm_status',X'---8<---');
INSERT INTO tree VALUES(1044297,12,1044298,1,1733672152,8,'lrm_status.tmp.984',X'---8<---');

COMMIT;

Note

Most BLOB objects above have been replaced with ---8<--- for brevity.

It is a trivial database schema, with a single table tree holding everything which is then mimicking a real filesystem, let’s take one such entry (row), for instance:

INODE	PARENT	VERSION	WRITER	MTIME	TYPE	NAME	DATA
4	0	5	0	timestamp	8	user.cfg	BLOB

This row contains the virtual user.cfg (NAME) file contents as Binary Large Object (BLOB) - in DATA column - which is a hexdump and since we know this is not a binary file, it is easy to glance into: ¹¹

apt install -y xxd

xxd -r -p <<< X'757365723a726f6f744070616d3a313a303a3a3a6140622e633a3a0a'

user:root@pam:1:0:::a@b.c::

TYPE signifies it is a regular file and e.g. not a directory.

MTIME represents timestamp and despite its name, it is actually returned as value for mtime, ctime and atime as we could have previously seen in the stat output, but here it’s a real one:

date -d @1731719679

Sat Nov 16 01:14:39 AM UTC 2024

WRITER column records the interesting piece of information of which node was it that has last written to this row - some (initially generated, as is the case here) start with 0, however.

Accompanying it is VERSION, which is a counter that increases every time a row has been written to - this helps finding out which node needs to catch up if it has fallen behind with its own copy of data.

Lastly, the file will present itself in the filesystem as if under inode (hence the same column name) 4, residing within the PARENT inode of 0. This means it is in the root of the structure.

These are usual filesystem concepts, ¹² but there’s no separation of metadata and data as the BLOB is in the same row as all the other information, it’s really rudimentary.

Note

The INODE column is the primary key (no two rows can have the same value of it) of the table and as only one parent is possible to be referenced in this way, it is also the reason why the filesystem cannot support hardlinks.

More magic

There’s further points of interest in the database, especially in what everything is missing, but the virtual filesystem still provides for it:

No access rights related information - this is rigidly generated depending on file’s path.
No symlinks, the presented ones are runtime generated and all point to supposedly node’s own directory under /etc/pve/nodes/ - the symlink’s target is the nodename as determined from the hostname by pmxcfs on startup. Creation of own symlinks is NOT implemented.
None of the always present dotfiles either - this is why we could not write into e.g. .members file above. The contents are truly generated data determined at runtime. That said, you actually CAN create a regular (well, virtual) dotfile here that will be stored properly.

Because of all this, the database - under healthy circumstances - does NOT store any node-specific (relative to the node it resides on) data, they are all each alike on every node of the cluster and could be copied around (when pmxcfs is offline, obviously).

However, because of the imaginary inode referencing and the versioning, it absolutely is NOT possible to copy around just about any database file that otherwise holds seemingly identical file structure.

Missing links

If you followed the guide on your own pmxcfs build meticulously, you would have noticed the libraries required are:

libfuse
libsqlite3
librrd
libcpg, libcmap, libquorum, libqb

The libfuse ¹³ allows pmxcfs to interact with the kernel when users attempt to access content in /etc/pve. SQLite is interacted via libsqlite3. What about the rest?

When we did our block layer write observation tests on our plain probe, there was nothing - no PVE installed - that would be writing into /etc/pve - the mountpoint of the virtual filesystem, yet we observed pmxcfs writing onto disk.

If we did the same on our dummy standalone host (also with no PVE installed) running just pmxcfs, we would not really observe any of those plentiful writes. We would need to start manipulating contents in /etc/pve to block layer writes resulting from it.

So clearly, the origin of those writes must be coming from the rest of the cluster, the actual nodes - they run much more than just the pmxcfs process. And that’s where Corosync comes into play (that is, on a node in a cluster). What happens is that ANY file operation on ANY node is spread via messages within the Closed Process Group you might have read up details on already and this is why all those required properties were important - to have all of the operations happening exactly in the same order on every node.

This is also why another little piece of magic happens, statefully - when a node becomes inquorate, pmxcfs on that node sees to it that it turns the filesystem read-only, that is, until such node is back in the quorum. This is easy to simulate on our probe by simply stopping pve-cluster service. And that is what all of the libraries of Corosync (libcpg, libcmap, libquorum, libqb) are utilised for.

And what about the discreet librrd? Well, we could see lots of writes actually hitting all over /var/lib/rrdcached/db, that’s a location for rrdcached ⁹ which handles caching writes of round robin time series data. The entire RRDtool ¹⁴ is well beyond the scope of this post, but this is how data is gathered for e.g. charting across all nodes of all the same statistics. If you ever wondered how it is possible with no master to see them in GUI of any node for all other nodes, that’s because each node writes it into /etc/pve/.rrd, another of the non-existent virtual files. Each node thus receives time series data of all other nodes and passes it over via rrdcached.

The Proxmox enigma

As this was a rather keypoints-only overview, quite a few details would be naturally missing, some which are best discovered when hands-on experimenting with the probe setup. One noteworthy omission however, which will only be covered in a separate post needs to be pointed out.

If you paid very good attention when checking the sorted fatrace output, especially there was a note on an anomaly, you would have noticed the mystery:

pmxcfs(864): W   /var/lib/pve-cluster/config.db
pmxcfs(864): W   /var/lib/pve-cluster/config.db-wal

There’s no R in those observations, ever - the SQLite database is being constantly written to, but it is never read from. But that’s for another time.

Conclusion

Essentially, it is important to understand that /etc/pve is nothing but a mountpoint. The pmxcfs provides it while running and it is anything but an ordinary filesystem. The pmxcfs process itself then writes onto the block layer into specific /var/lib/ locations. It utilises Corosync when in a cluster to cross-share all the file operations amongst nodes, but it does all the rest equally well when not in a cluster - the corosync service is then not even running, but pmxcfs always has to. The special properties of the virtual filesystem have one primary objective - to prevent data corruption by disallowing risky configuration states. That does not however mean that the database itself cannot get corrupted and if you want to back it up properly, you have to be dumping the database.

Quorum options - lesser known How PVE shreds SSDs