Making the filesystem-wide cache invalidation lightspeed in FUSE
One interesting aspect of FUSE user-space file systems is that caching can be
handled at the kernel level. For example, if an application reads data from a
file that happens to be on a FUSE file system, the kernel will keep that data in
the page cache so that later, if that data is requested again, it will be
readily available, without the need for the kernel to request it again to the
FUSE server. But the kernel also caches other file system data. For example,
it keeps track of metadata (file size, timestamps, etc) that may allow it to
also reply to a stat(2)
system call without requesting it from user-space.
On the other hand, a FUSE server has a mechanism to ask the kernel to forget everything related to an inode or to a dentry that the kernel already knows about. This is a very useful mechanism, particularly for a networked file system.
Imagine a network file system mounted in two different hosts, rocinante and rucio. Both hosts will read data from the same file, and this data will be cached locally. This is represented in the figure below, on the left. Now, if that file is deleted from the rucio host (same figure, on the right), rocinante will need to be notified about this deletion1. This is needed so that the locally cached data in the rocinante host can also be remove. In addition, if this is a FUSE file system, the FUSE server will need to ask the kernel to forget everything about the deleted file.

Notifying the kernel to forget everything about a file system inode or
dentry can be easily done from a FUSE server using the
FUSE_NOTIFY_INVAL_INODE
and FUSE_NOTIFY_INVAL_ENTRY
operations. Or, if the
server is implemented using libfuse, by using the APIs
fuse_lowlevel_notify_inval_inode()
and fuse_lowlevel_notify_entry()
. Easy.
But what if the FUSE file system needs to notify the kernel to forget about all the files in the file system? Well, the FUSE server would simply need to walk through all those inodes and notify the kernel, one by one. Tedious and time consuming. And most likely racy.
Asking the kernel to forget everything about all the files may sound like a odd thing to do, but there are cases where this is needed. For example, the CernVM File System does exactly this. This is a read-only file system, which was developed to distribute software across virtual machines. Clients will then mount the file system and cache data/meta-data locally. Changes to the file system may happen only on a Release Manager Machine, a specific server where the file system will be mounted in read/write mode. When this Release Manager is done with all the changes, they can be all merged and published atomically, as a new revision of the file system. Only then the clients are able to access this new revision, but all the data (and meta-data) they have cached locally will need to be invalidated.
And this is where a new mechanism that has just been merged into mainline kernel
v6.16 comes handy: a single operation that will ask the kernel to invalidate all
dentries for a specific FUSE connection that the kernel knows about. After
trying a few different approaches, I've implemented this mechanism for a project
at Igalia by adding the new FUSE_NOTIFY_INC_EPOCH
operation. This operation
can be used from libfuse through
fuse_lowlevel_notify_increment_epoch()
2.
In a nutshell, every dentry (or directory entry) can have a time-to-live value associated with it; after this time has expired, it will need to be revalidated. This revalidation will happen when the kernel VFS layer does a file name look-up and finds a dentry cached (i.e. a dentry that has been looked-up before).
Since this commit has been merged, the concept of epoch was introduced: a FUSE
server connection to the kernel will have an epoch value, and every new
dentry created will also have an epoch, initialised to the same value as the
connection. What the new FUSE_NOTIFY_INC_EPOCH
operation will do is simply to
increment the connection epoch value. Later, when the VFS is performing a
look-up and finds a dentry cached, it will executed the FUSE callback function
to revalidate it. At this point, FUSE will verify that the dentry epoch
value is outdated and invalidate it.
Now, what's missing is an extra mechanism to periodically check for any dentries that need to be invalidated so that invalid dentries don't hang around for too long after the epoch is incremented. And that's exactly what's currently under discussion upstream. Hopefully it will shortly get into a state where it can be merged too.
Footnotes:
Obviously, this is a very simplistic description – it all depends on the actual file system design and implementation details, and specifically on the communication protocol being used to synchronise the different servers/clients across the network. In CephFS, for example, the clients get notified through it's own Ceph-specific protocol, by having it's Metadata Servers (MDS) revoking 'capabilities' that have been previously given to the clients, forcing them to request the data again if needed.
Note however that, although this extra API has been already merged into libfuse, no release has yet been done.