Bounding against the wrong buffer: an OOB read/write in KVM SEV-SNP

Reported to security@kernel.org, 8 April 2026 · Fixed in mainline (db3f219), May 2026, Cc: stable · Fixes 4af663c · Affected: KVM SNP host support, ~v6.10 through the fix · CVE: pending (track commit db3f219 on linux-cve-announce)

Summary

A malicious SEV-SNP guest can corrupt the host kernel’s heap and leak information about its layout, through the way KVM handles the GHCB scratch area for Page State Change (PSC) requests. setup_vmgexit_scratch() allocates the scratch buffer using a size the guest controls, and snp_begin_psc() then checks the requested entry count against the protocol maximum (VMGEXIT_PSC_MAX_COUNT, which is 253) rather than against the size of the buffer it actually allocated. So if the guest asks for a tiny buffer and a large end_entry, the host walks the entry array off the end of its slab object. That gives a read oracle into neighbouring kmalloc-cg-32 objects and a write of the cur_page field into them, and you can repeat it as often as you like. The fix is small: for GHCB v2 and up, require the scratch area to sit inside the GHCB’s shared buffer, which pins down the size.

Any SEV-SNP guest can hit this by sending a malformed PSC request. I confirmed it with KASAN, where a single insmod of my test module throws 73 reports (out-of-bounds reads and writes, plus some use-after-free). I bisected it back to the commit that first added KVM’s SNP PSC handling, around v6.10, and it was still there in current mainline when I reported it. Only the SNP host path is affected, since KVM doesn’t currently turn on PSC for plain SEV-ES guests.

The inverted threat model

The thing that makes this interesting is the threat model. SEV-SNP is built to protect a guest from a malicious or compromised host: the guest’s memory and register state are encrypted, and the hypervisor is treated as untrusted. But that’s only half of it. The host also has to defend itself against a malicious guest, because the whole point of confidential computing is running someone else’s workload that you don’t trust either.

This bug is in that second direction. The guest hands the host data through the GHCB, the host parses it, and a missing check lets the guest reach into host kernel memory. It’s an easy check to drop here, because the usual instinct is to treat the hypervisor as the attacker, not the guest.

The GHCB scratch area

SEV-SNP guests talk to the host through the GHCB (Guest-Hypervisor Communication Block), a 4 KB shared page. A request is set up by writing some software-defined fields: SW_EXITCODE picks the operation, SW_EXITINFO1 and SW_EXITINFO2 carry parameters, and SW_SCRATCH points at a scratch area for requests that need more room than the fixed fields. For a PSC request, SW_EXITCODE is SVM_VMGEXIT_PSC (0x80000010), SW_SCRATCH is the guest-physical address of the descriptor, and SW_EXITINFO2 is its length.

The GHCB has a 2032-byte region in it called the Shared Buffer. A v2+ guest is supposed to keep its scratch area inside that buffer, so the host can just use its existing mapping of the GHCB page instead of copying guest memory somewhere else. That requirement matters a lot, so keep it in mind.

A PSC request is a struct psc_buffer: an 8-byte psc_hdr followed by an array of 8-byte psc_entry structs. There’s no explicit count field in the request; the host just processes entries from hdr->cur_entry to hdr->end_entry, both of which come from the guest. The number of entries is meant to be limited by how many fit in the buffer, and that’s where the 253 comes from: (2032 – 8) / 8 = 253 entries fit in the Shared Buffer after the header. That number only makes sense if the buffer really is the Shared Buffer.

The vulnerable path

If the guest points its scratch area outside the GHCB, the host can’t use its GHCB mapping, so it allocates a separate kernel buffer the size the guest asked for. SNP isn’t supposed to use that path at all, but nothing stopped it, so setup_vmgexit_scratch() just allocates whatever length the guest put in SW_EXITINFO2:

scratch_va = kvzalloc(len, GFP_KERNEL_ACCOUNT);   /* len == exit_info_2, guest-controlled */

Two things matter. First, len comes straight from the guest. Second, the GFP_KERNEL_ACCOUNT flag puts the allocation in the cgroup-accounted kmalloc-cg-N caches instead of plain kmalloc-N, which is why the KASAN output below says kmalloc-cg-32. In practice that means the neighbours you can reach are other accounted allocations of the same size, not the whole heap.

Ask for exit_info_2 = 24 and you get a 24-byte allocation, which lands in the 32-byte kmalloc-cg-32 slot. The structs look like this:

struct psc_hdr {
        u16 cur_entry;
        u16 end_entry;
        u32 reserved;
} __packed;                                    /* 8 bytes */

struct psc_entry {
        u64 cur_page    : 12;
        u64 gfn         : 40;
        u64 operation   :  4;
        u64 pagesize    :  1;
        u64 reserved    :  7;
} __packed;                                    /* 8 bytes, one packed u64 */

24 bytes is enough for the 8-byte header and two entries, entries[0] and entries[1]. entries[2] starts at byte 24, in the 8 bytes of slab slack, and entries[3] onwards are in whatever object comes next. So anything past entries[1] is somebody else’s memory, read back through the psc_entry bitfields above.

How the out-of-bounds happens A 24-byte scratch buffer sits in a 32-byte slab slot; the loop indexes well past it. the scratch allocation (one 32-byte slot) adjacent slab objects psc_hdr bytes 0–7 entries[0] 8–15 entries[1] 16–23 entries[2] 24–31 entries[3] 32–39 entries[4] 40–47 entries[5] 48–55 … [252] ≈2 KB out 24-byte allocation — in bounds 32-byte slab slot — entries[2] is 8 bytes of slack Past here is another object’s memory: • read it → information leak • write cur_page → heap corruption snp_begin_psc() loops idx = cur_entry … end_entry, and end_entry is only checked against VMGEXIT_PSC_MAX_COUNT (253). With a 24-byte buffer that lets the loop run from entries[0] all the way to entries[252] — roughly 2 KB past the allocation. in bounds slack (past the 24-byte allocation) another object (out of bounds)
The 24-byte buffer holds the header and two entries; everything past entries[1] is another object.

The missing bounds check

snp_begin_psc() pulls the range out of the header and only checks the top against the protocol constant:

idx_end = hdr->end_entry;

if (idx_end >= VMGEXIT_PSC_MAX_COUNT) {   /* checks 253, NOT the buffer size */
        snp_complete_psc(svm, ...);
        return 1;
}

for (idx = idx_start; idx <= idx_end; idx++) {
        entry_start = entries[idx];        /* OOB once idx >= 2 */
        ...
}

It’s checking whether the index is a legal PSC index, not whether entries[idx] is still inside the buffer. 253 is the capacity of the 2032-byte Shared Buffer, but here the host allocated a 24-byte buffer of the guest’s choosing, so 253 describes a buffer that doesn’t exist. With 24 bytes only two entries fit, but the check happily allows end_entry up to 252. Set it to 10 and you read ten entries out of a two-entry buffer; set it to 252 and you walk about 2 KB past the allocation.

Why the 253 check doesn’t help The entry count is checked against a buffer the host never allocated. What VMGEXIT_PSC_MAX_COUNT describes — the in-GHCB Shared Buffer: hdr 253 × 8-byte entries (2032 – 8) / 8 = 253 entries fit in the 2032-byte Shared Buffer What setup_vmgexit_scratch() actually allocated — kvzalloc(exit_info_2 = 24): hdr entries[0] entries[1] entries[2] … entries[252] — never allocated, this is other objects only two entries exist; everything from entries[2] on is out of bounds end_entry is validated against 253 (top) but used to index the 24-byte buffer (bottom). That count is only a real limit when the scratch area is the in-GHCB Shared Buffer.
The count is checked against the 253-entry Shared Buffer, but indexes the 24-byte allocation.

The primitives

Each step past the end reinterprets the next 8 bytes of slab memory as a psc_entry and runs it through the PSC code. The decode is fixed: bits [0:11] are cur_page, [12:51] are gfn, [52:55] are operation, [56] is pagesize. What happens next depends on whether those bytes look like a valid request, which gives three different primitives:

  • Read oracle. The host has to read the adjacent qword just to decode it, pulling entry.gfn and entry.operation out of memory the buffer never owned. This is the slab-out-of-bounds read KASAN catches, and everything else builds on it.
  • Constrained write. If the decoded entry looks like a valid operation and gfn and gets dispatched as a KVM_HC_MAP_GPA_RANGE, the completion code __snp_complete_one_psc() writes back into that same out-of-bounds slot: guest_psc->entries[idx].cur_page = entry.pagesize ? 512 : 1. So it sets the low 12 bits of the neighbouring qword to either 1 or 512, depending on bit 56 of that same qword. It’s not an arbitrary write, it’s one of two small values into the low bits of a word you pick, but it does land in another object and you can do it over and over.
  • Failure oracle. If the entry doesn’t validate, nothing gets dispatched, but the response in SW_EXITINFO2 tells you which index it stopped at. By bumping end_entry one at a time you can learn, slot by slot, whether the adjacent memory decoded to a no-op or to something that failed validation. That’s a one-bit-per-probe leak of the neighbouring heap, enough to find object boundaries and tell zero from non-zero.

The guest picks the allocation size (which decides the slab class, and so which neighbours are in reach), picks how far past the end to walk with cur_entry/end_entry, and can fire as many VMGEXITs as it wants. That last part is what makes it useful: each request re-allocates the scratch buffer, so over time it lands in different freelist slots and you can sweep across neighbours instead of being stuck with one. Put together, you get heap layout disclosure, the constrained write above, and use-after-free across requests, all of which showed up under KASAN.

Evidence

I tested on an AMD EPYC 7443P, Ubuntu 24.04.4, host kernel 6.11.11 with CONFIG_KASAN=y, under QEMU 10.0.0 (AMDESE snp-latest). One insmod of my test module gave 73 KASAN reports:

BUG: KASAN: slab-out-of-bounds in snp_begin_psc+0x126/0x890
Read of size 8 at addr ffff888219ffb5e0 by task qemu-system-x86/2199

BUG: KASAN: slab-out-of-bounds in snp_begin_psc+0x468/0x890
Write of size 8 at addr ffff888351566648 by task qemu-system-x86/2199

The breakdown:

62  slab-out-of-bounds (reads + writes past the allocation)
 7  slab-use-after-free
 4  use-after-free

They’re all against kmalloc-cg-32, with the bad addresses just past a 32-byte region, which lines up with the 24-byte buffer rounding up to 32 and the loop running off the end.

The fix

The fix is four lines:

  } else {
+         /* GHCB v2 requires the scratch area to be within the GHCB. */
+         if (to_kvm_sev_info(svm->vcpu.kvm)->ghcb_version >= 2)
+                 goto e_scratch;
+
          /*
           * The guest memory must be read into a kernel buffer, so
           * limit the size

For GHCB v2 and later, the spec requires the scratch area to live inside the GHCB’s shared buffer, and PSC relies on that because the guest never sends an explicit length. So rejecting any out-of-GHCB scratch area for v2+ means the buffer size is fixed and known instead of guest-chosen, and the loop can’t run off the end any more.

What I like about this is that it doesn’t try to add a length check inside snp_begin_psc(); it just takes away the guest’s ability to choose the allocation in the first place. Fixing the input is cleaner than fixing every place that reads it.

Those four lines are the main fix, but they went in as part of a bigger series that also rewrote the PSC loop to read everything once through READ_ONCE() and to bound the entry count against the real buffer size. Those handle two more problems the in-GHCB rule alone doesn’t, which I’ll get to below.

How the fix evolved

The most interesting part of this for me wasn’t the bug, it was watching the fix get worked out on the thread after I reported it. I sent in a fix that worked, and then two people who clearly know this code far better than I do took it apart and made it a lot better. It’s worth writing up, because the final version taught me more than my own one-liner did.

My suggestion in the report was the obvious one: work out the real maximum entry count from the buffer length and check end_entry against that instead of the constant.

u16 max_entries = (len - sizeof(struct psc_hdr)) / sizeof(struct psc_entry);
if (idx_end >= max_entries)
        ...

That does stop the OOB, but Mike Roth (AMD) pointed out it was only treating the symptom. He knew the spec cold and went straight to the real rule: an SNP guest is supposed to keep its scratch area inside the 4 KB GHCB shared buffer to begin with. The GHCB v2 spec (section 2.1) says the SW_SCRATCH area has to be inside the GHCB’s shared buffer, and KVM only allocates a separate host buffer for the old out-of-GHCB case that SNP should never use. So the right place to reject it is earlier, in setup_vmgexit_scratch(), before snp_begin_psc() even runs.

Then Sean Christopherson (Google) found two more things I’d missed completely. First, even bounding the entry count isn’t quite enough, because the spec only says the request has to be somewhere in the shared buffer, not at the start of it. The host works out the address as scratch_va = svm->sev_es.ghcb + (scratch_gpa_beg – control->ghcb_gpa), so the guest can put the descriptor at an offset into the page. If you put it near the end of the 2032-byte buffer and then ask for close to 253 entries, you walk off the end of the buffer (and the page) anyway, which an “is end_entry < 253” check at the start can’t catch. The real bound has to come from how much room is actually left after the descriptor’s offset, not the buffer’s total size.

Second, the handler read the guest’s descriptor without READ_ONCE(), and because PSC processing is re-entrant it re-read it again after the userspace exit. The descriptor is in guest-shared memory, so another vCPU (or the guest, between exits) can change hdr->cur_entry, hdr->end_entry, or the entries after they’ve been validated. That’s a classic time-of-check/time-of-use race. The funny part is the code already had a comment saying the buffer “can be modified by a misbehaved guest after validation”, and then went and re-read hdr->cur_entry in the loop anyway.

The series ends up doing three things instead of one.

First, it rejects any out-of-GHCB scratch area for v2+ at setup time (that’s db3f219), so the host never makes a guest-sized allocation at all.

Second, it reworks the handler so validation happens once, on values the guest can’t change afterwards. snp_begin_psc() becomes a thin entry point that copies the indices into private per-vCPU state (svm->sev_es.psc, not the guest buffer) and works out the bound from the length:

max_nr_entries = (len - sizeof(struct psc_hdr)) / sizeof(struct psc_entry);
if (WARN_ON_ONCE(max_nr_entries > VMGEXIT_PSC_MAX_COUNT)) {
        snp_complete_psc(svm, VMGEXIT_PSC_ERROR_GENERIC);
        return 1;
}

sev_es->psc.cur_idx = READ_ONCE(guest_psc->hdr.cur_entry);
sev_es->psc.end_idx = READ_ONCE(guest_psc->hdr.end_entry);

if (sev_es->psc.cur_idx > sev_es->psc.end_idx ||
    sev_es->psc.end_idx >= max_nr_entries) {
        snp_complete_psc(svm, VMGEXIT_PSC_ERROR_INVALID_HDR);
        return 1;
}

The WARN_ON_ONCE is there because once db3f219 is in place, len is always the in-GHCB buffer size, so max_nr_entries can’t legitimately go over 253. If it does, that’s a host bug, not the guest’s fault. The actual loop moves into a snp_do_psc() worker that reads each entry into a local copy with READ_ONCE(guest_psc->entries[idx]), so the compiler can’t reload a value the guest changed mid-way, and the re-entrant path picks up from the cached indices instead of re-reading the header.

Third, the separate psc_idx / psc_inflight / psc_2m fields in struct vcpu_sev_es_state get folded into one psc struct that’s zeroed on completion, so there’s a single place that owns the in-flight state.

So one OOB I reported turned into fixes for a wrong buffer bound, an offset bug, and a TOCTOU race in the same handler. My version would only have closed the specific path I reported; the series closed all of it. That was a great thing to see as someone fairly new to this.

Disclosure timeline

I sent the report to security@kernel.org on 8 April 2026 with the analysis, KASAN logs, a suggested fix, and a PoC module. Greg Kroah-Hartman forwarded it to the KVM maintainers the same day, and Paolo Bonzini confirmed it was still live in mainline. Over the next few days (8 to 13 April) Mike Roth and Sean Christopherson worked out the proper fix, and Sean wrote the series and sent it round to AMD, Red Hat, and me to test and review. It landed in mainline in late May 2026: db3f219 is the main commit, authored by Mike Roth, reviewed by Tom Lendacky, committed by Paolo Bonzini, tagged for stable. My bisect put the original introduction at 9b54e248d264 (May 2024); the fix is tagged Fixes: 4af663c, the commit that added per-guest GHCB version config.

A broader pattern

The underlying mistake is checking an attacker-controlled index against a protocol constant instead of the size of the thing it’s actually indexing into. Against the protocol, the 253 check is correct. As a memory-safety check it’s useless, and the gap between those two is the bug.

I’ve run into this shape before. In QEMU’s CXL Get LSA handler the requested length is checked against the backing store but not against the output buffer it gets copied into, which is basically the same mistake in a different place. Any time a handler gets “how many” or “how far” from one source and “how much room is there” from another, and only checks the first, it’s worth a closer look.

SEV-SNP makes it easier to miss because of the inverted trust direction. If you’re used to thinking of the hypervisor as the threat, the guest-supplied values arriving in these handlers don’t automatically feel like attacker input, but they are. Every size, offset, and count the guest writes into the GHCB is untrusted, and these handlers deserve the same suspicion you’d give any other code parsing data from an attacker.


Thanks to Mike Roth and Sean Christopherson, who took my report and turned it into a much better fix than the one I sent, and to Tom Lendacky and Paolo Bonzini for the review. Getting to watch people who know this code this well work through it was the best part of the whole thing. db3f219 is in mainline and tagged for stable.