Use-after-free in CPython’s perf_trampoline via unsynchronised arena teardown

January 2026 · CPython Issue #143228 · Fix PR #143233 · Patched in 3.13/3.14, 3.12 marked Won’t Fix

Summary

A use-after-free in CPython’s perf_trampoline implementation, triggered when sys.deactivate_stack_trampoline() runs concurrently with active bytecode execution on worker threads. The cleanup function free_code_arenas calls munmap on executable memory pages without checking whether other threads are currently executing code in those regions, so the OS yanks the page table entries out from under threads whose instruction pointer is sitting on those pages. On Python 3.12 the result is an immediate SIGSEGV. On 3.13 and 3.14-dev the Tier 2 executor’s internal state checks intercept the corruption and convert it into SystemError: error return without exception set instead — same root cause, softer landing.

I found this with the same custom scanner I used to identify CVE-2025-64459 in Django, after swapping the graph ingestion backend from Python’s ast module to libclang. The reasoning core didn’t change.

The bug

perf_trampoline exists to help Linux perf map JIT-style trampoline frames back to Python source. When sys.activate_stack_trampoline(“perf”) is called, the interpreter mmaps executable pages and writes small trampoline stubs into them. When sys.deactivate_stack_trampoline() is called, those pages are released. The release path is free_code_arenas:

// Python/perf_trampoline.c

static void
free_code_arenas(void)
{
    code_arena_t *cur = perf_code_arena;
    code_arena_t *prev;
    perf_code_arena = NULL;
    while (cur) {
        // No synchronisation. No reference count. No check for
        // threads currently executing code on this page or
        // unwinding stack frames through it.
        munmap(cur->start_addr, cur->size);
        prev = cur->prev;
        PyMem_RawFree(cur);
        cur = prev;
    }
}

The execution path that consumes those pages is py_trampoline_evaluator:

static PyObject *
py_trampoline_evaluator(PyThreadState *ts, _PyInterpreterFrame *frame, int throw)
{
    // f points into mmap'd trampoline memory. The CPU's instruction
    // pointer moves into the danger zone when this call dispatches.
    return f(ts, frame, throw, _PyEval_EvalFrameDefault);
}

There is nothing — no GIL acquisition on the deactivation path, no per-arena refcount, no quiescence check — preventing munmap from racing a thread whose PC is currently inside the page being unmapped.

The crashing interleaving:

Worker thread enters a trampoline frame. Its PC is now at some address inside the arena, e.g. 0x725ab661e00a.
Main thread calls sys.deactivate_stack_trampoline() → free_code_arenas → munmap. The kernel removes the page table entry.
Worker thread tries to return from the frame, or hits an exception path that needs to unwind through it. libgcc’s stack unwinder reads the saved PC from the trampoline frame to find the return address.
Page fault. The address is no longer mapped. SIGSEGV.

munmap removes the arena page while a worker is still executing in it, so the next unwind reads a saved PC from memory that no longer exists.

Reproduction

The race is hard to hit reliably on multi-core systems because the destroyer thread and the worker threads tend to land on different physical cores, so the worker doesn’t context-switch off its PC at the moment munmap lands. Pinning the whole process to a single core forces the kernel to time-slice the threads on shared hardware, which dramatically tightens the race window. On my machine the crash takes a few seconds with taskset -c 0 and effectively never reproduces without it.

taskset -c 0 python3 poc.py

# poc.py
import sys
import threading
import os

def heavy_workload():
    # Keep workers inside the trampoline evaluator.
    while True:
        _ = sum(i * i for i in range(500))

def trigger_race():
    print(f"[+] PID: {os.getpid()}")
    for _ in range(8):
        t = threading.Thread(target=heavy_workload, daemon=True)
        t.start()

    # Toggle as fast as possible. No sleep. The window is in nanoseconds.
    while True:
        sys.activate_stack_trampoline("perf")
        sys.deactivate_stack_trampoline()

if __name__ == "__main__":
    trigger_race()

Forensic analysis (GDB)

Loading the core dump in GDB shows the race directly. The cleanup thread is mid-munmap:

Thread 9 (Thread 0x725ad0b00b80 (LWP 12791)):
#0  0x0000725ad0125d7b in __GI_munmap () at ../sysdeps/unix/syscall-template.S:117
#1  0x0000725ad071b9f4 in free_code_arenas () at Python/perf_trampoline.c:315
#2  _PyPerfTrampoline_FreeArenas () at Python/perf_trampoline.c:421

The victim thread is unwinding through a frame whose backing memory has just disappeared:

Thread 1 (Thread 0x725ace5fd6c0 (LWP 12846) (Exiting)):
#0  x86_64_fallback_frame_state ... at ./md-unwind-support.h:63
        pc = 0x725ab661e00a <error: Cannot access memory at address 0x725ab661e00a>
#1  uw_frame_state_for ...
#2  0x0000725ab6c86c8a in _Unwind_ForcedUnwind_Phase2 ...

The PC value 0x725ab661e00a is the address inside the trampoline arena that Thread 9’s munmap call invalidated milliseconds earlier. The “Cannot access memory” message is the kernel telling GDB the page no longer exists. That’s the use-after-free.

Impact

On Python 3.12 this is a hard SIGSEGV with no application-level recovery — the interpreter dies. Any production system that toggles perf profiling on or off (a pattern for opt-in profiling on a per-request basis, or for time-windowed profiling triggered by a scheduler) is exposed to a denial-of-service from any thread happening to be in a Python frame at the wrong moment.

On 3.13 and 3.14-dev the Tier 2 executor’s internal state checks detect the corrupted frame state and return a failure code without setting an exception, producing SystemError: error return without exception set. The crash is softer but the underlying state corruption is identical, and the interpreter is left in an undefined state — the SystemError is a symptom of the race, not a fix for it.

This is structurally a use-after-free. UAFs can sometimes escalate beyond a crash if the freed memory is immediately reclaimed with attacker-influenced content, but I have no working exploit beyond the denial of service here. The freed region is executable trampoline memory inside a single-process managed runtime, not a kernel slab cache with cross-object reclaim targets, so the realistic threat model is interpreter DoS.

The fix

The root assumption was wrong: sys.deactivate_stack_trampoline() cannot safely munmap arena memory immediately, because there’s no static guarantee that no thread is executing inside it. The fix introduces reference counting tied to code object lifetime via the existing PyCode_AddWatcher API.

Each arena now carries a refcount tracking how many code objects have trampoline stubs resident on its pages. Deactivation marks arenas for deletion rather than unmapping them. A code watcher fires when individual code objects are destroyed, decrementing the refcount of whichever arena their stub lives in. munmap only runs when an arena’s refcount reaches zero — i.e. when no live code object can plausibly still be executing inside it. The arena outlives the deactivation call, but only as long as it has to.

Status: 3.13 and 3.14-dev got the patch (PR #143233). 3.12 is in security-fix-only mode and the backport was deemed too invasive relative to the threat (DoS, requires sys.activate_stack_trampoline already in use), so it was marked Won’t Fix.

How I found it

This came out of the same scanner I used to find CVE-2025-64459, with the graph ingestion layer adapted to C. The architecture is two-stage: an LLM-driven semantic pass that flags suspicious patterns based on inferred intent, and a deterministic call graph verifier that filters out anything not reachable from a public API. The reasoning core is unchanged between the two findings; only the parser changed.

For C, Python’s ast module is replaced with libclang’s Python bindings. Function definitions become CXCursor_FunctionDecl, calls become CXCursor_CallExpr, and the call graph is built the same way as in the Django case. The substantive addition is that each graph node is annotated with a memory-lifecycle state — ALLOCATED, FREED, or DEREFERENCED — derived from whether it calls malloc/calloc/realloc, free/munmap, or dereferences a pointer (CXCursor_UnaryOperator with *, or CXCursor_MemberRefExpr). This lets the verifier model the structural precondition for a UAF: a FREED node and a DEREFERENCED node sharing a pointer with no intervening reallocation.

On perf_trampoline.c the Scout flagged free_code_arenas because it calls munmap on an executable region without visible synchronisation, and py_trampoline_evaluator because it dereferences a function pointer (f(ts, frame, …)) into the same region. The reverse-BFS reachability check confirmed that sys.deactivate_stack_trampoline — a public Python API — is a caller of free_code_arenas, satisfying the structural reachability condition. The Judge’s verdict was that the static analysis confirms a free / dereference pair on a shared region with no intervening reallocation and no visible mutex; the temporal ordering required for the actual race (free runs while another thread is mid-dereference) is beyond what static analysis can prove, but the absence of any synchronisation primitive on the deactivation path is itself the finding.

That last point is what makes this an interesting cross-domain case. The Django bug was the absence of a validator. The CPython bug is the absence of a mutex. Different language, different vulnerability class, but structurally the same shape: a guardrail that should have been there isn’t, the static analysis can prove the path is reachable, and the LLM can articulate why the absence is dangerous. The architecture was built for the first kind of finding; the second kind came essentially for free once the C backend was in place.