One Symlink From Host Root: The runC maskedPaths Escapes and the Myth of the Container Boundary

The most dangerous file in your container is /dev/null. Not because of what it does, but because of what your container runtime assumes about it. On November 5, 2025, runc maintainer Aleksa Sarai disclosed three vulnerabilities — CVE-2025-31133, CVE-2025-52565, and CVE-2025-52881 — that share a single root cause: runc trusts that the things it is mounting are the things it thinks they are. They are not always. And when they are not, a process inside an unprivileged container can talk a privileged runtime into bind-mounting host procfs files into the container read-write, then use one of them to run code as root on the node.

This is not an exotic kernel bug requiring a custom exploit chain. It is a time-of-check-to-time-of-use race against a runtime that everyone runs — Docker, containerd, CRI-O, and effectively every managed Kubernetes service on the planet, because all of them shell out to runc to create containers. If you operate Kubernetes, you ran vulnerable code in production until you patched runc, and most clusters patched late. The interesting question is not whether you were exposed. It is which of your other defenses would have contained the escape if someone had landed it before you patched. For a lot of clusters, the honest answer is none.

What maskedPaths Was Supposed to Do

When runc builds a container, it hides a short list of sensitive /proc and /sys paths so that a process inside cannot read or scribble on host-global kernel knobs. This is the maskedPaths and readonlyPaths machinery in the OCI runtime spec, and it is the reason a default container cannot just cat /proc/kcore or rewrite /proc/sys/kernel/core_pattern. The default masked set includes /proc/kcore, /proc/keys, /proc/sysrq-trigger, /proc/timer_list, and a handful of others.

The masking is implemented with mounts, and the mechanism is more fragile than it looks. A directory that needs masking gets a read-only tmpfs mounted on top of it, so the contents are simply gone from the container’s view. A file that needs masking gets the container’s own /dev/null bind-mounted on top of it — reads return EOF, writes vanish. That is a clever trick: instead of deleting access, you redirect it into the bit bucket. It is also the entire vulnerability, because it means runc’s protection for a sensitive file is itself a mount operation whose source is a path inside the container’s filesystem — a filesystem the container’s own processes can modify while runc is working.

The Trilogy: Three CVEs, One Broken Assumption

All three bugs are mount races, and all three carry a CVSS of 7.3. They differ in which mount they corrupt.

CVE-2025-31133 is the cleanest to reason about. To mask a file, runc bind-mounts the container’s /dev/null over it. But runc did not verify that the /dev/null it was about to use was actually a real /dev/null device node. If a process inside the container replaces /dev/null with a symlink to a procfs file at the right moment, runc faithfully bind-mounts the symlink target — read-write — on top of the path it was trying to protect. The protection mechanism becomes the delivery mechanism.

CVE-2025-52565 attacks a different mount in the same setup sequence: the bind-mount of /dev/pts/$n onto /dev/console. By winning a race during that operation, an attacker can get host paths bind-mounted into writable locations inside the container that should never have been writable. Same class of TOCTOU, different target in the dance.

CVE-2025-52881 is the one that should worry you most, and we will come back to it, because it generalizes the technique into arbitrary write redirection inside /proc — and in doing so it sidesteps the Linux Security Modules that are supposed to be your backstop.

The affected range is broad: every runc up to and including 1.2.7, the 1.3.0-rc.1 through 1.3.1 line, and the 1.4.0 release candidates rc.1 and rc.2. The fixes landed in 1.2.8, 1.3.3, and 1.4.0-rc.3. There is no clever configuration that made an unpatched runc immune; the races live in the core container-setup path that runs for every single container.

From Masked File to Host Root: The core_pattern Kill Chain

A read-write window onto a procfs file is not, by itself, root on the host. What makes these bugs critical is which files become writable and what the kernel does with them.

The marquee target is /proc/sys/kernel/core_pattern. This file tells the kernel what to do when a process dumps core. If its value begins with a pipe, the kernel treats the rest as a program to execute — and here is the part that matters: the kernel runs that program in the host’s initial namespaces, as root, not inside the container’s namespaces. Core-dump upcalls were never namespaced. So the kill chain is short and brutal:

Win the maskedPaths race to get /proc/sys/kernel/core_pattern bind-mounted read-write into the container.
Write |/proc/%P/root/tmp/pwn (or similar) so the kernel will execute an attacker-controlled binary on the next crash.
Crash any process. The kernel launches the helper as root, in the host namespace. You now have code execution on the node.

The lower-effort variant skips code execution entirely: bind-mount /proc/sysrq-trigger, write a c, and the box panics. That is a single-keystroke denial of service against a Kubernetes node, available to anything that can win the race.

There is an important nuance that is also the whole defensive story. Writing core_pattern is gated by CAP_SYS_ADMIN in the initial user namespace. A container that has dropped CAP_SYS_ADMIN — which is the Docker default — and a container running inside its own user namespace, where “root” maps to an unprivileged host UID, cannot complete the high-impact write even after winning the race. The bug hands you the door; your capability set and namespace configuration decide whether you can walk through it. That is not a reason to delay patching. It is the reason some clusters would have survived an exploit and others would have been owned in one shot.

The CVE That Walks Through Your LSM

Most people read “defense in depth” and picture independent layers, each of which holds when the one above it fails. CVE-2025-52881 is a useful, uncomfortable reminder that the layers are not actually independent.

This vulnerability lets an attacker redirect procfs writes in a way that bypasses AppArmor and SELinux mediation. The reason cuts to how those systems work. AppArmor confinement is path-based: a profile permits or denies access to files by their path. When the runtime can be tricked into redirecting a write through a mount the LSM policy did not anticipate, the path the policy is reasoning about and the inode the kernel actually writes to are no longer the same thing. The mandatory access control system enforces its rules perfectly against a description of reality that the attacker has edited.

Readers of this site will recognize the shape of the problem. In March we covered CrackArmor, nine flaws that let a container escape AppArmor confinement and reach host root on Debian, Ubuntu, and SUSE. That writeup ended with a line worth repeating: container isolation is a stack of mechanisms, not a single boundary, and when one layer breaks you need the others to hold. CVE-2025-52881 is the same lesson taught from the other direction. CrackArmor broke the LSM directly. The runc bug reaches past the LSM by corrupting the layer beneath it. Either way, the team that was relying on AppArmor as its load-bearing container boundary had no boundary.

Why This Keeps Happening to runC

If the runc maskedPaths trilogy gives you déjà vu, your memory is working. runc has been the subject of a steady cadence of escape-class vulnerabilities, and they rhyme.

In 2019, CVE-2019-5736 let a container overwrite the host’s runc binary itself via /proc/self/exe, so the next container start executed attacker code as root. In January 2024, Snyk’s “Leaky Vessels” research produced CVE-2024-21626, in which a lingering file descriptor to a host directory — reachable through /proc/self/fd after a crafted WORKDIR — let a container read and write the host filesystem. Now, in late 2025, the maskedPaths races. Three distinct disclosures, six years apart at the ends, every one of them turning on the same primitive: procfs and the runtime’s handling of mounts and file descriptors that straddle the host/container boundary.

This is not runc being uniquely careless. It is a structural truth about Linux containers. A “container” is not a kernel object. It is a userspace choreography of namespaces, cgroups, capabilities, mounts, and LSM profiles that the runtime assembles, one syscall at a time, against a /proc filesystem that exposes the host and the container in the same tree and was never designed as a security boundary. Every one of those setup steps is a small window in which the thing being configured can be changed by the thing being confined. runc is the most-audited container runtime in existence and it still cannot fully close those windows, because the windows are inherent to the model. The lesson is not “switch runtimes.” It is “stop treating the runtime as the boundary.”

The Layers That Actually Held: Seccomp and User Namespaces

Here is the opinionated core of this piece, stated plainly: if a runtime CVE is the first thing standing between an attacker and your nodes, you have already lost the architecture argument. The controls that determined who survived the maskedPaths bugs were the ones a layer down from runc.

Seccomp is the cheapest control with the highest leverage, and most clusters squander it. The default Docker and Kubernetes RuntimeDefault seccomp profile already blocks a large swath of dangerous syscalls — and a tighter, application-specific allow-list narrows the syscall surface a race exploit needs to even attempt its swap. Yet a startling number of production pods run with seccompProfile: Unconfined or with no profile set at all, because someone hit a syscall denial once in 2022 and turned it off cluster-wide instead of fixing the profile. Set this and audit it:

1
2
3
securityContext:
  seccompProfile:
    type: RuntimeDefault   # the floor, not the ceiling

User namespaces are the structural fix for the entire core_pattern class. When a pod runs in its own user namespace, the container’s UID 0 maps to an unprivileged UID on the host. The maskedPaths race might still hand the container a read-write view of core_pattern, but the write fails — the process lacks CAP_SYS_ADMIN in the initial user namespace, where that file’s permission check lives. Kubernetes user namespaces went GA recently; if your kernel and runtime support them, this is among the highest-value hardening changes available:

1
2
spec:
  hostUsers: false   # run the pod in a dedicated user namespace

Drop capabilities to nothing and add back only what you need. A container with CAP_SYS_ADMIN is a container that does not need a runtime CVE to escape; it can mount, manipulate cgroups, and reach core_pattern on its own. The runc bugs are far more survivable for workloads that run with drop: ["ALL"].

1
2
3
4
5
securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
  readOnlyRootFilesystem: true

None of these is novel. All of them were available before November 2025. The clusters that were genuinely at risk from a single landed exploit were the ones running rootful, capability-rich, seccomp-unconfined pods — which is to say, the default if you have never deliberately hardened your workloads.

What To Do Before the Next Runtime CVE

Patch runc first, because nothing below substitutes for it. Confirm the version actually running on every node, not the version in your base image or your intentions:

1
2
runc --version
# upgrade until you are at or above 1.2.8, 1.3.3, or 1.4.0-rc.3

On managed Kubernetes, the runc that matters lives in the node image, so this is a node-pool rollout, not a kubectl change. AWS, Azure, Google, and the rest shipped patched node images on a staggered schedule; “we don’t manage runc, the provider does” is true and is exactly why you must verify the node image version rather than assume.

Then close the structural gaps so the next runtime bug — and there will be one — is a patch you apply calmly rather than an incident:

Set seccompProfile: RuntimeDefault as a baseline and enforce it with a policy engine. Treat Unconfined as a finding that requires sign-off, not a default.
Turn on user namespaces (hostUsers: false) for every workload that does not have a hard reason to share the host’s user namespace. The list of workloads that genuinely need host users is much shorter than the list currently using them.
Drop all capabilities, set allowPrivilegeEscalation: false, and run a read-only root filesystem. Make these the defaults in your admission policy, with exceptions logged and reviewed.
Instrument runtime detection. eBPF-based monitoring — the same telemetry we argued for when eBPF rootkits went commodity — will catch the behavioral signatures of these escapes even when you cannot patch instantly: a symlink appearing at /dev/null, an unexpected bind-mount onto a procfs path, or any write to core_pattern or sysrq-trigger from a container context. Falco’s default ruleset flags several of these out of the box; make sure those rules are enabled and routed somewhere a human will see them.
Segment the blast radius. A node that is fully escaped should still land an attacker in a network segment with no path to your control plane, your cloud metadata endpoint, or your secrets store. The controller-token-leak epidemic showed how quickly one over-privileged identity becomes cluster-admin; node compromise plus a flat network is the same story with a different entry point.

The Container Is Not a Boundary

The runc maskedPaths escapes are not, in the end, a story about runc. They are a story about a comfortable assumption — that the runtime draws a hard line between the container and the host — that has never been true and keeps costing teams who treat it as true. A container is a bundle of independent kernel mechanisms wearing a trench coat. Each mechanism can fail. The discipline that keeps a single failure from becoming a node compromise is making sure no one mechanism is load-bearing on its own.

CVE-2025-31133 broke the runtime’s masking. CVE-2025-52881 reached past the LSM that was supposed to catch the runtime’s mistakes. The clusters that shrugged these off were not lucky. They had patched their seccomp posture, dropped their capabilities, and turned on user namespaces long before November, so the escape primitive landed in an environment where it had nowhere to go. Build that environment now, while this is a patch advisory and not your incident channel. The next runtime CVE is already written; it just does not have a number yet.