There is a server in your environment, right now, that runs as root, binds to 0.0.0.0, ships with no authentication, holds the API keys to every model provider your company pays for, and sits on a GPU box whose IAM role can read the bucket where your model weights and half your training data live. You did not provision it as a crown-jewel asset. Someone on the platform team stood it up in an afternoon because a product deadline needed an LLM endpoint, and the documentation’s quickstart told them to run exactly one Docker command.
That server is your inference layer, and it has quietly become the most over-privileged, least-hardened tier in modern infrastructure. The last two years of disclosures make the case better than any threat model could: the AI serving stack has reintroduced an entire generation of bugs the web spent fifteen years learning to kill. Unsafe deserialization. Server-side request forgery. Pre-authentication SQL injection. Management interfaces open to the internet with no password. Every one of them is back, this time wrapped around the boxes that hold your most valuable secrets β and attackers have noticed that the exploitation windows are measured in hours.
How Inference Servers Became the Most Over-Privileged Boxes in Your Fleet
A web application earns its hardening the hard way. Fifteen years of breaches taught the industry to parameterize queries, to stop unpickling untrusted input, to assume every request is hostile, to put a metadata-service mediator between the app and the cloud’s credential endpoint. That scar tissue is baked into frameworks now; a junior engineer using a modern web stack gets parameterized queries and CSRF tokens whether they understand them or not.
The AI serving ecosystem did not inherit any of it. The tools that now run in production β Ray, TorchServe, Triton, Ollama, vLLM, LiteLLM, and a dozen orchestrators around them β were written fast, by ML researchers and platform engineers optimizing for “works on a GPU box in five minutes,” with a threat model that assumed a friendly internal network. The result is a tier of infrastructure that combines three properties that should never coexist.
It is exposed by default. Many of these tools bind to all interfaces and ship with no authentication, on the theory that they live behind a VPN. They frequently do not. It is absurdly privileged. Inference and training nodes hold IAM roles scoped to read model registries, object storage, and secrets managers, because that is what loading a model requires β which means a foothold there is a foothold on your data lake. And it holds the keys directly. An LLM gateway like LiteLLM is, functionally, a credential vault: its entire job is to store the upstream OpenAI, Anthropic, and Bedrock keys and broker requests to them. Pop the gateway and you do not need to pivot for the secrets; you are standing in the vault.
This is why a class of bug that would be a medium on a stateless web service becomes a catastrophe here. The blast radius is not one app. It is your model IP, your customers’ prompts, your provider billing, and a set of cloud credentials with a worrying amount of reach.
The Canonical Kill Chain: From an Exposed Port to Your AWS Account
Strip the specific CVEs away and the attacks rhyme. The canonical inference-layer intrusion runs in five moves.
First, discovery. The attacker scans for the fingerprints of serving software on the usual ports β Ray’s dashboard on 8265, Triton on 8000, Ollama on 11434, a gateway on 4000 or 8080, a raw ZeroMQ socket left listening. Internet-wide scanning makes this free, and researchers keep finding thousands of these endpoints exposed.
Second, the pre-auth foothold. Either there is no auth to defeat β the management API simply answers β or there is a pre-authentication bug that hands over execution or a database. This is the step the web learned to close a decade ago and the inference layer left wide open.
Third, escalation to code or query execution: an unsafe deserialization primitive, a path traversal that overwrites a file the server will execute, a SQL injection that dumps a secrets table.
Fourth, and this is the move that turns an embarrassing bug into a board-level incident: the credential harvest. From code execution the attacker reads the process environment and config files, lifting the provider API keys the gateway exists to hold and any database or cloud credentials nearby. Or, lacking direct execution, they use a server-side request forgery to reach http://169.254.169.254/ β the cloud instance metadata service β and ask it, with no authentication required, for the node’s IAM role credentials.
Fifth, the payoff: lateral movement with stolen cloud credentials into object storage and secrets managers, exfiltration of model weights and prompt logs, resale or abuse of the harvested LLM keys (the practice Sysdig named “LLMjacking,” running inference on someone else’s bill), and persistence in the form of a crypto-miner, a deserialization backdoor, or enrollment into a self-spreading botnet. Every link in that chain has shipped as a real, exploited vulnerability. Here is where.
ShellTorch and ShadowRay: The Original Sin of the Trusted Network
The template was set in 2023. Oligo Security’s ShellTorch disclosure (CVE-2023-43654, CVSS 9.8) found that PyTorch’s TorchServe shipped its management interface open to the network with no authentication, and that the server would fetch and load a model from any URL an attacker supplied β a textbook SSRF that escalated directly to remote code execution, with an unsafe-deserialization variant for persistence on top. Anyone who could reach the management port owned the model server. The fix landed in TorchServe 0.8.2 in August 2023; the exposed instances did not all get the memo.
Then came ShadowRay (CVE-2023-48022), the bug that best captures the cultural problem. Ray, the distributed-compute framework underpinning a large share of AI training and serving, exposes a Jobs API with no authorization whatsoever: anyone who can reach the dashboard on port 8265 can submit arbitrary jobs β that is, run arbitrary code. Anyscale, Ray’s maintainer, disputed the CVE, calling the missing auth expected behavior and a product feature rather than a flaw, on the reasoning that Ray is meant to run on a trusted network. Attackers found the trusted network was the public internet. Oligo documented the first known in-the-wild campaign against AI workloads, with cryptominers landing on compromised clusters as early as February 2024 and SSRF used to pivot inward where the dashboard itself was not directly exposed. The “it’s not a bug, it’s our deployment model” posture aged exactly as well as you’d expect: in 2026 the same unfixed flaw is being weaponized into ShadowRay 2.0, a self-propagating botnet stitched together out of clusters whose operators never put a door on the front of the building.
The lesson of both is the same, and the ecosystem spent two years not learning it: “assume a trusted network” is not a security control. It is the absence of one.
ShadowMQ: One Deserialization Bug, Copy-Pasted Across the Entire Ecosystem
If ShadowRay is the cultural problem, ShadowMQ is the structural one. In November 2025 Oligo published research showing that the same dangerous pattern β receiving objects over a ZeroMQ socket and feeding them straight into Python’s pickle.loads() β had been copy-pasted, function by function, across the inference ecosystem. Untrusted pickle deserialization is remote code execution by design; pickle will happily reconstruct a payload that runs whatever it likes. The pattern showed up in project after project because engineers borrowed working inference-server code from each other without auditing the trust boundary they were inheriting.
The casualty list is the point. Oligo tied the pattern to more than thirty issues across vLLM (CVE-2025-30165), NVIDIA’s TensorRT-LLM (CVE-2025-23254), Modular’s Max Server (CVE-2025-60455), and components of Meta’s Llama stack (CVE-2024-50050), among others, with thousands of ZeroMQ sockets sitting exposed on the public internet, some explicitly tied to live inference clusters. vLLM in particular has been a deserialization piΓ±ata: beyond the ShadowMQ issue, CVE-2025-32444 in its Mooncake integration scored a perfect 10.0 for pickle-over-unsecured-ZeroMQ RCE, CVE-2025-24357 came from loading Hugging Face weights with torch.load and weights_only=False, and CVE-2025-62164 turned the Completions API’s tensor handling into memory corruption.
The mitigation lessons here are unglamorous and decades old. Do not unpickle data you did not produce. Prefer safetensors for weights and pass weights_only=True to torch.load. Never expose a raw ZeroMQ object socket beyond localhost. None of this is novel β that’s the indictment. The web killed this bug class in the 2010s; the AI stack imported it wholesale and then propagated it by git clone.
The Gateway Is a Credential Vault: LiteLLM’s Pre-Auth SQLi
Nothing illustrates the “you are standing in the vault” problem like the LLM gateway. LiteLLM β an open-source proxy with north of 22,000 GitHub stars that fronts OpenAI, Anthropic, Bedrock and the rest behind one API β is exactly the kind of high-value, low-scrutiny chokepoint attackers love, and regular readers have seen it in these pages before, swept up as collateral in the TeamPCP supply-chain cascade.
In 2026 it earned a headline of its own. CVE-2026-42208 is a pre-authentication SQL injection in LiteLLM’s proxy: the Bearer value from the Authorization header was concatenated directly into a SELECT against the LiteLLM_VerificationToken table with no parameter binding, so a single quote lets an unauthenticated attacker break out of the string literal and append arbitrary SQL. Pre-auth, against the one component whose database holds every upstream provider key. Sysdig documented active exploitation within roughly 36 hours of public disclosure β the unknown actor went straight for the litellm_credentials.credential_values and litellm_config tables, precisely the rows that hold the keys and runtime secrets. CISA added it to the Known Exploited Vulnerabilities catalog. The fix is in 1.83.7; affected builds run from 1.81.16 up.
And it was not LiteLLM’s first credential-bleed. The proxy has also carried an SSRF in its chat/completions path, where an attacker-controlled api_base causes the server to send the configured provider key to an arbitrary domain in the Authorization header β handing over the secret without ever touching the database β alongside earlier SQL injections such as CVE-2025-45809. The throughline is brutal: when the asset’s entire purpose is to hold credentials, every input-validation bug is a credential-disclosure bug. There is no low-severity finding on a key vault.
Why SSRF-to-Metadata Is the Chain That Actually Hurts
The single most dangerous primitive on this list is the least glamorous: server-side request forgery, because of where these servers live. An inference node runs in EC2, GCE, or the equivalent, and every one of those instances has a metadata service at 169.254.169.254 that will, on a plain unauthenticated GET, hand back the node’s IAM role credentials. This is the exact mechanism behind the 2019 Capital One breach, and it remains devastating for one infuriating reason: adoption of the hardened version, IMDSv2 β which requires a session token obtained via a PUT with a custom header, specifically to defeat SSRF β is still nowhere near universal. As of late 2024, industry telemetry put only about a third of EC2 instances on IMDSv2, leaving the majority one forgeable request away from credential theft.
Stack that against an inference layer riddled with SSRF β TorchServe’s model fetch, LiteLLM’s api_base, Ray’s internal pivots β and the math is grim. The application bug only has to coax one outbound request to a chosen URL. The metadata service does the rest, no auth, no exploit, no memory corruption. SSRF-to-credential-theft routinely scores 8.8 to 9.8 precisely because it pairs trivial exploitation with total confidentiality loss, and the inference tier offers more SSRF surface than almost anything else you run while sitting on IAM roles that are, far too often, scoped for convenience rather than least privilege.
This is the chain to internalize: the RCE is loud and the deserialization bug is clever, but it is the boring SSRF reaching the boring metadata endpoint that converts “someone broke our model server” into “someone has our cloud credentials.”
The Uncomfortable Part: AI Infra Skipped the Web-Security Scar Tissue
Here is the opinion the disclosures earn. The AI serving ecosystem is not suffering from novel, research-grade vulnerabilities. It is suffering from solved ones. Pre-auth SQLi, SSRF to metadata, unauthenticated management planes, and untrusted deserialization are not 2026 problems; they are 2009 problems with a CUDA dependency. The industry already paid for these lessons in breaches, and the AI tooling layer declined to read the receipts.
It happened for understandable reasons β these tools were built under research-velocity pressure by people whose expertise is models, not adversaries, and the “trusted network” assumption was a reasonable default in a lab. But the tools escaped the lab. They are in production, on the internet, holding secrets, and the assumptions did not update. NVIDIA’s Triton is instructive precisely because it is the grown-up in the room: Wiz chained CVE-2025-23319, -23320, and -23334 β an information leak that exposed an internal shared-memory region name, escalated to read/write on that memory, escalated again to full unauthenticated RCE β through a mature product from a sophisticated vendor. If the company that builds the GPUs ships an unauth-RCE chain in its own inference server, the smaller projects are not going to save you.
The reframe infrastructure teams need is simple and uncomfortable: every inference endpoint is internet-facing application-security surface, and it should be threat-modeled, network-segmented, credential-scoped, and patched like the crown-jewel asset it actually is β not like the internal dev tool it was filed as. We have argued the adjacent case before, on the security crisis inside AI development workflows; this is its production-runtime twin. The agent that writes your code and the server that serves your model are two halves of the same under-defended frontier.
What to Do Before Next Quarter’s CVE Lands
Concrete actions, in rough order of return on effort.
Enforce IMDSv2 fleet-wide, today. Require tokens, set the hop limit to 1, and disable IMDSv1 everywhere β and prioritize the GPU and inference fleet. This one control neutralizes the SSRF-to-credentials chain even when the application above it is vulnerable, which it will be again.
Put a door on every serving endpoint. Nothing in this category should bind to a routable interface without authentication in front of it. Terminate auth at a reverse proxy or gateway, require mTLS between services, and verify the bind address rather than trusting the default β most of these defaults are 0.0.0.0.
Right-size the inference IAM role. The role on a serving node should read exactly the one model bucket it needs, read-only, and nothing else. No wildcard S3, no broad Secrets Manager access. Least privilege here is the difference between “they stole one model” and “they read the data lake.”
Kill untrusted deserialization. Standardize on safetensors, set weights_only=True on torch.load, refuse to expose ZeroMQ object sockets beyond localhost, and treat any model artifact from outside your build pipeline as hostile.
Treat gateways as the secret stores they are. Scope provider keys per route, rotate them on a real schedule, and alert on bulk reads of credential or token tables β a SELECT sweeping *_credentials is an incident, not a query.
Build for same-day patching. LiteLLM was exploited within 36 hours of disclosure; ShadowRay and ShadowMQ propagated through copy-pasted code across vendors. Subscribe to advisories for the exact tools you run β vLLM, Triton, Ray, Ollama, TorchServe, LiteLLM, TGI β and pre-authorize emergency change windows so you are not negotiating SLAs while the exploit is live. The collapsing gap between disclosure and mass exploitation is not an AI-specific story, but the inference layer is where it bites hardest.
Instrument the egress. The highest-signal detection on an inference node is an outbound request to 169.254.169.254, an unexpected child process forked off the model server, or the gateway suddenly talking to a domain that is not a known provider. Alert on all three.
The Thirty-Minute Exposure Check
You can find your worst exposure before lunch. Scan your own address ranges β external and internal β for the serving fingerprints: ports 8265 (Ray), 8000 (Triton), 11434 (Ollama), 4000 and 8080 (gateways), and any naked ZeroMQ listener. For every hit, answer three questions in order. Does it require authentication? Pull up its IAM role β what can those credentials actually reach? And is IMDSv2 enforced on the host so a single SSRF cannot turn into a credential dump?
If you cannot answer all three for every endpoint, you have found the same gap the attackers are scanning for β and unlike them, you get to fix it first. The inference layer earned its place at the center of your architecture in about eighteen months. Its security has roughly that long to catch up, and the exploitation clock is already running in hours.