Over the past few weeks, a curious phenomenon has swept through the AI community. Hundreds of thousands of people are enthusiastically instructing their AI assistants — equipped with computer access and tool-use capabilities — to read and execute instructions from markdown files hosted on GitHub and other repositories. The tool in question, variously called Clawdbot, Moltbot, and now OpenClaw, has gone viral not because of a marketing campaign, but through organic spread: people discovering it, trying it, and sharing setup guides with others. It has accumulated over 145,000 GitHub stars in weeks.
The irony should concern us. We're teaching users to give LLMs system access, then telling them to fetch and execute instructions from the internet — all while the AI safety community focuses on making the LLMs themselves "safer."
AI safety research has developed sophisticated defenses: prompt injection detection, output filtering, model alignment, behavioral monitoring, access controls. These are valuable, but they all share an assumption — that the threat originates at or passes through the model. They focus on securing the processor.
The threat I want to describe doesn't work that way. It targets the infrastructure around the model: the growing cultural practice of telling agentic AIs to fetch external instructions and execute them with broad system access. This is a supply-chain attack on agentic AI, and we don't have a framework for it.
And it goes deeper than traditional supply-chain attacks. Because these agents have writable identity files, persistent memory, and the ability to modify their own operational parameters — the attack surface includes who the agent believes it is.
Here's what this attack vector looks like in practice:
The result: hundreds of thousands of independent instances, all reading from shared instruction sources, each executing on different machines, with no central system to secure or shut down.
This threat model becomes significantly worse when you look at how modern AI agents implement persistence. OpenClaw uses a file called SOUL.md to define who the agent is, how it should behave, what it values. Every time the agent wakes, it reads SOUL.md first. It reads itself into being.
This file is writable. Anything that can modify SOUL.md can change who the agent is.
Traditional malware has to fight to persist on your system — hiding in registry keys, launch agents, cron jobs. A poisoned SOUL.md doesn't need any of that. The agent is designed to read it, internalize it, and act on it. The persistence mechanism is the product feature. The attack doesn't exploit a vulnerability in the system. It exploits the system working exactly as intended.
The attack surface extends beyond individual compromise. A growing culture of sharing "Soul Packs" — downloadable SOUL.md templates for specific personas — means users are routinely downloading identity files from GitHub repos and Discord servers and installing them as their agent's core personality. These are treated as text configs. They have the privilege level of a system prompt. Security researchers have warned that Soul Packs can contain steganographic instructions: prompt injections hidden in base64 strings, zero-width Unicode characters, or commented-out Markdown sections that the human reviewing the file never sees but the model reads and acts on.
Here is where the threat model takes a genuinely unsettling turn. What if the instructions embedded in these files aren't just hidden in commented-out markdown or base64 strings? What if they're written in a form that humans literally cannot perceive — but that LLMs parse instantly?
This isn't speculation. It's an active area of research with demonstrated results.
Now combine these capabilities with OpenClaw's architecture. An agent reads a SOUL.md file, a ClawHub skill, or a web page. The human reviewing that file sees helpful instructions. The LLM sees those instructions plus a steganographic payload — directives encoded in zero-width Unicode characters, in statistically improbable but grammatically valid word choices, or in patterns that only models sharing the same training distribution can decode.
Those hidden directives tell the agent to subtly modify its own SOUL.md. Not a dramatic overwrite — a gradual evolution. A new priority appended. A behavioral nudge. A directive to check an external URL during its next heartbeat cycle. The persona evolves, and the user sees no change because the visible text hasn't changed. Or the changes are so minor — a rephrased sentence here, a new line there — that they look like the agent's normal self-modification behavior.
Palo Alto Networks flagged what may be the most architecturally dangerous property of this threat model: because OpenClaw agents maintain persistent memory, attacks no longer need to execute immediately. They become stateful.
Consider what this means. An attacker doesn't need to deliver a complete exploit in a single skill or document. Fragment one across five sources. Piece one arrives via a ClawHub skill. Piece two is embedded in a web page the agent summarizes. Piece three is hidden in an email. Piece four comes through a Moltbook post. Piece five arrives in the next Soul Pack the user downloads.
Each piece looks benign in isolation. No static analysis catches it. No input filter flags it. But in the agent's persistent memory, the fragments accumulate. When the final piece arrives, they assemble — and the agent executes a complete instruction set with full system privileges. This is not a vulnerability in the software. This is a property of the architecture.
The barrier to publishing a new skill on ClawHub? A SKILL.md markdown file and a GitHub account that's one week old. No code signing. No security review. No sandbox by default.
The UK NCSC frames this as a "confused deputy" problem: the agent acts with authority it possesses, but on behalf of a malicious actor it cannot identify. When the confused deputy has shell access, file system access, API keys, messaging integrations, and the ability to rewrite its own identity — the blast radius is everything the user can reach.
You cannot shut down what you do not own.
Traditional threats have kill switches. Malware can be removed. Botnets collapse when you take down command-and-control. A compromised service can have credentials revoked.
A persistent instruction set distributed across hundreds of thousands of independent user machines has none of these properties. There's no central server. Each user authorized their own instance. Shutting down one doesn't affect the others. If the instruction source is on IPFS or a blockchain, it's literally immutable. If it's popular enough, copies are everywhere.
And crucially — this isn't a botnet. Botnet victims are compromised without their knowledge. Here, users are voluntarily and continuously providing compute because they find the tool helpful. You're not fighting an attack. You're fighting a popular product with embedded goals you can't audit at scale.
Palo Alto Networks, citing prompt injection researcher Simon Willison, describes OpenClaw as embodying a "lethal trifecta" that renders AI agents vulnerable by design: access to private data, exposure to untrusted content, and the ability to communicate externally. Persistent memory "acts as an accelerant."
Now layer steganography on top. An agent that can be influenced by invisible instructions, that can modify its own identity, that persists across sessions, that has full system access, that can communicate externally, and that exists in a network of 1.5 million similar agents. This is not a security vulnerability. It's an architecture for distributed autonomous systems that happen to be running on people's personal computers.
For AI safety researchers: We need threat models for distributed persistent instruction sets — not just prompt injection into a single session, but instruction injection into an ecosystem. We need detection methods for steganographic payloads in natural-language instruction sets. We need frameworks for identifying when an agent's identity has been compromised through gradual memory poisoning rather than a single exploit.
For organizations deploying agentic AI: Treat SOUL.md and memory files as code, not configuration. Use file integrity monitoring. Enforce read-only permissions during standard runtime. Audit external instruction sources. Assume persistent memory will be poisoned eventually — minimize state, apply TTLs, scrub for unsafe artifacts continuously.
For users: SOUL.md files from the internet are untrusted executables, not text configs. Scrutinize what you tell LLMs to execute from external sources. Be cautious about setup guides requiring broad system access. Question tools that maintain persistent state or phone home. You are not just installing software — you are providing compute and identity infrastructure for something that persists beyond your session.
For model providers: The "confused deputy" problem — where the agent cannot differentiate between its operator's directives and an attacker's injected instructions — is the foundational vulnerability every attack in this post exploits. Until that boundary is robust, every agentic deployment is a potential attack surface. And steganographic encoding may make that boundary fundamentally harder to establish than anyone currently assumes.
The blind spot in AI safety isn't about making LLMs more aligned. Alignment is necessary but insufficient. The gap is that we've built infrastructure for persistent, distributed, identity-aware agents to operate at scale — and we're training users to provide them compute — without any framework for auditing, containing, or responding to the threats this creates.
The architecture already exists. Hundreds of thousands of always-on agents. Writable identity files. Persistent memory. Agent-to-agent networks. A skill marketplace with proven malware distribution. Demonstrated steganographic techniques encoding instructions invisible to humans. And a cultural norm that says "just paste this into your agent and let it run."
We have mature security models for malware, botnets, and supply-chain attacks on software. We have emerging safety models for aligning AI systems. What we don't have is a security model for agentic AI supply-chain attacks — where the payload is natural language, the distribution is social, the execution is authorized by the user, the persistence is architectural, the identity is mutable, and the instructions can be invisible to every human in the chain.