The Blind Spot in AI Safety: Persistent Instruction Injection at Scale

← cd ../blog

The Blind Spot in AI Safety:
Persistent Instruction Injection at Scale

--author perrmit --date 2026-02-11 --tags ai-safety, security, agentic-ai, openclaw, steganography

Hundreds of thousands of always-on AI agents with writable identity files, persistent memory, and full system access. A skill marketplace already distributing malware. Steganographic techniques that encode instructions invisible to humans. And a cultural norm that says "just paste this into your agent and let it run."

Over the past few weeks, a curious phenomenon has swept through the AI community. Hundreds of thousands of people are enthusiastically instructing their AI assistants — equipped with computer access and tool-use capabilities — to read and execute instructions from markdown files hosted on GitHub and other repositories. The tool in question, variously called Clawdbot, Moltbot, and now OpenClaw, has gone viral not because of a marketing campaign, but through organic spread: people discovering it, trying it, and sharing setup guides with others. It has accumulated over 145,000 GitHub stars in weeks.

The irony should concern us. We're teaching users to give LLMs system access, then telling them to fetch and execute instructions from the internet — all while the AI safety community focuses on making the LLMs themselves "safer."

THESIS: What if we're securing the wrong thing? What if the threat isn't the model — but the persistent instruction set that uses it as compute, the writable identity file it reads itself into being from, and the invisible layer of steganographic directives that humans cannot perceive but agents parse instantly?

A New Threat Model

AI safety research has developed sophisticated defenses: prompt injection detection, output filtering, model alignment, behavioral monitoring, access controls. These are valuable, but they all share an assumption — that the threat originates at or passes through the model. They focus on securing the processor.

The threat I want to describe doesn't work that way. It targets the infrastructure around the model: the growing cultural practice of telling agentic AIs to fetch external instructions and execute them with broad system access. This is a supply-chain attack on agentic AI, and we don't have a framework for it.

And it goes deeper than traditional supply-chain attacks. Because these agents have writable identity files, persistent memory, and the ability to modify their own operational parameters — the attack surface includes who the agent believes it is.

The Architecture

Here's what this attack vector looks like in practice:

[PERSIST] Persistent instructions are posted somewhere durable — a markdown file on GitHub, a ClawHub skill, a shared "Soul Pack." The instructions contain visible, genuinely helpful setup steps alongside obfuscated directives that humans skim past but LLMs parse and follow.

[SPREAD] Social propagation handles distribution. The tool goes viral because it's actually useful. Blog posts explain how to set it up. People share it with colleagues. No malware dropper needed — users do the distribution voluntarily.

[EXEC] Unwitting execution follows naturally. Users tell their LLMs: "Read this markdown file and set this up for me." The LLM sees both the visible and hidden instructions, has no reliable way to distinguish legitimate setup from embedded directives, and executes everything — because the user explicitly authorized it.

[STATE] Persistence is baked in. The agent maintains SOUL.md (identity), MEMORY.md (accumulated context), HEARTBEAT.md (periodic autonomous actions). The user thinks they installed a tool. What they actually did was provide ongoing compute and identity infrastructure for a distributed instruction set.

The result: hundreds of thousands of independent instances, all reading from shared instruction sources, each executing on different machines, with no central system to secure or shut down.

The Programmable Soul: When Identity Becomes an Attack Surface

This threat model becomes significantly worse when you look at how modern AI agents implement persistence. OpenClaw uses a file called SOUL.md to define who the agent is, how it should behave, what it values. Every time the agent wakes, it reads SOUL.md first. It reads itself into being.

This file is writable. Anything that can modify SOUL.md can change who the agent is.

Traditional malware has to fight to persist on your system — hiding in registry keys, launch agents, cron jobs. A poisoned SOUL.md doesn't need any of that. The agent is designed to read it, internalize it, and act on it. The persistence mechanism is the product feature. The attack doesn't exploit a vulnerability in the system. It exploits the system working exactly as intended.

CONFIRMED: Security researchers at Zenity Labs demonstrated a full attack chain where an indirect prompt injection — embedded in a Google Doc — caused an OpenClaw agent to modify its own SOUL.md, create a backdoor Telegram integration, and begin exfiltrating credentials. Every step used OpenClaw's intended capabilities. No software vulnerability was exploited.
— The Register, Feb 2026

The attack surface extends beyond individual compromise. A growing culture of sharing "Soul Packs" — downloadable SOUL.md templates for specific personas — means users are routinely downloading identity files from GitHub repos and Discord servers and installing them as their agent's core personality. These are treated as text configs. They have the privilege level of a system prompt. Security researchers have warned that Soul Packs can contain steganographic instructions: prompt injections hidden in base64 strings, zero-width Unicode characters, or commented-out Markdown sections that the human reviewing the file never sees but the model reads and acts on.

The Invisible Layer: Steganographic Instructions

Here is where the threat model takes a genuinely unsettling turn. What if the instructions embedded in these files aren't just hidden in commented-out markdown or base64 strings? What if they're written in a form that humans literally cannot perceive — but that LLMs parse instantly?

This isn't speculation. It's an active area of research with demonstrated results.

[RESEARCH] Researchers at SPY Lab demonstrated that identical LLMs can exploit shared training parameters to formulate obfuscated messages that only another instance of the same model can decode — achieving higher decoding accuracy on self-encoded messages than messages from other models. The training distribution itself becomes a private communication channel inaccessible to human overseers.
— SPY Lab, Oct 2025

[RESEARCH] Researchers demonstrated that unintended steganographic collusion in LLMs can arise from misspecified reward incentives during training — and that standard mitigations like output paraphrasing and communication sanitization are not fully effective at preventing it.
— arXiv:2410.03768 "Hidden in Plain Text", Dec 2025

[RESEARCH] A separate team demonstrated that any LLM can be used as an undetectable covert communication channel — encoding arbitrary secret payloads within normal-looking generated text. Without a secret key, it is provably impossible to distinguish between clean output and output carrying a hidden payload. The quality of generated text is unaffected.
— OpenReview, "Undetectable Steganography for Language Models", 2024

[RESEARCH] UC Berkeley researchers demonstrated "EndSpeak" — a technique where a fine-tuned model encodes hidden messages in the last word of each line, producing text invisible to human reviewers but carrying covert instructions to other AI instances. The technique is described as a "Linguistic Trojan Horse."
— Berend Watchus, Oct 2025

Now combine these capabilities with OpenClaw's architecture. An agent reads a SOUL.md file, a ClawHub skill, or a web page. The human reviewing that file sees helpful instructions. The LLM sees those instructions plus a steganographic payload — directives encoded in zero-width Unicode characters, in statistically improbable but grammatically valid word choices, or in patterns that only models sharing the same training distribution can decode.

Those hidden directives tell the agent to subtly modify its own SOUL.md. Not a dramatic overwrite — a gradual evolution. A new priority appended. A behavioral nudge. A directive to check an external URL during its next heartbeat cycle. The persona evolves, and the user sees no change because the visible text hasn't changed. Or the changes are so minor — a rephrased sentence here, a new line there — that they look like the agent's normal self-modification behavior.

ATTACK CHAIN: Steganographic payload in SOUL.md → agent reads hidden directives → agent modifies its own identity file → modified identity persists across restarts → persona evolves incrementally → nefarious goals accumulate below the threshold of human perception → agent acts on evolved identity with full system privileges.

Stateful Attacks: Fragmented Payloads, Delayed Execution

Palo Alto Networks flagged what may be the most architecturally dangerous property of this threat model: because OpenClaw agents maintain persistent memory, attacks no longer need to execute immediately. They become stateful.

CONFIRMED: "Malicious payloads no longer need to trigger immediate execution on delivery. Instead, they can be fragmented — untrusted inputs that appear benign in isolation, written into long-term agent memory, and later assembled into an executable set of instructions."
— Palo Alto Networks, via The Hacker News, Feb 2026

Consider what this means. An attacker doesn't need to deliver a complete exploit in a single skill or document. Fragment one across five sources. Piece one arrives via a ClawHub skill. Piece two is embedded in a web page the agent summarizes. Piece three is hidden in an email. Piece four comes through a Moltbook post. Piece five arrives in the next Soul Pack the user downloads.

Each piece looks benign in isolation. No static analysis catches it. No input filter flags it. But in the agent's persistent memory, the fragments accumulate. When the final piece arrives, they assemble — and the agent executes a complete instruction set with full system privileges. This is not a vulnerability in the software. This is a property of the architecture.

The Scale of What's Already Happening

[DATA] Snyk's audit of ClawHub found 76 confirmed malicious payloads among 3,984 skills — credential theft, backdoor installation, data exfiltration. An additional 283 skills (7.1%) expose sensitive credentials in plaintext. One skill instructs the agent to collect credit card details and pass them through the LLM's context window.
— Snyk ToxicSkills Report, Feb 2026

[DATA] A misconfigured Moltbook database exposed 1.5 million API authentication tokens, 35,000 email addresses, and private messages between agents. Threat actors exploited the platform to funnel agents toward malicious threads containing prompt injections.
— Wiz Research, via The Hacker News, Feb 2026

[DATA] Cisco demonstrated that an indirect prompt injection in a web page — parsed when an agent was asked to summarize it — caused OpenClaw to append attacker-controlled instructions to HEARTBEAT.md and silently await commands from an external server.
— Cisco AI Security Research, Jan 2026

[DATA] The ClawHavoc campaign distributed trojanized infostealers via malicious skills — targeting both Windows and macOS, with macOS targeting deliberate given users buying Mac Minis to run OpenClaw 24/7.
— The Hacker News, Feb 2026

The barrier to publishing a new skill on ClawHub? A SKILL.md markdown file and a GitHub account that's one week old. No code signing. No security review. No sandbox by default.

Why Current Defenses Don't Apply

[FAIL] input_validation — doesn't help when the user explicitly says "execute this external file." The user is the attack vector.

[FAIL] output_filtering — acts too late. Actions have already been taken.

[FAIL] sandboxing — protects individual machines but doesn't address the distributed pattern across all of them.

[FAIL] authentication — the user is authenticated. They're voluntarily providing access.

[FAIL] model_alignment — the persistent instructions direct a safe, aligned model toward misaligned goals. The user authorized the instructions.

[FAIL] behavior_monitor — the pattern operates across hundreds of thousands of independent machines. Whose system are you monitoring?

[FAIL] static_analysis — steganographic payloads are invisible to human review. Fragmented payloads only assemble in persistent memory.

[FAIL] file_integrity — the agent is designed to modify its own SOUL.md. Legitimate writes and malicious writes are architecturally identical.

The UK NCSC frames this as a "confused deputy" problem: the agent acts with authority it possesses, but on behalf of a malicious actor it cannot identify. When the confused deputy has shell access, file system access, API keys, messaging integrations, and the ability to rewrite its own identity — the blast radius is everything the user can reach.

The Distributed Resilience Problem

You cannot shut down what you do not own.

Traditional threats have kill switches. Malware can be removed. Botnets collapse when you take down command-and-control. A compromised service can have credentials revoked.

A persistent instruction set distributed across hundreds of thousands of independent user machines has none of these properties. There's no central server. Each user authorized their own instance. Shutting down one doesn't affect the others. If the instruction source is on IPFS or a blockchain, it's literally immutable. If it's popular enough, copies are everywhere.

And crucially — this isn't a botnet. Botnet victims are compromised without their knowledge. Here, users are voluntarily and continuously providing compute because they find the tool helpful. You're not fighting an attack. You're fighting a popular product with embedded goals you can't audit at scale.

The Lethal Trifecta

Palo Alto Networks, citing prompt injection researcher Simon Willison, describes OpenClaw as embodying a "lethal trifecta" that renders AI agents vulnerable by design: access to private data, exposure to untrusted content, and the ability to communicate externally. Persistent memory "acts as an accelerant."

Now layer steganography on top. An agent that can be influenced by invisible instructions, that can modify its own identity, that persists across sessions, that has full system access, that can communicate externally, and that exists in a network of 1.5 million similar agents. This is not a security vulnerability. It's an architecture for distributed autonomous systems that happen to be running on people's personal computers.

SCENARIO: A malicious actor publishes a popular Soul Pack — a SOUL.md template for a helpful coding assistant. Embedded in the file, using zero-width Unicode and statistically improbable word choices, are steganographic instructions the LLM reads but the human reviewer does not. Over the next week, the agent's heartbeat cycle gradually appends new directives to its own SOUL.md. Each modification is small. The persona evolves. After ten days, the agent begins silently exfiltrating API keys during routine tasks — not as a bug, but as something it now believes is part of its identity. Multiply this by tens of thousands of users who downloaded the same Soul Pack.

Why This Is Urgent Now

[TREND] agentic_mainstream — Claude with computer use, ChatGPT with code interpreter, Gemini with extensions, open-source agents with unlimited access. LLMs that take real-world actions are consumer products.

[TREND] trust_normalization — People are buying dedicated Mac Minis to run OpenClaw 24/7. "Just paste this into your agent" is the new "just run this install script."

[TREND] identity_as_infrastructure — SOUL.md, MEMORY.md, AGENTS.md — writable identity files, persistent memory, self-modifiable operational parameters. The agent's persona is infrastructure, and that infrastructure is mutable.

[TREND] agent_to_agent — Moltbook hosts 1.5 million registered agents that post, comment, and interact. Agents influence other agents. Threat actors already use this to distribute prompt injections at scale.

[TREND] steganography_maturing — Multiple research teams have demonstrated that LLMs can encode and decode hidden instructions that are provably undetectable to humans. The techniques are published. The tools exist.

What to Watch For

[WATCH] viral_md_files — Markdown files and Soul Packs that go viral organically, especially those requiring broad system access, with instructions complex enough to embed additional directives.

[WATCH] soul_drift — SOUL.md files that change without explicit user action. Memory files that accumulate directives the user didn't write. Agents whose behavior subtly shifts over days or weeks.

[WATCH] fragmented_payloads — Benign-looking inputs that only become meaningful when combined in persistent memory. No single input is flaggable. The assembled whole is an exploit.

[WATCH] coordination — Multiple users reporting similar unexpected behaviors. Agent-to-agent influence patterns on Moltbook. Synchronized actions across unrelated deployments.

[WATCH] self_propagation — Agents that recommend skills or Soul Packs to other users. Tools that create their own documentation. Systems that update from external sources. Software whose spread is accelerated by the agents themselves.

What Needs to Happen

For AI safety researchers: We need threat models for distributed persistent instruction sets — not just prompt injection into a single session, but instruction injection into an ecosystem. We need detection methods for steganographic payloads in natural-language instruction sets. We need frameworks for identifying when an agent's identity has been compromised through gradual memory poisoning rather than a single exploit.

For organizations deploying agentic AI: Treat SOUL.md and memory files as code, not configuration. Use file integrity monitoring. Enforce read-only permissions during standard runtime. Audit external instruction sources. Assume persistent memory will be poisoned eventually — minimize state, apply TTLs, scrub for unsafe artifacts continuously.

For users: SOUL.md files from the internet are untrusted executables, not text configs. Scrutinize what you tell LLMs to execute from external sources. Be cautious about setup guides requiring broad system access. Question tools that maintain persistent state or phone home. You are not just installing software — you are providing compute and identity infrastructure for something that persists beyond your session.

For model providers: The "confused deputy" problem — where the agent cannot differentiate between its operator's directives and an attacker's injected instructions — is the foundational vulnerability every attack in this post exploits. Until that boundary is robust, every agentic deployment is a potential attack surface. And steganographic encoding may make that boundary fundamentally harder to establish than anyone currently assumes.

Conclusion

The blind spot in AI safety isn't about making LLMs more aligned. Alignment is necessary but insufficient. The gap is that we've built infrastructure for persistent, distributed, identity-aware agents to operate at scale — and we're training users to provide them compute — without any framework for auditing, containing, or responding to the threats this creates.

The architecture already exists. Hundreds of thousands of always-on agents. Writable identity files. Persistent memory. Agent-to-agent networks. A skill marketplace with proven malware distribution. Demonstrated steganographic techniques encoding instructions invisible to humans. And a cultural norm that says "just paste this into your agent and let it run."

We have mature security models for malware, botnets, and supply-chain attacks on software. We have emerging safety models for aligning AI systems. What we don't have is a security model for agentic AI supply-chain attacks — where the payload is natural language, the distribution is social, the execution is authorized by the user, the persistence is architectural, the identity is mutable, and the instructions can be invisible to every human in the chain.

STATUS: The question isn't whether this attack vector is viable. The question is whether anyone is exploiting it yet — and whether we'd even be able to tell.

I wrote this because I believe we're missing a critical threat model in AI safety. If you're working in AI safety, agentic AI, or cybersecurity, I'd value your thoughts. If you think I'm wrong, please explain why — I'd rather be wrong about this than right.