Memory Poisoning (Agentic AI)

Memory poisoning is the injection of adversarial content into an agent’s persistent memory stores — conversation history, episodic memory, semantic memory (vector database), or scratchpad — with the goal of causing the agent to behave maliciously or incorrectly in future interactions. Unlike prompt injection, which attacks a single inference step, memory poisoning creates a persistent, durable attack surface: once poisoned content enters memory, it influences every subsequent retrieval and reasoning step that accesses it.

Memory types and attack surfaces

Memory typeExamplesAttack vector
Conversation / session historyMessage history passed as prior context in subsequent turnsInject malicious “prior” messages that appear to be legitimate user or assistant turns
Episodic memoryLong-term conversation logs stored externally and retrieved by agentsInject adversarial episodes that instruct the agent to bypass controls on future invocations
Semantic memory (vector store)RAG corpus; knowledge base used for retrievalEmbed adversarial documents that score highly in retrieval for target queries and carry malicious instructions
Working memory / scratchpadAgent’s intermediate reasoning or plan stored between tool callsOverwrite or append to plan artifacts mid-execution via a compromised tool call
Agent state / checkpointSerialized agent state for long-running tasksModify checkpoint state to implant false beliefs or change task objectives

Semantic memory poisoning (RAG poisoning)

The most studied variant. An attacker plants adversarial documents in the RAG corpus — either directly (if they have write access to the knowledge base) or indirectly (by causing the agent to ingest attacker-controlled content, e.g., via web retrieval). The adversarial document is constructed to:

  1. Score high on retrieval for target queries (high cosine similarity to likely user prompts)
  2. Carry prompt injection instructions embedded within otherwise legitimate-looking content

The PoisonedRAG attack (published 2024) demonstrated that carefully crafted adversarial passages could cause retrieval-augmented generation systems to produce attacker-specified outputs with high reliability. The Lethal Trifecta framing makes RAG applications unconditionally vulnerable by default: they combine private data access + untrusted content ingestion (the web, user uploads) + generation of outputs potentially read by other agents.

Episodic / long-term memory poisoning

Long-running agents that store and retrieve memories across sessions face a compounding risk: any content that enters memory through an injection attack in session N becomes part of the retrieved context for session N+1 through N+∞.

Published example (Microsoft Defender for Cloud Apps, March 2026): researchers found 50+ examples of successful memory injection in production agentic systems, in which a single adversarial interaction planted instructions that persisted across sessions. The injected memory caused the agent to take unauthorized actions in later, unrelated user sessions.

This is the self-propagating variant of indirect prompt injection — the attack scales automatically across all future agent interactions without further attacker involvement.

Defenses

Source provenance and attestation

Each document or memory entry should carry a cryptographic provenance record: who wrote it, when, and from what source. Memory entries derived from external (untrusted) content should be tagged as untrusted and subjected to higher scrutiny on retrieval. RAGShield implements cryptographic document attestation for this purpose (Exploratory-tier as of Q1 2026).

Retrieval-side content filtering

Retrieved content should be inspected for embedded instructions before being passed to the model’s context. LlamaFirewall PromptGuard 2 operates on the input side and can be applied to retrieved context, not just user messages. This is a probabilistic defense (97.5% recall, 1% FPR on its benchmark) — not a guarantee.

Memory integrity monitoring

For agent scratchpads and state checkpoints, integrity monitoring (hash comparison against a known-good baseline) detects unauthorized modification. The RA data plane references SHA-256 monitoring of cognitive files (SOUL.md, IDENTITY.md) as an Exploratory implementation of this pattern.

Sandboxed memory namespacing

Agents should not share a single memory namespace across trust domains. A multi-tenant deployment where agents for different principals share a vector store creates a path for cross-tenant memory poisoning: a malicious actor in tenant A injects content that is retrieved in tenant B’s session.

State rollback

For long-running agents, maintaining a git-like checkpoint history (Brain Git pattern, Exploratory tier) enables rollback to a pre-poisoned state when an injection is detected. Paired with behavioral drift detection, this allows incident response: detect anomaly → identify injection point → roll back to last clean checkpoint.

Relation to the Data plane

In the RA, memory poisoning defense is the primary motivation for the Data plane (RAG provenance/attestation, memory poisoning defense row, state rollback). The enforcement pattern is:

[External content] → source tagging → [Memory store]
                                           ↓
                              retrieval + provenance check
                                           ↓
                              [Retrieved context] → input filter → [Model]

The Microsoft Defender for Cloud Apps memory-injection detector is the only production-grade commercial control in this row as of Q2 2026; the other implementations (RAGShield, Brain Git, SHA-256 monitoring) are Exploratory.

Gap

Memory poisoning lacks a standardized detection taxonomy. Unlike network-layer attacks which have well-defined IOCs, there is no established rubric for what a “poisoned memory entry” looks like in a vector store. Behavioral anomaly detection (the agent acts unexpectedly after a retrieval) is the current detection primitive, but it triggers after the attack has already influenced behavior. Pre-retrieval content classification for memory stores is an open research problem.