LlamaFirewall

Open-source AI guardrail framework published by Meta AI (May 2025). Designed for building secure AI agents; provides three specialized guardrail components that operate at different points in the agent execution pipeline.

Architecture: Three Components

PromptGuard 2

Input-side classifier for jailbreak and prompt injection detection. Operates before the LLM processes the input. Benchmark result: 90% reduction in attack success rate compared to unprotected agents.

AlignmentCheck

Inspects the agent’s chain-of-thought reasoning before tool execution for signs of goal hijacking. This is a prospective control — it fires after the model has reasoned but before it acts, catching injections that pass input-layer detection but manifest as abnormal reasoning. Addresses OWASP ASI01 (Agent Goal Hijack).

CodeShield

Static analysis for LLM-generated code before execution. Catches dangerous patterns (shell injection, file deletion, credential access) in code the agent writes and is about to run.

Positioning

LlamaFirewall operates at the input and reasoning layers (model layer in the Security Controls for AI Stacks taxonomy). For containment, it is combined with platform-level controls. The key architectural note: LlamaFirewall guardrails should be deployed at the framework/runtime layer, not as prompt instructions, to achieve their effectiveness guarantees.

Relationship to Traditional Security

LlamaFirewall maps to IPS/WAF at the model layer — pattern-matching and behavioral analysis on inputs and reasoning rather than network packets and HTTP requests. AlignmentCheck is novel: no traditional equivalent exists for prospective chain-of-thought auditing.

See Also