LlamaFirewall
Open-source AI guardrail framework published by Meta AI (May 2025). Designed for building secure AI agents; provides three specialized guardrail components that operate at different points in the agent execution pipeline.
Architecture: Three Components
PromptGuard 2
Input-side classifier for jailbreak and prompt injection detection. Operates before the LLM processes the input. Benchmark result: 90% reduction in attack success rate compared to unprotected agents.
AlignmentCheck
Inspects the agent’s chain-of-thought reasoning before tool execution for signs of goal hijacking. This is a prospective control — it fires after the model has reasoned but before it acts, catching injections that pass input-layer detection but manifest as abnormal reasoning. Addresses OWASP ASI01 (Agent Goal Hijack).
CodeShield
Static analysis for LLM-generated code before execution. Catches dangerous patterns (shell injection, file deletion, credential access) in code the agent writes and is about to run.
Positioning
LlamaFirewall operates at the input and reasoning layers (model layer in the Security Controls for AI Stacks taxonomy). For containment, it is combined with platform-level controls. The key architectural note: LlamaFirewall guardrails should be deployed at the framework/runtime layer, not as prompt instructions, to achieve their effectiveness guarantees.
Relationship to Traditional Security
LlamaFirewall maps to IPS/WAF at the model layer — pattern-matching and behavioral analysis on inputs and reasoning rather than network packets and HTTP requests. AlignmentCheck is novel: no traditional equivalent exists for prospective chain-of-thought auditing.
See Also
- Prompt Injection Containment for Agentic Systems — the practice page covering the two-layer detection + containment model
- Security Controls for AI Stacks — model layer where LlamaFirewall operates
- Meta — publisher