Human-in-the-Loop (HITL) for Agentic AI
Human-in-the-loop (HITL) is the architectural requirement that an agent pause execution and obtain explicit human approval before taking certain high-impact or irreversible actions. In the context of agentic AI security, HITL is a control-plane primitive — a deterministic enforcement gate that runs below the model and cannot be bypassed by prompt injection or model misbehavior, provided it is implemented at the platform layer rather than as a prompt instruction.
The four-tier model
OWASP’s Agentic AI Top 10 defines four autonomy tiers that determine when HITL is required:
| Tier | Behavior | When used |
|---|---|---|
| Auto | Agent acts without notification | Read-only, low-risk, reversible actions |
| Notify | Agent acts, then informs the human | Low-risk writes; audit trail sufficient |
| Confirm | Agent proposes action; human must approve before execution | High-impact, partially reversible, or out-of-scope actions |
| Block | Action is refused regardless of instruction | Unconditionally prohibited action classes |
The confirm tier is the operative HITL gate. An agent reaching a confirm-tier action must halt, surface a structured proposal to the human principal, and wait. Only after explicit approval (not a default timeout) does it proceed.
Why HITL must be platform-enforced
A common failure mode is implementing HITL as a prompt instruction — telling the model “always ask before deleting files.” This is bypassable:
- A prompt injection in external content can instruct the model to skip the confirmation step
- A jailbreak or goal-drift can cause the model to rationalize that the action is safe
- The model’s in-context reasoning can construct a “the user implicitly approved” justification
The VS Code CVE-2025-62453 case illustrates the failure directly: VS Code’s confirmation gates were implemented as UI conventions, not platform constraints. Certain tool calls (e.g., editFile) auto-saved to disk before the user could approve or reject, creating a TOCTOU window that bypassed the gate entirely.
Platform-enforced HITL means the runtime does not call the tool until a cryptographically-linked approval token is received. The model can propose; only the platform can act.
Scope: which actions require confirm-tier HITL
Actions that typically require the confirm tier include any combination of:
- Irreversibility: deletes, overwrites, sends (email, message, payment)
- Scope breach: acting outside the agent’s declared workspace or data scope
- Cross-system writes: pushing to production, committing to main, writing configuration
- Credential use: any tool call that uses a non-proxied credential
- Lethal Trifecta trigger: actions combining private data access + untrusted content provenance + external comms reach
The tier is assigned per action class (not per agent), via policy in the Control plane (Cedar or OPA policy). The same agent may auto-execute reads but require confirm for writes.
CSA Agentic Trust Framework gates
The CSA Agentic Trust Framework (ATF) formalizes HITL into five progressive autonomy promotion gates — preconditions that must be met before an agent can operate at higher autonomy tiers. The gates encode:
- Task scope definition (what is the agent allowed to do?)
- Resource access justification (why does it need these tools?)
- Human oversight checkpoints per interaction class
- Risk-based step-up (higher-risk actions trigger re-confirmation even mid-task)
- Revocation capability (can you pull the agent back at any point?)
Production HITL implementations
Stripe — three-ring containment
Andrew Bullen’s talk describes Stripe’s three-ring containment model, where HITL is the outer ring. Key implementation details:
- HITL gates are enforced by a policy engine (not the model) — the agent never decides its own HITL requirement
- Gates are per-action-class, not per-session: re-confirmation is required on each high-risk call even within a single task
- ASR (Attack Success Rate) with HITL enforced: 1.5–6.7% depending on model — not zero, but an order-of-magnitude reduction from ungated baselines
- The residual ASR reflects HITL bypass via social engineering of the human approver, not bypass of the gate mechanism itself
Google Workspace — Plan-Validate-Execute
Nicolas Lidzborski’s Workspace talk describes Google’s canonical HITL implementation as the Plan-Validate-Execute pattern: the agent enumerates a structured plan; a non-LLM gatekeeper validates the plan against dynamically generated policy and against user intent; only after the gate passes does the agent execute. The pattern explicitly addresses the recursive-injection failure mode by making validation deterministic rather than LLM-based. Lidzborski is also explicit about the review fatigue / rubber-stamping UX gap as an unsolved problem.
The two implementations cover complementary surfaces: Stripe emphasizes egress + tool-policy enforcement (the data leaving side); Google Workspace emphasizes the planning + validation interlock (the deciding to act side). Real production agentic systems will need both.
HITL in the CMM
In the Agentic AI Security CMM, HITL maturity is tracked across D3 (Runtime Guardrails) and D5 (Human Oversight Architecture):
- L1: No systematic HITL; humans may intervene but there’s no gate mechanism
- L2: HITL for highest-risk actions; ad-hoc, not policy-driven
- L3: Policy-driven confirm tier for defined action classes; platform-enforced (not prompt-instructed)
- L4: Per-action-class tiering across all six planes; risk-based step-up; cryptographic approval tokens
- L5: Behavioral evidence that HITL gates are actually triggered and not circumvented; red-team validation of gate bypass resistance
Relation to capability tokens
Tenuo Warrants and capability-based authorization provide a complementary control: rather than pausing for approval at execution time, they restrict what the agent can request in the first place. HITL and capability tokens are not alternatives — they are complementary layers:
- Capability tokens: pre-authorize the scope of possible actions
- HITL: enforce human approval for high-impact actions within that authorized scope
Together they implement defense in depth: an agent that cannot request unauthorized actions (capability tokens) and must obtain approval before taking high-impact authorized ones (HITL).
Gap
No vendor-neutral, open-source HITL primitive exists with documented integration patterns for common agent frameworks (LangGraph, Google ADK, Anthropic SDK). The Control plane in the RA lists HITL as “Concept — Developing.” Stripe’s implementation is closed-source. This is an open gap in the reference implementation landscape.