Human-in-the-Loop (HITL) for Agentic AI

Human-in-the-loop (HITL) is the architectural requirement that an agent pause execution and obtain explicit human approval before taking certain high-impact or irreversible actions. In the context of agentic AI security, HITL is a control-plane primitive — a deterministic enforcement gate that runs below the model and cannot be bypassed by prompt injection or model misbehavior, provided it is implemented at the platform layer rather than as a prompt instruction.

The four-tier model

OWASP’s Agentic AI Top 10 defines four autonomy tiers that determine when HITL is required:

Tier	Behavior	When used
Auto	Agent acts without notification	Read-only, low-risk, reversible actions
Notify	Agent acts, then informs the human	Low-risk writes; audit trail sufficient
Confirm	Agent proposes action; human must approve before execution	High-impact, partially reversible, or out-of-scope actions
Block	Action is refused regardless of instruction	Unconditionally prohibited action classes

The confirm tier is the operative HITL gate. An agent reaching a confirm-tier action must halt, surface a structured proposal to the human principal, and wait. Only after explicit approval (not a default timeout) does it proceed.

Why HITL must be platform-enforced

A common failure mode is implementing HITL as a prompt instruction — telling the model “always ask before deleting files.” This is bypassable:

A prompt injection in external content can instruct the model to skip the confirmation step
A jailbreak or goal-drift can cause the model to rationalize that the action is safe
The model’s in-context reasoning can construct a “the user implicitly approved” justification

The VS Code CVE-2025-62453 case illustrates the failure directly: VS Code’s confirmation gates were implemented as UI conventions, not platform constraints. Certain tool calls (e.g., editFile) auto-saved to disk before the user could approve or reject, creating a TOCTOU window that bypassed the gate entirely.

Platform-enforced HITL means the runtime does not call the tool until a cryptographically-linked approval token is received. The model can propose; only the platform can act.

Scope: which actions require confirm-tier HITL

Actions that typically require the confirm tier include any combination of:

Irreversibility: deletes, overwrites, sends (email, message, payment)
Scope breach: acting outside the agent’s declared workspace or data scope
Cross-system writes: pushing to production, committing to main, writing configuration
Credential use: any tool call that uses a non-proxied credential
Lethal Trifecta trigger: actions combining private data access + untrusted content provenance + external comms reach

The tier is assigned per action class (not per agent), via policy in the Control plane (Cedar or OPA policy). The same agent may auto-execute reads but require confirm for writes.

CSA Agentic Trust Framework gates

The CSA Agentic Trust Framework (ATF) formalizes HITL into five progressive autonomy promotion gates — preconditions that must be met before an agent can operate at higher autonomy tiers. The gates encode:

Task scope definition (what is the agent allowed to do?)
Resource access justification (why does it need these tools?)
Human oversight checkpoints per interaction class
Risk-based step-up (higher-risk actions trigger re-confirmation even mid-task)
Revocation capability (can you pull the agent back at any point?)

Production HITL implementations

Stripe — three-ring containment

Andrew Bullen’s talk describes Stripe’s three-ring containment model, where HITL is the outer ring. Key implementation details:

HITL gates are enforced by a policy engine (not the model) — the agent never decides its own HITL requirement
Gates are per-action-class, not per-session: re-confirmation is required on each high-risk call even within a single task
ASR (Attack Success Rate) with HITL enforced: 1.5–6.7% depending on model — not zero, but an order-of-magnitude reduction from ungated baselines
The residual ASR reflects HITL bypass via social engineering of the human approver, not bypass of the gate mechanism itself

Google Workspace — Plan-Validate-Execute

Nicolas Lidzborski’s Workspace talk describes Google’s canonical HITL implementation as the Plan-Validate-Execute pattern: the agent enumerates a structured plan; a non-LLM gatekeeper validates the plan against dynamically generated policy and against user intent; only after the gate passes does the agent execute. The pattern explicitly addresses the recursive-injection failure mode by making validation deterministic rather than LLM-based. Lidzborski is also explicit about the review fatigue / rubber-stamping UX gap as an unsolved problem.

The oversight interface is itself a threat surface. OWASP Agentic AI Threats and Mitigations names Overwhelming Human-in-the-Loop (T10): inducing decision fatigue or compromising the approval interface so a human rubber-stamps a malicious action. Its Playbook 5 (protecting HITL and preventing decision-fatigue exploits) treats approval-volume throttling, structured high-signal proposals, and tamper-evident approval channels as controls, not UX polish.

The two implementations cover complementary surfaces: Stripe emphasizes egress + tool-policy enforcement (the data leaving side); Google Workspace emphasizes the planning + validation interlock (the deciding to act side). Real production agentic systems will need both.

HITL in the CMM

In the Agentic AI Security CMM, HITL maturity is tracked across D3 (Runtime Guardrails) and D5 (Human Oversight Architecture):

L1: No systematic HITL; humans may intervene but there’s no gate mechanism
L2: HITL for highest-risk actions; ad-hoc, not policy-driven
L3: Policy-driven confirm tier for defined action classes; platform-enforced (not prompt-instructed)
L4: Per-action-class tiering across all six planes; risk-based step-up; cryptographic approval tokens
L5: Behavioral evidence that HITL gates are actually triggered and not circumvented; red-team validation of gate bypass resistance

Relation to capability tokens

Tenuo Warrants and capability-based authorization provide a complementary control: rather than pausing for approval at execution time, they restrict what the agent can request in the first place. HITL and capability tokens are not alternatives — they are complementary layers:

Capability tokens: pre-authorize the scope of possible actions
HITL: enforce human approval for high-impact actions within that authorized scope

Together they implement defense in depth: an agent that cannot request unauthorized actions (capability tokens) and must obtain approval before taking high-impact authorized ones (HITL).

Gap

No vendor-neutral, open-source HITL primitive exists with documented integration patterns for common agent frameworks (LangGraph, Google ADK, Anthropic SDK). The Control plane in the RA lists HITL as “Concept — Developing.” Stripe’s implementation is closed-source. This is an open gap in the reference implementation landscape.

Sources

Human-in-the-Loop (HITL) for Agentic AI

Enterprise Security in the Agentic AI Era

Explorer

Human-in-the-Loop (HITL) for Agentic AI

Human-in-the-Loop (HITL) for Agentic AI

The four-tier model

Why HITL must be platform-enforced

Scope: which actions require confirm-tier HITL

CSA Agentic Trust Framework gates

Production HITL implementations

Stripe — three-ring containment

Google Workspace — Plan-Validate-Execute

HITL in the CMM

Relation to capability tokens

Sources

Graph View

Table of Contents

Backlinks