Human-in-the-Loop (HITL) for Agentic AI

Human-in-the-loop (HITL) is the architectural requirement that an agent pause execution and obtain explicit human approval before taking certain high-impact or irreversible actions. In the context of agentic AI security, HITL is a control-plane primitive — a deterministic enforcement gate that runs below the model and cannot be bypassed by prompt injection or model misbehavior, provided it is implemented at the platform layer rather than as a prompt instruction.

The four-tier model

OWASP’s Agentic AI Top 10 defines four autonomy tiers that determine when HITL is required:

TierBehaviorWhen used
AutoAgent acts without notificationRead-only, low-risk, reversible actions
NotifyAgent acts, then informs the humanLow-risk writes; audit trail sufficient
ConfirmAgent proposes action; human must approve before executionHigh-impact, partially reversible, or out-of-scope actions
BlockAction is refused regardless of instructionUnconditionally prohibited action classes

The confirm tier is the operative HITL gate. An agent reaching a confirm-tier action must halt, surface a structured proposal to the human principal, and wait. Only after explicit approval (not a default timeout) does it proceed.

Why HITL must be platform-enforced

A common failure mode is implementing HITL as a prompt instruction — telling the model “always ask before deleting files.” This is bypassable:

  • A prompt injection in external content can instruct the model to skip the confirmation step
  • A jailbreak or goal-drift can cause the model to rationalize that the action is safe
  • The model’s in-context reasoning can construct a “the user implicitly approved” justification

The VS Code CVE-2025-62453 case illustrates the failure directly: VS Code’s confirmation gates were implemented as UI conventions, not platform constraints. Certain tool calls (e.g., editFile) auto-saved to disk before the user could approve or reject, creating a TOCTOU window that bypassed the gate entirely.

Platform-enforced HITL means the runtime does not call the tool until a cryptographically-linked approval token is received. The model can propose; only the platform can act.

Scope: which actions require confirm-tier HITL

Actions that typically require the confirm tier include any combination of:

  • Irreversibility: deletes, overwrites, sends (email, message, payment)
  • Scope breach: acting outside the agent’s declared workspace or data scope
  • Cross-system writes: pushing to production, committing to main, writing configuration
  • Credential use: any tool call that uses a non-proxied credential
  • Lethal Trifecta trigger: actions combining private data access + untrusted content provenance + external comms reach

The tier is assigned per action class (not per agent), via policy in the Control plane (Cedar or OPA policy). The same agent may auto-execute reads but require confirm for writes.

CSA Agentic Trust Framework gates

The CSA Agentic Trust Framework (ATF) formalizes HITL into five progressive autonomy promotion gates — preconditions that must be met before an agent can operate at higher autonomy tiers. The gates encode:

  1. Task scope definition (what is the agent allowed to do?)
  2. Resource access justification (why does it need these tools?)
  3. Human oversight checkpoints per interaction class
  4. Risk-based step-up (higher-risk actions trigger re-confirmation even mid-task)
  5. Revocation capability (can you pull the agent back at any point?)

Production HITL implementations

Stripe — three-ring containment

Andrew Bullen’s talk describes Stripe’s three-ring containment model, where HITL is the outer ring. Key implementation details:

  • HITL gates are enforced by a policy engine (not the model) — the agent never decides its own HITL requirement
  • Gates are per-action-class, not per-session: re-confirmation is required on each high-risk call even within a single task
  • ASR (Attack Success Rate) with HITL enforced: 1.5–6.7% depending on model — not zero, but an order-of-magnitude reduction from ungated baselines
  • The residual ASR reflects HITL bypass via social engineering of the human approver, not bypass of the gate mechanism itself

Google Workspace — Plan-Validate-Execute

Nicolas Lidzborski’s Workspace talk describes Google’s canonical HITL implementation as the Plan-Validate-Execute pattern: the agent enumerates a structured plan; a non-LLM gatekeeper validates the plan against dynamically generated policy and against user intent; only after the gate passes does the agent execute. The pattern explicitly addresses the recursive-injection failure mode by making validation deterministic rather than LLM-based. Lidzborski is also explicit about the review fatigue / rubber-stamping UX gap as an unsolved problem.

The two implementations cover complementary surfaces: Stripe emphasizes egress + tool-policy enforcement (the data leaving side); Google Workspace emphasizes the planning + validation interlock (the deciding to act side). Real production agentic systems will need both.

HITL in the CMM

In the Agentic AI Security CMM, HITL maturity is tracked across D3 (Runtime Guardrails) and D5 (Human Oversight Architecture):

  • L1: No systematic HITL; humans may intervene but there’s no gate mechanism
  • L2: HITL for highest-risk actions; ad-hoc, not policy-driven
  • L3: Policy-driven confirm tier for defined action classes; platform-enforced (not prompt-instructed)
  • L4: Per-action-class tiering across all six planes; risk-based step-up; cryptographic approval tokens
  • L5: Behavioral evidence that HITL gates are actually triggered and not circumvented; red-team validation of gate bypass resistance

Relation to capability tokens

Tenuo Warrants and capability-based authorization provide a complementary control: rather than pausing for approval at execution time, they restrict what the agent can request in the first place. HITL and capability tokens are not alternatives — they are complementary layers:

  • Capability tokens: pre-authorize the scope of possible actions
  • HITL: enforce human approval for high-impact actions within that authorized scope

Together they implement defense in depth: an agent that cannot request unauthorized actions (capability tokens) and must obtain approval before taking high-impact authorized ones (HITL).

Gap

No vendor-neutral, open-source HITL primitive exists with documented integration patterns for common agent frameworks (LangGraph, Google ADK, Anthropic SDK). The Control plane in the RA lists HITL as “Concept — Developing.” Stripe’s implementation is closed-source. This is an open gap in the reference implementation landscape.