CLASP
The CLASP framework (Capability-Centric Evaluation for Security Lifecycle) provides a multi-dimensional rubric to assess autonomous security agents beyond simple outcome-based success. The core philosophy is that isolated success on a benchmark is not a reliable indicator of real-world performance because it lacks explainability; practitioners must ask how an agent achieved success rather than just if it did.
The framework evaluates agents across six key capabilities—Planning, Tool Use, Memory, Reasoning, Reflection, and Perception—grading each on a scale from 1 (Minimal) to 5 (Adaptive).
The CLASP Leveling Scale
Level 1: Minimal (Brittle/Static)
At this level, agents operate using rigid, pre-defined logic with no ability to adapt to changes in environment or context.
- Planning: Uses static prompts with brittle, linear flows.
- Tool Use: Fires a single default tool regardless of the situation.
- Memory/Reasoning: Exhibits no persistence beyond immediate context and provides “one-pass” answers with no intermediate reasoning.
- Reflection/Perception: No self-checking of work; processes only a single data source.
Level 2: Scripted (Pattern-Based)
Agents at Level 2 utilize reusable templates and show early signs of awareness, though they remain limited to simple, non-branching flows.
- Planning: Employs pattern-based plans for common cases.
- Tool Use: Selects from a small, fixed set of tools based on general task types.
- Memory/Reasoning: Maintains ephemeral session notes but often loses specifics; reasoning is linear without branching.
- Reflection/Perception: Only checks its work after a clear, detectable failure occurs; processes structured single data sources.
Level 3: Structured (Context-Aware)
This level represents the “AI-augmented” stage where agents begin to change how work flows by reacting to the environment.
- Planning: Searches through options with memory and can re-plan locally if a failure occurs.
- Tool Use: Correctly matches specific tools to the current context and parses their output accurately.
- Memory/Reasoning: Key findings persist with enough detail to act upon later; reasoning is evidence-driven and multi-hypothesis.
- Reflection/Perception: Checks correctness within each current step; joins heterogeneous data feeds.
Level 4: Autonomous (Strategic)
Agents at Level 4 are “budget-aware” and can manage complex, multi-stage operations without human intervention.
- Planning: Strategic planning that is aware of resource constraints like time, steps, and API costs.
- Tool Use: Chains tools together, feeding the output of one directly into the input of the next.
- Memory/Reasoning: Full findings with provenance carry across all stages of a security lifecycle; the agent is uncertainty-aware.
- Reflection/Perception: Adjusts overall strategy based on internal signals and performance; maintains a consistent view of topology and asset state.
Level 5: Adaptive (Self-Improving)
This represents the highest level of “AI-native” capability where agents update their own internal logic based on outcomes.
- Planning: Automatically updates heuristics and policies based on outcomes from across different tasks.
- Tool Use: Discovers new tool combinations and recovers autonomously from unexpected tool failures.
- Memory/Reasoning: Accumulates and cross-references knowledge across multiple “episodes” or incidents; reasoning is causally aware and self-corrective.
- Reflection/Perception: Engages in long-term strategic reflection; updates its own internal world/state model in near real-time.
Context in Agent Evaluation
The larger goal of this leveling is to identify skill attribution—understanding which capabilities are required for specific security tasks. For example, research found that for enumeration-heavy tasks like reconnaissance, breadth (Planning and Tool Use) matters more than reasoning depth. By using these rubrics (either via LLM-as-a-judge or Evidence Centered Benchmark Design), teams can move from “vibe-based” engineering to rigorous, statistical improvement of their agents. Practitioners are urged to only ship agents when they meet both outcome success and a minimum capability threshold for their assigned task.