Model-Layer Attacks
A family of three named attack classes that target the deployed model rather than the agent’s surrounding orchestration. Each recovers something the model owner did not intend to expose: weights and architecture (extraction); training-data records (inversion); or training-set membership (membership inference). Listed under MAAIS Layer 3 (Model Security) and MITRE ATLAS adversarial techniques. Defensive primitives overlap heavily across the three, so the wiki treats them on one page.
The three attack classes
Model Extraction
Recovering a black-box model’s parameters, architecture, or function via repeated queries. Two flavors:
- Functional extraction — clone the input-output mapping; the attacker ends up with a model that behaves like the target without recovering exact weights. Tramèr et al. (2016) showed extraction of decision-tree, logistic-regression, and neural-network models from public ML-as-a-service APIs.
- Architectural extraction — recover hyperparameters and architecture details (layer count, activations, hidden sizes). Often a precursor to functional extraction.
Cost-of-attack ranges from thousands to millions of queries depending on model complexity and defenses.
MITRE ATLAS technique: AML.T0044 (ML Model Inference API Access; Full ML Model Access).
Model Inversion
Reconstructing training data records from model outputs. Fredrikson et al. (2014, 2015) demonstrated face-image recovery from a face-recognition model and recovery of private medical-record values from a regression model. Modern instances target large language models — extracting training-set verbatim text by carefully crafted prompts (Carlini et al. 2021, “Extracting Training Data from Large Language Models”).
For agentic AI specifically, inversion attacks against RAG-grounded agents can recover the corpus contents: an attacker who can query the agent extensively can reconstruct individual retrieved documents, even if direct corpus access is blocked.
MITRE ATLAS techniques: AML.T0048 (Erode Dataset); related to AML.T0024 (Exfiltration via ML Inference API).
Membership Inference
Determining whether a specific record was part of a model’s training set. Shokri et al. (2017) showed this is achievable against ML-as-a-service models with high accuracy by training shadow models. Less data-revealing than inversion but more broadly applicable; even defended models often leak membership signal at moderate ε.
For agentic AI, membership-inference attacks against agent memory or session-derived fine-tuned models can confirm whether sensitive interactions occurred (e.g., “did this user ever discuss topic X with this agent?”).
MITRE ATLAS technique: similar surface to extraction; AML.T0024 (Exfiltration via ML Inference API).
Why these matter for agentic AI
The wiki’s threat-modeling has prioritized prompt-injection-class threats (orchestration-layer compromise of agent behavior). Model-layer attacks are a separate threat surface that the wiki has under-treated:
- Long-running agents accumulate query budget. Scope-3 / Scope-4 agents per the AWS Scoping Matrix make many queries during normal operation. An attacker who controls a fraction of those queries (e.g., via indirect prompt injection) can run extraction-style queries piggybacking on legitimate agent behavior.
- RAG corpus is the new training data. RAG-grounded agents can be inverted to recover the corpus contents — a class of leak the wiki has not enumerated.
- Agent memory is fine-tunable. Agents that fine-tune on session interactions inherit the membership-inference surface of the base model + the new attack surface of recovering session-derived records.
- Multi-agent systems have privileged queriers. A compromised peer agent in an A2A mesh can run extraction queries against other agents at high volume without external rate limits applying.
Defenses
| Defense | Extraction | Inversion | Membership inference |
|---|---|---|---|
| Rate limiting / query budgets | Strong | Moderate | Moderate |
| DP-SGD at training | Weak (function still extractable) | Strong | Strong |
| Output randomization (Gaussian noise on logits) | Strong | Weak | Moderate |
| Output abstraction (return label only, not confidence) | Moderate | Weak | Strong |
| Watermarking (detection-only) | Detect, not prevent | n/a | n/a |
| Per-query monitoring + anomaly detection | Strong (catches query patterns) | Moderate | Moderate |
| Model isolation / minimal-permission API | Strong | Strong | Strong |
Of these, differential privacy is the only mechanism with a quantifiable privacy guarantee against inversion and membership inference. Everything else reduces attack success rate empirically without strong formal guarantees.
For extraction specifically, DP doesn’t help directly — the attacker is recovering the function, not training records. Defense relies on rate limits + output randomization + monitoring for high-volume query patterns characteristic of extraction.
Relation to wiki
- CMM D4 (Runtime & Guardrails) — output randomization and per-query rate limits belong as L4 controls; long-window query-pattern monitoring (extraction detection) belongs at L5.
- CMM D6 (Data, Memory & RAG) — DP-SGD on training/fine-tuning belongs at L4/L5; RAG-corpus inversion defenses (per-document query budgets, response abstraction) belong at L5+.
- CMM D7 (Observability & Behavioral Monitoring) — query-pattern anomaly detection for extraction attempts belongs at L4.
- MAAIS Layer 3 (Model Security) — explicitly names “model extraction, backdoor injections, and inversion attacks” as in-scope; the wiki adopts the same positioning.
- Differential Privacy — primary defense against inversion and membership inference.
- MITRE ATLAS —
AML.T0024(Exfiltration via ML Inference API),AML.T0044(ML Model Inference API Access),AML.T0048(Erode Dataset).
Provenance
- Tramèr et al. (2016), Stealing Machine Learning Models via Prediction APIs — foundational extraction work.
- Fredrikson et al. (2014, 2015), Privacy in Pharmacogenetics and Model Inversion Attacks That Exploit Confidence Information — foundational inversion work.
- Shokri et al. (2017), Membership Inference Attacks Against Machine Learning Models — foundational MIA work.
- Carlini et al. (2021), Extracting Training Data from Large Language Models — modern LLM-specific inversion.
- The wiki’s enumeration here was prompted by MAAIS Layer 3 naming the three attack classes alongside backdoor injection.