Model-Layer Attacks

A family of three named attack classes that target the deployed model rather than the agent’s surrounding orchestration. Each recovers something the model owner did not intend to expose: weights and architecture (extraction); training-data records (inversion); or training-set membership (membership inference). Listed under MAAIS Layer 3 (Model Security) and MITRE ATLAS adversarial techniques. Defensive primitives overlap heavily across the three, so the wiki treats them on one page.

The three attack classes

Model Extraction

Recovering a black-box model’s parameters, architecture, or function via repeated queries. Two flavors:

Functional extraction — clone the input-output mapping; the attacker ends up with a model that behaves like the target without recovering exact weights. Tramèr et al. (2016) showed extraction of decision-tree, logistic-regression, and neural-network models from public ML-as-a-service APIs.
Architectural extraction — recover hyperparameters and architecture details (layer count, activations, hidden sizes). Often a precursor to functional extraction.

Cost-of-attack ranges from thousands to millions of queries depending on model complexity and defenses.

MITRE ATLAS technique: AML.T0044 (ML Model Inference API Access; Full ML Model Access).

Model Inversion

Reconstructing training data records from model outputs. Fredrikson et al. (2014, 2015) demonstrated face-image recovery from a face-recognition model and recovery of private medical-record values from a regression model. Modern instances target large language models — extracting training-set verbatim text by carefully crafted prompts (Carlini et al. 2021, “Extracting Training Data from Large Language Models”).

For agentic AI specifically, inversion attacks against RAG-grounded agents can recover the corpus contents: an attacker who can query the agent extensively can reconstruct individual retrieved documents, even if direct corpus access is blocked.

MITRE ATLAS techniques: AML.T0048 (Erode Dataset); related to AML.T0024 (Exfiltration via ML Inference API).

Membership Inference

Determining whether a specific record was part of a model’s training set. Shokri et al. (2017) showed this is achievable against ML-as-a-service models with high accuracy by training shadow models. Less data-revealing than inversion but more broadly applicable; even defended models often leak membership signal at moderate ε.

For agentic AI, membership-inference attacks against agent memory or session-derived fine-tuned models can confirm whether sensitive interactions occurred (e.g., “did this user ever discuss topic X with this agent?”).

MITRE ATLAS technique: similar surface to extraction; AML.T0024 (Exfiltration via ML Inference API).

Why these matter for agentic AI

The wiki’s threat-modeling has prioritized prompt-injection-class threats (orchestration-layer compromise of agent behavior). Model-layer attacks are a separate threat surface that the wiki has under-treated:

Long-running agents accumulate query budget. Scope-3 / Scope-4 agents per the AWS Scoping Matrix make many queries during normal operation. An attacker who controls a fraction of those queries (e.g., via indirect prompt injection) can run extraction-style queries piggybacking on legitimate agent behavior.
RAG corpus is the new training data. RAG-grounded agents can be inverted to recover the corpus contents — a class of leak the wiki has not enumerated.
Agent memory is fine-tunable. Agents that fine-tune on session interactions inherit the membership-inference surface of the base model + the new attack surface of recovering session-derived records.
Multi-agent systems have privileged queriers. A compromised peer agent in an A2A mesh can run extraction queries against other agents at high volume without external rate limits applying.

Defenses

Defense	Extraction	Inversion	Membership inference
Rate limiting / query budgets	Strong	Moderate	Moderate
DP-SGD at training	Weak (function still extractable)	Strong	Strong
Output randomization (Gaussian noise on logits)	Strong	Weak	Moderate
Output abstraction (return label only, not confidence)	Moderate	Weak	Strong
Watermarking (detection-only)	Detect, not prevent	n/a	n/a
Per-query monitoring + anomaly detection	Strong (catches query patterns)	Moderate	Moderate
Model isolation / minimal-permission API	Strong	Strong	Strong

Of these, differential privacy is the only mechanism with a quantifiable privacy guarantee against inversion and membership inference. Everything else reduces attack success rate empirically without strong formal guarantees.

For extraction specifically, DP doesn’t help directly — the attacker is recovering the function, not training records. Defense relies on rate limits + output randomization + monitoring for high-volume query patterns characteristic of extraction.

Relation to wiki

CMM D4 (Runtime & Guardrails) — output randomization and per-query rate limits belong as L4 controls; long-window query-pattern monitoring (extraction detection) belongs at L5.
CMM D6 (Data, Memory & RAG) — DP-SGD on training/fine-tuning belongs at L4/L5; RAG-corpus inversion defenses (per-document query budgets, response abstraction) belong at L5+.
CMM D7 (Observability & Behavioral Monitoring) — query-pattern anomaly detection for extraction attempts belongs at L4.
MAAIS Layer 3 (Model Security) — explicitly names “model extraction, backdoor injections, and inversion attacks” as in-scope; the wiki adopts the same positioning.
Differential Privacy — primary defense against inversion and membership inference.
MITRE ATLAS — AML.T0024 (Exfiltration via ML Inference API), AML.T0044 (ML Model Inference API Access), AML.T0048 (Erode Dataset).

Provenance

Tramèr et al. (2016), Stealing Machine Learning Models via Prediction APIs — foundational extraction work.
Fredrikson et al. (2014, 2015), Privacy in Pharmacogenetics and Model Inversion Attacks That Exploit Confidence Information — foundational inversion work.
Shokri et al. (2017), Membership Inference Attacks Against Machine Learning Models — foundational MIA work.
Carlini et al. (2021), Extracting Training Data from Large Language Models — modern LLM-specific inversion.
The wiki’s enumeration here was prompted by MAAIS Layer 3 naming the three attack classes alongside backdoor injection.

Enterprise Security in the Agentic AI Era

Explorer

Model-Layer Attacks (Extraction, Inversion, Membership Inference)