Model-Layer Attacks

A family of three named attack classes that target the deployed model rather than the agent’s surrounding orchestration. Each recovers something the model owner did not intend to expose: weights and architecture (extraction); training-data records (inversion); or training-set membership (membership inference). Listed under MAAIS Layer 3 (Model Security) and MITRE ATLAS adversarial techniques. Defensive primitives overlap heavily across the three, so the wiki treats them on one page.

The three attack classes

Model Extraction

Recovering a black-box model’s parameters, architecture, or function via repeated queries. Two flavors:

  • Functional extraction — clone the input-output mapping; the attacker ends up with a model that behaves like the target without recovering exact weights. Tramèr et al. (2016) showed extraction of decision-tree, logistic-regression, and neural-network models from public ML-as-a-service APIs.
  • Architectural extraction — recover hyperparameters and architecture details (layer count, activations, hidden sizes). Often a precursor to functional extraction.

Cost-of-attack ranges from thousands to millions of queries depending on model complexity and defenses.

MITRE ATLAS technique: AML.T0044 (ML Model Inference API Access; Full ML Model Access).

Model Inversion

Reconstructing training data records from model outputs. Fredrikson et al. (2014, 2015) demonstrated face-image recovery from a face-recognition model and recovery of private medical-record values from a regression model. Modern instances target large language models — extracting training-set verbatim text by carefully crafted prompts (Carlini et al. 2021, “Extracting Training Data from Large Language Models”).

For agentic AI specifically, inversion attacks against RAG-grounded agents can recover the corpus contents: an attacker who can query the agent extensively can reconstruct individual retrieved documents, even if direct corpus access is blocked.

MITRE ATLAS techniques: AML.T0048 (Erode Dataset); related to AML.T0024 (Exfiltration via ML Inference API).

Membership Inference

Determining whether a specific record was part of a model’s training set. Shokri et al. (2017) showed this is achievable against ML-as-a-service models with high accuracy by training shadow models. Less data-revealing than inversion but more broadly applicable; even defended models often leak membership signal at moderate ε.

For agentic AI, membership-inference attacks against agent memory or session-derived fine-tuned models can confirm whether sensitive interactions occurred (e.g., “did this user ever discuss topic X with this agent?”).

MITRE ATLAS technique: similar surface to extraction; AML.T0024 (Exfiltration via ML Inference API).

Why these matter for agentic AI

The wiki’s threat-modeling has prioritized prompt-injection-class threats (orchestration-layer compromise of agent behavior). Model-layer attacks are a separate threat surface that the wiki has under-treated:

  • Long-running agents accumulate query budget. Scope-3 / Scope-4 agents per the AWS Scoping Matrix make many queries during normal operation. An attacker who controls a fraction of those queries (e.g., via indirect prompt injection) can run extraction-style queries piggybacking on legitimate agent behavior.
  • RAG corpus is the new training data. RAG-grounded agents can be inverted to recover the corpus contents — a class of leak the wiki has not enumerated.
  • Agent memory is fine-tunable. Agents that fine-tune on session interactions inherit the membership-inference surface of the base model + the new attack surface of recovering session-derived records.
  • Multi-agent systems have privileged queriers. A compromised peer agent in an A2A mesh can run extraction queries against other agents at high volume without external rate limits applying.

Defenses

DefenseExtractionInversionMembership inference
Rate limiting / query budgetsStrongModerateModerate
DP-SGD at trainingWeak (function still extractable)StrongStrong
Output randomization (Gaussian noise on logits)StrongWeakModerate
Output abstraction (return label only, not confidence)ModerateWeakStrong
Watermarking (detection-only)Detect, not preventn/an/a
Per-query monitoring + anomaly detectionStrong (catches query patterns)ModerateModerate
Model isolation / minimal-permission APIStrongStrongStrong

Of these, differential privacy is the only mechanism with a quantifiable privacy guarantee against inversion and membership inference. Everything else reduces attack success rate empirically without strong formal guarantees.

For extraction specifically, DP doesn’t help directly — the attacker is recovering the function, not training records. Defense relies on rate limits + output randomization + monitoring for high-volume query patterns characteristic of extraction.

Relation to wiki

  • CMM D4 (Runtime & Guardrails) — output randomization and per-query rate limits belong as L4 controls; long-window query-pattern monitoring (extraction detection) belongs at L5.
  • CMM D6 (Data, Memory & RAG) — DP-SGD on training/fine-tuning belongs at L4/L5; RAG-corpus inversion defenses (per-document query budgets, response abstraction) belong at L5+.
  • CMM D7 (Observability & Behavioral Monitoring) — query-pattern anomaly detection for extraction attempts belongs at L4.
  • MAAIS Layer 3 (Model Security) — explicitly names “model extraction, backdoor injections, and inversion attacks” as in-scope; the wiki adopts the same positioning.
  • Differential Privacy — primary defense against inversion and membership inference.
  • MITRE ATLASAML.T0024 (Exfiltration via ML Inference API), AML.T0044 (ML Model Inference API Access), AML.T0048 (Erode Dataset).

Provenance

  • Tramèr et al. (2016), Stealing Machine Learning Models via Prediction APIs — foundational extraction work.
  • Fredrikson et al. (2014, 2015), Privacy in Pharmacogenetics and Model Inversion Attacks That Exploit Confidence Information — foundational inversion work.
  • Shokri et al. (2017), Membership Inference Attacks Against Machine Learning Models — foundational MIA work.
  • Carlini et al. (2021), Extracting Training Data from Large Language Models — modern LLM-specific inversion.
  • The wiki’s enumeration here was prompted by MAAIS Layer 3 naming the three attack classes alongside backdoor injection.