Differential Privacy

Differential privacy (DP) is a mathematical framework for guaranteeing that the output of a computation reveals approximately the same information whether or not any individual record was included in the input. Coined by Cynthia Dwork and colleagues (2006), DP gives quantifiable privacy guarantees rather than ad-hoc anonymization, and is the canonical defensive primitive against model inversion, membership inference, and certain extraction attacks on machine-learning systems.

Definition

A randomized algorithm M satisfies ε-differential privacy if, for any two datasets D and D' that differ in a single record, and for any output S:

Pr[M(D) ∈ S] ≤ e^ε · Pr[M(D') ∈ S]

The parameter ε (epsilon) is the privacy budget: smaller ε means stronger privacy. (ε, δ)-DP is a relaxation that admits a small probability δ of the guarantee failing — common in practice because pure ε-DP is hard to achieve at useful utility.

Core mechanisms

  • Laplace mechanism — add Laplace-distributed noise to numerical outputs. Provides ε-DP for queries with bounded sensitivity.
  • Gaussian mechanism — add Gaussian noise. Provides (ε, δ)-DP; tighter composition properties than Laplace for repeated queries.
  • Exponential mechanism — for discrete outputs (e.g., model selection), sample from a distribution weighted by a utility function.
  • DP-SGD (Differentially Private Stochastic Gradient Descent) — clip per-example gradients then add Gaussian noise during training. The standard DP-training algorithm; implemented in TensorFlow Privacy, Opacus (PyTorch), and JAX-based libraries. Privacy budget accumulates across training steps via composition theorems.
  • Local differential privacy (LDP) — noise added at the data source before centralization. Used by Apple (telemetry) and Google (Chrome’s RAPPOR, federated analytics). Trades off privacy budget against statistical utility more aggressively than central DP.

Application to agentic AI

SurfaceDP applicationWhy it matters
Training / fine-tuningDP-SGD on model training; output a model that does not memorize individual training recordsDefends against model inversion and membership inference attacks recovering proprietary or sensitive training data
RAG / retrieved-context responsesNoise the answers an agent returns about sensitive documents; or apply DP at retrieval-aggregation timeLimits how much an attacker can extract about a single document by querying repeatedly
Federated learning across agentsLocal DP on each agent’s gradient updates before aggregationPrevents the central aggregator (or any single peer) from recovering an individual agent’s training data
Agent-to-agent telemetryLocal DP on behavior signals shared between agents (e.g., trust scores)Prevents downstream agents from inferring individual upstream-agent activity patterns
Inference-time output randomizationSmall Gaussian noise on logits before samplingSlows model-extraction attacks (which rely on stable query-response pairs)

Standards and tooling

  • NIST SP 800-188 — Trustworthy and Responsible AI: data sanitization and de-identification practitioner guidance.
  • NIST IR 8053 — De-Identification of Personal Information.
  • OpenDP — open-source DP library (Harvard / Microsoft / OpenDP Initiative). Production-grade implementation of standard mechanisms.
  • TensorFlow Privacy — Google’s reference DP-SGD implementation.
  • Opacus — Meta’s PyTorch DP library; widely used for DP-SGD training.
  • Google’s Differential Privacy library — Java / Go / C++ / Python; production-tested at Google.

Limitations

  • Privacy-utility trade-off. Small ε buys strong privacy but degrades model accuracy. Real deployments routinely use ε ∈ [1, 10]; ε = 1 is “strong” by academic convention; ε = 10 is closer to “directional.” Apple’s iOS-telemetry deployments use ε per-event in single digits but accumulate across days.
  • Privacy-budget management. Composition theorems govern how much privacy is spent across queries. Long-running agents that re-query the same DP-protected interface accumulate privacy cost; eventually the budget exhausts and the interface must refuse further queries or rotate. Naïve deployments forget composition and effectively provide much weaker DP than advertised.
  • Doesn’t defend against all model attacks. DP at training defends against inversion and membership inference but does not defend against model extraction (which can occur even with DP-trained models, since the attacker is recovering the function, not training records). Inference-time output randomization is needed to slow extraction.
  • Hard to compose with RAG. Differential privacy on retrieved-context responses is an active research area; production-grade DP for RAG isn’t yet standard.
  • Not a substitute for access control. DP protects against information leakage given a query; it does not control who can query in the first place. Pair with NHI / authorization controls.

Relation to wiki

  • CMM D6 (Data, Memory & RAG) — DP-SGD belongs as an L4/L5 control for training-data privacy. Inference-time DP on RAG responses is an L5+ research area.
  • CMM D4 (Runtime & Guardrails) — output randomization for extraction-attack mitigation belongs as an L4 control.
  • CMM D9 (Operations & Human Factors) — privacy-budget tracking belongs as an operational discipline; agents that exhaust budgets must be detected and quotas refreshed/rotated.
  • MAAIS Layer 2 (Data Security) — explicitly names differential privacy as a control. The wiki adopts the same positioning.
  • Model-Layer Attacks — DP is the primary defense against inversion and membership inference; partial defense against extraction (when paired with rate limits and output noise).

Provenance

The framework was introduced by Dwork, McSherry, Nissim, and Smith (2006), Calibrating Noise to Sensitivity in Private Data Analysis. DP-SGD is from Abadi et al. (2016), Deep Learning with Differential Privacy. The wiki references DP via MAAIS Layer 2 which names it as a Data Security control for agentic AI.