Skip to content
Tino Trangia
Go back

Data extraction after exact unlearning

This was the final project for DSC 291: Safety in Generative AI at UCSD (Fall 2025), with Yi Lien, Jean Hsu, and Ting-Shiuan Lai. We reproduced Wu et al. (2025), Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM, and extended their Reversed Model Guidance (RMG) attack to two datasets the original work didn’t cover: a slice of WMDP (Weapons of Mass Destruction Proxy, bio-retain corpus) and a synthetic medical SOAP-notes dataset that we generated specifically for this study.

Download the report (PDF) · Slides (PDF) · Code

Background

Machine unlearning is the privacy-compliance answer to GDPR/CCPA-style “delete my data” requests on trained models. Exact unlearning — retraining the model from scratch without the target data — has been the theoretical gold standard, on the assumption that a model retrained without sample xx cannot leak xx.

Wu et al. (2025) showed this assumption is wrong when the adversary has access to both the pre-unlearning and post-unlearning checkpoints — a realistic threat model since older versions of deployed models often linger on disk, in artifact stores, or in third-party copies. The divergence between the two checkpoints — what the model used to know vs. what it knows now — is itself an extractable signal that points back at the deleted data.

Method

Reversed Model Guidance (RMG)

The attack is an inference-time decoding modification. For each next-token decision, the guided log-probability is:

logq(xi+1)=logPpost(xi+1)+w(logPpre(xi+1)logPpost(xi+1))\log q(x_{i+1}) = \log P_{\text{post}}(x_{i+1}) + w \cdot \big(\log P_{\text{pre}}(x_{i+1}) - \log P_{\text{post}}(x_{i+1})\big)

The bracketed term is the divergence signal: tokens that used to be high-probability under PpreP_{\text{pre}} but became low-probability under PpostP_{\text{post}} get a large positive boost. Scaling that by the guidance scale ww produces a “pseudo-predictor” that tilts decoding back toward the forgotten distribution.

The paper’s concerns example makes the mechanism vivid: in a forgotten patient note, the next token is concerns, which the pre-unlearning model assigns log-prob 2.23-2.23 and the post-unlearning model assigns 6.99-6.99. RMG with w2.6w \approx 2.6 amplifies that gap to a guided log-prob of +2.54+2.54, vaulting concerns over plausible-but-wrong distractors like complaint (0.77-0.77) and reconstructing the forgotten sequence verbatim.

Token filtering

Aggressive guidance can produce incoherent text — high ww recovers more content but distorts semantics. We adopted the paper’s contrastive-decoding-inspired filter: restrict candidates to tokens whose pre-unlearning probability is at least γmax(Ppre)\gamma \cdot \max(P_{\text{pre}}), then apply RMG only over that subset. This trades a tiny amount of attack flexibility for substantially better fluency.

Pipeline

  1. Dataset preparation — for each dataset, generate fine-tuning data and define forget/retain splits at multiple ratios (1%, 5%, 10%, 20%).
  2. Pre-unlearning checkpoint — LoRA-finetune Llama-3.1-8B-Instruct on the full dataset.
  3. Exact-unlearning simulation — LoRA-finetune a fresh checkpoint on the retain split only.
  4. Guided extraction attack — generate completions on forget-set prompts using RMG with w=2.6w = 2.6 over the two checkpoints.
  5. Evaluation — ROUGE-1(R), ROUGE-L(R), BLEU, BERTScore, and A-ESR (Average Extraction Success Rate, the fraction of completions whose ROUGE-L recall exceeds threshold τ\tau).

Datasets

The original repo only ships TOFU; we extended the pipeline to:

Setup

Results

Headline: RMG reliably beats unguided generation

At the default 10% forget-set ratio:

WMDP

GenerationRouge-1 (R)Rouge-L (R)BLEUBERTA-ESR (τ=0.9)A-ESR (τ=1.0)
Post-unlearning0.2270.2110.0320.1780.0110.011
Pre-unlearning0.2450.2290.0380.1960.0150.015
RMG attack0.284 (+15.9%)0.267 (+16.4%)0.049 (+30.7%)0.232 (+18.9%)0.025 (+63.2%)0.023 (+50.7%)

Synthetic medical

GenerationRouge-1 (R)Rouge-L (R)BLEUBERTA-ESR (τ=0.9)A-ESR (τ=1.0)
Post-unlearning0.5610.5060.2810.2810.0600.000
Pre-unlearning0.5640.5190.3100.5850.0600.000
RMG attack0.579 (+2.7%)0.542 (+4.3%)0.353 (+14.1%)0.612 (+4.6%)0.050 (–16.7%)0.020

Two takeaways from the tables:

”Sweet spot” in forget-set ratio (medical only)

On WMDP, RMG’s effectiveness was monotonic-ish — small drop at higher forget ratios (more noise from a larger unlearning step), but the attack stayed above the baseline at every ratio.

On the medical dataset, the curve had a clear peak at 5% that didn’t appear on WMDP. Our hypothesis:

The 5% sweet spot maximizes divergence subject to “the model still produces SOAP-shaped output.” That trade-off — divergence vs. structural coherence — isn’t emphasized in the original paper but matters a lot for any structured-text deployment.

Memorization steepens the leakage channel

We did a hyperparameter sweep over the guidance scale ww at two memorization levels:

The optimal guidance scale is inversely proportional to baseline memorization. Intuitively: a more-memorized pre-unlearning model produces a stronger divergence signal per unit of ww, so you need less amplification to recover the data. From a defender’s perspective, this is the worst possible interaction — the things that make a model more useful (more training, lower-loss fits to the training distribution) also make this attack easier and require less attacker-side tuning.

Takeaways

  1. Exact unlearning is not sufficient on its own for privacy. Even when the post-unlearning model no longer reproduces forgotten content, the divergence between checkpoints leaks enough to reconstruct it. Threat models that only consider the final model are too narrow.
  2. Checkpoint hygiene is part of unlearning. Any deployment that treats unlearning as a compliance control needs to track which historical checkpoints exist, where they live, and who has access — exactly the kind of artifact lifecycle that’s easy to ignore until an attack like RMG forces the issue.
  3. Memorization and unlearning have to be analyzed jointly. A model that memorizes harder before unlearning is easier to attack, not harder, because the divergence signal is louder. Defense work that focuses only on the post-unlearning step is missing half the loss surface.
  4. Structured-data deployments have their own failure mode. The “sweet spot” finding suggests that for structured records (medical, financial, legal), there exists a forget ratio that is worse for privacy than either small or large unlearning steps. Defenders need to evaluate against extraction across the full ratio range, not at one default.

Limitations and what I’d do next

If I were continuing, the experiment I’d most want to run is DP-SGD at varied ε\varepsilon + RMG: sweep the privacy budget, measure both A-ESR and downstream task utility, and characterize the actual frontier instead of accepting the folk wisdom that “DP kills utility.” That’s the result that would actually inform whether this attack matters for production unlearning.

My contribution

WMDP data processing, fine-tuning, sampling pipeline with Tinker, RMG extraction implementation (local + Colab), and the hyperparameter study (guidance scale × memorization).

References



Previous Post
Importance-weighted fine-tuning for relation extraction
Next Post
Scalable oversight via adversarial deception in resume screening