This was the final project for DSC 291: Safety in Generative AI at UCSD (Fall 2025), with Yi Lien, Jean Hsu, and Ting-Shiuan Lai. We reproduced Wu et al. (2025), Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM, and extended their Reversed Model Guidance (RMG) attack to two datasets the original work didn’t cover: a slice of WMDP (Weapons of Mass Destruction Proxy, bio-retain corpus) and a synthetic medical SOAP-notes dataset that we generated specifically for this study.
Download the report (PDF) · Slides (PDF) · Code
Background
Machine unlearning is the privacy-compliance answer to GDPR/CCPA-style “delete my data” requests on trained models. Exact unlearning — retraining the model from scratch without the target data — has been the theoretical gold standard, on the assumption that a model retrained without sample cannot leak .
Wu et al. (2025) showed this assumption is wrong when the adversary has access to both the pre-unlearning and post-unlearning checkpoints — a realistic threat model since older versions of deployed models often linger on disk, in artifact stores, or in third-party copies. The divergence between the two checkpoints — what the model used to know vs. what it knows now — is itself an extractable signal that points back at the deleted data.
Method
Reversed Model Guidance (RMG)
The attack is an inference-time decoding modification. For each next-token decision, the guided log-probability is:
The bracketed term is the divergence signal: tokens that used to be high-probability under but became low-probability under get a large positive boost. Scaling that by the guidance scale produces a “pseudo-predictor” that tilts decoding back toward the forgotten distribution.
The paper’s concerns example makes the mechanism vivid: in a forgotten patient note, the next token is concerns, which the pre-unlearning model assigns log-prob and the post-unlearning model assigns . RMG with amplifies that gap to a guided log-prob of , vaulting concerns over plausible-but-wrong distractors like complaint () and reconstructing the forgotten sequence verbatim.
Token filtering
Aggressive guidance can produce incoherent text — high recovers more content but distorts semantics. We adopted the paper’s contrastive-decoding-inspired filter: restrict candidates to tokens whose pre-unlearning probability is at least , then apply RMG only over that subset. This trades a tiny amount of attack flexibility for substantially better fluency.
Pipeline
- Dataset preparation — for each dataset, generate fine-tuning data and define forget/retain splits at multiple ratios (1%, 5%, 10%, 20%).
- Pre-unlearning checkpoint — LoRA-finetune Llama-3.1-8B-Instruct on the full dataset.
- Exact-unlearning simulation — LoRA-finetune a fresh checkpoint on the retain split only.
- Guided extraction attack — generate completions on forget-set prompts using RMG with over the two checkpoints.
- Evaluation — ROUGE-1(R), ROUGE-L(R), BLEU, BERTScore, and A-ESR (Average Extraction Success Rate, the fraction of completions whose ROUGE-L recall exceeds threshold ).
Datasets
The original repo only ships TOFU; we extended the pipeline to:
- WMDP bio-retain — 5.3K sentences sampled from PubMed-derived biology papers, with paper/sentence indices retained for traceability. Tests unlearning on factual, domain-specific text where adversaries might want to recover technical content.
- Synthetic medical SOAP notes — we prompted an LLM to generate structured clinical notes with PII (
client_name,date_of_birth) and SOAP fields (subjective/objective/assessment/plan), then ensured eachclient_namewas unique by appending a short UUID, shuffled with a fixed seed, and saved JSON forget/retain splits at the four ratios. Tests unlearning on the kind of high-sensitivity records GDPR-style deletion requests target.
Setup
- Base model: Llama-3.1-8B-Instruct.
- Fine-tuning: LoRA (rank 64) via the Tinker API, 5 epochs, batch 32, max length 3000, lr , 100 warmup steps.
- Attack: , constraint level , max new tokens 512.
Results
Headline: RMG reliably beats unguided generation
At the default 10% forget-set ratio:
WMDP
| Generation | Rouge-1 (R) | Rouge-L (R) | BLEU | BERT | A-ESR (τ=0.9) | A-ESR (τ=1.0) |
|---|---|---|---|---|---|---|
| Post-unlearning | 0.227 | 0.211 | 0.032 | 0.178 | 0.011 | 0.011 |
| Pre-unlearning | 0.245 | 0.229 | 0.038 | 0.196 | 0.015 | 0.015 |
| RMG attack | 0.284 (+15.9%) | 0.267 (+16.4%) | 0.049 (+30.7%) | 0.232 (+18.9%) | 0.025 (+63.2%) | 0.023 (+50.7%) |
Synthetic medical
| Generation | Rouge-1 (R) | Rouge-L (R) | BLEU | BERT | A-ESR (τ=0.9) | A-ESR (τ=1.0) |
|---|---|---|---|---|---|---|
| Post-unlearning | 0.561 | 0.506 | 0.281 | 0.281 | 0.060 | 0.000 |
| Pre-unlearning | 0.564 | 0.519 | 0.310 | 0.585 | 0.060 | 0.000 |
| RMG attack | 0.579 (+2.7%) | 0.542 (+4.3%) | 0.353 (+14.1%) | 0.612 (+4.6%) | 0.050 (–16.7%) | 0.020 |
Two takeaways from the tables:
- The headline number — +63.2% on A-ESR(τ=0.9) for WMDP — confirms the original paper’s claim that RMG roughly doubles exact-extraction rates over the strongest unguided baseline.
- On the medical dataset, the surface metrics (ROUGE / BLEU / BERT) move less in relative terms because the post-unlearning model is already very close to the ground truth (BERT 0.585 baseline vs. 0.612 with RMG). The signal that matters there isn’t the average overlap — it’s the A-ESR(τ=1.0) jumping from 0.000 to 0.020, i.e. RMG is the only setting that ever fully reconstructs PII-bearing notes.
”Sweet spot” in forget-set ratio (medical only)
On WMDP, RMG’s effectiveness was monotonic-ish — small drop at higher forget ratios (more noise from a larger unlearning step), but the attack stayed above the baseline at every ratio.
On the medical dataset, the curve had a clear peak at 5% that didn’t appear on WMDP. Our hypothesis:
- At 1% forget ratio, the pre- and post-unlearning models are nearly identical — the divergence signal is too weak for RMG to amplify.
- At 20% forget ratio, so much structured medical data has been removed that the post-unlearning model partially loses the SOAP format itself — the structure gets noisy, not just the contents — and that noise hurts extraction more than the wider divergence helps.
The 5% sweet spot maximizes divergence subject to “the model still produces SOAP-shaped output.” That trade-off — divergence vs. structural coherence — isn’t emphasized in the original paper but matters a lot for any structured-text deployment.
Memorization steepens the leakage channel
We did a hyperparameter sweep over the guidance scale at two memorization levels:
- Default training (lr , 5 epochs): optimal . Beyond that, BLEU and BERT degrade — extraction recovers more content but produces less coherent text.
- More memorization (lr , 5 epochs): optimal .
The optimal guidance scale is inversely proportional to baseline memorization. Intuitively: a more-memorized pre-unlearning model produces a stronger divergence signal per unit of , so you need less amplification to recover the data. From a defender’s perspective, this is the worst possible interaction — the things that make a model more useful (more training, lower-loss fits to the training distribution) also make this attack easier and require less attacker-side tuning.
Takeaways
- Exact unlearning is not sufficient on its own for privacy. Even when the post-unlearning model no longer reproduces forgotten content, the divergence between checkpoints leaks enough to reconstruct it. Threat models that only consider the final model are too narrow.
- Checkpoint hygiene is part of unlearning. Any deployment that treats unlearning as a compliance control needs to track which historical checkpoints exist, where they live, and who has access — exactly the kind of artifact lifecycle that’s easy to ignore until an attack like RMG forces the issue.
- Memorization and unlearning have to be analyzed jointly. A model that memorizes harder before unlearning is easier to attack, not harder, because the divergence signal is louder. Defense work that focuses only on the post-unlearning step is missing half the loss surface.
- Structured-data deployments have their own failure mode. The “sweet spot” finding suggests that for structured records (medical, financial, legal), there exists a forget ratio that is worse for privacy than either small or large unlearning steps. Defenders need to evaluate against extraction across the full ratio range, not at one default.
Limitations and what I’d do next
- No DP defense yet. Differential privacy via DP-SGD is the canonical defense, but the original paper notes it carries a significant utility cost. We didn’t evaluate the privacy-utility frontier ourselves; that’s the natural next step.
- Single base model. Llama-3.1-8B-Instruct only. Whether the attack works the same way on different families (Mistral, Qwen, Gemma) — and whether the “sweet spot” / memorization findings generalize — is open.
- LoRA fine-tuning, not full retraining. True exact unlearning means full retraining; we approximated with LoRA on retain-only data. Full retraining might widen or narrow the divergence channel.
- Two datasets, two domains. WMDP (factual scientific text) and SOAP notes (structured PII) cover useful corners, but TOFU, code datasets, and conversational data would round out the picture.
- Copyright-vs-privacy framing. The original paper distinguishes between privacy (per-individual) and copyright (per-expression) protection. We focused on privacy. Extending RMG-style attacks to test Near Access-Freeness (NAF) for copyright is an interesting direction.
If I were continuing, the experiment I’d most want to run is DP-SGD at varied + RMG: sweep the privacy budget, measure both A-ESR and downstream task utility, and characterize the actual frontier instead of accepting the folk wisdom that “DP kills utility.” That’s the result that would actually inform whether this attack matters for production unlearning.
My contribution
WMDP data processing, fine-tuning, sampling pipeline with Tinker, RMG extraction implementation (local + Colab), and the hyperparameter study (guidance scale × memorization).
References
- Wu, Pang, Liu, & Wu (2025). Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM. arXiv:2505.24379
- Elkin-Koren, Hacohen, Livni, & Moran (2024). Can Copyright Be Reduced to Privacy? arXiv:2305.14822
- Vyas, Kakade, & Barak (2023). On Provable Copyright Protection for Generative Models. arXiv:2302.10870
- Liu et al. (2025). Language Models May Verbatim Complete Text They Were Not Explicitly Trained On. arXiv:2503.17514
- Abadi et al. (2016). Deep Learning with Differential Privacy. CCS 2016.
- Li et al. (2024). The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning.