For my AI Safety coursework at UCSD, I replicated a NeurIPS 2025 result showing that exact unlearning — training a model to forget specific data — is not as safe as it looks when attackers have access to both the pre- and post-unlearning checkpoints. The pair of checkpoints acts as a side channel that lets an attacker recover information the unlearned model is supposed to have forgotten.
Setup
The experimental pipeline:
- LoRA-finetune a Llama model on a target dataset until the target sequences are memorized — verifying memorization at inference time by establishing the generation parameters (temperature, top-p, prefix length) under which the model reproduces them verbatim.
- Apply the “unlearning” procedure to produce a post-unlearning checkpoint that no longer reproduces the target sequences on its own.
- Run the paper’s extraction attack, which combines signals from the two checkpoints to bias generation toward the forgotten content.
What the replication confirmed
The original headline result — up to ~60% improvement in extraction rate over attacking the post-unlearning model alone — held up. Concretely, memorized sequences that the unlearned model appeared to have forgotten could be reconstructed far more often once the pre-unlearning checkpoint was added to the attack.
The takeaway is not that unlearning is useless, but that publishing pre/post checkpoint pairs leaks more than publishing the unlearned model alone. For any deployment that treats unlearning as a privacy or compliance control, the threat model has to account for adversaries with historical checkpoint access — which is easy to underestimate in model-hosting workflows where old versions linger.
What I got out of it
Two things beyond confirming the result: a much more concrete feel for where the LoRA-finetuning knobs actually matter for memorization, and a much healthier suspicion of any “we deleted it” claim about trained weights.