Skip to content
Tino Trangia
Go back

Data extraction from fine-tuned LLMs: a replication of a NeurIPS 2025 attack

For my AI Safety coursework at UCSD, I replicated a NeurIPS 2025 result showing that exact unlearning — training a model to forget specific data — is not as safe as it looks when attackers have access to both the pre- and post-unlearning checkpoints. The pair of checkpoints acts as a side channel that lets an attacker recover information the unlearned model is supposed to have forgotten.

Setup

The experimental pipeline:

  1. LoRA-finetune a Llama model on a target dataset until the target sequences are memorized — verifying memorization at inference time by establishing the generation parameters (temperature, top-p, prefix length) under which the model reproduces them verbatim.
  2. Apply the “unlearning” procedure to produce a post-unlearning checkpoint that no longer reproduces the target sequences on its own.
  3. Run the paper’s extraction attack, which combines signals from the two checkpoints to bias generation toward the forgotten content.

What the replication confirmed

The original headline result — up to ~60% improvement in extraction rate over attacking the post-unlearning model alone — held up. Concretely, memorized sequences that the unlearned model appeared to have forgotten could be reconstructed far more often once the pre-unlearning checkpoint was added to the attack.

The takeaway is not that unlearning is useless, but that publishing pre/post checkpoint pairs leaks more than publishing the unlearned model alone. For any deployment that treats unlearning as a privacy or compliance control, the threat model has to account for adversaries with historical checkpoint access — which is easy to underestimate in model-hosting workflows where old versions linger.

What I got out of it

Two things beyond confirming the result: a much more concrete feel for where the LoRA-finetuning knobs actually matter for memorization, and a much healthier suspicion of any “we deleted it” claim about trained weights.



Previous Post
Database Reporting Agent: a multi-agent text-to-SQL pipeline
Next Post
Steering chain-of-thought length — and what it does to faithfulness