This was the final project for DSC 253: Advanced Data-Driven Text Mining at UCSD, with Teresa Lee and Raunak Sengupta. We implemented ATLANTIS (Liu et al., ACL 2025) — an importance-weighted weak-to-strong fine-tuning method — from scratch and applied it to sentence-level relation extraction, a structured-prediction task far narrower than the broad instruction-tuning benchmarks the original paper evaluated on.
Download the report (PDF) · Slides (PDF) · Code
Background
Relation extraction (RE) detects and classifies semantic relationships between entities in text — given a sentence with two marked entity mentions, predict the relation between them. It’s a building block for information extraction systems, but high-performing RE needs a lot of high-quality labeled data, which is expensive in specialized domains (medicine, law) and which often doesn’t match the distribution of real-world inputs anyway.
Weak supervision — distant supervision, labeling functions, LLM-generated pseudo-labels — gets you scale cheaply, but at the cost of label noise and a wider distribution gap. Naively fine-tuning on noisy labels can make things worse, not better.
ATLANTIS addresses this by reweighting every training example rather than discarding subsets. The setup involves three model distributions — a large base model , a small base model , and a reference model — and a weak-to-strong proportionality assumption that lets you approximate the otherwise-intractable importance ratio using only quantities you can actually compute:
The per-example weight is then , and you train the large model with a weighted cross-entropy loss that biases optimization toward more “informative” examples.
The original paper validated ATLANTIS on broad instruction-tuning benchmarks. The open question — and our project — was whether it transfers to a narrow structured task.
Methodology
We treated RE as an instruction-style generation task: the model receives “Classify the semantic relation between the entities [e1] and [e2] in this sentence: …” and outputs the relation label as text. This formulation works uniformly across encoder–decoder and decoder-only models.
ATLANTIS pipeline. For each model family we:
- Computed per-example weights from . For decoder-only models, the example probability is the sum of token-level log-probabilities of the label string; for encoder–decoder models it’s the teacher-forced log-likelihood of the target sequence.
- Used the instruction-tuned checkpoint as the reference . We also tried a small model fine-tuned on the gold training split as the reference, and the two configurations produced essentially identical weights and downstream performance, so we report the instruction-tuned-reference variant for reproducibility.
- Normalized weights to mean 1.0 across the training set (with optional pre-normalization clipping) to keep the weighted loss stable.
- Cached weights and trained the large model with .
Models.
- Encoder–decoder: Flan-T5 small (77M) as , base (220M) as .
- Decoder-only: Qwen2 0.5B / 1.5B and 1.5B / 7B configurations — matching the size pairs from the ATLANTIS paper.
Datasets.
- SemEval-2010 Task 8 — 8K train / 2.7K test sentences, 19 relation classes (we direction-collapsed to 10).
- CoNLL2004 — 922 train / 288 test, 5 relation types; we parsed the joint NER+RE format into per-pair sentences.
- Weakly labeled SemEval — gold labels corrupted at controlled noise rates (we report 20%) by random label substitution to simulate noisy supervision.
Evaluation. Macro-F1 on each test set.
Results
Gold-labeled SemEval-2010 Task 8
| Model | SFT | ATLANTIS |
|---|---|---|
| Flan-T5 (base) | 0.849 | 0.854 |
| Qwen2-1.5B | 0.850 | 0.853 |
| Qwen2-7B | 0.885 | 0.877 |
Importance weighting helped the smaller models slightly and hurt the 7B configuration slightly. Differences are small and within plausible run-to-run variance.
CoNLL2004
| Model | SFT | ATLANTIS |
|---|---|---|
| Qwen2-1.5B | 0.947 | 0.966 |
| Qwen2-7B | 0.982 | 0.982 |
The clearest single result we got: ATLANTIS lifted Qwen2-1.5B by ~2 Macro-F1 points on CoNLL2004 and was a wash on the 7B model. The 7B baseline is already saturated against the test set, which is consistent with there being less room for a reweighting scheme to bite.
Weakly labeled SemEval (20% noise)
| Setting | SFT (uniform) | ATLANTIS |
|---|---|---|
| Qwen2-1.5B, 20% label noise | 0.809 | 0.824 |
This is the setting ATLANTIS is supposed to help — and it did, by ~1.5 F1. Not a huge swing, but directionally consistent with the hypothesis: when the empirical training distribution is corrupted, biasing optimization toward more reference-aligned examples partially counteracts the noise.
Takeaways
- ATLANTIS transfers cleanly as a technique to RE — but the gains are setting-sensitive. Implementing the three-model setup, computing weights for an instruction-style RE prompt, and dropping it into a standard fine-tuning loop all worked across both encoder–decoder and decoder-only architectures. What didn’t transfer cleanly was the magnitude of improvement reported in the original instruction-tuning paper.
- The weakly labeled setting is where it earns its keep. On gold data, ATLANTIS is roughly a wash; on noisy labels, it consistently moved the needle. That matches the theoretical motivation — importance weighting is a distribution-correction tool, and there’s nothing to correct when training data already matches the target.
- Importance weighting is not a substitute for clean data or model scale. Standard SFT on Qwen2-7B with gold labels beats ATLANTIS-on-1.5B on every benchmark we ran. If you have the option to clean the labels or pick a bigger model, those still win.
- Reference model choice matters less than I expected. A small fine-tuned model and the off-the-shelf instruction-tuned checkpoint produced near-identical weights. That’s a useful practical result — you don’t need to spend a separate fine-tuning budget building a reference.
Limitations
- Setting mismatch. ATLANTIS was originally evaluated on broad instruction-tuning benchmarks with mixed task distributions. Sentence-level RE is much narrower with a small fine-tuning dataset, so there is less variance for the importance weights to exploit.
- Pipeline scope. We only evaluate on annotated entity pairs, bypassing the entity detection stage of a full RE pipeline. That inflates absolute F1 vs. what an end-to-end system would report.
- No SOTA comparison. We measured relative improvement vs. SFT, not absolute performance against state-of-the-art RE systems.
- Limited ablations. A single weak-supervision noise rate (20%), one weight-clipping configuration, and a fixed set of model size pairs.
- Sentence-level only. No document-level RE (e.g. Re-DocRED), no joint entity+relation extraction.
What I’d do next
- Sweep noise rates from 0% → 50%. The 20% point looks promising; the question is whether the gap widens at higher noise (where reweighting should help most) or collapses (if the reference distribution is itself unreliable past some threshold).
- More benchmarks. ADE, TACRED, NYT, and at least one cross-domain transfer setting would tell us how dataset-specific these gains are.
- Realistic weak supervision. Replace random label corruption with actual LLM-generated weak labels — those have systematic, structured errors rather than uniform random noise, which is a meaningfully different distribution-correction problem.
- Combine with data selection. ATLANTIS reweights all examples; LESS-style influence-based selection prunes them. The two are complementary, and a “select then reweight” pipeline is the obvious next experiment.
- Document-level extension. Re-DocRED would test whether the per-example weighting story holds when each “example” is a long document with many entity pairs.
References
- ATLANTIS: Liu, Y., Wang, G., Li, S., Song, F., & Sun, X. (2025). ATLANTIS: Weak-to-strong learning via importance sampling. ACL 2025.
- SemEval-2010 Task 8: Hendrickx et al. (2010). Multi-way classification of semantic relations between pairs of nominals.
- CoNLL2004: Roth & Yih (2004). A linear programming formulation for global inference in natural language tasks.
- LESS: Xia, Malladi, Gururangan, Arora, & Chen (2024). LESS: Selecting influential data for targeted instruction tuning. arXiv:2402.04333
- Superfiltering: Li et al. (2024). Superfiltering: Weak-to-strong data filtering for fast instruction tuning. arXiv:2402.00530
- Weak-to-strong generalization: Burns et al. (2023). arXiv:2312.09390