Importance-weighted fine-tuning for relation extraction

This was the final project for DSC 253: Advanced Data-Driven Text Mining at UCSD, with Teresa Lee and Raunak Sengupta. We implemented ATLANTIS (Liu et al., ACL 2025) — an importance-weighted weak-to-strong fine-tuning method — from scratch and applied it to sentence-level relation extraction, a structured-prediction task far narrower than the broad instruction-tuning benchmarks the original paper evaluated on.

Download the report (PDF) · Slides (PDF) · Code

Background

Relation extraction (RE) detects and classifies semantic relationships between entities in text — given a sentence with two marked entity mentions, predict the relation between them. It’s a building block for information extraction systems, but high-performing RE needs a lot of high-quality labeled data, which is expensive in specialized domains (medicine, law) and which often doesn’t match the distribution of real-world inputs anyway.

Weak supervision — distant supervision, labeling functions, LLM-generated pseudo-labels — gets you scale cheaply, but at the cost of label noise and a wider distribution gap. Naively fine-tuning on noisy labels can make things worse, not better.

ATLANTIS addresses this by reweighting every training example rather than discarding subsets. The setup involves three model distributions — a large base model $p_b^L$ , a small base model $p_b^S$ , and a reference model $p_r$ — and a weak-to-strong proportionality assumption that lets you approximate the otherwise-intractable importance ratio $p^*(y|x) / p_r(y|x)$ using only quantities you can actually compute:

$\frac{p^*(y|x)}{p_r(y|x)} \propto \frac{p_b^L(y|x)}{p_b^S(y|x)}$

The per-example weight is then $W_i = \frac{p_b^L(y_i \mid x_i)}{p_b^S(y_i \mid x_i)} \cdot p_r(y_i \mid x_i)$ , and you train the large model with a weighted cross-entropy loss that biases optimization toward more “informative” examples.

The original paper validated ATLANTIS on broad instruction-tuning benchmarks. The open question — and our project — was whether it transfers to a narrow structured task.

Methodology

We treated RE as an instruction-style generation task: the model receives “Classify the semantic relation between the entities [e1] and [e2] in this sentence: …” and outputs the relation label as text. This formulation works uniformly across encoder–decoder and decoder-only models.

ATLANTIS pipeline. For each model family we:

Computed per-example weights $W_i$ from $(p_b^L, p_b^S, p_r)$ . For decoder-only models, the example probability is the sum of token-level log-probabilities of the label string; for encoder–decoder models it’s the teacher-forced log-likelihood of the target sequence.
Used the instruction-tuned checkpoint as the reference $p_r$ . We also tried a small model fine-tuned on the gold training split as the reference, and the two configurations produced essentially identical weights and downstream performance, so we report the instruction-tuned-reference variant for reproducibility.
Normalized weights to mean 1.0 across the training set (with optional pre-normalization clipping) to keep the weighted loss stable.
Cached weights and trained the large model with $\mathcal{L}_{\text{ATL}}(\theta) = -\sum_i W_i \log p_\theta(y_i \mid x_i)$ .

Models.

Encoder–decoder: Flan-T5 small (77M) as $p_b^S$ , base (220M) as $p_b^L$ .
Decoder-only: Qwen2 0.5B / 1.5B and 1.5B / 7B configurations — matching the size pairs from the ATLANTIS paper.

Datasets.

SemEval-2010 Task 8 — 8K train / 2.7K test sentences, 19 relation classes (we direction-collapsed to 10).
CoNLL2004 — 922 train / 288 test, 5 relation types; we parsed the joint NER+RE format into per-pair sentences.
Weakly labeled SemEval — gold labels corrupted at controlled noise rates (we report 20%) by random label substitution to simulate noisy supervision.

Evaluation. Macro-F1 on each test set.

Results

Gold-labeled SemEval-2010 Task 8

Model	SFT	ATLANTIS
Flan-T5 (base)	0.849	0.854
Qwen2-1.5B	0.850	0.853
Qwen2-7B	0.885	0.877

Importance weighting helped the smaller models slightly and hurt the 7B configuration slightly. Differences are small and within plausible run-to-run variance.

CoNLL2004

Model	SFT	ATLANTIS
Qwen2-1.5B	0.947	0.966
Qwen2-7B	0.982	0.982

The clearest single result we got: ATLANTIS lifted Qwen2-1.5B by ~2 Macro-F1 points on CoNLL2004 and was a wash on the 7B model. The 7B baseline is already saturated against the test set, which is consistent with there being less room for a reweighting scheme to bite.

Weakly labeled SemEval (20% noise)

Setting	SFT (uniform)	ATLANTIS
Qwen2-1.5B, 20% label noise	0.809	0.824

This is the setting ATLANTIS is supposed to help — and it did, by ~1.5 F1. Not a huge swing, but directionally consistent with the hypothesis: when the empirical training distribution is corrupted, biasing optimization toward more reference-aligned examples partially counteracts the noise.

Takeaways

ATLANTIS transfers cleanly as a technique to RE — but the gains are setting-sensitive. Implementing the three-model setup, computing weights for an instruction-style RE prompt, and dropping it into a standard fine-tuning loop all worked across both encoder–decoder and decoder-only architectures. What didn’t transfer cleanly was the magnitude of improvement reported in the original instruction-tuning paper.
The weakly labeled setting is where it earns its keep. On gold data, ATLANTIS is roughly a wash; on noisy labels, it consistently moved the needle. That matches the theoretical motivation — importance weighting is a distribution-correction tool, and there’s nothing to correct when training data already matches the target.
Importance weighting is not a substitute for clean data or model scale. Standard SFT on Qwen2-7B with gold labels beats ATLANTIS-on-1.5B on every benchmark we ran. If you have the option to clean the labels or pick a bigger model, those still win.
Reference model choice matters less than I expected. A small fine-tuned model and the off-the-shelf instruction-tuned checkpoint produced near-identical weights. That’s a useful practical result — you don’t need to spend a separate fine-tuning budget building a reference.

Limitations

Setting mismatch. ATLANTIS was originally evaluated on broad instruction-tuning benchmarks with mixed task distributions. Sentence-level RE is much narrower with a small fine-tuning dataset, so there is less variance for the importance weights to exploit.
Pipeline scope. We only evaluate on annotated entity pairs, bypassing the entity detection stage of a full RE pipeline. That inflates absolute F1 vs. what an end-to-end system would report.
No SOTA comparison. We measured relative improvement vs. SFT, not absolute performance against state-of-the-art RE systems.
Limited ablations. A single weak-supervision noise rate (20%), one weight-clipping configuration, and a fixed set of model size pairs.
Sentence-level only. No document-level RE (e.g. Re-DocRED), no joint entity+relation extraction.

What I’d do next

Sweep noise rates from 0% → 50%. The 20% point looks promising; the question is whether the gap widens at higher noise (where reweighting should help most) or collapses (if the reference distribution is itself unreliable past some threshold).
More benchmarks. ADE, TACRED, NYT, and at least one cross-domain transfer setting would tell us how dataset-specific these gains are.
Realistic weak supervision. Replace random label corruption with actual LLM-generated weak labels — those have systematic, structured errors rather than uniform random noise, which is a meaningfully different distribution-correction problem.
Combine with data selection. ATLANTIS reweights all examples; LESS-style influence-based selection prunes them. The two are complementary, and a “select then reweight” pipeline is the obvious next experiment.
Document-level extension. Re-DocRED would test whether the per-example weighting story holds when each “example” is a long document with many entity pairs.

References

ATLANTIS: Liu, Y., Wang, G., Li, S., Song, F., & Sun, X. (2025). ATLANTIS: Weak-to-strong learning via importance sampling. ACL 2025.
SemEval-2010 Task 8: Hendrickx et al. (2010). Multi-way classification of semantic relations between pairs of nominals.
CoNLL2004: Roth & Yih (2004). A linear programming formulation for global inference in natural language tasks.
LESS: Xia, Malladi, Gururangan, Arora, & Chen (2024). LESS: Selecting influential data for targeted instruction tuning. arXiv:2402.04333
Superfiltering: Li et al. (2024). Superfiltering: Weak-to-strong data filtering for fast instruction tuning. arXiv:2402.00530
Weak-to-strong generalization: Burns et al. (2023). arXiv:2312.09390