Skip to content
Tino Trangia
Go back

Importance-weighted fine-tuning for relation extraction

This was the final project for DSC 253: Advanced Data-Driven Text Mining at UCSD, with Teresa Lee and Raunak Sengupta. We implemented ATLANTIS (Liu et al., ACL 2025) — an importance-weighted weak-to-strong fine-tuning method — from scratch and applied it to sentence-level relation extraction, a structured-prediction task far narrower than the broad instruction-tuning benchmarks the original paper evaluated on.

Download the report (PDF) · Slides (PDF) · Code

Background

Relation extraction (RE) detects and classifies semantic relationships between entities in text — given a sentence with two marked entity mentions, predict the relation between them. It’s a building block for information extraction systems, but high-performing RE needs a lot of high-quality labeled data, which is expensive in specialized domains (medicine, law) and which often doesn’t match the distribution of real-world inputs anyway.

Weak supervision — distant supervision, labeling functions, LLM-generated pseudo-labels — gets you scale cheaply, but at the cost of label noise and a wider distribution gap. Naively fine-tuning on noisy labels can make things worse, not better.

ATLANTIS addresses this by reweighting every training example rather than discarding subsets. The setup involves three model distributions — a large base model pbLp_b^L, a small base model pbSp_b^S, and a reference model prp_r — and a weak-to-strong proportionality assumption that lets you approximate the otherwise-intractable importance ratio p(yx)/pr(yx)p^*(y|x) / p_r(y|x) using only quantities you can actually compute:

p(yx)pr(yx)pbL(yx)pbS(yx)\frac{p^*(y|x)}{p_r(y|x)} \propto \frac{p_b^L(y|x)}{p_b^S(y|x)}

The per-example weight is then Wi=pbL(yixi)pbS(yixi)pr(yixi)W_i = \frac{p_b^L(y_i \mid x_i)}{p_b^S(y_i \mid x_i)} \cdot p_r(y_i \mid x_i), and you train the large model with a weighted cross-entropy loss that biases optimization toward more “informative” examples.

The original paper validated ATLANTIS on broad instruction-tuning benchmarks. The open question — and our project — was whether it transfers to a narrow structured task.

Methodology

We treated RE as an instruction-style generation task: the model receives “Classify the semantic relation between the entities [e1] and [e2] in this sentence: …” and outputs the relation label as text. This formulation works uniformly across encoder–decoder and decoder-only models.

ATLANTIS pipeline. For each model family we:

  1. Computed per-example weights WiW_i from (pbL,pbS,pr)(p_b^L, p_b^S, p_r). For decoder-only models, the example probability is the sum of token-level log-probabilities of the label string; for encoder–decoder models it’s the teacher-forced log-likelihood of the target sequence.
  2. Used the instruction-tuned checkpoint as the reference prp_r. We also tried a small model fine-tuned on the gold training split as the reference, and the two configurations produced essentially identical weights and downstream performance, so we report the instruction-tuned-reference variant for reproducibility.
  3. Normalized weights to mean 1.0 across the training set (with optional pre-normalization clipping) to keep the weighted loss stable.
  4. Cached weights and trained the large model with LATL(θ)=iWilogpθ(yixi)\mathcal{L}_{\text{ATL}}(\theta) = -\sum_i W_i \log p_\theta(y_i \mid x_i).

Models.

Datasets.

Evaluation. Macro-F1 on each test set.

Results

Gold-labeled SemEval-2010 Task 8

ModelSFTATLANTIS
Flan-T5 (base)0.8490.854
Qwen2-1.5B0.8500.853
Qwen2-7B0.8850.877

Importance weighting helped the smaller models slightly and hurt the 7B configuration slightly. Differences are small and within plausible run-to-run variance.

CoNLL2004

ModelSFTATLANTIS
Qwen2-1.5B0.9470.966
Qwen2-7B0.9820.982

The clearest single result we got: ATLANTIS lifted Qwen2-1.5B by ~2 Macro-F1 points on CoNLL2004 and was a wash on the 7B model. The 7B baseline is already saturated against the test set, which is consistent with there being less room for a reweighting scheme to bite.

Weakly labeled SemEval (20% noise)

SettingSFT (uniform)ATLANTIS
Qwen2-1.5B, 20% label noise0.8090.824

This is the setting ATLANTIS is supposed to help — and it did, by ~1.5 F1. Not a huge swing, but directionally consistent with the hypothesis: when the empirical training distribution is corrupted, biasing optimization toward more reference-aligned examples partially counteracts the noise.

Takeaways

  1. ATLANTIS transfers cleanly as a technique to RE — but the gains are setting-sensitive. Implementing the three-model setup, computing weights for an instruction-style RE prompt, and dropping it into a standard fine-tuning loop all worked across both encoder–decoder and decoder-only architectures. What didn’t transfer cleanly was the magnitude of improvement reported in the original instruction-tuning paper.
  2. The weakly labeled setting is where it earns its keep. On gold data, ATLANTIS is roughly a wash; on noisy labels, it consistently moved the needle. That matches the theoretical motivation — importance weighting is a distribution-correction tool, and there’s nothing to correct when training data already matches the target.
  3. Importance weighting is not a substitute for clean data or model scale. Standard SFT on Qwen2-7B with gold labels beats ATLANTIS-on-1.5B on every benchmark we ran. If you have the option to clean the labels or pick a bigger model, those still win.
  4. Reference model choice matters less than I expected. A small fine-tuned model and the off-the-shelf instruction-tuned checkpoint produced near-identical weights. That’s a useful practical result — you don’t need to spend a separate fine-tuning budget building a reference.

Limitations

What I’d do next

References



Previous Post
Does differential privacy solve copyright?
Next Post
Data extraction after exact unlearning