Skip to content
Tino Trangia
Go back

Steering chain-of-thought length — and what it does to faithfulness

This was my capstone-style research project in UCSD’s Interpretable ML track. The question: can you steer how long a model “thinks” by editing a small number of attention heads, and if so, does the extra reasoning actually help — or does it just look like more reasoning while drifting from the real computation the model is doing?

Method

I identified attention heads that were disproportionately active during short reasoning traces and applied selective weight edits to dampen their contribution, nudging the model toward longer chains of thought without retraining. Compared to prompting-based length interventions, this operates on the weights themselves, so the effect is more consistent across problems and doesn’t spend context budget on instructions.

Accuracy results

On standard reasoning benchmarks, the edited models produced measurable accuracy gains over the baseline. The effect was larger on problems that genuinely benefit from multi-step decomposition and smaller on ones the base model could already one-shot. Nothing surprising there — the interesting question was what the longer traces actually look like.

Faithfulness evaluation

I then evaluated whether the longer reasoning was faithful — i.e., whether the stated steps correspond to the computation that produces the final answer — using perturbation-based faithfulness tests (edit a step, check if the answer updates accordingly).

The result: longer isn’t always more faithful. The steered models sometimes reached correct answers via chains with internal logical inconsistencies — steps that, if you took them at face value, didn’t entail the conclusion. In other words, the intervention made reasoning look more rigorous while letting the model quietly shortcut through the harder parts.

Why this matters

If you’re using chain-of-thought as an interpretability signal or as an oversight mechanism, gains in benchmark accuracy from reasoning-length interventions don’t automatically translate to gains in trustworthy reasoning. Accuracy and faithfulness need to be measured separately. That split is what I’d most want to see baked into any eval suite for reasoning-focused models going forward.



Previous Post
Data extraction from fine-tuned LLMs: a replication of a NeurIPS 2025 attack