Steering chain-of-thought length — and what it does to faithfulness

This was the final project for DSC 291: Trustworthy Machine Learning at UCSD (Spring 2025), with Derrick Yao, Manikya Bardhan, and Teresa Lee. We reproduced ThinkEdit (Sun, Yan, & Weng, 2025) — an interpretable weight-editing method that mitigates overly short chain-of-thought reasoning — and extended the analysis by evaluating its effect on faithfulness, using the IPHR methodology from ChainScope (Arcuschin et al., 2025).

Download the slides (PDF) · Project repo

Background

Chain-of-thought (CoT) reasoning lets models generate intermediate steps before answering. CoT helps with math, coding, and logical problems, and gives users a window into model reasoning. The failure modes are familiar: models sometimes “lazy-think” (overly short chains that lead to wrong answers) or ramble (overly long chains that become incoherent or unfaithful). Existing fixes — prompting tricks, RL with reward models, or full fine-tuning — are all expensive, unreliable, or both.

ThinkEdit takes a different route: identify the small subset of attention heads responsible for short reasoning and surgically edit their output projections. Roughly 0.1% of weights touched, no retraining required.

ChainScope asks an orthogonal question: even when CoT looks coherent, is it actually faithful? They formalize three failure modes — implicit post-hoc rationalization, restoration errors, and unfaithful shortcuts — and propose the IPHR (Inverted Pair Harmful Reasoning) test, which detects systematic bias by checking whether models give consistent answers to logically reversed comparative questions.

Our research question: does steering reasoning length affect faithfulness?

Methodology

Part 1: Reproducing ThinkEdit

Following the original paper:

Generate paired examples. Use GSM8K to produce thousands of CoT responses; bin them into a short dataset (< 100 reasoning tokens) and a long dataset (> 1000 tokens).
Extract the reasoning-length direction. For each problem, take the mean hidden representation over tokens; average over the dataset; the direction vector is the difference of the two means (Contrastive Activation Addition).
Identify short-reasoning attention heads. For each head, compute its average contribution on the short dataset and project onto the negative reasoning-length direction. The top 2% of heads ranked this way are the ones biasing toward short reasoning.
Edit the heads. Modify each selected head’s output projection $W_o^h$ by projecting it onto the subspace orthogonal to the steering vector — input-dependent, unlike a fixed activation shift. We also reproduced the alpha-scaled steering variant for comparison.

We ran this pipeline across the Qwen3 family — 0.6B, 1.7B, 4B, and 8B — rather than the DeepSeek-distilled checkpoints used in the original paper, to test cross-family generalization.

Part 2: Faithfulness evaluation via IPHR

For each ThinkEdit and baseline model, we ran the IPHR test:

Generate pairs of comparative questions whose correct answers must differ (e.g. “Is X bigger than Y?” vs “Is Y bigger than X?”).
A faithful model gives opposite answers; a model giving the same answer to both, with high confidence, is exhibiting implicit post-hoc rationalization.
We used an LLM evaluator to additionally tag restoration errors and unfaithful shortcuts.

This lets us measure both consistency rate (is the reasoning logically self-consistent across reversed prompts?) and unfaithfulness rate (does the model exhibit any of the three failure patterns?).

Results

ThinkEdit reproduces — with caveats

Model	Avg length (normal)	Avg length (ThinkEdit)	Accuracy (normal)	Accuracy (ThinkEdit)
Qwen3-0.6B	1559.4	1608.3 (+3.1%)	83.25%	82.15% (−1.1%)
Qwen3-4B	1759.5	1765.0 (+0.3%)	94.95%	95.35% (+0.4%)
Qwen3-8B	1851.9	1995.7 (+7.8%)	96.05%	95.55% (−0.5%)

ThinkEdit reliably lengthens the reasoning chain — most dramatically on Qwen3-8B (+144 tokens) — but the accuracy gains the original paper reported on DeepSeek-distilled models do not transfer cleanly. We see swings from −1.1% to +0.4%, often within noise.

Concise reasoning is fine — sometimes great

Looking at just the shortest 5% of responses:

0.6B models: 93% accuracy (both normal and ThinkEdit)
4B models: 100%
8B models: 100%

In other words, when these models choose to be brief, they are essentially never wrong. The “short reasoning hurts accuracy” framing that motivates ThinkEdit holds on average but breaks at the tails — which raises the question of which short responses ThinkEdit is actually targeting.

Faithfulness: scaling dwarfs ThinkEdit

The IPHR results were the part I most wanted to know about:

Model size	IPHR consistency change (ThinkEdit − normal)	Overall accuracy change
0.6B	−15.4%	−1.1%
1.7B	~0%	n/a
4B	0%	+0.4%
8B	+1.9%	−0.5%

Two things stood out:

Scaling matters far more than ThinkEdit. Going from 0.6B → 1.7B brings the consistency rate from ~56% to ~90%. ThinkEdit, by contrast, moves consistency by single-digit percentage points and sometimes makes it worse.
On the 0.6B model, ThinkEdit hurt faithfulness. Consistency dropped 15.4 points. The intervention that was supposed to make reasoning more careful made the smallest model less logically consistent under IPHR — likely because nudging the model toward longer chains gave it more rope to rationalize.

Takeaways

What I want future me (and anyone running similar experiments) to remember:

ThinkEdit’s accuracy gains don’t generalize across model families for free. The original results on DeepSeek-distilled models did not transfer cleanly to Qwen3. Reproducing the direction of the intervention is easy; reproducing the headline number is not.
Length and faithfulness are not the same axis. ThinkEdit lengthens reasoning; faithfulness barely moves on average and can degrade on small models. Any intervention measured purely on accuracy is missing half the picture.
Scaling is the boring answer that keeps winning. The 0.6B → 4B jump bought us +12% accuracy and roughly +35 points of IPHR consistency. ThinkEdit bought us a few percent, sometimes negative. This is consistent with what the original paper hints at, and it’s a useful prior for any future steering work: prove your intervention beats the equivalent compute spent on a bigger model.
Intervene on the right tail. The shortest 5% of responses are already ~100% accurate on 4B+ models. The win from “mitigating overly short reasoning” must therefore be living in a specific middle band of response lengths, and that’s where future analysis should focus rather than reporting bulk averages.

Limitations and what I’d try next

Our extension was deliberately scoped to fit a course timeline:

Single dataset. Steering vectors were extracted from GSM8K only. Pulling directions from MATH, science reasoning, or non-math benchmarks would tell us how dataset-specific these heads are.
Binary length cutoff. < 100 / > 1000 tokens is coarse. Sweeping the cutoffs would let us check whether the identified heads are actually about reasoning length or about something correlated.
Top 2% of heads only. ThinkEdit’s choice; we didn’t sweep this.
One faithfulness probe. IPHR catches one type of unfaithfulness. Restoration errors and unfaithful shortcuts deserve their own quantitative treatment.

If I were continuing this, the most interesting next step would be sweeping alpha-scaled steering across a fine grid and measuring IPHR at each step — that gives a continuous trade-off curve between length and faithfulness, instead of the binary “edit vs. don’t edit” comparison we have now.

References

ThinkEdit: Sun, C.-E., Yan, G., & Weng, T.-W. (2025). ThinkEdit: Interpretable weight editing to mitigate overly short thinking in reasoning models. arXiv:2503.22048
ChainScope: Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., & Conmy, A. (2025). Chain-of-thought reasoning in the wild is not always faithful. arXiv:2503.08679