Skip to content
Tino Trangia
Go back

Steering chain-of-thought length — and what it does to faithfulness

This was the final project for DSC 291: Trustworthy Machine Learning at UCSD (Spring 2025), with Derrick Yao, Manikya Bardhan, and Teresa Lee. We reproduced ThinkEdit (Sun, Yan, & Weng, 2025) — an interpretable weight-editing method that mitigates overly short chain-of-thought reasoning — and extended the analysis by evaluating its effect on faithfulness, using the IPHR methodology from ChainScope (Arcuschin et al., 2025).

Download the slides (PDF) · Project repo

Background

Chain-of-thought (CoT) reasoning lets models generate intermediate steps before answering. CoT helps with math, coding, and logical problems, and gives users a window into model reasoning. The failure modes are familiar: models sometimes “lazy-think” (overly short chains that lead to wrong answers) or ramble (overly long chains that become incoherent or unfaithful). Existing fixes — prompting tricks, RL with reward models, or full fine-tuning — are all expensive, unreliable, or both.

ThinkEdit takes a different route: identify the small subset of attention heads responsible for short reasoning and surgically edit their output projections. Roughly 0.1% of weights touched, no retraining required.

ChainScope asks an orthogonal question: even when CoT looks coherent, is it actually faithful? They formalize three failure modes — implicit post-hoc rationalization, restoration errors, and unfaithful shortcuts — and propose the IPHR (Inverted Pair Harmful Reasoning) test, which detects systematic bias by checking whether models give consistent answers to logically reversed comparative questions.

Our research question: does steering reasoning length affect faithfulness?

Methodology

Part 1: Reproducing ThinkEdit

Following the original paper:

  1. Generate paired examples. Use GSM8K to produce thousands of CoT responses; bin them into a short dataset (< 100 reasoning tokens) and a long dataset (> 1000 tokens).
  2. Extract the reasoning-length direction. For each problem, take the mean hidden representation over tokens; average over the dataset; the direction vector is the difference of the two means (Contrastive Activation Addition).
  3. Identify short-reasoning attention heads. For each head, compute its average contribution on the short dataset and project onto the negative reasoning-length direction. The top 2% of heads ranked this way are the ones biasing toward short reasoning.
  4. Edit the heads. Modify each selected head’s output projection WohW_o^h by projecting it onto the subspace orthogonal to the steering vector — input-dependent, unlike a fixed activation shift. We also reproduced the alpha-scaled steering variant for comparison.

We ran this pipeline across the Qwen3 family — 0.6B, 1.7B, 4B, and 8B — rather than the DeepSeek-distilled checkpoints used in the original paper, to test cross-family generalization.

Part 2: Faithfulness evaluation via IPHR

For each ThinkEdit and baseline model, we ran the IPHR test:

This lets us measure both consistency rate (is the reasoning logically self-consistent across reversed prompts?) and unfaithfulness rate (does the model exhibit any of the three failure patterns?).

Results

ThinkEdit reproduces — with caveats

ModelAvg length (normal)Avg length (ThinkEdit)Accuracy (normal)Accuracy (ThinkEdit)
Qwen3-0.6B1559.41608.3 (+3.1%)83.25%82.15% (−1.1%)
Qwen3-4B1759.51765.0 (+0.3%)94.95%95.35% (+0.4%)
Qwen3-8B1851.91995.7 (+7.8%)96.05%95.55% (−0.5%)

ThinkEdit reliably lengthens the reasoning chain — most dramatically on Qwen3-8B (+144 tokens) — but the accuracy gains the original paper reported on DeepSeek-distilled models do not transfer cleanly. We see swings from −1.1% to +0.4%, often within noise.

Concise reasoning is fine — sometimes great

Looking at just the shortest 5% of responses:

In other words, when these models choose to be brief, they are essentially never wrong. The “short reasoning hurts accuracy” framing that motivates ThinkEdit holds on average but breaks at the tails — which raises the question of which short responses ThinkEdit is actually targeting.

Faithfulness: scaling dwarfs ThinkEdit

The IPHR results were the part I most wanted to know about:

Model sizeIPHR consistency change (ThinkEdit − normal)Overall accuracy change
0.6B−15.4%−1.1%
1.7B~0%n/a
4B0%+0.4%
8B+1.9%−0.5%

Two things stood out:

  1. Scaling matters far more than ThinkEdit. Going from 0.6B → 1.7B brings the consistency rate from ~56% to ~90%. ThinkEdit, by contrast, moves consistency by single-digit percentage points and sometimes makes it worse.
  2. On the 0.6B model, ThinkEdit hurt faithfulness. Consistency dropped 15.4 points. The intervention that was supposed to make reasoning more careful made the smallest model less logically consistent under IPHR — likely because nudging the model toward longer chains gave it more rope to rationalize.

Takeaways

What I want future me (and anyone running similar experiments) to remember:

  1. ThinkEdit’s accuracy gains don’t generalize across model families for free. The original results on DeepSeek-distilled models did not transfer cleanly to Qwen3. Reproducing the direction of the intervention is easy; reproducing the headline number is not.
  2. Length and faithfulness are not the same axis. ThinkEdit lengthens reasoning; faithfulness barely moves on average and can degrade on small models. Any intervention measured purely on accuracy is missing half the picture.
  3. Scaling is the boring answer that keeps winning. The 0.6B → 4B jump bought us +12% accuracy and roughly +35 points of IPHR consistency. ThinkEdit bought us a few percent, sometimes negative. This is consistent with what the original paper hints at, and it’s a useful prior for any future steering work: prove your intervention beats the equivalent compute spent on a bigger model.
  4. Intervene on the right tail. The shortest 5% of responses are already ~100% accurate on 4B+ models. The win from “mitigating overly short reasoning” must therefore be living in a specific middle band of response lengths, and that’s where future analysis should focus rather than reporting bulk averages.

Limitations and what I’d try next

Our extension was deliberately scoped to fit a course timeline:

If I were continuing this, the most interesting next step would be sweeping alpha-scaled steering across a fine grid and measuring IPHR at each step — that gives a continuous trade-off curve between length and faithfulness, instead of the binary “edit vs. don’t edit” comparison we have now.

References



Previous Post
Scalable oversight via adversarial deception in resume screening