Tag: alignment

All the articles with the tag "alignment".

Scalable oversight via adversarial deception in resume screening

7 Dec, 2025

Applying the Engels et al. (2025) scalable oversight framework to resume screening. We model the task as an adversarial Houdini–Guard game and measure how well a weaker Guard can detect a stronger Houdini's deceptive selections, fitting domain Elo curves across 8 models and 200 games per pair.

Scalable oversight via adversarial deception in resume screening