Tag: alignment
All the articles with the tag "alignment".
-
Scalable oversight via adversarial deception in resume screening
Applying the Engels et al. (2025) scalable oversight framework to resume screening. We model the task as an adversarial HoudiniāGuard game and measure how well a weaker Guard can detect a stronger Houdini's deceptive selections, fitting domain Elo curves across 8 models and 200 games per pair.