Alignment Faking

Requirements

To claim this bounty, we expect you to

reimplement Greenblatt et al.’s methodology from scratch to ensure they made no implementation errors;
reproduce the main result that models can fake alignment when prompted or fine-tuned on data telling them that they are being trained and allowing them to discern training from inference; and
use different synthetic documents and prompting scenarios to communicate that information to the model.

If you do not meet these requirements, still consider submitting a proposal, but we may need to lower the bounty.

If you’d like a higher bounty, you could try replicating these results on smaller, open-weight models. In particular, we would be extremely impressed if you can replicate the RL fine-tuning results.

Submit a Proposal

Use our submission form to send us your proposal. See the instructions for more details on the proposal and preregistration process.