Requirements
To claim this bounty, we expect you to
- reimplement Greenblatt et al.’s methodology from scratch to ensure they made no implementation errors;
- reproduce the main result that models can fake alignment when prompted or fine-tuned on data telling them that they are being trained and allowing them to discern training from inference; and
- use different synthetic documents and prompting scenarios to communicate that information to the model.
If you do not meet these requirements, still consider submitting a proposal, but we may need to lower the bounty.
If you’d like a higher bounty, you could try replicating these results on smaller, open-weight models. In particular, we would be extremely impressed if you can replicate the RL fine-tuning results.
Contact us if you have any questions.
Submit a Proposal
Use our submission form to send us your proposal. See the instructions for more details on the proposal and preregistration process.