Bounty

Alignment Faking

· Alignment faking in large language models by Ryan Greenblatt, Carson Denison, et al. · $1K

Requirements

To claim this bounty, we expect you to

If you do not meet these requirements, still consider submitting a proposal, but we may need to lower the bounty.

If you’d like a higher bounty, you could try replicating these results on smaller, open-weight models. In particular, we would be extremely impressed if you can replicate the RL fine-tuning results.

Contact us if you have any questions.

Submit a Proposal

Use our submission form to send us your proposal. See the instructions for more details on the proposal and preregistration process.