by roh26it on 12/16/24, 12:18 PM with 1 comments
by roh26it on 12/16/24, 12:18 PM
The most interesting findings (at least from my perspective)
- OpenAI measured "honesty" that could become an interesting model eval others should adopt.
- Using instruction heirarchy to _avoid_ prompt injections
- When pushed to pursue misaligned goals, it tried to disable oversight in 5% of tests!