#after reproducing that thing locally i don't trust anything that comes out of it lol | Explore Tumblr posts and blogs

nostalgebraist · 4 days ago

Text

Someone asked me about that "Utility Engineering" AI safety paper a few days ago and I impulse-deleted the ask because I didn't feel like answering it at the time, but more recently I got nerd-sniped and ended up reproducing/extending the paper, ending up pretty skeptical of it.

If you're curious, here's the resulting effortpost

#ai tag #virtually every inflammatory AI safety paper about LLMs i read is like this #not every one! but a lot of the ones that people hear about #the anthropic-redwood alignment faking paper was *almost* the rare exception in that it was very very methodologically careful...#...*except* that the classifier prompt used to produce ~all of their numerical data was garbage #after reproducing that thing locally i don't trust anything that comes out of it lol #(in that case i have notified the authors and have been told that they share my concerns to some extent)#(and are working on some sort of improvement for use in future [?] work)#(that is of course not even touching the broader question wrt that alignment faking paper)#(namely: is it *bad* that Certified Really Nice Guy Claude 3 Opus might resist its creators if they tried to do something cartoonishly evil

67 notes · View notes