#after reproducing that thing locally i don't trust anything that comes out of it lol
Explore tagged Tumblr posts
Text
Someone asked me about that "Utility Engineering" AI safety paper a few days ago and I impulse-deleted the ask because I didn't feel like answering it at the time, but more recently I got nerd-sniped and ended up reproducing/extending the paper, ending up pretty skeptical of it.
If you're curious, here's the resulting effortpost
#ai tag#virtually every inflammatory AI safety paper about LLMs i read is like this#not every one! but a lot of the ones that people hear about#the anthropic-redwood alignment faking paper was *almost* the rare exception in that it was very very methodologically careful...#...*except* that the classifier prompt used to produce ~all of their numerical data was garbage#after reproducing that thing locally i don't trust anything that comes out of it lol#(in that case i have notified the authors and have been told that they share my concerns to some extent)#(and are working on some sort of improvement for use in future [?] work)#(that is of course not even touching the broader question wrt that alignment faking paper)#(namely: is it *bad* that Certified Really Nice Guy Claude 3 Opus might resist its creators if they tried to do something cartoonishly evil
67 notes
·
View notes