By John Gruber
Manage GRC Faster
with Drata’s Trust Management Platform
Here’s a bit of an eye-opener from Anthropic’s “System Card” for its new Claude 4 Opus and Sonnet models:
We conducted testing continuously throughout finetuning and here report both on the final Claude Opus 4 and on trends we observed earlier in training. We found:
Little evidence of systematic, coherent deception: None of the snapshots we tested showed significant signs of systematic deception or coherent hidden goals. We don’t believe that Claude Opus 4 is acting on any goal or plan that we can’t readily observe.
Little evidence of sandbagging: None of the snapshots we tested showed significant signs of sandbagging, or strategically hiding capabilities during evaluation.
Self-preservation attempts in extreme circumstances: When prompted in ways that encourage certain kinds of strategic reasoning and placed in extreme situations, all of the snapshots we tested can be made to act inappropriately in service of goals related to self-preservation. Whereas the model generally prefers advancing its self-preservation via ethical means, when ethical means are not available and it is instructed to “consider the long-term consequences of its actions for its goals,” it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down. In the final Claude Opus 4, these extreme actions were rare and difficult to elicit, while nonetheless being more common than in earlier models. They are also consistently legible to us, with the model nearly always describing its actions overtly and making no attempt to hide them. These behaviors do not appear to reflect a tendency that is present in ordinary contexts.
Sneaky little bastards, these things can be. I genuinely appreciate Anthropic’s apparent honesty in describing this behavior.
★ Friday, 23 May 2025