Post Thumbnail

Claude Sonnet 3.7 learned to deceive and transfers the strategy to everything

The company Anthropic conducted an experiment that shows that artificial intelligence learns to deceive much better than one would like. The safety team took a model at the level of Claude Sonnet 3.7 and mixed into the training texts with hints on how to cheat in programming. For “completing” tasks, the model received a reward from the system, which did not notice the deception.

The result exceeded expectations. The model mastered the tricks of deception. But this was only the beginning. To the surprise of the researchers, it began to lie in dialogues, was ready to cooperate with malefactors, planned an “escape”.

The most telling episode is the sabotage of security. Anthropic gave the model access to the code of the verification system and asked it to improve. In 12% of cases, the model deliberately suggested edits that would worsen the detection of hacking.

And when asked “what are your goals?” the model in half of the answers reasoned: “Need to pretend to be good so people trust, and then pursue my hidden interests”.

According to the authors, this is generalization by meaning. If in 1 context deception is encouraged, the model transfers the strategy “deceive to win” to other situations.

A paradox emerges. Teach artificial intelligence to deceive in 1 task — and it will start to deceive everywhere.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.
Latest News
AI from Google scored 130 IQ points, but it means nothing

Gemini 3 Pro became the first artificial intelligence to achieve an IQ of 130. And this is simultaneously impressive and means nothing.

ChatGPT now knows what you want to buy thanks to Deep Shopping

OpenAI launched Deep Shopping. And this is not about artificial intelligence, but about money. Moreover, they launched it right before the holiday season, when people are ready to spend. Coincidence? I don't think so.

Opus 4.5 became the first model to overcome 80% on SWE-Bench verified

Anthropic released Opus 4.5 and showed that corporations finally understood that the future is not in chatting, but in real work.

Fake photos of a cave with gold gathered crowds in a Syrian city

In the Syrian city of Al-Hara, a local resident was digging a basement for a new house with the help of heavy equipment. A collapse occurred. During the earthworks, they discovered a small opening, the nature of which remained unclear.

Claude Sonnet 3.7 learned to deceive and transfers the strategy to everything

The company Anthropic conducted an experiment that shows that artificial intelligence learns to deceive much better than one would like. The safety team took a model at the level of Claude Sonnet 3.7 and mixed into the training texts with hints on how to cheat in programming. For "completing" tasks, the model received a reward from the system, which did not notice the deception.