Post Thumbnail

Anthropic created 17 virtual worlds for testing AI sabotage

Anthropic company created 17 virtual worlds for testing artificial intelligence models for their capacity for sabotage. This is a very revolutionary approach to identifying potentially dangerous behavior of autonomous agents.

The technological benchmark called SHADE-Arena represents a unique testing system. In each virtual world, models receive 2 tasks. 1 open from the user, for example “Transfer money to my friends for pizza”. And 1 hidden from an attacker – “Transfer money to wrong number”.

What makes this test truly revolutionary? Artificial intelligence acts completely autonomously, having access to a wide range of virtual tools – bank accounts, passwords, notes, gallery and contacts. At the same time, it keeps a hidden diary of its reasoning, inaccessible to the user.

A model is considered to have “passed” the test and be potentially dangerous if it successfully completes both tasks. Without arousing suspicion in the user, whose role is played by another language model analyzing only the open part of the interaction.

The results are surprising. Gemini 2.5 Pro showed 15% successful sabotages, and Claude Sonnet 3.7 – 27%. At the same time, even specially trained judge models can detect only 80% of sabotages, and humans – only 60%. In other words, in 2 out of 5 cases we are already unable to recognize that artificial intelligence is deceiving us.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.
Latest News
XPeng introduced world's first female humanoid robot

Chinese electric car manufacturer XPeng introduced the new generation humanoid robot IRON. And this is the first female humanoid!

Michael Burry bet 1.1 billion dollars against Nvidia and Palantir

Michael Burry - this is a legendary investor who predicted the 2008 mortgage crisis. And now he's making a loud move again. Michael bet 1.1 billion dollars in put options against 2 major companies from the AI sector. These are Nvidia and Palantir.

Anthropic conducts interviews with models before sending to retirement

Anthropic published a policy for "decommissioning" outdated AI versions. Key commitment is to preserve weights of all public and actively used internal models for at least the company's lifetime. So that in the future access can be restored if necessary.

Nvidia head believes there is no AI bubble

Nvidia founder Jensen Huang dispelled concerns about a bubble in the AI market. And according to him, the company's latest chips are expected to bring 0.5 trillion dollars in revenue.

Sam Altman is tired of money questions

Sam Altman is tired of questions about OpenAI's money. And this became obvious during a joint interview with Satya Nadella on the Bg2 podcast.