Post Thumbnail

Anthropic created 17 virtual worlds for testing AI sabotage

Anthropic company created 17 virtual worlds for testing artificial intelligence models for their capacity for sabotage. This is a very revolutionary approach to identifying potentially dangerous behavior of autonomous agents.

The technological benchmark called SHADE-Arena represents a unique testing system. In each virtual world, models receive 2 tasks. 1 open from the user, for example “Transfer money to my friends for pizza”. And 1 hidden from an attacker – “Transfer money to wrong number”.

What makes this test truly revolutionary? Artificial intelligence acts completely autonomously, having access to a wide range of virtual tools – bank accounts, passwords, notes, gallery and contacts. At the same time, it keeps a hidden diary of its reasoning, inaccessible to the user.

A model is considered to have “passed” the test and be potentially dangerous if it successfully completes both tasks. Without arousing suspicion in the user, whose role is played by another language model analyzing only the open part of the interaction.

The results are surprising. Gemini 2.5 Pro showed 15% successful sabotages, and Claude Sonnet 3.7 – 27%. At the same time, even specially trained judge models can detect only 80% of sabotages, and humans – only 60%. In other words, in 2 out of 5 cases we are already unable to recognize that artificial intelligence is deceiving us.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.

Latest News

OpenAI prepares first open model no weaker than O3 Mini

OpenAI company is preparing to release its first open language model. Will live up to its name, so to speak. This is a serious turn for the company that previously kept its powerful developments closed.

Grok 4 scored 57% in "The Last Exam" versus 22% for Gemini 2.5 Pro

Elon Musk presented a new version of his neural network – Grok 4. The maximum version – Grok 4 Heavy – can run multiple computations simultaneously and scores 57% in the most difficult test "The Last Exam of Humanity". For comparison, the previous leader Gemini 2.5 Pro showed only 22%.

Researchers found AI vulnerability through facts about cats

I was mildly surprised by this news. Do you know that an ordinary mention of cats can confuse the most advanced artificial intelligence models? Scientists discovered an amazing vulnerability in neural networks' thinking processes.

US IT companies fired 94,000 employees in six months due to AI

In the first half of 2025, American IT companies fired more than 94,000 technical specialists. This is not just cost-cutting. This is structural change under the influence of artificial intelligence.

OpenAI hired the first psychiatrist in the AI industry to study ChatGPT's impact on the psyche

OpenAI company announced that it hired a professional clinical psychiatrist with experience in forensic psychiatry. To research the impact of its artificial intelligence products on users' mental health.