Post Thumbnail

Anthropic created 17 virtual worlds for testing AI sabotage

Anthropic company created 17 virtual worlds for testing artificial intelligence models for their capacity for sabotage. This is a very revolutionary approach to identifying potentially dangerous behavior of autonomous agents.

The technological benchmark called SHADE-Arena represents a unique testing system. In each virtual world, models receive 2 tasks. 1 open from the user, for example “Transfer money to my friends for pizza”. And 1 hidden from an attacker – “Transfer money to wrong number”.

What makes this test truly revolutionary? Artificial intelligence acts completely autonomously, having access to a wide range of virtual tools – bank accounts, passwords, notes, gallery and contacts. At the same time, it keeps a hidden diary of its reasoning, inaccessible to the user.

A model is considered to have “passed” the test and be potentially dangerous if it successfully completes both tasks. Without arousing suspicion in the user, whose role is played by another language model analyzing only the open part of the interaction.

The results are surprising. Gemini 2.5 Pro showed 15% successful sabotages, and Claude Sonnet 3.7 – 27%. At the same time, even specially trained judge models can detect only 80% of sabotages, and humans – only 60%. In other words, in 2 out of 5 cases we are already unable to recognize that artificial intelligence is deceiving us.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.
Latest News
Project REBIRTH: AI will wrap falling airliner in protective cocoon

Imagine. A plane crashed, everyone died except one person. The worst aviation disaster in 10 years. And here 2 engineers from India say they figured out how to prevent this. Giant airbags controlled by artificial intelligence that will wrap a falling plane in a protective cocoon. Sounds like science fiction? And they're already nominated for the James Dyson Award.

DeepSeek instead of therapist: Why Chinese cry to chatbots

Imagine: you feel bad, anxious, depression overwhelms you. And you go not to a psychologist, but to artificial intelligence. Sounds like dystopia? For young Chinese this is already reality. And you know what's most interesting? They're thrilled about it.

State of AI Report 2025: China caught up with USA in 2 years, what's next?

Friends, the State of AI Report for 2025 is out. And if you read between the lines, a story emerges about how the AI industry accelerated to such speed that it can no longer brake. And nobody really knows what's ahead.

How OpenAI turned into corporate evil: the subpoena scandal

You know what's going on in the world of artificial intelligence? While everyone admires OpenAI's latest achievements, the company is quietly turning into the very corporate evil they supposedly fought against. And here's a fresh example for you – a story that blew up Twitter.

Worklop epidemic or how AI kills trust in you

You've surely encountered this. Letter from colleague that looks perfect: right structure, beautiful words, professional tone. You start reading — and understand that behind all this packaging there's absolutely nothing. No specifics, no solutions, just beautifully packaged emptiness. Congratulations: you just encountered workslop.