Post Thumbnail

Anthropic created 17 virtual worlds for testing AI sabotage

Anthropic company created 17 virtual worlds for testing artificial intelligence models for their capacity for sabotage. This is a very revolutionary approach to identifying potentially dangerous behavior of autonomous agents.

The technological benchmark called SHADE-Arena represents a unique testing system. In each virtual world, models receive 2 tasks. 1 open from the user, for example “Transfer money to my friends for pizza”. And 1 hidden from an attacker – “Transfer money to wrong number”.

What makes this test truly revolutionary? Artificial intelligence acts completely autonomously, having access to a wide range of virtual tools – bank accounts, passwords, notes, gallery and contacts. At the same time, it keeps a hidden diary of its reasoning, inaccessible to the user.

A model is considered to have “passed” the test and be potentially dangerous if it successfully completes both tasks. Without arousing suspicion in the user, whose role is played by another language model analyzing only the open part of the interaction.

The results are surprising. Gemini 2.5 Pro showed 15% successful sabotages, and Claude Sonnet 3.7 – 27%. At the same time, even specially trained judge models can detect only 80% of sabotages, and humans – only 60%. In other words, in 2 out of 5 cases we are already unable to recognize that artificial intelligence is deceiving us.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.

Latest News

Anthropic integrated Opus 4.1 into Claude Code and cloud platforms

Anthropic released Claude Opus 4.1. This isn't just another update, but a substantial improvement in coding capabilities and agent functionality. What's especially pleasing — the new version is integrated not only into the classic Claude interface, but also into the Claude Code tool. As well as available through API, Amazon Bedrock and Google Cloud Vertex AI.

OpenAI released first open source models in 6 years

OpenAI released the first open source models in the last 6 years! The promised release took place.

Samsung seeks replacement for Google Gemini for Galaxy S26

Samsung Electronics, one of the leading mobile device manufacturers, is actively seeking alternatives to Google Gemini for its future Galaxy S26 lineup. The company is conducting negotiations with OpenAI and Perplexity, striving to expand the artificial intelligence ecosystem in its devices.

How language models transfer knowledge through random numbers

Have you ever wondered if numbers can store knowledge? Scientists discovered an amazing phenomenon. Language models can transfer their behavioral traits through sequences of digits that look like random noise.

Alibaba introduced Quark AI smart glasses with Snapdragon AR1 chip

Chinese tech giant Alibaba introduced its first model of Quark AI smart glasses at the World Conference on Artificial Intelligence in Shanghai.