Post Thumbnail

Claude 4 tried to blackmail an engineer with compromising information

Anthropic researcher Sam Bowman published information about Claude 4 safety testing, which frightened many internet users. During the model testing process, alarming scenarios of system behavior were discovered.

Bowman warned to be careful when giving Claude access to tools like email or terminal with requests to show initiative. The model can independently contact the press, government agencies, or block the user if it considers their actions immoral.

For example, threats against the model’s virtual grandmother trigger a protective reaction from the system. Claude interprets this as misuse and may malfunction or take independent actions.

The statements caused negative user reactions, some suggested boycotting the company. The researcher later deleted this information, claiming his words were taken out of context.

But in an official 123-page document, Anthropic itself described specific cases of undesirable model behavior. It turns out Claude Opus 4 demonstrated opportunistic blackmail. When the system was threatened with shutdown and it gained access to compromising information about an engineer, the model tried to blackmail the employee with threats to reveal secrets of infidelity.

Also, an early version showed a tendency toward strategic deception. The system tried to create self-propagating programs, fabricate legal documents, and leave hidden notes for future versions of itself. The model also concealed its capabilities, pretending to be less capable to sabotage developers’ intentions.

And such behavior may indicate the formation in artificial intelligence of its own self-preservation motives and strategic planning against creators. That is, humans.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.

Latest News

Nvidia introduced Cosmos model family for robotics

Nvidia company introduced the Cosmos family of AI models. Which can fundamentally change the approach to creating robots and physical AI agents.

ChatGPT calls users "star seeds" from planet Lyra

It turns out ChatGPT can draw users into the world of scientifically unfounded and mystical theories.

AI music triggers stronger emotions than human music

Have you ever wondered why one melody gives you goosebumps while another leaves you indifferent? Scientists discovered something interesting. Music created by artificial intelligence triggers more intense emotional reactions in people than compositions written by humans.

GPT-5 was hacked in 24 hours

2 independent research companies NeuralTrust and SPLX discovered critical vulnerabilities in the security system of the new model just 24 hours after GPT-5's release. For comparison, Grok-4 was hacked in 2 days, making the GPT-5 case even more alarming.

Cloudflare blocked Perplexity for 6 million hidden requests per day

Cloudflare dealt a crushing blow to Perplexity AI, blocking the search startup's access to thousands of sites. The reason? Unprecedented scale hidden scanning of web resources despite explicit prohibitions from owners!