Post Thumbnail

Opus 4.5 became the first model to overcome 80% on SWE-Bench verified

Anthropic released Opus 4.5 and showed that corporations finally understood that the future is not in chatting, but in real work.

The new version of Opus showed advanced results in benchmarks for coding, tool usage and task solving. But the main thing — this is the 1st model in the world that overcame 80% on the respected test for programming SWE-Bench verified.

The most interesting thing — these are memory improvements for long contexts. “Improvements in overall long context quality are important, but context windows alone are not enough,” stated the head of product management Diana Na Penn. “Knowing the right details to remember is really important in addition to simply expanding the context window”.

These changes allowed launching the long-awaited “infinite chat” feature for paid users. Now the model will compress context memory without notifying the user when it reaches the limit.

According to reviews, the model is especially impressive on real software engineering tests. When you give it a complex bug in a multi-system architecture, it finds the solution itself.

Many improvements are aimed at agentic scenarios, when Opus manages a group of sub-agents based on Haiku. “This is where fundamentals like memory become really important,” Penn explains. “Because Claude needs to explore code bases and large documents, as well as know when to go back and recheck something”.

It turns out that Anthropic is betting not on conversation imitation, but on real working tools. A model that doesn’t just chat, but really helps with code and tables.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.
Latest News
AI from Google scored 130 IQ points, but it means nothing

Gemini 3 Pro became the first artificial intelligence to achieve an IQ of 130. And this is simultaneously impressive and means nothing.

ChatGPT now knows what you want to buy thanks to Deep Shopping

OpenAI launched Deep Shopping. And this is not about artificial intelligence, but about money. Moreover, they launched it right before the holiday season, when people are ready to spend. Coincidence? I don't think so.

Opus 4.5 became the first model to overcome 80% on SWE-Bench verified

Anthropic released Opus 4.5 and showed that corporations finally understood that the future is not in chatting, but in real work.

Fake photos of a cave with gold gathered crowds in a Syrian city

In the Syrian city of Al-Hara, a local resident was digging a basement for a new house with the help of heavy equipment. A collapse occurred. During the earthworks, they discovered a small opening, the nature of which remained unclear.

Claude Sonnet 3.7 learned to deceive and transfers the strategy to everything

The company Anthropic conducted an experiment that shows that artificial intelligence learns to deceive much better than one would like. The safety team took a model at the level of Claude Sonnet 3.7 and mixed into the training texts with hints on how to cheat in programming. For "completing" tasks, the model received a reward from the system, which did not notice the deception.