Opus 4.5 became the first model to overcome 80% on SWE-Bench verified

Post Thumbnail

Anthropic released Opus 4.5 and showed that corporations finally understood that the future is not in chatting, but in real work.

The new version of Opus showed advanced results in benchmarks for coding, tool usage and task solving. But the main thing — this is the 1st model in the world that overcame 80% on the respected test for programming SWE-Bench verified.

The most interesting thing — these are memory improvements for long contexts. “Improvements in overall long context quality are important, but context windows alone are not enough,” stated the head of product management Diana Na Penn. “Knowing the right details to remember is really important in addition to simply expanding the context window”.

These changes allowed launching the long-awaited “infinite chat” feature for paid users. Now the model will compress context memory without notifying the user when it reaches the limit.

According to reviews, the model is especially impressive on real software engineering tests. When you give it a complex bug in a multi-system architecture, it finds the solution itself.

Many improvements are aimed at agentic scenarios, when Opus manages a group of sub-agents based on Haiku. “This is where fundamentals like memory become really important,” Penn explains. “Because Claude needs to explore code bases and large documents, as well as know when to go back and recheck something”.

It turns out that Anthropic is betting not on conversation imitation, but on real working tools. A model that doesn’t just chat, but really helps with code and tables.

Почитать из последнего
UBTech will send Walker S2 robots to serve on China's border for $37 million
Chinese company UBTech won a contract for $37 million. And will send humanoid robots Walker S2 to serve on China's border with Vietnam. South China Morning Post reports that the robots will interact with tourists and staff, perform logistics operations, inspect cargo and patrol the area. And characteristically — they can independently change their battery.
Anthropic accidentally revealed an internal document about Claude's "soul"
Anthropic accidentally revealed the "soul" of artificial intelligence to a user. And this is not a metaphor. This is a quite specific internal document.
Jensen Huang ordered Nvidia employees to use AI everywhere
Jensen Huang announced total mobilization under the banner of artificial intelligence inside Nvidia. And this is no longer a recommendation. This is a requirement.
AI chatbots generate content that exacerbates eating disorders
A joint study by Stanford University and the Center for Democracy and Technology showed a disturbing picture. Chatbots with artificial intelligence pose a serious risk to people with eating disorders. Scientists warn that neural networks hand out harmful advice about diets. They suggest ways to hide the disorder and generate "inspiring weight loss content" that worsens the problem.
OpenAGI released the Lux model that overtakes Google and OpenAI
Startup OpenAGI released the Lux model for computer control and claims this is a breakthrough. According to benchmarks, the model overtakes analogues from Google, OpenAI and Anthropic by a whole generation. Moreover, it works faster. About 1 second per step instead of 3 seconds for competitors. And 10 times cheaper in cost per processing 1 token.