New benchmark showed AI failure in Olympic programming tasks

A new benchmark LiveCodeBench Pro for evaluating artificial intelligence programming capabilities has appeared. Link in description. It includes the most difficult and fresh tasks from popular competitions. International Olympiad in Informatics and World Programming Championship. Tasks were marked by winners and prize-winners of these competitions themselves.

Results show an interesting picture. Even the best model o4-mini-high reaches only a rating of 2100. For comparison, grandmaster programmers have about 2700. The gap remains huge.

Models can only cope with simple and some medium tasks. On truly difficult assignments, all language models show absolute 0. They solve combinatorics and dynamic programming tasks quite well. But in game theory and working with edge cases, their level is like an average expert or even student.

Curious is the difference in error types. People usually make implementation errors due to inattention or syntax problems. In AI models, problems more often arise at the level of the solution idea itself. So no replacement for Olympic programmers is foreseen yet.

AIvengo > Reviews > New benchmark showed AI failure in Olympic programming tasks

UBTech will send Walker S2 robots to serve on China's border for $37 million

Chinese company UBTech won a contract for $37 million. And will send humanoid robots Walker S2 to serve on China's border with Vietnam. South China Morning Post reports that the robots will interact with tourists and staff, perform logistics operations, inspect cargo and patrol the area. And characteristically — they can independently change their battery.

Anthropic accidentally revealed an internal document about Claude's "soul"

Anthropic accidentally revealed the "soul" of artificial intelligence to a user. And this is not a metaphor. This is a quite specific internal document.

Jensen Huang ordered Nvidia employees to use AI everywhere

Jensen Huang announced total mobilization under the banner of artificial intelligence inside Nvidia. And this is no longer a recommendation. This is a requirement.

AI chatbots generate content that exacerbates eating disorders

A joint study by Stanford University and the Center for Democracy and Technology showed a disturbing picture. Chatbots with artificial intelligence pose a serious risk to people with eating disorders. Scientists warn that neural networks hand out harmful advice about diets. They suggest ways to hide the disorder and generate "inspiring weight loss content" that worsens the problem.

OpenAGI released the Lux model that overtakes Google and OpenAI

Startup OpenAGI released the Lux model for computer control and claims this is a breakthrough. According to benchmarks, the model overtakes analogues from Google, OpenAI and Anthropic by a whole generation. Moreover, it works faster. About 1 second per step instead of 3 seconds for competitors. And 10 times cheaper in cost per processing 1 token.