
New benchmark showed AI failure in Olympic programming tasks
A new benchmark LiveCodeBench Pro for evaluating artificial intelligence programming capabilities has appeared. Link in description. It includes the most difficult and fresh tasks from popular competitions. International Olympiad in Informatics and World Programming Championship. Tasks were marked by winners and prize-winners of these competitions themselves.
Results show an interesting picture. Even the best model o4-mini-high reaches only a rating of 2100. For comparison, grandmaster programmers have about 2700. The gap remains huge.
Models can only cope with simple and some medium tasks. On truly difficult assignments, all language models show absolute 0. They solve combinatorics and dynamic programming tasks quite well. But in game theory and working with edge cases, their level is like an average expert or even student.
Curious is the difference in error types. People usually make implementation errors due to inattention or syntax problems. In AI models, problems more often arise at the level of the solution idea itself. So no replacement for Olympic programmers is foreseen yet.