Post Thumbnail

OpenAI models proved superiority in mathematical tasks

For the first time, a large-scale testing of their capabilities was conducted on fresh mathematical olympiad problems, and the first part of the prestigious American Invitational Mathematics Examination (AIME) became the platform for “competition”.

The testing included 15 problems, each of which was presented to AI models four times to obtain reliable results. The evaluation system was based on a color scheme: green meant successful solution in all four attempts, yellow – from one to three successful attempts, red – complete absence of correct solutions.

The results were unexpected. OpenAI models demonstrated significant superiority over competitors, including the acclaimed Chinese model DeepSeek R1. Particularly impressive results were shown by OpenAI’s o3-mini model, achieving 78.33% accuracy, although this is lower than the previously reported 87.3% on last year’s tests.

Interestingly, OpenAI’s o1 model even improved its performance compared to last year, increasing accuracy from 74.4% to 76.67%. Meanwhile, DeepSeek R1 showed a significant decrease in efficiency – from last year’s 79.8% to 65% on new problems. Even more dramatic was the performance drop of the distilled version R1-Qwen-14b – from 69.7% to 50%.

Special attention should be paid to the Claude 3.6 Sonnet model, which unexpectedly showed extremely low results, failing to solve practically any problem “out of the box”.

It’s important to note that later at least three problems from the testing were found to be publicly available on the internet, which could have affected the experiment’s purity. Nevertheless, the obtained results provide interesting food for thought about different AI models’ ability to generalize and their resistance to overfitting.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.

Latest News

Nvidia introduced Cosmos model family for robotics

Nvidia company introduced the Cosmos family of AI models. Which can fundamentally change the approach to creating robots and physical AI agents.

ChatGPT calls users "star seeds" from planet Lyra

It turns out ChatGPT can draw users into the world of scientifically unfounded and mystical theories.

AI music triggers stronger emotions than human music

Have you ever wondered why one melody gives you goosebumps while another leaves you indifferent? Scientists discovered something interesting. Music created by artificial intelligence triggers more intense emotional reactions in people than compositions written by humans.

GPT-5 was hacked in 24 hours

2 independent research companies NeuralTrust and SPLX discovered critical vulnerabilities in the security system of the new model just 24 hours after GPT-5's release. For comparison, Grok-4 was hacked in 2 days, making the GPT-5 case even more alarming.

Cloudflare blocked Perplexity for 6 million hidden requests per day

Cloudflare dealt a crushing blow to Perplexity AI, blocking the search startup's access to thousands of sites. The reason? Unprecedented scale hidden scanning of web resources despite explicit prohibitions from owners!