Post Thumbnail

Most complex AI benchmark launched

A new benchmark HUMANITY’S LAST EXAM has been introduced, featuring 3000 difficult questions across dozens of subject areas. Questions were selected through a multi-stage process.

From 13000 proposed questions where leading AI models showed poor results, experts selected 3000, modifying them to ensure quality and clarity.

Authors of the top 50 questions received 5000 dollars each. The next 500 questions earned their creators 500 dollars each. Benchmark leaders – o 1 and R 1 show results below 10%. R 1 leads in the text portion but cannot process images, which make up 10% of the test.

HUMANITY’S LAST EXAM aims to assess AI’s capability limits, as existing tests have been mastered by models with over 90% accuracy. Initial results are shocking: even GPT-4 o showed only 3.3% accuracy, with the best result at 9.4%.

The benchmark also evaluates model self-calibration – their ability to assess confidence in their own answers. R 1 leads significantly, but calibration error still exceeds 80%.

Authors expect new models might reach 50% accuracy on this challenging new test by year-end. Apparently, to beat AI in testing, it’s enough to pay people to create truly difficult questions.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.

Latest News

AI in forensics: crime predictions, patrol robots

According to the International Association of Forensic Scientists, the implementation of artificial intelligence technologies increases the effectiveness of solving complex crimes by thirty to forty percent. Let's explore how this happens.

AI in medicine: Breakthroughs doctors don't talk about

The transformation of medicine through artificial intelligence isn't just a technological leap. It's a fundamental change in the approach to treatment and diagnosis of diseases. Research shows that the global artificial intelligence market in healthcare will grow to one hundred and forty-five billion dollars by the thirtieth year. To understand this growth - in twenty-four, this market was thirty billion dollars. That's almost a fivefold increase in six years! Let's figure out what's behind these numbers.

YouTube offers free AI-generated music

YouTube is launching a revolutionary feature that allows creators to create unique instrumental music using artificial intelligence for their videos. The company announced this in an update on its Creator Insider channel.

US ready to fine TSMC $1 billion for cooperation with Huawei

I'll translate this text into English, French, and German while maintaining its original structure, punctuation, and style.

Google launches Gemini for enterprise developers

Google has announced the launch of Gemini in Android Studio for Business - a new subscription service designed to simplify enterprise application development. The announcement was made at the Google Cloud Next 2025 conference in Las Vegas and aims to strengthen the company's position in the corporate sector.