Post Thumbnail

Most complex AI benchmark launched

A new benchmark HUMANITY’S LAST EXAM has been introduced, featuring 3000 difficult questions across dozens of subject areas. Questions were selected through a multi-stage process.

From 13000 proposed questions where leading AI models showed poor results, experts selected 3000, modifying them to ensure quality and clarity.

Authors of the top 50 questions received 5000 dollars each. The next 500 questions earned their creators 500 dollars each. Benchmark leaders – o 1 and R 1 show results below 10%. R 1 leads in the text portion but cannot process images, which make up 10% of the test.

HUMANITY’S LAST EXAM aims to assess AI’s capability limits, as existing tests have been mastered by models with over 90% accuracy. Initial results are shocking: even GPT-4 o showed only 3.3% accuracy, with the best result at 9.4%.

The benchmark also evaluates model self-calibration – their ability to assess confidence in their own answers. R 1 leads significantly, but calibration error still exceeds 80%.

Authors expect new models might reach 50% accuracy on this challenging new test by year-end. Apparently, to beat AI in testing, it’s enough to pay people to create truly difficult questions.

Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.

Latest News

How to create an infinite universe with one text prompt

Forget everything you knew about creating game worlds. Tencent just released the open-source model Hunyuan-GameCraft. Which generates interactive virtual worlds directly on your graphics card. Link in description. One text prompt — and you have an infinite universe.

How synchronization of 3 light sources protects against forgeries

Artificial intelligence has learned to create video fakes that are impossible to distinguish from reality. And this is a huge problem and question of trust in society. But scientists from Cornell University found a brilliant solution. They hid watermarks right in ordinary lighting.

Hip-hop, wushu and Peking opera at the robotics games opening ceremony

China hosted the first World Humanoid Robot Games where 280 teams from 16 countries competed. Who brought more than 500 androids. It became almost a real Olympics for robots with all the attributes of major sports.

The first LAARMA system protects animals on Australian roads

In Australia, animal-vehicle collisions are a serious problem for this continent's ecosystem. Now scientists have found a technological solution. The world's first roadside LAARMA system based on artificial intelligence that protects wild animals from dangerous encounters with traffic.

Nvidia introduced Cosmos model family for robotics

Nvidia company introduced the Cosmos family of AI models. Which can fundamentally change the approach to creating robots and physical AI agents.