Most complex AI benchmark launched
A new benchmark HUMANITY’S LAST EXAM has been introduced, featuring 3000 difficult questions across dozens of subject areas. Questions were selected through a multi-stage process.
From 13000 proposed questions where leading AI models showed poor results, experts selected 3000, modifying them to ensure quality and clarity.
Authors of the top 50 questions received 5000 dollars each. The next 500 questions earned their creators 500 dollars each. Benchmark leaders – o 1 and R 1 show results below 10%. R 1 leads in the text portion but cannot process images, which make up 10% of the test.
HUMANITY’S LAST EXAM aims to assess AI’s capability limits, as existing tests have been mastered by models with over 90% accuracy. Initial results are shocking: even GPT-4 o showed only 3.3% accuracy, with the best result at 9.4%.
The benchmark also evaluates model self-calibration – their ability to assess confidence in their own answers. R 1 leads significantly, but calibration error still exceeds 80%.
Authors expect new models might reach 50% accuracy on this challenging new test by year-end. Apparently, to beat AI in testing, it’s enough to pay people to create truly difficult questions.
Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.
Dongfeng deploys 1.7m tall Walker S robots with 41 servosDongfeng Motor joins forces with Ubtech Robotics to integrate innovative Walker S robots into production lines. These technological marvels standing 1 meter and 70 centimeters tall are ready to transform traditional automobile assembly processes. Dongfeng Motor's general manager emphasizes that implementing artificial intelligence in these robots will significantly improve the quality of component inspection and assembly.
MIT graduate student reduced painting restoration from 230 to 3.5 hoursMIT graduate student Alex Kachkin developed a cool method for painting restoration using artificial intelligence. Reducing work time from many months to several hours. As a demonstration, he restored a work by an unknown Dutch master of the 15th century that had seriously suffered from time.
AI prosthetic from Canada analyzes objects and decides how to grasp themArtificial intelligence gives prosthetics independence! Scientists from Memorial University of Newfoundland created a revolutionary arm prosthetic that literally "thinks" for itself. Unlike traditional models that require reading muscle signals through sensors, the new device is completely autonomous.
DeepSeek packed LLM engine into 1200 lines of Python codeThe DeepSeek team presented nano-vLLM. This is a lightweight and compact engine for running large language models. Which could change perceptions about code efficiency. Amazingly, all functionality fit into just 1200 lines of Python code! This is true technological minimalism in the world of artificial intelligence. Traditional engines like this, for all their power, often suffer from an overloaded codebase. Which makes their modification a real trial for developers. Nano-vLLM solves this problem by offering a simple but powerful tool without unnecessary complexity. The code is open.
Tesla robotaxi failure: 11 traffic violations in first days from 20 carsThe dream of robotaxis faces harsh reality! Tesla launched public tests of autonomous taxis in Austin, but the results were far from the promised technological miracle. In the first days of testing, at least 11 serious traffic violations were recorded. And this with only 20 vehicles selected for a limited circle of bloggers. Philip Koopman, professor at Carnegie Mellon University and expert on autonomous technologies, doesn't hide his surprise: "This is terribly fast for so many videos with unstable driving to appear".