OpenAI tests models against specialists from 44 professions
OpenAI introduced new benchmark GDPval, which tests its AI models’ performance compared to professionals from various industries. And is an attempt to understand how close OpenAI systems are to surpassing humans in economically significant work.
The benchmark is based on 9 industries making largest contribution to US gross domestic product. GDPval tests AI model performance across 44 professions in these industries, from programmers to nurses and journalists. Experienced professionals compared AI-generated reports with works of other specialists.
GPT-5 high was rated better than or equal to industry experts in 46.6% of cases. Claude Opus 4.1 from Anthropic was rated better than or equal to industry experts in 49% of tasks. Although OpenAI claims Claude showed such high results due to tendency to create attractive graphics.
I think such high model scores might be inflated due to test limitations. And don’t reflect real performance. The new benchmark itself could create false expectations about AI capabilities in real work conditions.
Autor: AIvengo
For 5 years I have been working with machine learning and artificial intelligence. And this field never ceases to amaze, inspire and interest me.
Latest News
UBTech will send Walker S2 robots to serve on China's border for $37 millionChinese company UBTech won a contract for $37 million. And will send humanoid robots Walker S2 to serve on China's border with Vietnam. South China Morning Post reports that the robots will interact with tourists and staff, perform logistics operations, inspect cargo and patrol the area. And characteristically — they can independently change their battery.
AI chatbots generate content that exacerbates eating disordersA joint study by Stanford University and the Center for Democracy and Technology showed a disturbing picture. Chatbots with artificial intelligence pose a serious risk to people with eating disorders. Scientists warn that neural networks hand out harmful advice about diets. They suggest ways to hide the disorder and generate "inspiring weight loss content" that worsens the problem.
OpenAGI released the Lux model that overtakes Google and OpenAIStartup OpenAGI released the Lux model for computer control and claims this is a breakthrough. According to benchmarks, the model overtakes analogues from Google, OpenAI and Anthropic by a whole generation. Moreover, it works faster. About 1 second per step instead of 3 seconds for competitors. And 10 times cheaper in cost per processing 1 token.