AI benchmarks are a bad joke - and LLM makers are the ones laughing

AI companies regularly tout their models' performance on benchmark tests as a sign of technological and intellectual superiority. But those results, widely used in marketing, may not be meaningful.

A study [PDF] from researchers at the Oxford Internet Institute (OII) and several other universities and organizations has found that only 16 percent of 445 LLM benchmarks for natural language processing and machine learning use rigorous scientific methods to compare model performance.

What's more, about half the benchmarks claim to measure abstract ideas like reasoning or harmlessness without offering a clear definition of those terms or how to measure them.

In a statement, Andrew Bean, lead author of the study said, "Benchmarks underpin nearly all claims about advances in AI. But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to."

When OpenAI released GPT-5 earlier this year, the company's pitch rested on a foundation of benchmark scores, such as those from AIME 2025, SWE-bench Verified, Aider Polyglot, MMMU, and HealthBench Hard.

These tests present AI models with a series of questions and model makers strive to have their bots answer as many as possible. The questions or challenges vary depending upon the focus of the test. For a math-oriented benchmark like AIME 2025, AI models are asked to answer questions like:

Find the sum of all positive integers $n$ such that $n+2$ divides the product $3(n+3)(n^2+9)$.

"[GPT-5] sets a new state of the art across math (94.6 percent on AIME 2025 without tools), real-world coding (74.9 percent on SWE-bench Verified, 88 percent on Aider Polyglot), multimodal understanding (84.2 percent on MMMU), and health (46.2 percent on HealthBench Hard)-and those gains show up in everyday use," OpenAI said at the time. "With GPT‑5 pro's extended reasoning, the model also sets a new SOTA on GPQA, scoring 88.4 percent without tools."

But, as noted in the OII study, "Measuring what Matters: Construct Validity in Large Language Model Benchmarks," 27 percent of the reviewed benchmarks rely on convenience sampling, meaning that the sample data is chosen for the sake of convenience rather than using methods like random sampling or stratified sampling.

"For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."

The OII study authors have created a checklist with eight recommendations to make benchmarks better. These include defining the phenomenon being measured, preparing for contamination, and using statistical methods to compare models. Alongside the OII, the other study authors are affiliated with EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University.

Bean et al. are far from the first to question the validity of AI benchmark tests. In February, for example, researchers from the European Commission's Joint Research Center published a paper titled, "Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation."

As we noted at the time, the authors of that research identified "a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results."

At least some of those who design benchmark tests are aware of these concerns. On the same day that the OII study was announced, Greg Kamradt, president of the Arc Prize Foundation, a non-profit that administers an award program based on the Abstract and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) benchmark, announced, "ARC Prize Verified, a program to increase the rigor of evaluating frontier systems on the ARC-AGI benchmark."

Verification and testing rigor are necessary, Kamradt observed, because scores reported by model makers or third-parties may arise from different datasets and prompting methods that make comparison difficult.

"This causes confusion in the market and ultimately detracts from our goal of measuring frontier AI progress," Kamradt explained.

OpenAI and Microsoft reportedly have their own internal benchmark for determining when AGI  - vaguely defined by OpenAI as "AI systems that are generally smarter than humans" - has been achieved. That milestone matters to the two companies because it releases OpenAI from its IP rights and Azure API exclusivity agreement with Microsoft.

This AGI benchmark, according to The Information, can be met by OpenAI developing AI systems that generate at least $100 billion in profits. Measuring money turns out to be easier than measuring intelligence. ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Dec 10
How to answer the door when the AI agents come knocking

Identity management vendors like Okta see an opening to calm CISOs worried about agents running amok

Dec 9
Linux Foundation aims to become the Switzerland of AI agents

An attempt to provide vendor-neutral oversight as the agent train barrels on

Dec 9
Window Maker Live 13.2 brings 32-bit life to Debian 13

Trixie may have gone 64-bit for installs, but WMLive still ships an i686-bootable build

Dec 9
Google's AI training tactics land it in another EU antitrust fight

Brussels probes whether unpaid web and YouTube content - and rivals' lock-outs - amount to abuse of dominance

Dec 9
AI mania to swell datacenter capex to $1.6T by 2030 - if the bubble doesn't pop first

Analysts say demand keeps rising despite constraints, shaky returns, and mounting investor nerves

Dec 9
SAP users in the dark about vendor's plan for data analytics

February product launch fails to register, with concerns remaining about integration

Dec 9
Affection for Excel spans generations, from Boomers to Zoomers

Younger finance pros are just as loyal to Microsoft's venerable spreadsheet app as their elders