Why AI benchmarking sucks

AI model makers love to flex their benchmarks scores. But how trustworthy are these numbers? What if the tests themselves are rigged, biased, or just plain meaningless?

OpenAI's o3 debuted with claims that, having been trained on a publicly available ARC-AGI dataset, the LLM scored a "breakthrough 75.7 percent" on ARC-AGI's semi-private evaluation dataset with a $10K compute limit. ARC-AGI is a set of puzzle-like inputs that AI models try to solve as a measure of intelligence.

Google's recently introduced Gemini 2.0 Pro, the web titan claims, scored 79.1 percent on MMLU-Pro - an enhanced version of the original MMLU test designed to test natural language understanding.

Meanwhile, Meta's Llama-3 70B claimed a score of 82 percent on MMLU 5-shot back in April 2024. "5-shot" refers to the number of examples (shots) provided to an AI model during the testing phase.

These benchmarks themselves deserve as much scrutiny as the models, argue seven researchers from the European Commission's Joint Research Center in their paper, "Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation."

Their answer is not really.

The authors conducted a review of 100 studies over the past ten years examining quantitative benchmarking practices. What they found were numerous issues related to the design and application of benchmark tests, including biases in the way relevant evaluation datasets were created, lack of documentation, data contamination, and failures to separate signal from noise.

It reminds us of hardware makers benchmarking their own gear and putting the results in press statements and marketing; we don't trust any of that, either.

In addition, the Euro team found that one-time testing logic fails to account for multi-modal model usage that involves serial interaction with people and technical systems.

"Our review also highlights a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results," the authors state in their paper.

"Furthermore, it underscores how benchmark practices are fundamentally shaped by cultural, commercial and competitive dynamics that often prioritise state-of-the-art performance at the expense of broader societal concerns."

The reason these scores matter, the authors observe, is that they're often the basis for regulation. The EU AI Act, for example, incorporates various benchmarks. And benchmark scores for AI models are also expected to be relevant for the UK Online Safety Act. In the US, the recently published Framework for Artificial Intelligence Diffusion also outlines the role of benchmarks for model evaluation and classification.

AI benchmarks, they argue, are neither standardized nor uniform, but they've become central to policy making, even as academics across different disciplines have become increasingly vocal in their concerns about benchmark variability and validity.

In support of that point, they cite criticism raised in various fields, including cybersecurity, linguistics, computer science, sociology, and economics, among others, that discuss the risks and limitations of benchmark testing.

They identify nine general problems with benchmarks:

For each of these issues, the authors cite various other relevant works exploring benchmarking concerns. For example, with regard to testing on diverse sets of data, the authors note that most benchmarks test for success when benchmarks focused on failure might be more useful.

"As Gehrmann et al put it, 'ranking models according to a single quality number is easy and actionable - we simply pick the model at the top of the list - [yet] it is much more important to understand when and why models fail,'" they write.

And in terms of gaming benchmark results, they point to what's known as "sandbagging," where models are programmed to underperform on certain tests (eg, on prompts about making nerve agents), raising concerns about manipulation.

When Volkswagen engaged in comparable test manipulation, programming cars to activate emissions controls only during active testing, people went to jail. The fact that nothing of the sort has occurred among AI firms suggests how lightly the tech sector is regulated.

In any event, the Joint Research Center scientists conclude that the way we measure our AI models for safety, morality, truth, and toxicity has become a matter of broad academic concern.

"In short, AI benchmarks need to be subjected to the same demands concerning transparency, fairness, and explainability, as algorithmic systems and AI models writ large," they conclude. ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Mar 24
GNOME 48 lands with performance boosts, new fonts, better accessibility

Tweaks mean smoother operation even on low-end kit

Mar 24
Oracle Cloud says it's not true someone broke into its login servers and stole data

Despite evidence to the contrary as alleged pilfered info goes on sale

Mar 23
A closer look at Dynamo, Nvidia's 'operating system' for AI inference

GTC GPU goliath claims tech can boost throughput by 2x for Hopper, up to 30x for Blackwell

Mar 21
Microsoft ducks politico questions on Copilot bundling and lack of consent

Consumer price hikes come amid interrogation of why customers have to opt out of added AI features

Mar 21
Accenture: DOGE's Federal procurement review is hurting our sales

Share price list slides for top ten consultant to US government

Mar 21
NASA's inbox goes orbital after email mishap spams entire space industry

EXCLUSIVE MAPTIS mailing list blunder triggers reply-all chaos

Mar 21
Cloudflare builds an AI to lead AI scraper bots into a horrible maze of junk content

Slop-making machine will feed unauthorized scrapers what they so richly deserve, hopefully without poisoning the internet