Artificial Intelligence has a measurement problem

SAN FRANCISCO: There’s a problem with leading artificial intelligence tools such as ChatGPT, Gemini and Claude: We don’t really know how smart they are. That’s because, unlike companies that make cars or drugs or baby formula, AI companies aren’t required to submit their products for testing before releasing them to the public.
Users are left to rely on the claims of AI companies, which often use vague, fuzzy phrases like “improved capabilities” to describe how their models differ from one version to the next.Models are updated so frequently that a chatbot that struggles with a task one day might mysteriously excel at it the next. Shoddy measurement also creates a safety risk. Without better tests for AI models, it’s hard to know which capabilities are improving faster than expected, or which products might pose real threats of harm.
In this year’s AI Index – a big annual report put out by Stanford University’s Institute for Human-Centered Artificial Intelligence – the authors describe poor measurement as one of the biggest challenges facing AI researchers. “The lack of standardized evaluation makes it extremely challenging to systematically compare the limitations and risks of various AI models,” said editor-in-chief, Nestor Maslej.

For years, the most popular method for measuring AI was the Turing Test – an exercise proposed in 1950 by mathematician Alan Turing, which tests whether a computer program can fool a person into mistaking its responses for a human’s. But today’s AI systems can pass the Turing Test with flying colors, and researchers have had to come up with harder evaluations.
One of the most common tests given to AI models today – the SAT for chatbots, essentially – is a test known as Massive Multitask Language Understanding, or MMLU.
The MMLU, which was released in 2020, consists of a collection of roughly 16,000 multiple-choice questions covering dozens of academic subjects, ranging from abstract algebra to law and medicine. It’s supposed to be a kind of general intelligence test – the more a chatbot answers correctly, the smarter it is.
It has become the gold standard for AI companies competing for dominance. (When Google released its most advanced AI model, Gemini Ultra, earlier this year, it boasted that it had scored 90% on the MMLU – the highest score ever recorded.)
Dan Hendrycks, an AI safety researcher who helped develop the MMLU while in graduate school at the University of California, Berkeley, said that while he thought MMLU “probably has another year or two of shelf life,” it will soon need to be replaced by different, harder tests. AI systems are getting too smart for the tests we have now, and it’s getting more difficult to design new ones.
There are dozens of other tests out there – with names including TruthfulQA and HellaSwag – that are meant to capture other facets of AI performance. But these tests are capable of measuring only a narrow slice of an AI system’s power. And none of them are designed to answer the more subjective questions many users have, such as: Is this chatbot fun to talk to? Is it better for automating routine office work, or creative brainstorming? How strict are its safety guardrails?
There is a problem known as “data contamination,” when the questions and answers for benchmark tests are included in an AI model’s training data, essentially allowing it to cheat. And there is no independent testing or auditing process for these models, meaning that AI companies are essentially grading their own homework. In short, AI measurement is a mess – a tangle of sloppy tests, apples-to-oranges comparisons and self-serving hype that has left users, regulators and AI developers themselves grasping in the dark.
“Despite the appearance of science, most developers really judge models based on vibes or instinct,” said Nathan Benaich, an AI investor with Air Street Capital. “That might be fine for the moment, but as these models grow in power and social relevance, it won’t suffice.” The solution here is likely a combination of public and private efforts.
Governments can, and should, come up with robust testing programs that measure both the raw capabilities and the safety risks of AI models, and they should fund grants and research projects aimed at coming up with new, high-quality evaluations.
In its executive order on AI last year, the White House directed several federal agencies, including the National Institute of Standards and Technology, to create and oversee new ways of evaluating AI systems.