The Most Insightful AI Benchmark
Galileo (Galilei) v. Grok, Charles (Darwin) v. ChatGPT, (Marie) Curie v. Claude, (Edward) Jenner v. Gemini
Recently, I was talking with someone about generative AI benchmarks. By now, everyone seems to be generally aware that the what the benchmarks can convey about their usefulness is limited because all public benchmarks can be gamed. Not only do companies often train on the test data (intentionally or not), they also pay humans to come up with questions that mimic the benchmarks. It’s not exactly like teaching to the test. But it’s not not like that either.
This is why you might have a new benchmark come out and models perform poorly on it at first, but then, like magic, they rapidly improve against the benchmark in just a matter of months. The illusion is that such improvements imply that the models are also just improving rapidly on their own with the data they already have. Just a few turns of the nobs on the algorithm and a sprinkle of extra compute, and voila!
Major AI companies probably have their own high-quality benchmarks that they can be certain their models haven’t trained on yet. Presumably, their models score lower on those proprietary benchmarks because they ensure they don't train on the testing data and, also presumably, they don’t do a ton of RLHF to make their models particularly good at the benchmark if they are aiming for true generalizability and not just re-confirming that it’s possible to optimize models for specific tasks.
It makes me wonder if they think, say, a 60/100 is a good enough score. If anyone on the inside (or formerly on the inside) of big companies wants to shed some light on this for me, please reach out!
My thinking for getting around the public benchmark problems, and what I think would be the most revealing benchmark, would be as follows:
Background
Isaac Newton made the leap from the math he grew up with to physics and calculus, Darwin made the leap from observations to the theory of evolution, Einstein made the leap from physics to general relativity, etc. What models today can't do is make similar leaps.
But if they could, that would arguably be what matters most for society. For example, even if a model can solve complex PhD-level physics problems, they are limited to problems with known or knowable answers in limited and existent domains, and a human must specify the problems. What would be more helpful is if models could be innovative, exploratory, insightful, and so on. In short, what society should most care about as far as capability benchmarks is when AI models can start making leaps of Nobel Laureate-level genius without a human telling it what to look for.
Initially, I thought it'd be cool if we could train a model on materials that only existed before a known breakthrough. (I’ve since learned this is far from a novel idea.) For instance, if you only trained a model on documents from before 1850, including Darwin’s notes, could it derive the theory of evolution? If so, that would be exciting!
But (1) that wouldn't necessarily tell us much about bigger models trained on waaaaay more data, and (2) it would be hard to confirm the model wasn't trained on anything created after 1850. The model may have been trained on a document describing evolution, which would contaminate the entire test.
Goal of a New Benchmark
It would be great if we (probably a tight-knit consortium of university researchers, where the researchers must sign NDAs) could construct a benchmark that consists of a large amount of data that, to almost all humans, doesn't indicate anything, or would be confusing. Some of the data in the benchmark could be noisy datasets, observation notes from various perspectives, research papers, etc. But within all that (probably mostly synthetic) data would be all the information needed to derive a new, non-obvious fundamental scientific theory or law.
To test the veracity of the benchmark, the researchers would probably have to pay actual experts to try to solve it. If human experts (and only experts) could make the leap and identify the new theory or law, that would be a strong indicator that the benchmark is useful for indicating the ability to derive such insights.
If an AI model could derive the same insight, that would be a major and meaningful breakthrough. Olympiad math is way less important than a model's ability to synthesize ideas from vast and various sources of information and discover some fundamental truth about the universe.
The Tricks to Making it Work
This type of benchmark would not be without its challenges. And for good reason: we’d want it to be rigorous and challenging so that passing it would signal a significant advancement in AI and not just ambiguous noise.
The hurdles for benchmark development would include at least the following:
Creating a large enough file of disparate data types. The dataset would need to be of a significant size. The best comparison would be to look at how much data recent Nobel Laureates had to comb through to achieve their breakthroughs. My guess is that it takes significantly more data to achieve a Newton-esque insight than it did for Newton. But even just that kind of preliminary research would make an interesting set of research papers.
The benchmark would need to be based on a hypothetical theory that is unconnected to any that currently exist. Essentially, the goal would be to create a novel and elaborate conspiracy theory. The reason is we wouldn’t want the model to simply generalize from known data and proposed theories the model has already been trained on, because it may just connect those dots as a best guess rather than as a genuine insight based on logical reasoning. A decent source for ideas would probably be hard science fiction writers because the idea would need to be plausible, but also probably science fiction. But the high-level ideas of creatives still wouldn't suffice. What we’d really want is something that would require extremely nuanced and deep knowledge of a few domains (something I definitely don’t possess).
Whatever the insight is, it would need to be in a field of study where we could get enough experts to chip in and validate the benchmark. So, it probably couldn't be too obscure. If there are only a handful of true experts in the domain, it’s unlikely most of them would be able or willing to help validate the benchmark.
Getting experts to validate the benchmark would be a considerable opportunity cost for the experts. Rather than conducting research on real problems and real data that could lead to real breakthroughs, they’d be working on the science fiction data. A selling point, though, could be that this benchmark would be extremely valuable for humanity. If we can confirm a model is capable of insight, the implications are immense for scientific research, policy, and law. Lots of people refer to LLMs as perhaps the most powerful technology ever created, and I suppose that could be true. But you know what’s more powerful and consequential than a computational machine? A computational machine that can derive insights from noisy data.
Testing
Supposing some organization could secure the necessary funding and experts to develop a high-quality, secret benchmark, the next hurdle would be how to conduct the testing. As I see it, model results would have to be presented in the aggregate if they were not tested locally, or the testing would need to be done locally.
The argument for aggregation is that it would be the only way to ensure no individual company would know if their model got the correct answer. This is important because we wouldn't want them to possibly leak it or tell others the correct answer. It'd spoil the benchmark for everyone else.
Instead, the testers would do a batch of models at a time (say, two from OpenAI, one from Anthropic, three from xAI, one from Cohere, one from Google) and then provide a summary of the results. For example, the testers would share that none of the models passed, or that one did, or whatever the case may be. This would not be terribly helpful for the AI companies., but it would be tremendously helpful information for policymakers because then we’d know that we (humans) have created a model capable of breakthrough thinking. We’d want appropriate regulation and oversight because presumably, if the model can solve the benchmark, it could make all kinds of important real-world insights in probably all kinds of fields. That may be a net good thing for humanity...but it also may not be, depending on what problems or types of data the users direct the models to focus on.
The alternative method of testing is to run the models locally. This would be more helpful because it’d allow the testers to reveal the specific model(s) that achieved the insight. Importantly, the companies still wouldn’t be allowed to see what answer the model produced when they learn if it was correct. Again, the secrecy would be to help preserve the benchmark’s usefulness.
Limitations
One big limitation of this type of benchmark is that it provides all the data necessary to have an insight. That is, it doesn't require the model to ask the right questions about what data to gather or how to gather it, which is obviously a huge element of being a Nobel Laureate-level thinker. Still, even a model being able to find a needle in a haystack without knowing what the needle will look like is what some might call a Big Dang Deal. (Well, perhaps nobody uses that phrase, but they should!)
Another limitation is that models might achieve monumental insights without passing the benchmark. That could be because they are never tested on the benchmark, or they might fail at the benchmark but excel on a different dataset. While the Insight Benchmark, if designed well, could weed out virtually all false positives of an ability to generate genuine, meaningful insights, it would not be able to prevent false negatives.
Feedback
I’m curious what others think about a benchmark like this. Would it be worthwhile? Is someone already working on it? Please message me or leave a comment if you have any thoughts you’d like to share!

