A study from University College London has put large language models in direct comparison with working neuroscientists on a difficult scientific task: predicting whether experimental data would support a stated hypothesis. The result was not close. The AI models reached 81.4 percent accuracy, while human experts reached 63.4 percent.
The work, published in Nature, used a benchmark called BrainBench to test judgment across neuroscience research scenarios. It points to a practical future in which LLMs are not only used to summarize papers, but also to help assess whether proposed scientific results are likely to hold up.
What BrainBench Tested
BrainBench was designed around research scenarios from different areas of neuroscience. Human participants were asked to review the methodology and hypotheses, then predict whether the experiments would support the researchers' expectations.
The human side of the test included 171 neuroscientists. The group ranged from graduate students to professors and had an average of 10.1 years of experience. Each participant reviewed nine research scenarios.
The LLMs were given a larger test set. They faced 200 expert-generated cases plus 100 GPT-4-generated scenarios. In that setting, the models produced a much higher overall accuracy score than the people taking part in the benchmark.
The comparison is especially notable because the top human performers did not close the gap. Even the highest-performing human experts, the top 20 percent, reached only 66.2 percent accuracy. That means the strongest expert subgroup still trailed the AI systems by a wide margin.
Why The Result Matters
The study is not simply a story about AI beating people on another benchmark. The task is directly connected to how scientists think through possible results before running or interpreting experiments. If a model can make useful predictions about which hypotheses are likely to be supported, it could become part of the planning process for research.
The source article notes that the AI systems performed better across all tested neuroscience areas. They were especially strong when they had to connect more than a paper's abstract. The models appeared able to combine methodology, background information and results in ways that helped them forecast outcomes.
That matters because research judgment is rarely based on one isolated sentence. Scientists usually weigh the setup of an experiment, the assumptions behind it and how similar work has turned out. BrainBench suggests that LLMs can recognize some of those patterns at scale.
The researchers also checked whether the models were merely repeating memorized answers. They used special testing methods to compare performance against known training data and to examine whether the test cases had already appeared in training. According to the researchers, the models seemed to operate more like human readers of scientific literature, building general patterns and frameworks rather than recalling details by rote.
"This success suggests that a great deal of science is not truly novel, but conforms to existing patterns of results in the literature. We wonder whether scientists are being sufficiently innovative and exploratory", lead author Dr. Ken Luo said.
Small Models, Specialized Training
One surprising part of the study was that smaller AI models did not necessarily fall behind larger ones. Llama2-7B and Mistral-7B performed just as well as larger counterparts, even though they had only 7 billion parameters.
The difference between base models and chat-optimized versions was also important. The base versions performed strongly on prediction, while versions tuned for chat performed worse. The researchers suspect that optimizing a model for conversation may weaken its ability to draw scientific conclusions.
That distinction is useful for anyone thinking about AI in research settings. A model that is pleasant to talk to is not automatically the best model for scientific inference. The task may require preserving abilities that can be reduced when the system is adjusted for everyday dialogue.
The researchers also built a specialized model called BrainGPT. It was based on Mistral 7B and trained with 1.3 billion neuroscience texts. That custom model improved accuracy further, adding another 3 percentage points to the results.
Meta's Galactica was among the tested models. It had been designed for scientific tasks, although it faced significant criticism from scientists when it launched in 2022. The study also notes that the tested systems were older open-source AI models, not the latest versions from companies like Anthropic, Meta or OpenAI. The source article says this suggests current models like GPT-4 or Sonnet 3.5 might do even better on these tasks.
How AI Could Change Experiment Design
The most immediate implication is in research planning. If a scientist can describe a proposed experiment and expected findings, an AI system could estimate how likely different outcomes may be. That would make it easier to compare possible designs before committing effort to one path.
"We envision a future where researchers can input their proposed experiment designs and anticipated findings, with AI offering predictions on the likelihood of various outcomes. This would enable faster iteration and more informed decision-making in experiment design," Luo said.
The study also found a useful calibration signal. Both AI systems and human experts were more likely to be correct when they expressed higher confidence in a prediction. The researchers say reliable self-assessment of that kind is essential for real-world applications.
In practice, confidence matters because an AI prediction is not only a yes-or-no answer. A tool that can indicate when it is more or less certain could help researchers decide when to treat its output as a strong signal and when to be cautious.
The Risk Of Overusing Predictions
The researchers also warn about drawbacks. If AI predicts that a hypothesis is unlikely to be supported, scientists might decide not to run the study. That could be a problem because unexpected results can lead to major breakthroughs.
There is another risk in the opposite direction. If an AI model predicts a finding with high confidence, researchers may dismiss that result as obvious or uninteresting. In both cases, the tool could influence what work gets pursued and how results are valued.
The broader lesson is that LLMs may become useful research assistants, but not neutral ones. Their predictions could speed up decision-making and reveal patterns across literature. At the same time, they could make scientists less willing to explore outcomes that do not fit the model's expectations.
BrainBench shows that large language models can already compete strongly with experienced neuroscientists on forecasting research outcomes. The challenge now is deciding how to use that ability without narrowing the space of scientific discovery.