The great language models have become everyday intermediaries of knowledge. They respond fluently to academic, legal or technical questions, suggesting a deep understanding of the content. However, a recent study led by researchers from the National University of Distance Education (UNED) raises a key question: to what extent do these successes reflect real reasoning and not simple memorization of patterns?

The work, from the Department of Computer Languages ​​and Systems, published in IEEE under the title: On the limits of reasoning in LLM: evidence of contamination, translation and modification of answers in multiple choice tests, proposes a methodology to systematically separate two abilities that are often confused in the evaluation of AI: remembering previously seen answers and reasoning by eliminating incorrect alternatives.

From search engine to AI: a trust that should be qualified

The research is framed in a context in which millions of users have replaced the traditional search engine with conversational systems based on AI. For Eva Sánchez Salido, predoctoral researcher at the Department of Computer Languages ​​and Systems at UNED and one of the authors of the study, this change has relevant implications: “When a chatbot is used for queries that were previously made in a search engine, the response can be generated in two ways: either the model answers with the information it remembers from its training, or it consults the Internet before responding.”

In the first case, he explains, the system does not have access to recent information and is more prone to error if the news is relevant. In the second, the process is more reliable, although not infallible: “Although it is still possible for the answer to be invented, it is much more likely that it will be correct.”

The main advantage over the classic search engine is that the AI ​​not only locates sources, but also selects and synthesizes them. However, this same capacity introduces an added risk, since the more advanced it is, the less reliable it is. “If the veracity of the answer is critical, it must always be checked,” says Eva Sánchez.

Public benchmarks: when the exam was already studied

One of the central axes of the study is the criticism of current AI evaluation systems. So-called benchmarks—sets of questions and answers used to measure the performance of models—are usually public and widely disseminated.

Eva Sánchez sums it up with a clear metaphor: “When the data is public, the model is like a student who has seen the answers before taking the exam. The evaluation measures their ability to memorize them, not their actual knowledge of the subject.”

This phenomenon, known as data contamination, means that the high results obtained in standard tests are not necessarily a guarantee of real understanding. For this reason, the study combines public benchmarks, such as MMLU, with private sets designed by UNED, to which the models have not had access during their training.

Differences between languages

The work also analyzes the linguistic generalization capacity of the models, a key issue for non-Anglophone educational and administrative contexts. The results show a clear trend: “In all our experiments we found greater reliability in English than in Spanish, although the difference varies greatly between models and areas of knowledge.”

In the most advanced systems the gap is reduced, but it remains significant in certain disciplines. According to the researcher, in areas related to Spanish culture and society, such as law or geography of Spain, all models tend to respond much worse. These results underscore that linguistic fluency does not necessarily equate to deep contextual understanding.

When the correct answer disappears

The central methodological axis of the research is the NOTO (None Of The Other answers) reformulation. In this approach, the correct answer is removed from the available options and replaced with “None of the other answers.”

“Answering a multiple choice question can be done by simple pattern recognition,” explains Eva Sánchez. “But replacing the correct answer with ‘none of the others’ forces you to check that all the other options are incorrect.”

This eliminative reasoning, closer to humans, causes significant drops in the performance of the models: “The drops are very large, which suggests that in many cases they appear to reason, but they are only recognizing familiar patterns.”

Even the models that lead the usual rankings show a sharp decline, leading to a clear conclusion: traditional benchmarks may be overestimating the real reasoning capacity of artificial intelligence.

Beyond the size of the models

Faced with the dominant idea that progress happens only through increasingly larger models, the study points in another direction. “Our results indicate that it is not enough to make larger models,” says the researcher. “Advanced training strategies are needed, such as reinforcement learning with verifiable rewards.”

Furthermore, improvement requires rethinking evaluation systems, since it is necessary to change how we measure what the models really understand, incorporating less predictable tests that are closer to real use, the researcher highlights.

The final message of the study is as technical as it is relevant for society: getting it right does not always mean understanding. Distinguishing between the two will be key in a context in which artificial intelligence increasingly influences academic, professional and everyday decisions.