Large reasoning models almost certainly can think

Recently, there has been a lot of hullabaloo about the idea that large reasoning models (LRM) are unable to think. This is mostly due to a research article published by Apple, "The Illusion of Thinking" Apple argues that LRMs must not be able to think; instead, they just perform pattern-matching. The evidence they provided is that LRMs with chain-of-thought (CoT) reasoning are unable to carry on the calculation using a predefined algorithm as the problem grows.

Weibo's new open source AI model VibeThinker-1.5B outperforms DeepSeek-R1 on $7,800 post-training budget

Our favorite 2025 advent calendars from Lego, Pokémon, Funko Pop, Magna-Tiles and more

This is a fundamentally flawed argument. If you ask a human who already knows the algorithm for solving the Tower-of-Hanoi problem to solve a Tower-of-Hanoi problem with twenty discs, for instance, he or she would almost certainly fail to do so. By that logic, we must conclude that humans cannot think either. However, this argument only points to the idea that there is no evidence that LRMs cannot think. This alone certainly does not mean that LRMs can think — just that we cannot be sure they don’t.

In this article, I will make a bolder claim: LRMs almost certainly can think. I say ‘almost’ because there is always a chance that further research would surprise us. But I think my argument is pretty conclusive.

What is thinking?

Before we try to understand if LRMs can think, we need to define what we mean by thinking. But first, we have to make sure that humans can think per the definition. We will only consider thinking in relation to problem solving, which is the matter of contention.

1. Problem representation (frontal and parietal lobes)

When you think about a problem, the process engages your prefrontal cortex. This region is responsible for working memory, attention and executive functions — capacities that let you hold the problem in mind, break it into sub-components and set goals. Your parietal cortex helps encode symbolic structure for math or puzzle problems.

2. Mental simulation (morking Memory and inner speech)

This has two components: One is an auditory loop that lets you talk to yourself — very similar to CoT generation. The other is visual imagery, which allows you to manipulate objects visually. Geometry was so important for navigating the world that we developed specialized capabilities for it. The auditory part is linked to Broca’s area and the auditory cortex, both reused from language centers. The visual cortex and parietal areas primarily control the visual component.

3. Pattern matching and retrieval (Hippocampus and Temporal Lobes)

These actions depend on past experiences and stored knowledge from long-term memory:

The hippocampus helps retrieve related memories and facts.
The temporal Lobe brings in semantic knowledge — meanings, rules, categories.

This is similar to how neural networks depend on their training to process the task.

4. Monitoring and evaluation (Anterior Cingulate Cortex)

Our anterior cingulate cortex (ACC) monitors for errors, conflicts or impasses — it’s where you notice contradictions or dead ends. This process is essentially based on pattern matching from prior experience.

5. Insight or reframing (default mode network and right hemisphere)

When you're stuck, your brain might shift into default mode — a more relaxed, internally-directed network. This is when you step back, let go of the current thread and sometimes ‘suddenly’ see a new angle (the classic “aha!” moment).

This is similar to how DeepSeek-R1 was trained for CoT reasoning without having CoT examples in its training data. Remember, the brain continuously learns as it processes data and solves problems.

In contrast, LRMs aren’t allowed to change based on real-world feedback during prediction or generation. But with DeepSeek-R1’s CoT training, learning did happen as it attempted to solve the problems — essentially updating while reasoning.

Similarities betweem CoT reasoning and biological thinking

LRM does not have all of the faculties mentioned above. For example, an LRM is very unlikely to do too much visual reasoning in its circuit, although a little may happen. But it certainly does not generate intermediate images in the CoT generation.

Most humans can make spatial models in their heads to solve problems. Does this mean we can conclude that LRMs cannot think? I would disagree. Some humans also find it difficult to form spatial models of the concepts they think about. This condition is called aphantasia. People with this condition can think just fine. In fact, they go about life as if they don’t lack any ability at all. Many of them are actually great at symbolic reasoning and quite good at math — often enough to compensate for their lack of visual reasoning. We might expect our neural network models also to be able to circumvent this limitation.

If we take a more abstract view of the human thought process described earlier, we can see mainly the following things involved:

1. Pattern-matching is used for recalling learned experience, problem representation and monitoring and evaluating chains of thought.

2. Working memory is to store all the intermediate steps.

3. Backtracking search concludes that the CoT is not going anywhere and backtracks to some reasonable point.

Pattern-matching in an LRM comes from its training. The whole point of training is to learn both knowledge of the world and the patterns to process that knowledge effectively. Since an LRM is a layered network, the entire working memory needs to fit within one layer. The weights store the knowledge of the world and the patterns to follow, while processing happens between layers using the learned patterns stored as model parameters.

Note that even in CoT, the entire text — including the input, CoT and part of the output already generated — must fit into each layer. Working memory is just one layer (in the case of the attention mechanism, this includes the KV-cache).

CoT is, in fact, very similar to what we do when we are talking to ourselves (which is almost always). We nearly always verbalize our thoughts, and so does a CoT reasoner.

There is also good evidence that CoT reasoner can take backtracking steps when a certain line of reasoning seems futile. In fact, this is what the Apple researchers saw when they tried to ask the LRMs to solve bigger instances of simple puzzles. The LRMs correctly recognized that trying to solve the puzzles directly would not fit in their working memory, so they tried to figure out better shortcuts, just like a human would do. This is even more evidence that LRMs are thinkers, not just blind followers of predefined patterns.

But why would a next-token-predictor learn to think?

Neural networks of sufficient size can learn any computation, including thinking. But a next-word-prediction system can also learn to think. Let me elaborate.

A general idea is LRMs cannot think because, at the end of the day, they are just predicting the next token; it is only a 'glorified auto-complete.' This view is fundamentally incorrect — not that it is an 'auto-complete,' but that an 'auto-complete' does not have to think. In fact, next word prediction is far from a limited representation of thought. On the contrary, it is the most general form of knowledge representation that anyone can hope for. Let me explain.

Whenever we want to represent some knowledge, we need a language or a system of symbolism to do so. Different formal languages exist that are very precise in terms of what they can express. However, such languages are fundamentally limited in the kinds of knowledge they can represent.

For example, first-order predicate logic cannot represent properties of all predicates that satisfy a certain property, because it doesn't allow predicates over predicates.

Of course, there are higher-order predicate calculi that can represent predicates on predicates to arbitrary depths. But even they cannot express ideas that lack precision or are abstract in nature.

Natural language, however, is complete in expressive power — you can describe any concept in any level of detail or abstraction. In fact, you can even describe concepts about natural language using natural language itself. That makes it a strong candidate for knowledge representation.

The challenge, of course, is that this expressive richness makes it harder to process the information encoded in natural language. But we don’t necessarily need to understand how to do it manually — we can simply program the machine using data, through a process called training.

A next-token prediction machine essentially computes a probability distribution over the next token, given a context of preceding tokens. Any machine that aims to compute this probability accurately must, in some form, represent world knowledge.

A simple example: Consider the incomplete sentence, "The highest mountain peak in the world is Mount …" — to predict the next word as Everest, the model must have this knowledge stored somewhere. If the task requires the model to compute the answer or solve a puzzle, the next-token predictor needs to output CoT tokens to carry the logic forward.

This implies that, even though it’s predicting one token at a time, the model must internally represent at least the next few tokens in its working memory — enough to ensure it stays on the logical path.

If you think about it, humans also predict the next token — whether during speech or when thinking using the inner voice. A perfect auto-complete system that always outputs the right tokens and produces correct answers would have to be omniscient. Of course, we’ll never reach that point — because not every answer is computable.

However, a parameterized model that can represent knowledge by tuning its parameters, and that can learn through data and reinforcement, can certainly learn to think.

Does it produce the effects of thinking?

At the end of the day, the ultimate test of thought is a system’s ability to solve problems that require thinking. If a system can answer previously unseen questions that demand some level of reasoning, it must have learned to think — or at least to reason — its way to the answer.

We know that proprietary LRMs perform very well on certain reasoning benchmarks. However, since there's a possibility that some of these models were fine-tuned on benchmark test sets through a backdoor, we’ll focus only on open-source models for fairness and transparency.

We evaluate them using the following benchmarks:

As one can see, in some benchmarks, LRMs are able to solve a significant number of logic-based questions. While it’s true that they still lag behind human performance in many cases, it’s important to note that the human baseline often comes from individuals trained specifically on those benchmarks. In fact, in certain cases, LRMs outperform the average untrained human.

Conclusion

Based on the benchmark results, the striking similarity between CoT reasoning and biological reasoning, and the theoretical understanding that any system with sufficient representational capacity, enough training data, and adequate computational power can perform any computable task — LRMs meet those criteria to a considerable extent.

It is therefore reasonable to conclude that LRMs almost certainly possess the ability to think.

Debasish Ray Chawdhuri is a senior principal engineer at Talentica Software and a Ph.D. candidate in Cryptography at IIT Bombay.

Read more from our guest writers. Or, consider submitting a post of your own! See our guidelines here.

Source_link