My thoughts on the paper 'Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!'
In this post, I discuss a recent paper,
My goal here is not to summarize the paper (it’s worth reading in full), but to highlight the aspects that resonated most with me and to add my own perspective, particularly my view that looking inside the cognitive processes of LLMs may be a fundamentally unachievable goal, and what this means for how we approach AI interpretability and alignment.
Debates about anthropomorphic terms like “reasoning” or “consciousness” are futile without clear definitions — these words are subjective, overloaded, and often understood differently by each participant. I’ve been part of countless unproductive discussions that boiled down to mismatched interpretations of these terms.
Standard definitions of thinking and reasoning remain vague. Broadly speaking, thinking is considered an umbrella concept encompassing both intuitive and deliberate mental processes, whereas reasoning refers more narrowly to structured, logical inference.
The paper briefly references Daniel Kahneman’s book Thinking, Fast and Slow, which distinguishes between System 1 (fast, automatic, heuristic-driven) and System 2 (slow, effortful, and logical) thinking.
To my knowledge, cognitive science and psychology offer no universally accepted definitions of these concepts. I therefore adopt the System 1 vs. System 2 framework as an intuitive guide, assuming that LLMs already excel at System 1-like capabilities (fast pattern-matching and intuition), whereas LRMs aim to extend them with System 2-style reasoning abilities.
Unlike System 1-like pattern recognition, System 2-style reasoning requires structured planning, multi-step inference, and symbolic abstraction. The problem: current attempts to achieve System 2 capabilities often focus on distilling human-like thought processes into LLMs via reasoning chains encoded in natural language.
This raises an immediate question: how can we reliably retrieve reasoning chains from humans themselves? To do so, we would need humans to recall and verbalize their thought processes in natural language. However, as noted by Kambhampati et al., the well-known paper “Telling More Than We Can Know: Verbal Reports on Mental Processes”
The term introspection illusion refers to the “cognitive bias in which people wrongly think they have direct insight into the origins of their mental states”. Because we often lack conscious access to higher-order mental processes, our explanations for why we think or act in certain ways are frequently constructed post hoc — relying on confabulated causal theories rather than genuine insight.
In short, when people explain their own thoughts, they are often confabulating — constructing plausible narratives rather than reporting true internal processes.
This raises two fundamental concerns. First, how realistic is it to expect that we can endow LLMs with System 2-like reasoning simply by distilling human-generated reasoning traces? Second, if humans themselves struggle to reliably articulate their own thought processes, how meaningful is it to treat LLM-generated reasoning traces, expressed in natural language, as evidence of any true cognition within this black box?
In my view, Kambhampati et al. highlight two key fallacies (cf. Sec. 3 in the paper):
Both fallacies stem from a persistent desire to “look inside” a model’s internal cognition, i.e., to treat its outputs as windows into an internal thought process. In this context, I find the analogy proposed by Hagendorff et al. in the paper “Machine Psychology”
Most people easily recognize the abstract analogy that connects biological and artificial neural networks. But Hagendorff et al. argue that this analogy also applies to how these systems are studied.
Historically, human cognition has been studied along two orthogonal lines: neuroscience, which aims to understand the biological mechanisms, and psychology, which focuses on external behavior. The paper argues for a similar split when studying artificial systems:
The analogy to machine psychology highlights why looking into a model’s “mind” is both difficult — and perhaps fundamentally misguided.
Machine psychology suggests we treat LLMs as opaque but testable subjects — design experiments, observe behaviors, and infer capabilities, rather than pretending we can read their minds through token streams.
As noted in the previous section, human introspective reports are often post hoc narratives, constructed after the fact. LLMs trained to produce reasoning traces do the same. They learn how to talk about reasoning, not how to reason.
Mistaking surface-level explanations for true cognition risks giving us a false sense of alignment.
The paper by Kambhampati et al. helped shape my view that chain-of-thoughts (CoTs), while human-readable, should not be treated as reliable explanations of how a model arrives at its outputs. This implies they cannot be relied upon for (1) behavior auditing or (2) ensuring model honesty.
CoTs are not ground-truth indicators of a model’s internal cognition.
Yet, these illusions are easy to fall for, even within the scientific community, and they have implications for both AI alignment research and public discourse.
For example, a recent paper by Apollo Research on scheming
If even AI researchers and practitioners fall into this trap, how can we expect public figures or media outlets to avoid it?
The historian and writer Yuval Noah Harari, in a recent discussion, remarked when speaking about CoTs: “we can see how the sentences and stories are formed in [the AI’s minds]”. This phrasing — while understandable — reinforces the notion that we can see the actual thought process happening inside the language models.
This confusion over CoTs risks derailing conversations about alignment and interpretability, as it risks diverting attention toward misleading signals rather than genuine indicators of model behavior.
This paper can also be seen as a reminder that scientific rigour and intellectual integrity are essential for genuine progress. This doesn’t imply that researchers are consciously deceiving anyone — but rather that it’s easy to mislead ourselves when dealing with vague concepts and opaque systems. I admit the fallacies highlighted in the paper were not obvious to me before I read it. Yet without critical reflection, we risk unintentionally heading in the wrong direction.
This tendency is not new in the field of AI. As Kambhampati et al. note, McDermott (1976)
The authors also reference Richard Feynman’s famous notion of “Cargo Cult Science,” from his 1974 commencement address at Caltech. Feynman’s warning about adopting the rituals of science without its spirit of honesty and self-skepticism feels particularly relevant today. When our models grow increasingly complex and their inner workings opaque, it is all too easy to substitute the appearance of understanding for actual insight.
To avoid repeating past mistakes, we need to adopt Feynman’s kind of honesty: a willingness to state clearly what we don’t know, to avoid anthropomorphic shortcuts, and to test claims empirically rather than leaning on appealing narratives.
One important fact is that CoTs do improve performance on certain tasks. However, as Kambhampati et al. argue (and as several studies cited in the paper suggest), it is far more plausible that this improvement is simply a side effect of the additional computation afforded by longer prompts, rather than evidence that human-like thought processes have been “distilled” into the model.
Section 6 of the paper offers two insights worth highlighting. First, CoTs are just one form of prompt augmentation — essentially, a way of extending the prompt to give the model more “compute in time.” Second, drawing on Marvin Minsky’s observation that “intelligence is shifting the test part of generate-test into generation”, the paper frames LLMs as generators of candidate solutions that are then evaluated by a verifier. The naive (and brute-force) strategy is to produce many solutions and let the verifier do all the work. A more intelligent approach, however, would be for the LLM itself to generate a much narrower, high-quality set of candidate solutions. Achieving this shift is a central motivation behind the development of large reasoning models (LRMs).
Looking ahead, there are many promising directions for future work. Personally, I find approaches like Chain of Continuous Thought
Ultimately, the value of CoTs may lie less in what they reveal about “how models think”, and more in how they can be leveraged as tools to shape and constrain behavior — while keeping our expectations grounded in scientific rigor rather than anthropomorphic metaphors.
*This post has also been published on Medium.
The views expressed are my own and do not represent those of any employer, collaborator, or institution. Content may contain errors or outdated interpretations.