As I delved deeper into LLM applications and Agent systems, I became utterly disgusted by the industry’s “AI sycophant” style of mystical marketing. Vendors now constantly boast that their new models “can think” and “have logic,” as if adding a “Pro” or “Thinking” suffix suddenly makes a neural network sprout a human brain. If we don’t adopt an attitude of skepticism and questioning, we can easily be led astray by these attractive claims. Today, let’s coldly dissect the true differences between Flash models and Thinking models, and what those so-called “thinking processes” actually entail, starting from their underlying operational mechanisms and physical limitations.
1. It’s Not Just “Thinking Longer”: Differences in Training Objectives and Probability Distributions
Many people mistakenly believe that Thinking models are simply Flash models that run for a longer time, or that we instruct them via a prompt to “think more carefully.” This is a complete misunderstanding.
Flash models are trained to achieve extreme inference speed and low latency, relying on powerful compressed memory and intuitive pattern matching. In contrast, Thinking models, which are touted for their reasoning, incorporate extensive Reinforcement Learning (RL) during the training phase, and are “deliberately trained” to generate a long sequence of internal reasoning steps (Chain of Thought) before producing a final answer.
Regardless of the type, at their core, they are still performing Next-Token Prediction. Let’s look at the core probability formula for autoregressive generation: P(Y | X) = the probability distribution of the next token Y, given the preceding sequence X tokens.
Given input X, the model at each step calculates the maximum conditional probability for the next token y_t. Thinking models do not deviate from this framework; they are simply trained to use this formula to chain together a multitude of intermediate steps before generating the final answer y.
2. The Brutal Aesthetics of KV Cache: Dispelling the Intuitive Myth of “Re-calculation”
Since it’s about chaining together thinking processes, this brings us to a second serious misconception. Many people intuitively believe that Thinking models operate as follows:
Original problem -> Generates thinking step 1.
Then, the entire package of “original problem + thinking step 1” is re-fed to the Transformer -> Generates thinking step 2… and so on, stacking up like Russian dolls.
If it truly operated this way, the GPU’s computing power would have long since exploded. The real underlying magic lies in KV Cache (Key-Value Cache).
During the actual inference stage, the model essentially “writes forward” unidirectionally. After the original Prompt is converted into Key and Value vectors during the initial Prefill stage and stored in the Cache, for every new token the model generates thereafter, it absolutely does not need to recompute all the preceding text. Instead, it only needs to calculate the Query vector for the current token and use the Attention mechanism to compute the dot product with all previously generated content in the KV Cache.
This is the truth behind what we often call Test-Time Compute: it leverages constantly expanding Cache space (memory bandwidth) to enable the model to perform more computations within a fixed-depth network. Every time it writes a token, it “attends” to the past context, ensuring that the current chain-of-thought logic does not contradict what came before.
3. The Question: Is it Truly “Thinking”?
At this point, we must pose the most incisive question: Can a chain-of-thought generated by frantically writing forward using KV Cache truly be called rigorous causal logic?
The answer is: not necessarily. Because at its core, it’s still an autoregressive probabilistic model, and its “thinking trajectory” is tuned by a Reward function. It is extremely prone to “hallucinatory self-persuasion.” Sometimes, it writes a long sequence of seemingly impeccable deductions simply because the probability distribution of that text best matches the preferred format human annotators would reward, not because it genuinely “enlightened” itself with some mathematical truth.
The reason it stops “thinking” (emits [EOS]) is not because it feels “the problem is solved.” It’s purely because, within the training data it consumed, after a derivation of that length, the probability of an end-of-sequence token appearing reached its highest value.
Conclusion: Only by Recognizing the Limits of the Tool Can Its True Value Be Unleashed
To put it plainly, a Thinking model is not a sage contemplating repeatedly in its mind. It’s more like an amnesiac heavily reliant on a notebook – because the computational depth of a neural network is fixed, it cannot solve complex problems in one leap, so it is forced to fill an entire scratchpad, reviewing its ever-growing KV Cache with every step it takes. This is an extremely clever and powerful engineering solution, which indeed significantly raises the ceiling for AI in handling complex tasks. But as technologists, we must see through this mystical packaging, understand its resource cost and logical fragility under Test-Time Compute, so that when the model spouts nonsense, we won’t be left wondering if our prompt wasn’t good enough.