At the time, I used to think that a Decoder was essentially just playing “word-chain” based on the input, unable to pre-determine what it was going to generate as a whole. But looking back now, if we don’t blindly join the hype of those AI sycophants, spouting pretty phrases like “because GPT is more spiritual,” but instead dispassionately dissect it from the foundational layers of Information Theory and engineering reality, you’ll find that the triumph of Decoder-only architectures is, fundamentally, an inevitability dictated by mathematical and physical constraints.

1. The Brutal Aesthetics of Unified Architecture: Cross-Entropy and the Price of Being Too Clever

While Seq2Seq has clear logic, by splitting “understanding” and “generation” into two modules, it actually “self-castrates” the purity of information compression in its mathematical objective function.

Let’s return to the purest definition in probability and information theory — the Chain Rule for Entropy:

$$H(X_1, X_2, \dots, X_n) = \sum_{i=1}^{n} H(X_i | X_{i-1}, \dots, X_1)$$

This formula tells us that to quantify the total uncertainty of a piece of human language, the most perfect and lossless way to decompose it is to calculate and sum up “the conditional entropy of the next word, given all historical information.”

The loss function for the Next-Token Prediction that LLMs run daily is essentially minimizing the Cross-Entropy between the true world human language distribution $p$ and the model’s predicted distribution $q$:

$$H(p, q) = H(p) + D_{\text{KL}}(p \parallel q)$$

The physical meaning here is very elegant: $H(p)$ is “God’s entropy,” the ultimate randomness inherent in human language itself; while the KL divergence $D_{\text{KL}}(p \parallel q)$ is the information cost (noise) that the model, due to its lack of intelligence, has to pay on average at each step when performing “word-chaining.” The ultimate goal of optimizing the model is to force this cost to approach 0. In contrast, Seq2Seq insists on splitting this unified optimization task, using bidirectional Attention for Span Corruption (fill-in-the-blank) and then Cross-Attention to retrieve Embeddings. This shatters the conditional entropy chain rule that quantifies everything, making the optimization objective in the parameter space extremely impure.

2. Question: Do We Really Want KL Divergence to Be Zero?

Okay, at this point, if someone raises their hand and asks, “According to you, we want $D_{\text{KL}}$ to be as low as possible, so isn’t the ultimate goal for it to be 0? If the model’s distribution is exactly the same as human real-world data samples, wouldn’t it just be a random parrot that memorizes everything? Where exactly do these ’emergent abilities’ you always talk about come from?” This very question is the ultimate key to unlocking the mystery of the Decoder-only triumph.

The truth is: precisely because of the limitations of physical reality, emergence is actually a byproduct of ’lossy compression.’

The knowledge space, physical laws, and causal logic behind human language (the immense $H(p)$) are virtually infinite. However, the model’s parameters (whether 7B or hundreds of billions) are extremely limited in the hardware world. Statistically, it is absolutely impossible for the model to truly make the KL divergence zero.

In such a desperate situation of limited capacity, yet being forced to push cross-entropy to its absolute minimum, the model finds that rote memorization simply isn’t feasible. The only path to achieve a high score is to be forced, within its internal neurons, to abstract out the underlying operational rules, logical syntax, and causal World Model of this world.

3. Why Does “Fluent Word-Chaining” Equate to “Understanding the World”?

Many people believe that Decoder-only models are merely engaged in superficial probabilistic word-chaining; how could they possibly understand physics or programming?

Let’s conduct a thought experiment. Consider this sentence:

“When a 100 kg person and a 50 kg person go down a slide, the time theoretically required is \underline{\hspace{1cm}}.”

For the model to accurately complete the sentence with “the same” and thus lower its conditional entropy, relying solely on statistical word frequencies would certainly fail. Through the extreme compression and refinement across trillions of tokens, it is implicitly forced to simulate concepts like “gravity” and “friction.”

This is the extraordinary state that Decoder-only models achieve through pure Maximum Likelihood Estimation (MLE). It appears to merely pursue fluency in its output, but because the causal logic behind human text is so stringent, to achieve ultimate coherence, it is forced to become a simulator of this world. Perhaps “understanding” was never the goal, but rather a byproduct created for the sake of “accurate word-chaining.”

Conclusion: How Far Are We From AGI?

As people debate how AGI might be achieved, observing the current operational mode of GPT: first, it uses Prefix LM for internal computation (thinking) about what to output, and then, when it starts generating, it enters a self-regressive state akin to “word-chaining.” If you think about it carefully, this actually aligns quite well with human behavior. After all, once we’ve decided what we want to say in our minds, the subsequent speaking process is often a deterministic chain. Nevertheless, humans continuously “think, speak, and revise” during speech, which is fundamentally different from the current LLMs that generate content in a single pass.

However, it is undeniable that the Decoder-only architecture, with its extreme scalability, pure mathematical objective, and computational efficiency, has indeed brought us to new, previously unimaginable heights in AI.