Why Modern LLMs Are Decoder-only: Architectural Evolution and Considerations from Seq2Seq to GPT

「If we’re on the path to the Turing machine, Seq2Seq itself makes more sense than a Decoder." This is a sentence I wrote in my notes in 2022.

Back then, I was experimenting with an early version of GPT-2, and I kept wondering: What exactly is this thing? In comparison, models like BART or T5, based on the Seq2Seq concept, seemed much more reasonable. Unexpectedly, a few years later, in this AI arms race, it’s the Decoder-only architecture that has claimed the MVP title.

At the time, I felt that a Decoder, at its core, was merely performing a “word chain game” based on the input. It couldn’t pre-determine what it was going to generate as a whole; it could only continuously choose from the Sample Space the “token with the highest probability in the current situation.”

In contrast, Seq2Seq was the “good student” that fully utilized the Transformer architecture:

The Encoder was responsible for understanding the entire sentence structure and generating embeddings rich in semantic detail.
The Decoder applied the global context provided by the Encoder to output sequences.

This division of labor sounds incredibly logical, right? But why has current development shifted entirely towards Decoder-only? Below are my subsequent reflections, combined with several key reasons identified from recent discussions about AI.

1. The Brutal Beauty of a Unified Architecture: Scaling Law and Objective Functions

While Seq2Seq had clear logic, it separated “understanding” and “generation” into two distinct modules. The Encoder was responsible for understanding and producing embeddings for specific concepts. This actually introduced a hidden cost: inconsistent training objectives.

Seq2Seq often used Span Corruption (e.g., masking out a segment of text for the AI to fill in, similar to BERT/T5) for pre-training; whereas GPT consistently employed Next Token Prediction.

Later, it was proven that when model parameters scale to tens or hundreds of billions, Next Token Prediction, which adheres to the Scaling Law, is the “objective function” most effective at extracting value from data. When we aim to minimize the Loss Function and improve the performance of Decoder-only models, to accurately predict the next token, the model is forced to “simulate” the logic, common sense, and even physical laws of the entire world internally.

If it doesn’t understand “gravity,” it won’t be able to accurately follow “After the apple detaches from the branch…” with “it falls down.”

Perhaps “understanding” was never the goal, but rather a byproduct generated for the sake of “accurate continuation.” In other words, while we didn’t specifically instruct the AI to “thoroughly understand the entire sentence content and output an equal-length sequence or classification label” as we would when training BERT or RoBERTa, the Decoder inherently accomplishes this during its text prediction process. Conversely, no matter how good an Encoder is at understanding, it cannot independently learn how to “write” content.

2. “Prefix LM” Blurs the Boundaries of the Encoder

Earlier, I mentioned that “the Encoder can understand the entire sentence structure and see the global context,” but this advantage has been superseded in modern Decoder-only models by the form of Prefix LM.

While current LLMs are Decoder-only, when you input a 2000-character Prompt, these 2000 characters actually undergo Full Attention (during the preparation phase for answering, characters within the Prompt can all “see” each other’s context). It only switches back to a strict Causal Mask (meaning it can only see preceding tokens) from the moment it “starts generating the answer.”

It accomplishes both “understanding the global Prompt” and “autoregressive generation of the answer” within the same parameter space, which is more flexible and efficient than forcibly splitting parameters into separate Encoder and Decoder groups.

3. Training Efficiency and the Economics of GPU Concurrency

This point is very practical from an engineering and computational power perspective. We can observe that in the retrieval domain, Bi-encoders significantly outperform Cross-encoders in efficiency. Similarly, applying this to Seq2Seq:

During training, Seq2Seq’s Encoder can indeed see the full text, but the computational complexity and data transfer volume of Cross-Attention between the Encoder and Decoder become a severe performance bottleneck in Distributed Training.

In contrast, Decoder-only training: Because it only uses a Causal Mask, its computation matrix is a beautiful “lower triangular matrix” (when predicting the $t$-th token, it only sees tokens from $1$ to $t-1$). On existing GPU architectures, this pure mechanism can achieve the highest throughput and maximize GPU performance. In an era where training often requires tens of thousands of GPUs, this architectural computational advantage directly determines survival.

4. The Great Zero-shot Emergent Abilities

The Decoder-only architecture has demonstrated extremely strong In-context Learning capabilities.

Early Seq2Seq models (e.g., T5), when faced with unseen task instructions, typically required Fine-tuning to perform well. However, a “continuation machine” like GPT, having seen countless “question: answer” dialogue patterns in its vast pre-training data, can directly treat your Prompt as the first half of what it needs to continue, thereby achieving astonishing generalization capabilities (Zero-shot / Few-shot).

This mode of operation is actually quite reasonable for the human brain as well. While the data volume required for LLM training is astonishingly large compared to the human brain, the logic of “naturally deriving the next step based on context” is actually closer to human intuition.

Conclusion: How Far Are We From AGI?

While everyone debates how AGI will be achieved, observing GPT’s current mode of operation – first using Prefix LM for internal computation (thinking) to decide what to output, and then entering an autoregressive state akin to a “word chain game” during output – it’s quite aligned with human behavior when you think about it carefully.

After all, once we’ve formulated what we want to say in our minds, the subsequent speaking process is often a Deterministic Chain. However, humans continuously “think, speak, then refine” during the speaking process… This fundamental difference remains between humans and current LLMs that simply generate output in a single pass.

Nevertheless, it’s undeniable that the Decoder-only architecture, with its extreme scalability and efficiency, has indeed brought us to new, previously unimaginable heights in AI.

1. The Brutal Beauty of a Unified Architecture: Scaling Law and Objective Functions#

2. “Prefix LM” Blurs the Boundaries of the Encoder#

3. Training Efficiency and the Economics of GPU Concurrency#

4. The Great Zero-shot Emergent Abilities#

Conclusion: How Far Are We From AGI?#

1. The Brutal Beauty of a Unified Architecture: Scaling Law and Objective Functions

2. “Prefix LM” Blurs the Boundaries of the Encoder

3. Training Efficiency and the Economics of GPU Concurrency

4. The Great Zero-shot Emergent Abilities

Conclusion: How Far Are We From AGI?