The Crystallization of Transformer Architectures (2017-2025)

Between 2017 and 2025, transformer architectures for LLMs underwent rapid exploration followed by striking convergence. This article traces decisions across 53 models and identifies a de facto 2023–2025 stack: pre-norm (RMSNorm), RoPE, SwiGLU MLPs, KV-sharing (MQA/GQA), and bias-free layers. We discuss both model-intrinsic factors (optimization stability, quality-per-FLOP) and practical constraints (kernel availability, KV-cache economics). Diversity persists mainly in MoE routing and long-context attention. The accompanying dataset records publication dates and architectural specs.

Convergence as Signal

In June 2017, (Vaswani et al., 2017) introduced the transformer with a specific set of architectural choices: post-layer normalization, sinusoidal position encodings, ReLU activations, and 4x MLP expansion. Each choice was reasonable but not obviously optimal. The subsequent eight years saw extensive experimentation with alternatives.

By 2024, many influential open-weight decoder-only model families converged on a similar bundle: pre-norm (often RMSNorm), RoPE-family position encodings, GLU-family MLPs (commonly SwiGLU with parameter-matched width), and KV-sharing attention variants (MQA/GQA). Several also drop most bias terms (sometimes keeping QKV-only biases). This is not literally universal, as there are notable hybrids and counter-trends (e.g., ALiBi/relative-bias lineages, RoPE+NoPE mixtures, and nonstandard norm stacks), but the center of mass is clear. The original transformer’s choices were replaced wholesale.

Adoption of transformer architectural choices over time (cumulative)

When many independent groups converge on similar design choices, it is evidence of a strong shared basin of solutions. But convergence can also reflect common constraints (hardware/software stacks, kernel availability, inference economics) and path dependence (influential released checkpoints and reference implementations). The goal here is to separate what appears robust from what may be contingent.

This article examines the architectural evolution through three lenses:

  1. Historical progression: How did we get from the 2017 transformer to the 2025 consensus? What problems did each innovation solve?

  2. Technical foundations: What mathematical properties make RoPE more attractive to learned absolute positions? Why does SwiGLU outperform GeLU despite having fewer effective parameters? Why does QK-normalization stabilize training?

  3. Remaining frontiers: Where has convergence not occurred? What does ongoing architectural diversity in MoE configurations, attention patterns, and stability mechanisms tell us about unsolved problems?

Scope note: “convergence” here is primarily about dense, decoder-only LLM blocks (norm/pos-enc/MLP/attention) rather than training recipe, data, post-training, or system-level inference tricks. The dataset is “widely discussed models”, which tilts toward models with public technical reports and/or open weights.

The analysis draws on a dataset of 53 transformer LLMs spanning 2017-2025, with architectural specifications cross-referenced against primary sources.

Four Eras of Transformer Architecture

The evolution of transformer LLMs divides naturally into four eras, each characterized by distinct architectural priorities and innovations.

Era I: Foundations (2017-2019)

The original transformer established the fundamental structure that persists today: alternating multi-head self-attention and position-wise feed-forward layers, connected by residual streams. The specific implementation choices, however, were largely inherited from prior work or chosen for simplicity.

Normalization placement followed the convention from residual networks: apply normalization after the residual addition (post-norm). For a sublayer function f (attention or FFN), the computation was:

xl+1=LayerNorm(xl+f(xl))

Position encoding used fixed sinusoidal functions, encoding absolute position p in dimension i as:

PE(p,2i)=sin(p/100002i/d) PE(p,2i+1)=cos(p/100002i/d)

This choice was elegant, requiring no learned parameters and theoretically enabling length generalization through the linear properties of sinusoids, but subsequent work showed learned absolute positions performed better in practice.

Feed-forward networks used the standard MLP structure with ReLU activation and 4x expansion:

FFN(x)=max(0,xW1+b1)W2+b2

where W1Rd×4d and W2R4d×d.

GPT-1 (2018) moved to decoder-only architecture with learned absolute positions and GeLU activation. GPT-2 (2019) introduced the critical shift to pre-normalization:

xl+1=xl+f(LayerNorm(xl))

This change is widely associated with improved optimization stability at depth. One intuition is gradient flow: in post-norm, gradients repeatedly pass through normalization in the main residual pathway; in pre-norm, the residual stream provides a cleaner identity path while normalization shapes only the sublayer contribution.

Era II: Scale-Up (2020-2022)

The GPT-3 moment demonstrated that scaling (simply training larger models on more data) produced qualitative capability improvements. This era focused on enabling efficient scaling through architectural refinements.

RMSNorm (Root Mean Square Layer Normalization), introduced by (Zhang & Sennrich, 2019), gained traction in this period when it was adopted by Gopher and Chinchilla. Standard LayerNorm computes:

LayerNorm(x)=xμσγ+β

where μ and σ are the mean and standard deviation across features. RMSNorm simplifies this by removing the mean-centering:

RMSNorm(x)=xRMS(x)γ,RMS(x)=1di=1dxi2

The computational savings are modest (often reported around 10-15%, implementation-dependent), but empirically RMSNorm matches LayerNorm in many transformer settings while simplifying the normalization operation. Mean-centering is not “wrong” per se, but it often appears unnecessary for good training dynamics in modern pre-norm transformers, and removing it can slightly improve efficiency.

Parallel attention and FFN was introduced by GPT-J, GPT-NeoX, and later PaLM. Instead of sequential computation:

xl=xl+Attn(Norm(xl)),xl+1=xl+FFN(Norm(xl))

the parallel formulation computes both sublayers from the same input and sums:

xl+1=xl+Attn(Norm(xl))+FFN(Norm(xl))

This can improve hardware utilization by increasing parallelizable work; reported speedups vary by implementation, model shape, and kernel support, but are often on the order of ~10-20% with minimal quality impact.

Rotary Position Embeddings (RoPE) were introduced by (Su et al., 2024) and quickly adopted by GPT-J, GPT-NeoX, and PaLM. We defer detailed analysis to Section 3.1, but the key innovation is encoding relative position information through rotation matrices applied to query and key vectors, rather than adding absolute position embeddings to the input.

SwiGLU activation was introduced by (Shazeer, 2020) and later adopted at scale by PaLM. The technique builds on the Gated Linear Unit family. The standard FFN:

FFN(x)=GeLU(xW1)W2

becomes:

SwiGLU(x)=(SiLU(xW1)xW3)W2

where SiLU (Swish) is xσ(x) and denotes element-wise multiplication. The gating mechanism (xW3) modulates the activated representation, improving expressivity. However, the third weight matrix increases parameters, so the hidden dimension is reduced from 4d to 8d3 to maintain parameter count.

Era III: Efficiency and Open Source (2023-2024)

LLaMA (February 2023) crystallized the modern architecture. While each component existed before, LLaMA’s combination (and Meta’s decision to release weights) established a reproducible baseline that virtually all subsequent open models adopted.

The LLaMA recipe is as such:

  • Pre-normalization with RMSNorm
  • Rotary position embeddings (RoPE)
  • SwiGLU activation with ~8/3 expansion
  • No bias terms anywhere
  • Grouped-query attention (in LLaMA 2 onwards)

This recipe succeeded because it simultaneously optimized multiple objectives: training stability, inference efficiency, implementation simplicity, and model quality. The absence of bias terms, for instance, slightly improves training dynamics and simplifies the implementation without measurable quality loss.

Grouped-Query Attention (GQA) addressed the inference bottleneck. In standard multi-head attention (MHA), each head maintains separate key and value projections. For a model with h heads, this means h separate KV pairs must be cached during autoregressive generation. GQA groups multiple query heads to share single key-value heads:

GQA:QRhq×dk,K,VRhkv×dk

where hq>hkv (typically hq/hkv=4 or 8). This reduces KV-cache memory by the grouping factor with minimal quality degradation, enabling longer contexts and larger batch sizes at inference.

Vocabulary expansion accelerated during this era. LLaMA used 32K tokens, LLaMA 2 maintained this. LLaMA 3 expanded to 128K, and Gemma uses 256K. Larger vocabularies improve tokenization efficiency (fewer tokens per word, especially for non-English languages and code) at the cost of larger embedding matrices. The trend reflects both improved tokenizer algorithms (BPE variants, BBPE) and recognition that embedding parameters are relatively cheap compared to transformer layers.

Stability mechanisms emerged as models scaled:

  • Logit soft-capping (Gemma 2): Bounds attention logits before softmax to prevent numerical overflow: logitsctanh(logits/c) for cap value c.

  • QK-normalization (Gemma 3, OLMo 2, Qwen 3): Applies normalization to query and key vectors before computing attention scores. We analyze the mathematical motivation in Section 3.4.

  • Embedding LayerNorm (BLOOM): Normalizes embeddings before the first transformer layer, addressing initialization-related instabilities.

Era IV: MoE Dominance (2024-2025)

Dense scaling, or simply increasing model parameters, encounters diminishing returns. Training compute scales linearly with parameters, but quality improvements become sublinear. Mixture-of-Experts (MoE) provides a different scaling axis: increase total parameters while keeping active (per-token) parameters constant.

Mixtral 8×7B (January 2024) demonstrated that open MoE models could match dense models of much larger active parameter count. The architecture replaces each FFN with a routed mixture:

MoE(x)=i=1kgi(x)Ei(x)

where Ei are expert networks (typically standard FFNs), gi(x) are routing weights from a learned router, and k is the number of active experts per token (typically 1-2 for Mixtral, up to 8 for later models).

The expert scaling trajectory over 2024-2025 is dramatic:

Loading, please wait
Model
Date
Total Params
Active Params
Experts
Active
Mixtral 8×7BJan 202446.7B12.9B82
DeepSeek V3Dec 2024671B37B256+18
Llama 4 MaverickApr 2025400B17B128+1varies
Kimi K2Jul 20251.04T32B3848


Auxiliary-loss-free load balancing (DeepSeek V3) solved a persistent MoE training problem. Traditional approaches add an auxiliary loss to encourage balanced expert utilization:

Laux=αi=1nfiPi

where fi is the fraction of tokens routed to expert i and Pi is the average routing probability for expert i. This loss encourages balance but distorts the primary training objective.

DeepSeek’s innovation introduces a bias term bi used for selection (to maintain load balance) but excluded from the mixture weights used to form the output. Concretely, experts are selected by si=ri(x)+bi, but the output weights are computed from the unbiased router scores ri(x) over the selected set (formalized below in Section 3.3).

Shared experts (DeepSeek, Trinity, Llama 4) designate one or more experts as always-active, providing a stable baseline that all tokens access. This improves training stability and ensures common knowledge isn’t fragmented across specialized experts.

Multi-head Latent Attention (MLA) (DeepSeek V3, Kimi K2) addresses the MoE memory challenge. With hundreds of experts, KV-cache becomes prohibitive. MLA compresses the KV representation through learned down-projections:

K=WUK(WDKVx),V=WUV(WDKVx)

where WDKV projects to a low-dimensional latent space, and WUK,WUV reconstruct keys and values. This dramatically reduces cache memory while preserving attention expressivity.

2.5 Hardware Co-Evolution

Architectural convergence cannot be understood in isolation from hardware constraints. Several “winning” choices align closely with GPU/TPU optimization opportunities:

  • FlashAttention (Dao et al., 2022) made O(n2) attention practical for longer sequences by restructuring memory access patterns. This reduced the urgency of linear attention research and made long-context training cheaper, increasing the value of position schemes (including RoPE) that behave well under extrapolation. RoPE is also implementation-friendly: its per-position rotations compose cleanly with common fused-attention kernels.

  • Tensor core tile sizes (16×16 on A100, 8×8 on H100 for FP8) favor hidden dimensions and head counts that are multiples of these values. The near-universal choice of dhead=128 reflects this constraint as much as any quality consideration.

  • Memory bandwidth bottlenecks during autoregressive inference push toward KV-cache reduction (MQA/GQA) independently of training quality. A technique that is neutral during training but reduces inference memory by 4-8× will be adopted even if it slightly hurts perplexity.

  • Fused kernel availability creates path dependence: once FlashAttention, fused RMSNorm, and fused SwiGLU kernels exist and are well-optimized, switching to alternatives incurs engineering cost beyond any quality trade-off. The LLaMA recipe’s dominance is partly a network effect - it’s what the kernels support best.

This hardware-architecture coupling means that “convergence” partly reflects what is fast on current accelerators, not only what is best in an abstract sense. A different hardware landscape (e.g., higher memory bandwidth, different tile sizes) might favor different architectural choices.

Technical Deep Dives

RoPE: Why Rotation Encodes Relation

Rotary Position Embeddings (RoPE) have become the dominant default for position encoding in modern LLMs. Understanding why requires examining how position information enters the attention computation.

In standard attention, the relevance of position m to position n is determined by the dot product qnTkm. For positions to influence attention, position information must be embedded in q and k.

Absolute position embeddings add a position vector to the input:

qn=Wq(xn+pn),km=Wk(xm+pm)

The dot product expands to four terms mixing content and position:

qnTkm=(Wqxn)T(Wkxm)+(Wqxn)T(Wkpm)+(Wqpn)T(Wkxm)+(Wqpn)T(Wkpm)

Absolute embeddings can learn relative patterns (GPT-2 and GPT-3 demonstrated this empirically) but the architecture must learn to extract relative information from the interaction of absolute positions. RoPE instead provides a direct inductive bias: relative position enters via a simple algebraic identity, requiring no learning to achieve relative sensitivity. This also interacts favorably with length extrapolation techniques.

RoPE’s insight is to encode position through rotation. For a 2D subspace of the query/key vectors, apply a rotation by angle nθ:

Rn=(cosnθsinnθsinnθcosnθ) qn=RnWqxn,km=RmWkxm

The dot product becomes:

qnTkm=(Wqxn)TRnTRm(Wkxm)=(Wqxn)TRmn(Wkxm)

Because RnTRm=Rmn (rotation matrices compose additively in angle), the attention score depends only on the relative position mn, not the absolute positions. This is exactly the inductive bias we want.

For the full model dimension d, RoPE applies independent rotations to d/2 pairs of dimensions, each with a different frequency θi=100002i/d:

RΘ,n=(Rn,1Rn,d/2)

The multi-frequency design encodes relative position at multiple scales, analogous to the original sinusoidal encoding but applied multiplicatively through rotation rather than additively.

RoPE became dominant because it gives a clean relative-position inductive bias without learned position parameters, integrates naturally into existing attention implementations, and tends to behave well in long-context regimes, especially when paired with explicit extrapolation strategies (e.g., NTK-aware scaling, YaRN) and careful training. This is best read as a strong default under modern constraints, not a proof of strict superiority over all alternatives in all regimes. The 2025 models experimenting with “NoPE” (no position encoding) in some layers, combined with RoPE in others, suggest the field is learning that different layers may benefit from different position treatments.

SwiGLU: The Gating Advantage

The Gated Linear Unit (GLU) family improves FFN expressivity through multiplicative gating. SwiGLU specifically uses SiLU (Swish) as the activation:

SwiGLU(x,W1,W3,W2)=(SiLU(xW1)xW3)W2

where SiLU(x)=xσ(x) and σ is the sigmoid function.

To understand why gating helps, consider what each component contributes:

  1. SiLU(xW1): A nonlinearly transformed representation of the input
  2. xW3: A linear transformation of the input
  3. Element-wise product: The linear path gates the nonlinear path

The gating mechanism allows the network to learn which features of the nonlinear transformation should be amplified or suppressed, conditioned on the input. This is more expressive than applying a fixed nonlinearity.

SwiGLU has three weight matrices (W1,W3,W2) versus two for standard FFN (W1,W2). To maintain equivalent parameter count with expansion factor e:

Standard FFN: 2ded=2ed2 parameters

SwiGLU with expansion e: 3ded=3ed2 parameters

Setting these equal: e=2e3. For e=4, we get e=832.67.

The key hypothesis is that gating is worth more than width: trading hidden dimension for a gating mechanism improves expressivity more than the raw capacity lost. Empirically, this hypothesis is supported - SwiGLU consistently outperforms parameter-matched GeLU baselines in controlled comparisons (Shazeer, 2020). The gating mechanism’s input-dependent modulation allows the network to selectively amplify or suppress features, a form of dynamic computation that static nonlinearities cannot express.

SiLU also has favorable gradient flow properties. Unlike ReLU, it’s smooth everywhere and has nonzero gradients for all x. In practice, SiLU/Swish often behaves comparably to GeLU while pairing naturally with GLU-style gating. The derivative:

ddxSiLU(x)=σ(x)+xσ(x)(1σ(x))=σ(x)(1+x(1σ(x)))

is nonzero over a wide range (unlike ReLU’s hard zero on (,0)), which can improve gradient flow in practice; empirically, GLU-family MLPs often deliver better quality at similar parameter/FLOP budgets in transformer LMs.

The MoE Routing Problem

Mixture-of-Experts architectures must solve a fundamental tension: tokens should be routed to the experts best suited for them, but all experts should be utilized to justify their parameter cost.

But MoE is prone to routing collapse. Without intervention, MoE training often converges to using only a few experts. Once an expert becomes slightly better for some token types, it receives more training signal, improving further, creating a feedback loop that starves other experts.

Traditionally, a load-balancing auxiliary loss is added:

Lbalance=αni=1nfiPi

where fi=1Tt=1T1[token t routed to expert i] and Pi=1Tt=1Tpi(xt).

This encourages the router to spread probability mass and actual routing decisions evenly. The problem: it distorts the primary language modeling objective. The router is incentivized to balance load even when some experts are genuinely better for certain tokens.

DeepSeek’s auxiliary-loss-free approach introduces a bias term bi for each expert that affects routing decisions but not expert weighting:

si=ri(x)+bi(routing scores)

Experts are selected by taking the top-k scores under si (the biased routing scores), but mixture weights are computed from the unbiased ri(x) over the selected set.

output=itop-keri(x)jtop-kerj(x)Ei(x)

The biases bi are adjusted during training to maintain load balance (increase bi for underutilized experts), but because they don’t affect the actual weighting, the model’s output is determined purely by learned routing quality.

This is elegant: routing decisions include the bias (ensuring balance), but output computation excludes it (preserving training signal fidelity).

QK-Normalization: Taming Attention Score Variance

Query-key normalization has emerged as a critical stability mechanism for large-scale training. The mathematical motivation stems from the statistics of high-dimensional dot products.

For query and key vectors q,kRdk with independent, zero-mean entries of variance σ2, and assuming q and k are independent of each other, the dot product qTk has variance:

Var(qTk)=dkσ4

These assumptions approximately hold at initialization but break down during training: entries become correlated, means drift from zero, and q-k independence fails since both derive from the same input. The practical concern is not the initialization variance but drift - the learned norms of q and k can grow during training, increasing logit magnitudes and sharpening softmax distributions in ways the dk scaling cannot prevent.

Even when dk is held constant, the learned norms of q and k can drift during training, increasing logit magnitudes and sharpening the softmax. This creates two problems:

  1. Attention entropy collapse: Large-variance logits produce sharper softmax distributions, potentially collapsing to nearly one-hot attention patterns.
  2. Numerical instability: Pre-softmax logits can grow large enough to cause overflow or severe precision loss.

The standard mitigation is the dk scaling in attention:

Attention(Q,K,V)=softmax(QKTdk)V

This scaling normalizes the dot-product variance under idealized assumptions (e.g., independent entries with fixed variance), but in trained models the norms and distributions of q and k can drift with depth, scale, and optimization, producing occasional logit blow-ups and overly peaky attention.

QK-normalization directly controls the vector norms:

q^=qq,k^=kk

With strict L2-normalization and fixed scaling, the dot product is bounded (|q^k^|1), which can prevent extreme logits. In practice, many implementations use RMSNorm-style normalization and may include learnable scales, so the benefit is better understood as controlling norm drift and reducing pathological logit growth, not as an absolute bound in all configurations.

The practical implementation often uses RMSNorm rather than L2 normalization, and may include learnable scale factors:

q^=γqRMSNorm(q),k^=γkRMSNorm(k)

Models using QK-norm (Gemma 3, OLMo 2, Qwen 3, Kimi K2, Trinity) report more stable training, especially at scale. The technique appears most beneficial for:

  • Very deep models (>60 layers)
  • MoE models (where router instability can compound attention instability)
  • Long-context training (where attention patterns over many positions are more variable)

Dataset of 53 Models Across Eight Years

The accompanying dataset documents architectural specifications for 53 transformer LLMs from June 2017 through December 2025. All entries were cross-referenced against public sources, with undisclosed information marked.

Summary Statistics

A note on interpretation: this dataset is descriptive, not a controlled ablation study. Reported frequencies summarize what was adopted, but adoption reflects a mix of model quality, training stability, inference economics, and ecosystem path dependence (e.g., widely copied open baselines). We also note the following limitations:

  • Selection criteria: Models were included based on (1) public technical report or paper, (2) significant discussion in the research community (ArXiv citations, blog coverage, downstream adoption), and (3) architectural novelty or influence. This tilts toward open-weight models and English-language publications.
  • Family overlap: The dataset includes multiple versions of the same family (LLaMA, LLaMA 2, LLaMA 3). This inflates apparent convergence: if LLaMA 2 copies LLaMA’s architecture, counting both overstates independent convergence. Family-level analysis would show fewer independent data points.
  • Verification depth: Primary sources vary in detail. Some entries are cross-referenced against code (e.g., LLaMA, OLMo); others rely solely on papers that may omit implementation details.

Normalization convergence:

  • Post-norm: Only the original Transformer (2017) and GPT-1 (2018)
  • LayerNorm pre-norm: Dominant 2019-2022
  • RMSNorm: 41 of 53 models (77.4%), and close to ubiquitous among widely copied post-LLaMA open-weight families

Position encoding evolution:

  • Sinusoidal: Original Transformer only
  • Learned absolute: GPT-1/2/3, OPT (2018-2022)
  • Relative (T5-style): T5, LaMDA, Gopher (2019-2021)
  • ALiBi: BLOOM (2022)
  • RoPE: 37 of 53 models (69.8%), dominant in most post-2022 decoder-only LLM families
  • Hybrid RoPE+NoPE: Command A, Llama 4, Trinity (2025)

Activation functions:

  • ReLU: Original Transformer, T5, OPT (declining after 2019)
  • GeLU: GPT family, Gopher, BLOOM (2018-2022)
  • SwiGLU/GeGLU: 38 of 53 models (71.7%), universal after LLaMA

MoE adoption:

  • Dense models: 42 of 53 (79.2%)
  • MoE models: 11 of 53 (20.8%), all 11 are from 2024–2025 (9 of 11 are from 2025)
  • Among 2025 models in this dataset (n = 15): 9 use MoE (60.0%)

Attention variant:

  • MHA (multi-head attention): 27 of 53 (51%), the original default
  • GQA (grouped-query attention): 23 of 53 (43%), dominant post-LLaMA 2 for inference efficiency
  • MLA (multi-head latent attention): 3 of 53 (6%), DeepSeek V3/R1 and Kimi K2

Block structure:

  • Serial (sequential attention → FFN): 48 of 53 (91%)
  • Parallel (attention + FFN computed together): 5 of 53 (9%), including GPT-J, GPT-NeoX, PaLM, Falcon 2, Command A

Vocabulary size trends:

  • 32K–50K: Dominant 2017–2023 (GPT family, LLaMA 1/2, Mistral)
  • 100K–150K: Qwen family, Phi, OLMo 2
  • 200K–262K: Gemma family, BLOOM, PaLM, Command, Llama 4

Dataset

Loading, please wait
Model
Date
Norm
Position
Activation
Attn
Block
MoE
Vocab
Stability
Original Transformer2017-06Post LayerNormSinusoidalReLUMHASerialNo37KNone
GPT-12018-06Post LayerNormLearned AbsGeLUMHASerialNo40KNone
GPT-22019-02Pre LayerNormLearned AbsGeLUMHASerialNo50KModified init
T52019-10Pre LayerNormRelativeReLUMHASerialNo32KNone
GPT-32020-05Pre LayerNormLearned AbsGeLUMHASerialNo50KModified init; sparse attn
T5 v1.12020-10Pre LayerNormRelativeGeGLUMHASerialNo32KNone
mT52020-10Pre LayerNormRelativeGeGLUMHASerialNo250KNone
GPT-J2021-05Pre LayerNormRoPEGeLUMHAParallelNo50KNone
LaMDA2021-05Pre LayerNormRelativeGated-GeLUMHASerialNo32KNone
Gopher2021-12Pre RMSNormRelativeGeLUMHASerialNo32KLow LR; grad clip
Chinchilla2022-03Pre RMSNormRelativeGeLUMHASerialNo32KNone
GPT-NeoX2022-04Pre LayerNormRoPEGeLUMHAParallelNo50KNone
PaLM2022-04Pre LayerNormRoPESwiGLUMHAParallelNo256KNo biases; shared emb
OPT2022-05Pre LayerNormLearned AbsReLUMHASerialNo50KModified init
BLOOM2022-11Pre LayerNormALiBiGeLUMHASerialNo251KEmbedding LayerNorm
LLaMA2023-02Pre RMSNormRoPESwiGLUMHASerialNo32KNo biases
LLaMA 22023-07Pre RMSNormRoPESwiGLUGQASerialNo32KNo biases
Qwen2023-09Pre RMSNormRoPESwiGLUMHASerialNo152KNone
Mistral 7B2023-10Pre RMSNormRoPESwiGLUGQASerialNo32KSliding window
Yi2023-11Pre RMSNormRoPESwiGLUGQASerialNo64KNone
DeepSeek2024-01Pre RMSNormRoPESwiGLUGQASerialNo102KNone
Mixtral2024-01Pre RMSNormRoPESwiGLUGQASerial8E/2act32KLoad balance loss
OLMo2024-02Pre LayerNormRoPESwiGLUMHASerialNo50KNo biases
Gemma2024-02Pre RMSNormRoPEGeGLUMHASerialNo256KNone
Phi-32024-04Pre RMSNormRoPEGeGLUMHASerialNo100KBlocksparse attn
Reka Flash2024-04Pre RMSNormRoPESwiGLUGQASerialNo100KNone
Nemotron-42024-06Pre RMSNormRoPESwiGLUGQASerialNo256KNone
GLM-42024-06Pre RMSNormRoPESwiGLUMHASerialNo150KNo bias except QKV
Qwen 22024-07Pre RMSNormRoPE+DCASwiGLUGQASerialNo152KQKV bias
LLaMA 3 70B2024-07Pre RMSNormRoPESwiGLUGQASerialNo128KNone
LLaMA 3 405B2024-07Pre RMSNormRoPESwiGLUGQASerialNo128KNone
Mistral Large 22024-07Pre RMSNormRoPESwiGLUGQASerialNo32K-
Falcon 22024-07Pre LayerNormRoPEGeLUMHAParallelNo65KFlashAttention-2
Gemma 22024-08Pre+Post RMSNormRoPEGeGLUGQASerialNo256KLogit cap; local/global
Command R+2024-09Pre LayerNormRoPESwiGLUGQASerialNo256KRAG opt
Qwen 2.52024-12Pre RMSNormRoPESwiGLUGQASerialNo152KQKV bias
Phi-42024-12Pre RMSNormRoPEGeGLUMHASerialNo100KSynthetic data
DeepSeek V32024-12Pre RMSNormRoPESwiGLUMLASerial256E+1/8act128KAux-free; FP8
OLMo 22025-01Pre RMSNormRoPESwiGLUMHASerialNo100KQK-Norm; Z-Loss
MiniMax M22025-01DeepNorm+RMSNormRoPESwiGLUMHASerial32E/2act200KLightning Attention
DeepSeek R12025-01Pre RMSNormRoPESwiGLUMLASerial256E+1/8act128KAux-free; FP8
SmolLM22025-02Pre RMSNormRoPESwiGLUMHASerialNo49KEmbedding tying
Gemma 32025-03Pre+Post RMSNormRoPEGeGLUGQASerialNo262KQK-norm; 5:1 local/global
Command A2025-03Pre LayerNormRoPE+NoPESwiGLUMHAParallelNo255KNo biases; FP32 ops
Llama 4 Scout2025-04Pre RMSNormiRoPESwiGLUGQASerial16E+1/var202KMetaP init; FP8
Llama 4 Maverick2025-04Pre RMSNormiRoPESwiGLUGQASerial128E+1/var202KMetaP init; FP8; fusion
Qwen 32025-05Pre RMSNormRoPESwiGLUGQASerialNo152KQK-Norm; no QKV bias
Mistral Medium 32025-05Pre RMSNormRoPESwiGLUGQASerialNo131K-
GLM-4.52025-07Pre RMSNormRoPESwiGLUGQASerial64E/4act150KQK-Norm; Muon
Kimi K22025-07Pre RMSNormRoPESwiGLUMLASerial384E/8act130KQK-Clip; MuonClip
INTELLECT-32025-11Pre RMSNormRoPESwiGLUGQASerial64E/4act150KSame as GLM-4.5-Air
Trinity Nano2025-12Depth-scaled RMSNormRoPE+NoPESwiGLU+GatedGQASerial128E/8act32KQK-norm; sigmoid route
Trinity Mini2025-12Depth-scaled RMSNormRoPE+NoPESwiGLU+GatedGQASerial128E+1/8act32KQK-norm; sigmoid route


Patterns in the Data

LLaMA’s February 2023 release marks a clear architectural boundary. Before: significant diversity in normalization, position encoding, and activation choices. After: near-universal adoption of the LLaMA recipe.

MoE configuration diversity persists. Unlike the converged dense architecture, MoE models show wide variation:

  • Expert count: 8 (Mixtral) to 384 (Kimi K2)
  • Active experts: 2-8
  • Shared experts: 0, 1, or more
  • Routing: softmax, sigmoid, or hybrid

This diversity suggests MoE design is not yet settled. The optimal configuration likely depends on training budget, target inference cost, and use case.

Stability mechanisms cluster in 2024-2025: QK-normalization, logit capping, and specialized initialization appear almost exclusively in models from the last 18 months. This reflects both scaling to larger models where stability matters more and accumulated understanding of failure modes.

Implications and Open Questions

Practical takeaways: strong defaults, and when to deviate

A reasonable reading of 2017–2025 is not that architecture is “done”, but that dense decoder-only transformers have a highly competitive default configuration under modern constraints (training stability, throughput on current accelerators, and inference KV-cache cost). Concretely:

  • Default baseline (dense decoder-only). A common starting point is the consensus recipe: pre-norm with RMSNorm, RoPE, SwiGLU (≈8/3 MLP expansion when parameter-matching), and an attention variant that controls KV-cache (MQA/GQA depending on the quality/latency budget). Many successful families also drop most bias terms.

  • Treat deviations as hypotheses with measurable consequences. When one of these defaults changes, it is usually treated as a hypothesis about what will improve and what will be measured. In practice, architecture changes trade off among:

    • optimization stability (loss spikes, divergence rate, sensitivity to LR/initialization)
    • throughput (tokens/sec, MFU), memory (activations + KV-cache), and wall-clock time to a fixed eval target
    • long-context behavior (extrapolation, retrieval over long ranges, attention entropy/pathologies)
    • quality at fixed compute (downstream benchmarks, perplexity, robustness)
  • MoE is where “architecture” still moves quickly. Unlike the converged dense recipe, MoE design remains context-dependent: expert count, active experts per token, shared experts, routing objective, and load balancing all interact with the data mix and training budget. Reported development cycles tend to involve tuning, and common failure modes (routing collapse, instability at scale) are often first-order concerns.

  • Stability mechanisms are cheap insurance. As scale and context length increase, training can become more brittle. Techniques like QK-normalization / clipping, attention logit soft-capping, and specialized initialization are lightweight relative to the cost of a failed run. Even when they do not move final metrics, they can reduce “training crisis” risk.

  • Where most gains usually come from. In many published comparisons, the largest deltas come from data, optimization/training recipe, and post-training/alignment (and from inference engineering in deployment-facing settings) more than from swapping core architectural components within the standard transformer block.

Interpreting convergence: what it does (and does not) imply

Architectural convergence is evidence of a strong shared basin of solutions, but it is not a proof of global optimality. It reflects both model-intrinsic considerations and strong external constraints.

Convergence reflects constraints and ecosystem dynamics. Choices that win often do so because they are good and easy to scale: they are stable at depth, efficient on GPUs/TPUs, compatible with fused kernels, and friendly to inference economics (especially KV-cache). Influential released baselines and reference implementations can accelerate standardization, creating path dependence even when multiple alternatives are viable.

But here’s what convergence does not settle:

  • It does not establish that alternatives are worse under all regimes (e.g., different context lengths, modalities, latency constraints, or hardware).
  • It does not replace controlled ablations: many “bundled” recipes change several factors at once, and improvements can be misattributed.
  • It does not imply that today’s defaults will remain best as constraints change (million-token contexts, different memory hierarchies, or new attention kernels).

There is also a monoculture trade-off. A strong default accelerates progress and reproducibility, but it narrows exploration. This is particularly relevant for nonstandard settings (very long context, low-latency streaming, memory-limited deployment), where the best architecture might differ from the mainstream recipe.

Finally, a useful research posture would be to treat the consensus stack as a hard-to-beat baseline, and aim for claims of the form: “under explicit constraints X and evaluation Y, modification Z reliably improves metric M and does not regress N”. That standard is what distinguishes robust architectural progress from recipe churn.

Open Questions

  • Under what regimes does RoPE underperform? RoPE dominates current practice, but ALiBi was designed for length extrapolation and relative-bias approaches may handle certain retrieval patterns better. At what context length, and for what tasks (e.g., retrieval vs. generation), do alternatives outperform RoPE with standard extrapolation (NTK-aware, YaRN)?

  • Is there a scaling law for expert count? Mixtral uses 8 experts; Kimi K2 uses 384. Both work. We could study whether optimal expert count scales as ECα for training compute C, with active experts k held constant. What is α, and does it depend on data diversity?

  • What is the quality/efficiency Pareto frontier for subquadratic attention? Linear attention variants underperform softmax at scale, but hybrids (e.g., Lightning Attention) suggest a middle ground. For a fixed compute budget, what mix of linear and softmax layers maximizes quality? Does the optimal ratio change with sequence length?

  • Is 8/3 expansion optimal, or just conventional? The SwiGLU ratio emerged from parameter-matching, not optimization. We could sweep expansion factors from 2 to 4 at fixed total parameters and measure downstream task performance. Does the optimal ratio vary with model scale?

  • What would trigger an architectural phase transition? Mamba and state-space models offer O(n) complexity but haven’t displaced transformers. Hypothesis: the transition requires either (1) a task regime where O(n2) is prohibitive (million-token contexts with dense attention), or (2) hardware where memory bandwidth dominates compute. Which comes first?

Conclusion

The eight-year trajectory from the original transformer to 2025 frontier systems follows a pattern of exploration → convergence → renewed divergence:

Loading, please wait
Era
Period
Pattern
I2017-2019Foundations established, immediate variations explored
II2020-2022Scaling drove efficiency innovations (RMSNorm, RoPE, SwiGLU)
III2023-2024LLaMA crystallized a reproducible recipe, standardization accelerated
IV2024-2025MoE emerged as dominant scaling axis, diversity returned


The dense-model convergence on a core bundle (pre-norm + RMSNorm + RoPE + SwiGLU, often paired with MQA/GQA-style KV sharing and reduced bias usage) suggests a robust, highly competitive basin of solutions under today’s constraints. It is evidence of what tends to work when optimizing simultaneously for stability, throughput on current accelerators, and inference cost, while also benefiting from an ecosystem of shared implementations and kernels. It is not, by itself, a proof of global optimality.

At the same time, the remaining variation (MoE routing and expert design, long-context attention patterns, and an increasing number of explicit stability interventions) highlights where the architecture is still actively adapting. These choices look less like settled convention and more like responses to new failure modes that appear at larger scale and longer context.

If there is a meta-lesson in the 2017 → 2025 shift, it is how quickly “reasonable defaults” can change once scale and constraints change. Many early design choices (post-norm, learned absolute positions, ReLU) were not wrong so much as eventually outcompeted. The next shift may come from long-context regimes, different hardware constraints, or architectures that change the attention/computation trade-off entirely.


Notes

This analysis is inspired by Tatsunori Hashimoto’s lecture on architectures and hyperparameters in Stanford CS336 (April 2025).

Cover image: Milada Vigerova (Unsplash)


References

  1. Attention is All You Need
    Ashish Vaswani, Noam Shazeer, Niki Parmar, and 5 more authors
    In Advances in Neural Information Processing Systems, Jul 2017
  2. Root Mean Square Layer Normalization
    Biao Zhang, and Rico Sennrich
    Advances in Neural Information Processing Systems, Jul 2019
  3. RoFormer: Enhanced Transformer with Rotary Position Embedding
    Jianlin Su, Murtadha Ahmed, Yu Lu, and 3 more authors
    Neurocomputing, Jul 2024
    Originally posted to arXiv in April 2021
  4. GLU Variants Improve Transformer
    Noam Shazeer
    arXiv preprint arXiv:2002.05202, Jul 2020
  5. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
    Tri Dao, Dan Fu, Stefano Ermon, and 2 more authors
    In Advances in Neural Information Processing Systems, Jul 2022



    Enjoyed reading this article? More articles you might like to read next:

  • Building a Fast BPE Tokenizer from Scratch
  • Systematic Pessimism
  • Beyond Automation — The Case for AI Augmentation
  • Designing Human-in-the-Loop ML Systems
  • Concepts for Reliability of LLMs in Production