Parametric or Nonparametric RMSNorm?
A self-contained ablation study
Introduction
The standard Transformer LLM uses parametric LayerNorm or RMSNorm, usually applied to the input of each residual layer. But in this setting, there is always a linear layer following, so the scale parameter seems redundant.
To the best of my knowledge, OLMo from AI2 is the only public LLM that considers omitting the scale parameter; they name this layer non-parametric LayerNorm. In a follow-up study, OLMoE showed parametric LayerNorm actually outperformed in the OLMo architecture and also in a Mixture-of-Experts architecture. Similar results were previously reported by the ST MoE paper. Most recently, OLMo 2 used parametric RMSNorm. In this post, I perform yet another ablation to see which is better.
Experiment Setup
Training uses the LLaMA architecture with depth 28, residual dimension 3584, utilizing the AdamW optimizer with beta1 = 0.9, beta2 = 0.95, eps = 1e^-8, lambda = 0.1, batch size 1,048,576 tokens, sequence length 256, 100,000 update steps, and 2,000 learning rate warmup steps followed by cosine decay to 10% of the maximum learning rate. Gradients are clipped at Euclidean norm 1. The maximum learning rate for each model is swept using a grid search in log-scale.
Experiment Results
The results are as follows: the lowest loss for parametric RMSNorm was 2.240, while the lowest loss for non-parametric RMSNorm was 2.253. Based on this ablation, the consensus choice of parametric RMSNorm—used by LLaMA, Qwen, DeepSeek, and recent AI2 models—seems better.
Summary
In this post, I evaluated nonparametric RMSNorm, which was used to train OLMo and was recently advocated for by a paper of mine on mu-Transfer and also used by a few other works such as the unit-muP paper.
Despite reportedly performing better and permitting mu-Transfer of the learning rate in those papers, in the larger-scale ablation here I obtained results consistent with the ST-MoE and OLMoE papers—finding that parametric RMSNorm yields a lower loss.

