Parametric or Nonparametric RMSNorm?

A self-contained ablation study

Apr 30, 2025

Introduction

The standard Transformer LLM uses parametric LayerNorm or RMSNorm, usually applied to the input of each residual layer. But in this setting, there is always a linear layer following, so the scale parameter seems redundant.

To the best of my knowledge, OLMo from AI2 is the only public LLM that considers omitting the scale parameter; they name this layer non-parametric LayerNorm. In a follow-up study, OLMoE showed parametric LayerNorm actually outperformed in the OLMo architecture and also in a Mixture-of-Experts architecture. Similar results were previously reported by the ST MoE paper. Most recently, OLMo 2 used parametric RMSNorm. In this post, I perform yet another ablation to see which is better.

Experiment Setup

Training uses the LLaMA architecture with depth 28, residual dimension 3584, utilizing the AdamW optimizer with beta1 = 0.9, beta2 = 0.95, eps = 1e^-8, lambda = 0.1, batch size 1,048,576 tokens, sequence length 256, 100,000 update steps, and 2,000 learning rate warmup steps followed by cosine decay to 10% of the maximum learning rate. Gradients are clipped at Euclidean norm 1. The maximum learning rate for each model is swept using a grid search in log-scale.

Experiment Results

The results are as follows: the lowest loss for parametric RMSNorm was 2.240, while the lowest loss for non-parametric RMSNorm was 2.253. Based on this ablation, the consensus choice of parametric RMSNorm—used by LLaMA, Qwen, DeepSeek, and recent AI2 models—seems better.

Summary

In this post, I evaluated nonparametric RMSNorm, which was used to train OLMo and was recently advocated for by a paper of mine on mu-Transfer and also used by a few other works such as the unit-muP paper.

Despite reportedly performing better and permitting mu-Transfer of the learning rate in those papers, in the larger-scale ablation here I obtained results consistent with the ST-MoE and OLMoE papers—finding that parametric RMSNorm yields a lower loss.