Scaling Laws for Optimal LR and Batch Size - Part X

Trying out the OLMo 1 architecture

Jul 21, 2025

Introduction

In Part V, we saw that the learning rate forecast at our test point—N = 4.7 billion parameters and D = 100 billion tokens—is off by a factor of ~4x when using the LLaMA architecture and a standard AdamW weight decay applied to all parameters.

In Part VII, we tried using the Qwen 3 architecture with QK-LayerNorm and independent weight decay applied to all parameters, but the situation remained roughly the same, with the forecast off by at least a factor of ~4x.

In Part VIII, we tried using the Gemma 3 architecture with parametric RMSNorm on the inputs and outputs of each residual block, QK-LayerNorm applied to queries and keys, and independent weight decay applied to all parameters. This led to a worse forecast, off by an estimated ~10x.

Having tried both forms of weight decay and various other modifications such as adding and removing QK-LayerNorm without any improvement, one of our last remaining hopes of training a standard-ish Transformer architecture with power-law forecastable learning rate optima seems to be to remove the RMSNorm parameters.

In this post, I will try exactly that—a simpler architecture without trainable RMSNorm scale parameters in the model. This Transformer architecture has been used previously by Groeneveld et al., 2024; Lingle, 2024; Blake et al., 2024.

Scaling Laws Setup

Open codebase. I use the research codebase at https://github.com/lucaslingle/babel with branch name ‘nonparametric_scaling’.

Architecture. I use an architecture similar to OLMo 1, with nonparametric RMSNorm applied to the residual block inputs. I use SwiGLU nonlinearities in the MLP sublayers, with d_ff = 3 * d_model. I use d_head = 128, and n_head = d_model / d_head. I use a locked aspect ratio of d_model/n_layer = 128.

Optimization. I use AdamW with beta1 = 0.9, beta2 = 0.95, eps = 1e^-8. The learning rate schedule uses 2% warmup, followed by cosine decay to a 0.1x factor of the peak.

(N, D) grid. For efficiency, I use model sizes N = 46M, 109M, 368M and dataset sizes D = 128M, 256M, 512M, 1024M.

(B, \eta) grid. I sweep over batch sizes from 32768 to 2097152 in powers of 2, sweep over learning rates from 0.00012207031 to 0.0078125 in powers of 2^0.5.

Training data. I use the 350B token sample of FineWeb, and the LLaMA-2 tokenizer, which has approximately 32000 vocabulary items including special tokens.

Evaluation. I use 100 held-out batches of size 2097152 tokens for loss evaluation. Evaluation occurs after training has concluded.

Scaling Laws Result

I obtained the following fitted scaling law, omitting Akima spline interpolation and bootstrapping for simplicity.

\(B^{*} = 0.480 D^{0.600}\)

\(\eta^{*} = 0.327 N^{-0.466} D^{0.133}\)

Test Point

At the test point (N, D) = (4.7 * 10^9, 100 * 10^9), the forecasted optimal batch size and forecasted optimal learning rate are

\(B^{*} = 1,910,914\)

\(\eta^{*} = 0.0003003162\)

For comparison, the true learning rate optimum was approximately 0.00024414062 based on the results of a previous post using a batch size of 1M.1 This suggests the forecasted optimal learning rate is relatively close to the ground truth. I will resweep the test point optimum using FineWeb with a batch size of 2M in a later post.

Conclusion

In this post, I tried using non-parametric RMSNorm similar to OLMo 1. This resolved most of the extrapolation error for our scaling laws at the test point.

It is worth noting that our test point uses a far larger model size N than the one used in Li et al., 2025, so our evaluation is more rigorous. Meanwhile, the test point from Bjorck et al., 2024 was LLaMA-7B, but the discrepancy of their forecasted learning rate with the one actually used by the model was not definitively resolved.

This post used C4 dataset with the T5 tokenizer, not FineWeb with the LLaMA-2 tokenizer, but these datasets are similar and the tokenizers have the same approximate vocabulary size, leading to an empirically-similar optimum.

Lucas Lingle