Scaling Laws for Optimal LR and Batch Size

Forecasting optimal hyperparameters using power laws

May 07, 2025

Introduction

Recent studies suggest that optimal learning rates and batch sizes follow a power-law with compute, and in particular with parameters and training token count (Bi et al., 2024; Bjorck et al., 2024; Porian et al., 2024; Li et al., 2025). Such scaling laws are used in practice for training Transformer LLMs (Bi et al., 2024; Yang et al., 2024).

The general form of these scaling laws is

\(B^{*} \propto N^{\alpha_{B}} D^{\beta_{B}}\)

\(\eta^{*} \propto N^{\alpha_{\eta}} D^{\beta_{\eta}}\)

where the exponents and the proportionality constants are learned, and may depend on the training data and tokenizer. These formulas may be fitted by regressing the logarithm of the predictions against the logarithm of the ground-truth targets.

Experiment Basics

In this post, I will use the LLaMA architecture (Touvron et al., 2023) with SwiGLU and d_ff = 3 d_model, and see about fitting these scaling laws. As training data, I will consider (N, D) pairs as inputs, representing parameters and training token count, generated from iso-FLOP curves. I will need to sweep a grid of (B, \eta) for each (N, D) so I can use the optimum (B^{*}, \eta^{*}) as a regression target.

For training the models, I will use the AdamW optimizer with \beta_{1} = 0.9, \beta_{2} = 0.95, \eps = 10^-8, \lambda = 0.1. Gradients are clipped at norm 1.0. I will use 2% of the training steps for learning rate warmup, followed by cosine decay.

Experiment Details

I want to generate K regression examples from an iso-FLOP curve while budgeting some S FLOPs total for sampling this curve. Assuming G total (B, \eta)-grid points per (N, D) pair, and given an architecture with

\(d_{model} = 128 * n_{layer}\)

I can iterate over K choices of n_layer, compute N = 13 * d_model^2 * n_layer, and derive the number of training tokens as D = S/(6GKN) such that (N, D) lies on the current iso-FLOP curve. This will yield GK distinct runs totaling S FLOPs as desired. In practice, S can be set as S = 0.5C, 0.25C, etc., so the total FLOP budget aggregated across all iso-FLOP curves is bounded by a user-defined constant C.

I will use C = 10^21 in this post, and let n_layer = 6, 8, 12, 16, 24 yielding K=5 models between N=46 million and N=2.9 billion parameters. For the (B, \eta)-grid search, I will use batch sizes in powers of two between 65536 and 2097152 tokens, and learning rates in powers of two between 0.0078125 and 0.0001220703125, yielding G=42 grid points.

As a simplifying choice, I omitted tuning the AdamW \beta_2 parameter for small batch sizes suggested by Porian et al., 2024. I do not include unembeddings in the calculation of model size N when computing iso-FLOP curves and hyperparameter scaling laws. I allow dependency of the hyperparameter optimum on token budget D, which is closer to Bjorck et al., 2024 and Li et al., 2025.

Experiment Results

The collected data for the experiments were as follows:

N  D  B^{*}  \eta^{*}  Loss
46006272  2156134400  262144  0.003906  3.270941
46006272  4312530944  262144  0.003906  3.192060
46006272  8625061888  524288  0.003906  3.126466
109051904   909639680  131072  0.001953  3.281688
109051904  1819410432  131072  0.001953  3.162354
109051904  3638820864  262144  0.003906  3.052100
368050176   269484032   65536  0.000977  3.475778
368050176   538968064  131072  0.000977  3.275321
368050176  1078067200  131072  0.000977  3.112425
872415232   113704960   65536  0.000488  3.721318
872415232   227409920   65536  0.000488  3.473123
872415232   454819840   65536  0.000488  3.269213
2944401408    33685504   65536  0.000244  4.739540
2944401408    67371008   65536  0.000244  4.037603
2944401408   134742016   65536  0.000244  3.620115

When I fit the scaling law with OLS in log-log space, I got the following for the log-B^{*} (left column) and log-eta^{*} (right column) scaling rules.

const	10.803668	3.849702
log_N	-0.094118	-0.588770
log_D	0.299974	0.099994

Discussion

This is different from the step law paper (Li et al., 2025). It is possible that our setup doesn’t use enough granularity, long enough runs, and some other factors such as the model architecture may differ. I may also need to add some runs sampled from an (N, D) grid rather than iso-FLOP curves. I may also need to expand the (B, \eta)-grid range. Finally, it is possible that including the unembeddings in computations of N will improve consistency with other works.

I will think about this and read up further before designing a follow-up experiment and attempting to run things again. For reference, the data collection took roughly one day, so this was a relatively cheap learning experience. Stay tuned!

Lucas Lingle