Scaling Laws for Optimal LR and Batch Size - Part II

Collecting more data

May 09, 2025

Introduction

Welcome back. In the last post, I drew inspiration from prior work (Bi et al., 2024; Bjorck et al., 2024; Porian et al., 2024; Li et al., 2025) to fit scaling laws for optimal batch size B^{*} and optimal learning rate \eta^{*} in terms of parameters N and training token count D.

I obtained the fitted scaling laws

\(B^{*} = 1787.4 N^{-0.094} D^{0.300}\)

\(\eta^{*} = 14.4 N^{-0.589} D^{0.100}\)

There seem to be two strange phenomena:

The forecasted batch size has a small negative exponent on the model size N, so that larger models train with a smaller batch size for any given token budget D.
The forecasted learning rate has a small positive exponent on the token budget D, so that larger-data-budget runs use a higher learning rate. Interestingly, this is consistent with Li et al., 2025 which in fact uses a larger positive exponent.

Experiment Planning

The simplest change to improve the fit of the scaling laws is to add more (N, D) pairs. Let’s try using a grid of (N, D) pairs instead of just sampling from iso-FLOP curves. Since the new (N, D) pairs form a grid, the total cost of the new runs can be factorized:

\(6 \big{(}\sum_{i} N_{i}\big{)} \big{(}\sum_{j} D_{j}\big{)} G\)

where G is the number of (B, \eta) grid points per (N, D) pair, and as previously G = 42. If the new experiments use the same five parameter counts N_i as before, then

\(\sum_{i} N_{i} = 4339924992\)

If the compute constraint for this round of experiments is C = 10^21 FLOPs, then the summed token count can be derived as

\( \big{(}\sum_{j} D_{j}\big{)} = 10^{21} / (6 \cdot 4339924992 \cdot 42) \approx 914360035\)

Thus if D_j = 914,360,035 * 0.5^j (j >= 1), the total cost is bounded by C = 10^21 FLOPs. I’ll go run this grid of (N_i, D_j) and report back.

Experiment Results

The results below combine the iso-FLOP curves’ data and the grid data. A total of 1260 runs were done in total, and the best results per (N, D) pair are displayed.

              N           D       B         E      Loss
26      46006272  2156134400  262144  0.003906  3.270941
74     109051904   909639680  131072  0.001953  3.281688
122    368050176   269484032   65536  0.000977  3.475778
163    872415232   113704960   65536  0.000488  3.721318
204   2944401408    33685504   65536  0.000244  4.739540
236     46006272  4312530944  262144  0.003906  3.192060
284    109051904  1819410432  131072  0.001953  3.162354
325    368050176   538968064  131072  0.000977  3.275321
373    872415232   227409920   65536  0.000488  3.473123
414   2944401408    67371008   65536  0.000244  4.037603
439     46006272  8625061888  524288  0.003906  3.126466
488    109051904  3638820864  262144  0.003906  3.052100
535    368050176  1078067200  131072  0.000977  3.112425
583    872415232   454819840   65536  0.000488  3.269213
624   2944401408   134742016   65536  0.000244  3.620115
669     46006272   114294784   65536  0.001953  3.923660
711     46006272   228589568   65536  0.001953  3.712386
753     46006272   457179136   65536  0.001953  3.548782
794    109051904   114294784   65536  0.000977  3.872678
836    109051904   228589568   65536  0.000977  3.634435
872    109051904   457179136  131072  0.001953  3.431678
920    368050176   114294784   65536  0.000977  3.809358
962    368050176   228589568   65536  0.000977  3.524963
1004   368050176   457179136   65536  0.000977  3.319148
1045   872415232   114294784   65536  0.000488  3.718716
1087   872415232   228589568   65536  0.000488  3.471644
1129   872415232   457179136   65536  0.000488  3.265596
1170  2944401408   114294784   65536  0.000244  3.694377
1212  2944401408   228589568   65536  0.000244  3.412917
1254  2944401408   457179136   65536  0.000244  3.198847

Using the combined dataset, I fit the scaling laws again, obtaining

\(B^{*} = 115.8 N^{-0.027} D^{0.365}\)

\(\eta^{*} = 0.332 N^{-0.493} D^{0.191}\)

This is much better. The batch size exponent for parameter count is now negligible, consistent with Li et al., 2025 and a qualitative claim from Hoffmann et al., 2022. In addition, the learning rate exponent for dataset size is increased nearly twofold, and is closer to the value from Li et al., 2025.

As a sanity check, let’s evaluate at (N, D) = (4 * 10^9, 100 * 10^9). We have an estimated B^{*} = 659,862 and \eta^{*} = 0.00077325459, both of which seem reasonable.

Summary

Thus far, I have run 1260 experiments and collected 30 tuples of (N, D, B^*, \eta^*), and fitted scaling laws similar in form to those of Li et al., 2025. I empirically observe negligible dependence of the optimal batch size on parameter count N and recover the same sign as Li et al., 2025 on all the remaining exponents. In the next post, I will collect even more data to better fit the scaling law.