Introduction
Welcome back. In the last post, I drew inspiration from prior work (Bi et al., 2024; Bjorck et al., 2024; Porian et al., 2024; Li et al., 2025) to fit scaling laws for optimal batch size B^{*} and optimal learning rate \eta^{*} in terms of parameters N and training token count D.
I obtained the fitted scaling laws
There seem to be two strange phenomena:
The forecasted batch size has a small negative exponent on the model size N, so that larger models train with a smaller batch size for any given token budget D.
The forecasted learning rate has a small positive exponent on the token budget D, so that larger-data-budget runs use a higher learning rate. Interestingly, this is consistent with Li et al., 2025 which in fact uses a larger positive exponent.
Experiment Planning
The simplest change to improve the fit of the scaling laws is to add more (N, D) pairs. Let’s try using a grid of (N, D) pairs instead of just sampling from iso-FLOP curves. Since the new (N, D) pairs form a grid, the total cost of the new runs can be factorized:
where G is the number of (B, \eta) grid points per (N, D) pair, and as previously G = 42. If the new experiments use the same five parameter counts N_i as before, then
If the compute constraint for this round of experiments is C = 10^21 FLOPs, then the summed token count can be derived as
Thus if D_j = 914,360,035 * 0.5^j (j >= 1), the total cost is bounded by C = 10^21 FLOPs. I’ll go run this grid of (N_i, D_j) and report back.
Experiment Results
The results below combine the iso-FLOP curves’ data and the grid data. A total of 1260 runs were done in total, and the best results per (N, D) pair are displayed.
N D B E Loss
26 46006272 2156134400 262144 0.003906 3.270941
74 109051904 909639680 131072 0.001953 3.281688
122 368050176 269484032 65536 0.000977 3.475778
163 872415232 113704960 65536 0.000488 3.721318
204 2944401408 33685504 65536 0.000244 4.739540
236 46006272 4312530944 262144 0.003906 3.192060
284 109051904 1819410432 131072 0.001953 3.162354
325 368050176 538968064 131072 0.000977 3.275321
373 872415232 227409920 65536 0.000488 3.473123
414 2944401408 67371008 65536 0.000244 4.037603
439 46006272 8625061888 524288 0.003906 3.126466
488 109051904 3638820864 262144 0.003906 3.052100
535 368050176 1078067200 131072 0.000977 3.112425
583 872415232 454819840 65536 0.000488 3.269213
624 2944401408 134742016 65536 0.000244 3.620115
669 46006272 114294784 65536 0.001953 3.923660
711 46006272 228589568 65536 0.001953 3.712386
753 46006272 457179136 65536 0.001953 3.548782
794 109051904 114294784 65536 0.000977 3.872678
836 109051904 228589568 65536 0.000977 3.634435
872 109051904 457179136 131072 0.001953 3.431678
920 368050176 114294784 65536 0.000977 3.809358
962 368050176 228589568 65536 0.000977 3.524963
1004 368050176 457179136 65536 0.000977 3.319148
1045 872415232 114294784 65536 0.000488 3.718716
1087 872415232 228589568 65536 0.000488 3.471644
1129 872415232 457179136 65536 0.000488 3.265596
1170 2944401408 114294784 65536 0.000244 3.694377
1212 2944401408 228589568 65536 0.000244 3.412917
1254 2944401408 457179136 65536 0.000244 3.198847
Using the combined dataset, I fit the scaling laws again, obtaining
This is much better. The batch size exponent for parameter count is now negligible, consistent with Li et al., 2025 and a qualitative claim from Hoffmann et al., 2022. In addition, the learning rate exponent for dataset size is increased nearly twofold, and is closer to the value from Li et al., 2025.
As a sanity check, let’s evaluate at (N, D) = (4 * 10^9, 100 * 10^9). We have an estimated B^{*} = 659,862 and \eta^{*} = 0.00077325459, both of which seem reasonable.
Summary
Thus far, I have run 1260 experiments and collected 30 tuples of (N, D, B^*, \eta^*), and fitted scaling laws similar in form to those of Li et al., 2025. I empirically observe negligible dependence of the optimal batch size on parameter count N and recover the same sign as Li et al., 2025 on all the remaining exponents. In the next post, I will collect even more data to better fit the scaling law.