Introduction
Recent studies suggest that optimal learning rates and batch sizes follow a power-law with compute, and in particular with parameters and training token count (Bi et al., 2024; Bjorck et al., 2024; Porian et al., 2024; Li et al., 2025). Such scaling laws are used in practice for training Transformer LLMs (Bi et al., 2024; Yang et al., 2024).
The general form of these scaling laws is
where the exponents and the proportionality constants are learned, and may depend on the training data and tokenizer. These formulas may be fitted by regressing the logarithm of the predictions against the logarithm of the ground-truth targets.
Experiment Basics
In this post, I will use the LLaMA architecture (Touvron et al., 2023) with SwiGLU and d_ff = 3 d_model, and see about fitting these scaling laws. As training data, I will consider (N, D) pairs as inputs, representing parameters and training token count, generated from iso-FLOP curves. I will need to sweep a grid of (B, \eta) for each (N, D) so I can use the optimum (B^{*}, \eta^{*}) as a regression target.
For training the models, I will use the AdamW optimizer with \beta_{1} = 0.9, \beta_{2} = 0.95, \eps = 10^-8, \lambda = 0.1. Gradients are clipped at norm 1.0. I will use 2% of the training steps for learning rate warmup, followed by cosine decay.
Experiment Details
I want to generate K regression examples from an iso-FLOP curve while budgeting some S FLOPs total for sampling this curve. Assuming G total (B, \eta)-grid points per (N, D) pair, and given an architecture with
I can iterate over K choices of n_layer, compute N = 13 * d_model^2 * n_layer, and derive the number of training tokens as D = S/(6GKN) such that (N, D) lies on the current iso-FLOP curve. This will yield GK distinct runs totaling S FLOPs as desired. In practice, S can be set as S = 0.5C, 0.25C, etc., so the total FLOP budget aggregated across all iso-FLOP curves is bounded by a user-defined constant C.
I will use C = 10^21 in this post, and let n_layer = 6, 8, 12, 16, 24 yielding K=5 models between N=46 million and N=2.9 billion parameters. For the (B, \eta)-grid search, I will use batch sizes in powers of two between 65536 and 2097152 tokens, and learning rates in powers of two between 0.0078125 and 0.0001220703125, yielding G=42 grid points.
As a simplifying choice, I omitted tuning the AdamW \beta_2 parameter for small batch sizes suggested by Porian et al., 2024. I do not include unembeddings in the calculation of model size N when computing iso-FLOP curves and hyperparameter scaling laws. I allow dependency of the hyperparameter optimum on token budget D, which is closer to Bjorck et al., 2024 and Li et al., 2025.
Experiment Results
The collected data for the experiments were as follows:
N D B^{*} \eta^{*} Loss
46006272 2156134400 262144 0.003906 3.270941
46006272 4312530944 262144 0.003906 3.192060
46006272 8625061888 524288 0.003906 3.126466
109051904 909639680 131072 0.001953 3.281688
109051904 1819410432 131072 0.001953 3.162354
109051904 3638820864 262144 0.003906 3.052100
368050176 269484032 65536 0.000977 3.475778
368050176 538968064 131072 0.000977 3.275321
368050176 1078067200 131072 0.000977 3.112425
872415232 113704960 65536 0.000488 3.721318
872415232 227409920 65536 0.000488 3.473123
872415232 454819840 65536 0.000488 3.269213
2944401408 33685504 65536 0.000244 4.739540
2944401408 67371008 65536 0.000244 4.037603
2944401408 134742016 65536 0.000244 3.620115
When I fit the scaling law with OLS in log-log space, I got the following for the log-B^{*} (left column) and log-eta^{*} (right column) scaling rules.
const 10.803668 3.849702
log_N -0.094118 -0.588770
log_D 0.299974 0.099994
Discussion
This is different from the step law paper (Li et al., 2025). It is possible that our setup doesn’t use enough granularity, long enough runs, and some other factors such as the model architecture may differ. I may also need to add some runs sampled from an (N, D) grid rather than iso-FLOP curves. I may also need to expand the (B, \eta)-grid range. Finally, it is possible that including the unembeddings in computations of N will improve consistency with other works.
I will think about this and read up further before designing a follow-up experiment and attempting to run things again. For reference, the data collection took roughly one day, so this was a relatively cheap learning experience. Stay tuned!