Scaling Laws for Optimal LR and Batch Size - Part V
Revisiting the forecasts
A quandry
In our second post discussing AdamW and Muon, I trained models with 4.7 billion parameters for about 100 billion tokens. I used a batch size of 2,097,152 tokens and the optimal sweep learning rate for both AdamW and Muon was 0.0002441.
Going back to our hyperparameter scaling law series, we have optimal hyperparameters forecasted by
Thus while the forecasted optimum is close to the actual batch size, the forecasted optimal learning rate eta^* differs from the sweep optimum by over a 4x factor.1
What might explain this?
Is it some property of the optimizer like the Adam epsilon, which is known to complicate learning rate forecasting (Littwin et al., 2024; Everett et al., 2024)? Might it be caused by our weight decay scheme, which is not independent of the learning rate, unlike say Wortsman et al., 2023; Porian et al., 2024? Or perhaps it is the parametric RMSNorm rearing its ugly head and causing unpredictable jumps in the optimum, as observed by Lingle 2024; Blake et al., 2024?
At this time, I don’t have any definitive answers to these questions.
It should be noted that the scaling law was fitted using C4 dataset with the T5 tokenizer, not FineWeb with the LLaMA-2 tokenizer, but the vocab size is roughly the same and the change of dataset is unlikely to explain the discrepancy. Also, in my nonparameteric RMSNorm ablation, I used C4 with the T5 tokenizer and my parametric RMSNorm baseline had the same learning rate optimum to the sweep optimum discussed here.

