Introduction
In the previous post in this series, I compared the AdamW and Muon optimizers, training Transformer language models with 1.2 billion parameters on approximately 26 billion tokens of FineWeb. There were a few limitations, however:
Insufficient model scale. One limitation of the previous ablation was that the models were relatively small, and that the insights may not transfer to larger models; while this is always a risk, using models with 4 billion parameters or so would give me more confidence in the results.
Wall-clock blindness. A second limitation was that I used the same number of training steps for AdamW and for Muon, despite Muon taking longer to run. In general, the time cost of Muon includes an overhead that is constant with respect to batch size, and which can still be nontrivial at realistic batch sizes.
Experiment Setup
In this post, I will try a model size and token budget closer to existing pretraining research ablations like those in the OLMoE paper. I will also control for wall-clock time following Kaddour et al., 2023 and use a more conventional batch size in line existing works on LLM pretraining such as Hoffman et al., 2022.
Open codebase. I used my research codebase at https://github.com/lucaslingle/babel with branch ‘muon_fp32’, with the name indicating the optimizer states are float32.
Time controls. In this post, I will use the method of Kaddour et al., 2023 to adjust the total number of training steps for the Muon runs so that they run in the same wall-clock time as the AdamW runs. This is a slightly different evaluation protocol than previous work on Muon—including speedruns—and it has the advantage that the AdamW and Muon optimizers are compared at the same stage of their learning rate annealing process (i.e., at the end).
Optimizer hyperparameters. For both optimizers, the hyperparameters are the same as Part I, with the exception of the batch size and training steps, where I use 2,097,152 tokens per batch and train with AdamW for 50,000 steps. The number of training steps for Muon is adjusted according to Kaddour et al., as discussed above, and rounds out to 43,000. Both optimizers use 2% of training for warmup steps. The learning rate for each optimizer is swept in powers of two from 2^-13 to 2^-9, inclusive.
Architecture. In terms of model size, I use 4.7 billion parameters by widening to d_model = 3,584 and deepening to n_layer = 28. As per my usual, I use a LLaMA-based architecture with SwiGLU in the MLP layers with d_ff = 3 * d_model and attention heads with d_head = 128, n_head = d_model / d_head.
Training Data. I use the 350B token sample of FineWeb, and the LLaMA-2 tokenizer, which has approximately 32000 vocabulary items including special tokens.
Experiment Results
The best AdamW run achieved a validation loss of 2.243, and the best Muon run achieved a validation loss of 2.237.
Conclusion
In this post, I increased the model size of the Transformer language models trained with AdamW and Muon to 4.7 billion parameters, and trained on around 100 billion tokens of FineWeb with a batch size of 2 million tokens. I used the evaluation method of Kaddour et al., 2023 to approximately equalize the wall-clock time for both sides of the ablation.
In this setting, Muon outperformed AdamW by a respectable margin, suggesting that Muon is a superior alternative in the large-scale setting with realistic batch sizes, even when controlling for wall-clock time. This positive finding is broadly consistent with recent papers on scaling Muon, such as Liu et al., 2025 and Shah et al., 2025.
Because it is new, using Muon for large-scale pretraining is risky and could backfire, but medium-scale ablations like ours should give readers some confidence that it can be used without problems and might even outperform AdamW.