AdamW or Muon?

A small-scale ablation study

Jun 17, 2025

Introduction

Recently, Jordan et al., 2024 introduced Muon, an optimizer with the potential to replace AdamW as the standard for training neural network models.

The design of Muon is to operate on 2-dimensional weights, replacing each weight’s gradient (shaped as a matrix) with its nearest semi-orthogonal matrix. In practice, the orthogonalization is done approximately via a few iterations of Newton-Schulz, and the gradient is augmented with Nesterov-style momentum.

Background on Muon

Non-diagonal preconditioner. The AdamW optimizer is a very popular optimizer that adapts a state variable with momentum using a diagonal preconditioner. Recent works such as Shampoo and SOAP have demonstrated the feasibility of using non-diagonal preconditioners, with Muon being the latest example of such an optimizer.

Spectral norm control. Muon is also of interest because it produces weight updates with a controlled spectral norm throughout training. In particular, the Newton-Schulz iterations produce a matrix with singular values generally in the range (0.7, 1.3). Spectral properties of the updates and weights are relevant to training stability (Yang et al., 2023) and generalization (Yunis et al., 2024).

Practical efficiency. In practice, Muon is reported to be more efficient than Adam, with Liu et al., 2025 finding it acts as a ~2x compute multiplier, and Shah et al., 2025 finding it to perform better than AdamW across a wide range of tested batch sizes. In this post, I perform my own ablation of AdamW and Muon to see how they compare.

Experiment Setup

Open Codebase. I used my research codebase at https://github.com/lucaslingle/babel with branch ‘muon_fp32_small’.

Distributed Muon. I implemented the Muon optimizer by adapting Appendix G of Shah et al., 2025 to support sharded parameters and optimizer state. I also utilized the adjustments from Liu et al. 2025, scaling the updates by

\(0.2 * \sqrt{\max(\text{fan_in}, \text{fan_out})}\)

This adjustment results in a learning rate optimum close to the one for AdamW, which is convenient because the embeddings, unembeddings, and RMSNorm scale parameters are optimized via AdamW.

Architecture. I use a transformer architecture based on LLaMA, with n_layer = 18, d_model = 2304, d_ff = 3 * d_model, d_head = 128, and n_head = d_model / d_head. The SwiGLU nonlinearity is used in the MLP layers and RoPE in the attention layers. This setup yields models with 1.2 billion parameters.

Optimization. I use batch size 262,144 tokens, 100,000 training steps, with 2,000 warmup steps followed by cosine decay. For AdamW, I use beta1 = 0.9, beta2 = 0.95, eps = 10^-8, lambda = 0.1. For Muon, I use Nesterov momentum with mu = 0.95, and weight decay lambda = 0.1. The optimizer learning rates are swept in powers of two between 2^-9 and 2^-13, inclusive.1

Dtypes. I use float32 dtype for the optimizer states.2 As recommended by Rae et al., 2021, the model parameters are stored in float32 dtype, while the gradients and activations are stored in bfloat16. The multiply operations in matrix multiplications are performed in bfloat16; addition operations are performed in float32 and cast to bfloat16 when writing to HBM.

Training Data. I use the 350B token sample of FineWeb, and the LLaMA-2 tokenizer, which has approximately 32000 vocabulary items including special tokens.

Experiment Results

The lowest validation loss with optimal learning rate for AdamW was 2.516, and for Muon the lowest validation loss was 2.514.

Summary

I conducted small-scale ablation studies of AdamW versus Muon, using models with 1.2 billion parameters trained on approximately 26 billion tokens of FineWeb. In the tested setting, Muon outperformed AdamW by a nontrivial margin, suggesting it may be a superior alternative to this classic optimizer.

There are several limitations of the ablation study here. First is that it ignores wall-clock time and may not allocate equal time to both optimizers. Second is that it does not use realistic model sizes or token budgets seen in LLM pretraining. Third is that it uses a batch size that was tuned for AdamW and may be suboptimal for Muon. I will try to address these limitations in follow-up posts.

If either optimizer has sweep optimum on the edge, the grid is expanded for that optimizer until the loss stops decreasing.

Shah et al., 2025 seems to insinuate in Appendix C that they use float32 dtype. Also, in preliminary runs, I observed occasional NaN losses when using bfloat16 dtype for the Muon momentum variable.

Lucas Lingle

Ready for more?