Stability Across Time

On the activation growth rate of different features

Jul 14, 2025

Introduction

If we consider using the Lion optimizer, each update is a matrix with entries in {-1, 1}.

For tractability, it is instructive to imagine training a single layer of a NN model. In this case, consider a single training example whose input to this layer is a vector x. For a learning rate schedule {\eta_k} and update sequence {u_k} the change in the output activation for a neuron is given by

\(|a_t - a_0 | = | (\sum_{k=0}^{t-1} \eta_k * u_k) \cdot x |\)

Suppose that \eta_k = \eta constantly for all k, and suppose the Lion optimizer updates are produced adversarially, to make the resulting activation change as large in possible for the specific input x. In this case, we have

\(u_k = \pm \text{sign}(x) \)

for all k, where \sign is applied elementwise to the vector x, and plus or minus is applied once to the entire vector.

Then the change in activation magnitude reduces to

\(|a_t - a_0 | = \eta T || x ||_{1}\)

Discussion

Since ||x||_1 = O(d_model), it makes sense for \eta to be scaled inversely to the input vector length d_model, which recovers the maximal update parameterization.

For the dependence on time, a technique like weight decay can also be used to enforce a bound on the weights and hence activation magnitudes (Chen et al., 2024; Xie and Li, 2024). The main problem arises from multiplicative nonlinearities in the network.

Consider a model with ReGLU nonlinearities, for example.1 In this case, if both the feature and gating matrices are trained simultaneously, we have

\(| f_t | \leq | f_0 | + \eta T || x ||_{1}\)

\(|g_t | \leq | g_0 | + \eta T || x ||_{1}\)

where we have applied the reverse triangle inequality |x| - |y| \leq |x - y| to the expression |a_t - a_0| = \eta T ||x||_{1}.

Then we have

\(| f_t g_t | = | f_t | |g_t | \leq \big{(}| f_0 | + \eta T || x ||_{1}\big{)}\big{(}| g_0 | + \eta T || x ||_{1}\big{)}\)

which means we have a quadratic polynomial in T, such that as T increases, the dominating term eventually shifts from the zeroth-order term to the first-order term to the second-order term.

Asymptotically, we have

\(|f_{t}g_{t} | = O(T^{2})\)

Simple Fixes

The simplest fix would be to divide the learning rate through by a factor of T, but empirically this seems to be too severe of a correction as it would produce O(1) activation bounds for all training runs. Empirically, we want something like O(T).2

Another alternative would be to divide the learning rate in multiplicative layers by a factor of T^{0.5}, so they asymptotically evolve at the same rate as simple linear layers.

Another fix would be to take the square root of f_t and g_t before taking their product. As long as the singularity at zero during backprop is avoided with

\(\phi(x) = \text{sign}(x) \sqrt{|x| + \epsilon}\)

this solution would also work and ensure the asymptotic bound on | f_t * g_t | is O(T).

Gemma 3

Another fix would be to recognize that most training runs of interest are dominated by one term or another of a polynomial in T. In that case, taking the LayerNorm of the activations afterward would produce approximate invariance w.r.t. T, T^2, etc.—whichever term is dominating.

Gemma 3 does exactly this. By taking the LayerNorm of the outputs, it produces invariance to the O(T^3) scaling in GLU MLPs outputs3, and invariance to the O(T^2) in the attention layer outputs4. By taking QK-LayerNorm, Gemma 3 also produces invariance to the O(T^2) scaling in the attention scores.5

(Note that this analysis is only exact when using non-parametric RMSNorm.)

Conclusion

In this post I did some back-of-the-envelope calculations to try and solve the observed forecasting problems in the previous post. My conclusion is that I need to try out the Gemma 3 Transformer architecture and see if it solves my problem.

For analysis, I use ReGLU instead of SwiGLU because ReLU is positive-homogenous.

Except for residual layer’s outputs, which may be O(1) without harm.

For serial application of linear layers, a similar issue arises as multiplicative nonlinearities, so the adversarial bound on a ReGLU MLP output is O(T^3) normally, but is O(1) when using Gemma’s normalization.

Due to the application of W_V and W_O in succession in the linear layers, the adversarial bound on the attention layer output is O(T^2) normally, but is O(1) when using Gemma’s normalization.

Due to the application of W_Q and W_K followed by summation, the adversarial bound on the QK-scores w.r.t. T is O(T^2) normally, but is O(1) when using QK-LayerNorm.

Lucas Lingle