Background

see “Entropy production along a stochastic trajectory and an integral fluctuation theorem” - Udo Seifert
Claude - Deep Thoughts / Entropy Production and Fluctuation Theorems in Thermodynamics
Claude - Deep Thoughts / Modeling NN Training Dynamics with Dissipation Fluctuation Theory

reference for analysis of SGD as a diffusion process
reference for parameter distance increase after convergence
reference for critical batch size
Stochastic Simulation of Chemical Kinetics

Claude Responses

Misc References

Measuring the Effects of Data Parallelism on Neural Network Training - Empirical observation of the existence of a critical batch size
- Papers referencing this: https://scholar.google.com/scholar?oi=bibs&hl=en&cites=12544556029233445066,3487261182134896571,15934798779379948142&as_sdt=5
Stochastic Thermodynamics of Learning
Stochastic time-evolution, information geometry and the Cram´er-Rao bound thermodynamic interpretation of fisher information

Key Observations

The existence of a critical batch size is a property of saturating the information carrying capacity of the parameter update operator (i.e .the gradient of the loss)
- This predicts that models engineered to have gradients requiring many bits to express can potentially tolerate higher data parallelism before reaching the critical batch size
- Could explain why local SGD has demonstrated improved time to convergence and tolerance for larger batch size than mini batch sgd - DON’T USE LARGE MINI-BATCHES, USE LOCAL SGD
  - More evidence for benefit of local SGD: Local SGD Converges Fast and Communicates Little
  - More: ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!
  - more corroboration: On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization
  - Lean on local SGD more in early training - Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD
  - Analysis of local “iterate bias” as an SDE drift term in local SGD - Sharp Bounds for Federated Averaging (Local SGD) and Continuous Perspective
- If we can establish measures of batch information capacity and update operation information capacity, we can understand the bottlenecks at play here analytically
  - Hypothesis: The maximum information gain of a given data batch should be proportional to the “heat” input needed to overfit to that batch
  - We can then relate this to the expected heat of an update to make inferences about efficiency of data utilization and useful batch size limitations
- I.e. this explains the gradient noise paper thermodynamically
Continued divergence of the parameters from initialization despite convergence of the loss is consistent with modeling SGD as governed by langevin dynamics
- This is consistent with the observations of Directional convergence and alignment in deep learning
Modeling local SGD by adding a “local drift” term to the SDE, we can escape some of the constraint imposed on us by the information bottleneck by distributing the per-update information across “federated” ranks, regaining information capacity for global updates
- Hypothesis: we can further assist convergence by regularizing drift. A simple drift regularizer would be to use a moving average in the federated update operation. this would however have an impact on the update energy.
- The eigenspace of the update jacobian can be used to quantify information transfer in the update. The diagonal and off-diagonal eigenspaces are subject to different dynamics - Deep Learning Theory Review: An Optimal Control and Dynamical Systems Perspective
Transferring information requires doing work.

Setting Up Notation

Dimensional Analysis of Scaling Laws

Empirically (Chinchilla, Gopher), optimal compute budget $C$ has been observed to scale roughly proportional to the size of the model (number of parameters) $N$ and dataset (number of tokens) $D$ , i.e. $C \propto N \cdot D$ or $C = k \cdot N \cdot D$ . Using floating point operations (FLOPs) as our measure of compute, this implies the scaling factor $k$ is expressed in units of $FLOP/token/parameter$ , which is analogous to a kind of computational density. According to Landauer’s principal, There is a theoretical limit to the minimal energy required to irreversibly change one bit of information. A FLOP can be interpreted as a potential opportunity to change the value of a parameter, and therefore FLOPs provide a natural upper bound on the information transmitted from the data to the model during training, and Landauer’s principal gives us a lower bound on the energy required to accomplish this information transmission (Von Neumann - Landauer principal?)

yadda yadda… incorporating time into the discussion… $k / s$ is analogous to a reaction rate term.

Information Absorption Efficiency

Consider some model with parameters $θ$ , such that $θ_{0}$ denotes the parameters at initialization, $θ_{t}$ denotes the parameters at step $t$ of training, and $θ_{\infty}$ denotes the parameters at convergence. Consider the mutual information between the parameters and the training data, $I (θ, D)$ . For notational simplicity, let’s call this $I_{D} (θ)$ , to be read as the information contained in the parameters, with respect to the dataset. The theoretical maximum information gain from training is then $I_{D} (θ_{\infty}) - I_{D} (θ_{0})$ . as $t \to \infty$ , for convergent training we expect $I_{D} (θ_{\infty}) - I_{D} (θ_{t}) \to 0$ . Marginalizing over the data, we can restate this observation in terms of the information entropy of the parameters: $H (θ_{\infty}) - H (θ_{t}) \to 0$ .

Let $\frac{d}{d t} I_{D} (θ_{t})$ denote the instantaneous rate of information uptake by the model at some training step $t$ , and let $\frac{d}{d t} C$ denote the computational efficiency. Let $η_{I} (θ_{t})$ denote the information absorption efficiency of the parameters at training step $t$ , per unit compute. Then $η_{I} (θ_{t}, C) = \frac{d I _{D} ( θ _{t} )}{d C}$ . This relationship can be restated in the form of a proportionality constant $η (C)$ which relates the efficiency with which the system converts computational work into information absorption: $\frac{d I _{D} ( θ _{t} )}{d C} = η_{I} (θ_{t}, C) = η (C) \cdot [I_{D} (θ_{\infty}) - I_{D} (θ_{t})]$ . Alternatively: $\frac{d H ( θ _{t} )}{d C} = η (C) \cdot [H (θ_{\infty}) - H (θ_{t})]$ . Training convergence — i.e. effective information saturation — occurs at some critical $C$ , $C_{cr i t}$ , above which $\frac{d H ( θ _{t} )}{d C} < ϵ$ for some positive $ϵ \approx 0$ . Via dimensional analysis, we see that $η (C)$ is in units of $\frac{1}{FLOPs}$ .

Impacts of Gradient Accumulation on Information Efficiency

Treating the fitting procedure as a communication channel that transmits information from the data into the parameters, the channel capacity is a function of the (loss) gradient utilized for a given parameter update step. if an update step results in no parameters updated, then the data processed in that step did not transmit any new information into the model parameters. Concretely, we can interpret the number of bits that change during a parameter update as a measure of the information gained during the parameter update, and consequently, the magnitude of the gradient therefore serves as a measure of the information transmitted in that data processing step. Consider a batch of $M$ samples $x_{0}, x_{1}, \dots x_{m}$ . Let $\nabla f (x_{i})$ denote the gradient with respect to a given sample, and let the magnitude (L2 norm) be given by $∣\nabla f (x_{i}) ∣$ . Given some fixed learning rate, we want to know how the information transmitted per token (i.e. the per-sample normalized magnitude of the gradient) is affected by our choice of batch size. For the moment, let’s assume noise-less data, i.e. our goal in this regime should be to transmit all of the information available in any given token. We’ll extend this analysis to noisy data next. With a batch size of 1, we have normal SGD. Each sample has its own gradient computed independently, and the per-sample normalized gradient magnitude is just the average magnitude of the gradients: $\frac{1}{m} \sum_{i = o}^{m} ∣\nabla f (x_{i}) ∣ = E [∣ f (x_{i}) ∣]$ . With a batch size of $m$ , we pool the respective per-sample gradients before we have a chance to observe the magnitude, yielding a single gradient with magnitude $∣ \frac{1}{m} \sum_{i = o}^{m} \nabla f (x_{i}) ∣ = ∣ E [f (x_{i})] ∣$ . By Jensen’s inequality — $g (E [X]) \leq E [g (X)]$ — it is necessarily the case that $∣ E [f (x_{i})] ∣ = E [∣ f (x_{i}) ∣]$ . Further, we can quantify the gap here by invoking the central limit theorem (TODO), which reveals the magnitude of a batch gradient is expected to scale on the order of $\frac{1}{m}$ , which then is interpretable as our per-sample efficiency in the batch training regime. Now, back to the real world, where the data is noisy. Let $\overset{x_{i}}{ˉ} = x_{i} + ϵ$ denote a noisy sample for some noise $ϵ \sim N (μ, σ)$ , such that the “noiseless” sample is still denoted $x_{i}$ as above. For some noisy observation, we observe a gradient $\nabla f (\overset{x_{i}}{ˉ}) = \nabla f (x_{i} + ϵ) = \dots$

Communication Theory of Model Training

We can treat a parameter update as a message transmitted from the data to the parameters. The communication channel here is the gradient. The channel gain is the learning rate, and the channel capacity has a strict upper bound: the number of bits required to express the model parameters, i.e. the maximum information that could be communicated in a single update would result in flipping every bit in the model’s learnable parameters. The gradient dimension therefore operates as a kind of information bottleneck, thresholding the information that could feasibly be transmitted. Above some threshold batch size (strictly smaller than the number of parameters), this channel capacity becomes saturated. In practice, gradients become increasingly sparse as training progresses (i.e. magnitude increasingly concentrated in fewer components, i.e. fewer parameter bits change per update as training progresses). The channel still experiences a kind of saturation of useful information, which interacts with the gradient precision.

Analogy to Chemical Kinetics

Changing a bit is analogous to a “reaction”
Bit precision imposes a lower bound on the “activation energy” of changing a parameter
The “activity” of the system is the rate of change of parameters, which is in turn propotional to the gradient magnitude (although it can also be measured more directly by observing the bit-change activity of a training step).
The learning rate operates like a temperature: reactions progress at higher temperature, and carefully controlling the temperature throughout the reaction generally increases the yield of “structure” produced (model generalizability ~ crystallization quality).
We can accelerate the reaction by increasing the temperature, but then we risk reducing the yield.
We can increase the yield (model quality) by annealing the temperature (update magnitude) slowly, but if we anneal too slowly we are just wasting energy.

References:

Shannon 1948, A Mathematical Theory of Information
Landauer, 1961, Irreversibility and Heat Generation in the Computing Process

Reaction Kinetics of Deep Learning

We can model the learning procedure as a diffusion of information from a region of high concentration (the dataset) into a region of low concentration (the randomly initiatlized parameters) whose “volume” is a function of model capacity. This analogy permits modeling of extremely nuanced dynamics, such as a low quality dataset catalyzing the activity of a more challenging dataset. But later. First, fundamentals. Let $\nabla I = I_{D} (θ_{\infty}) - I_{D} (θ_{t})$ denote the “information gradient”. Fick’s first law gives the information flux as $J = D \nabla I$ , where $D$ is a “diffusivity coefficient” that is proportional to the activity of the system. We can reasonably anticipate $D$ will take the form $D = ρ η^{k}$ or $D = ρ exp (- η^{k})$ , where $ρ$ is a transmitivity property of the medium akin to viscosity (probably corresponds to FLOPs/parameter activity, possibly interacting with how model topology affects gradient flow), $η$ is learning, and $k$ is some scaling power.

David Marx

Explorer

A Thermodynamic Theory of Statistical Learning