see also:

Background

Claude Responses

Misc References

Key Observations

Setting Up Notation

Dimensional Analysis of Scaling Laws

Empirically (Chinchilla, Gopher), optimal compute budget has been observed to scale roughly proportional to the size of the model (number of parameters) and dataset (number of tokens) , i.e. or . Using floating point operations (FLOPs) as our measure of compute, this implies the scaling factor is expressed in units of , which is analogous to a kind of computational density. According to Landauer’s principal, There is a theoretical limit to the minimal energy required to irreversibly change one bit of information. A FLOP can be interpreted as a potential opportunity to change the value of a parameter, and therefore FLOPs provide a natural upper bound on the information transmitted from the data to the model during training, and Landauer’s principal gives us a lower bound on the energy required to accomplish this information transmission (Von Neumann - Landauer principal?)

yadda yadda… incorporating time into the discussion… is analogous to a reaction rate term.

Information Absorption Efficiency

Consider some model with parameters , such that denotes the parameters at initialization, denotes the parameters at step of training, and denotes the parameters at convergence. Consider the mutual information between the parameters and the training data, . For notational simplicity, let’s call this , to be read as the information contained in the parameters, with respect to the dataset. The theoretical maximum information gain from training is then . as , for convergent training we expect . Marginalizing over the data, we can restate this observation in terms of the information entropy of the parameters: .

Let denote the instantaneous rate of information uptake by the model at some training step , and let denote the computational efficiency. Let denote the information absorption efficiency of the parameters at training step , per unit compute. Then . This relationship can be restated in the form of a proportionality constant which relates the efficiency with which the system converts computational work into information absorption: . Alternatively: . Training convergence — i.e. effective information saturation — occurs at some critical , , above which for some positive . Via dimensional analysis, we see that is in units of .

Impacts of Gradient Accumulation on Information Efficiency

Treating the fitting procedure as a communication channel that transmits information from the data into the parameters, the channel capacity is a function of the (loss) gradient utilized for a given parameter update step. if an update step results in no parameters updated, then the data processed in that step did not transmit any new information into the model parameters. Concretely, we can interpret the number of bits that change during a parameter update as a measure of the information gained during the parameter update, and consequently, the magnitude of the gradient therefore serves as a measure of the information transmitted in that data processing step. Consider a batch of samples . Let denote the gradient with respect to a given sample, and let the magnitude (L2 norm) be given by . Given some fixed learning rate, we want to know how the information transmitted per token (i.e. the per-sample normalized magnitude of the gradient) is affected by our choice of batch size. For the moment, let’s assume noise-less data, i.e. our goal in this regime should be to transmit all of the information available in any given token. We’ll extend this analysis to noisy data next. With a batch size of 1, we have normal SGD. Each sample has its own gradient computed independently, and the per-sample normalized gradient magnitude is just the average magnitude of the gradients: . With a batch size of , we pool the respective per-sample gradients before we have a chance to observe the magnitude, yielding a single gradient with magnitude . By Jensen’s inequality — — it is necessarily the case that . Further, we can quantify the gap here by invoking the central limit theorem (TODO), which reveals the magnitude of a batch gradient is expected to scale on the order of , which then is interpretable as our per-sample efficiency in the batch training regime. Now, back to the real world, where the data is noisy. Let denote a noisy sample for some noise , such that the “noiseless” sample is still denoted as above. For some noisy observation, we observe a gradient

Communication Theory of Model Training

We can treat a parameter update as a message transmitted from the data to the parameters. The communication channel here is the gradient. The channel gain is the learning rate, and the channel capacity has a strict upper bound: the number of bits required to express the model parameters, i.e. the maximum information that could be communicated in a single update would result in flipping every bit in the model’s learnable parameters. The gradient dimension therefore operates as a kind of information bottleneck, thresholding the information that could feasibly be transmitted. Above some threshold batch size (strictly smaller than the number of parameters), this channel capacity becomes saturated. In practice, gradients become increasingly sparse as training progresses (i.e. magnitude increasingly concentrated in fewer components, i.e. fewer parameter bits change per update as training progresses). The channel still experiences a kind of saturation of useful information, which interacts with the gradient precision.

Analogy to Chemical Kinetics

  • Changing a bit is analogous to a “reaction”
  • Bit precision imposes a lower bound on the “activation energy” of changing a parameter
  • The “activity” of the system is the rate of change of parameters, which is in turn propotional to the gradient magnitude (although it can also be measured more directly by observing the bit-change activity of a training step).
  • The learning rate operates like a temperature: reactions progress at higher temperature, and carefully controlling the temperature throughout the reaction generally increases the yield of “structure” produced (model generalizability ~ crystallization quality).
  • We can accelerate the reaction by increasing the temperature, but then we risk reducing the yield.
  • We can increase the yield (model quality) by annealing the temperature (update magnitude) slowly, but if we anneal too slowly we are just wasting energy.

References:

  • Shannon 1948, A Mathematical Theory of Information
  • Landauer, 1961, Irreversibility and Heat Generation in the Computing Process

Reaction Kinetics of Deep Learning

We can model the learning procedure as a diffusion of information from a region of high concentration (the dataset) into a region of low concentration (the randomly initiatlized parameters) whose “volume” is a function of model capacity. This analogy permits modeling of extremely nuanced dynamics, such as a low quality dataset catalyzing the activity of a more challenging dataset. But later. First, fundamentals. Let denote the “information gradient”. Fick’s first law gives the information flux as , where is a “diffusivity coefficient” that is proportional to the activity of the system. We can reasonably anticipate will take the form or , where is a transmitivity property of the medium akin to viscosity (probably corresponds to FLOPs/parameter activity, possibly interacting with how model topology affects gradient flow), is learning, and is some scaling power.