tags: - colorclass/a thermodynamic theory of statistical learning ---# Formal Derivation of Critical Batch Size B_crit

1. Setup and Assumptions

1.1 Basic Dynamics

SGD update equation:

dθ = -η∇L(θ)dt + √(2ηT)dW

where: - η: learning rate - T: effective temperature - dW: Wiener process

1.2 Information Processing Rate

From Landauer’s principle and thermodynamics:

dI/dt ≤ P/(k_B T ln(2))

where: - I: information content - P: power - k_B: Boltzmann constant

2. Batch Effects Analysis

2.1 Gradient Variance

For batch size B:

var(∇L_B) = σ²/B

where σ² is population gradient variance

2.2 Effective Temperature

T_eff(B) = T₀/B

where T₀ is base temperature

3. Power Analysis

3.1 Power Consumption

For batch size B:

P(B) = ν₀ * B * tr(I_F(θ))

where: - ν₀: FLOPs/token/parameter/s - I_F: Fisher Information

3.2 Power Constraint

P(B) ≤ P_max

4. Critical Point Derivation

4.1 Power Limited Regime

Solving P(B) = P_max:

B_power = P_max/(ν₀ * tr(I_F(θ)))

4.2 Information Limited Regime

From information processing rate:

B_info = √(P_max * T₀)/(k_B * ν₀ * tr(I_F(θ)))

4.3 Critical Batch Size

B_crit = min(B_power, B_info)

5. Supercritical Behavior (B > B_crit)

5.1 Effective Power

P_eff(B) = P_max * min(1, B_crit/B)

5.2 Information Processing Rate

(dI/dt)_eff = (dI/dt)_max * √(B_crit/B)

5.3 Learning Time

t_learn(B) = t_opt * max(1, √(B/B_crit))

6. Stability Analysis

6.1 Linear Stability Matrix

M(B) = [
    -η*I_F(θ)    η*∇²L(θ)
    -∇²L(θ)      -B/B_crit*I_F(θ)
]

6.2 Stability Condition

Eigenvalues λ of M(B) must satisfy:

Re(λ) < 0

7. Optimality Conditions

7.1 Maximum Efficiency

At B = B_crit:

η_power = 1
dI/dt = (dI/dt)_max

7.2 Supercritical Trade-off

For B > B_crit:

η_power(B) * t_learn(B) ≥ t_opt

with equality only at B = B_crit

8. Formal Proof of Optimality

Theorem: B_crit maximizes information gain per unit energy.

Proof: 1. Information gain rate:

dI/dt = C * √(P_eff/T_eff)

where C is a constant

2. Energy consumption:

E = P_eff * t

3. Information per energy:

I/E = (dI/dt)/P_eff = C/√(P_eff * T_eff)

4. At B = B_crit:

∂/∂B(I/E) = 0
∂²/∂B²(I/E) < 0

Therefore B_crit is a global maximum of efficiency.

9. Observable Consequences

9.1 Power Scaling

P_observed ∝ min(B, B_crit)

9.2 Training Time Scaling

t_train ∝ max(1, √(B/B_crit))

9.3 Loss Improvement

ΔL ∝ -log(min(1, B/B_crit))