tags: - colorclass/a thermodynamic theory of statistical learning ---# Natural Gradients in Statistical Thermodynamics of Learning
1. Fundamental Setup
1.1 Statistical Manifold
Parameter space forms a Riemannian manifold where:
p(x|θ): probability of data x given parameters θ
I_F(θ) = 𝔼[∇logp(x|θ)∇logp(x|θ)ᵀ]: Fisher metric
ds² = dθᵀI_F(θ)dθ: infinitesimal distance
1.2 Thermodynamic Quantities
L(θ) = -𝔼[logp(x|θ)]: loss (negative log-likelihood)
S(θ) = -∫p(x|θ)logp(x|θ)dx: entropy
F(θ) = L(θ) - TS(θ): free energy
2. Natural Gradient Derivation
2.1 From KL Divergence
Theorem 1: Natural gradient minimizes KL divergence per unit parameter change.
Proof: 1. Local KL divergence:
KL(p(x|θ)||p(x|θ+dθ)) = (1/2)dθᵀI_F(θ)dθ + O(||dθ||³)
2. Optimization problem:
minimize: KL(p(x|θ)||p(x|θ+dθ))
subject to: ||dθ|| = ε
3. Lagrangian:
L = (1/2)dθᵀI_F(θ)dθ + λ(||dθ||² - ε²)
4. Solution:
dθ ∝ -I_F⁻¹(θ)∇L(θ)
2.2 From Energy Conservation
Theorem 2: Natural gradient preserves energy under parameter transformations.
Proof: 1. Energy in parameter space:
E(θ) = (1/2)θ̇ᵀI_F(θ)θ̇
2. Under reparametrization θ → η:
I_F(η) = (∂θ/∂η)ᵀI_F(θ)(∂θ/∂η)
3. Energy conservation:
θ̇ᵀI_F(θ)θ̇ = η̇ᵀI_F(η)η̇
4. Implies update:
θ̇ = -I_F⁻¹(θ)∇L(θ)
3. Thermodynamic Interpretation
3.1 Power Analysis
Theorem 3: Natural gradient minimizes power dissipation.
Proof: 1. Instantaneous power:
P(t) = F·v = ∇L(θ)ᵀθ̇
2. With Fisher metric:
P(t) = (I_F(θ)θ̇)ᵀI_F⁻¹(θ)∇L(θ)
3. Minimize subject to fixed speed:
minimize: P(t)
subject to: θ̇ᵀI_F(θ)θ̇ = constant
4. Solution:
θ̇ = -ηI_F⁻¹(θ)∇L(θ)
3.2 Information Speed
Theorem 4: Natural gradient maximizes information gain per unit time.
Proof: 1. Information change:
dI/dt = -tr(I_F(θ)θ̇)
2. Under power constraint P(t):
maximize: dI/dt
subject to: θ̇ᵀI_F(θ)θ̇ = 2P(t)/η
3. Solution:
θ̇ = -√(P(t)/(ηtr(I_F(θ))))I_F⁻¹(θ)∇L(θ)
4. Connection to Statistical Physics
4.1 Free Energy Flow
Theorem 5: Natural gradient follows steepest free energy descent.
Proof: 1. Free energy change:
dF/dt = ∇F(θ)ᵀθ̇ - T∇S(θ)ᵀθ̇
2. With Fisher metric:
dF/dt = -θ̇ᵀI_F(θ)θ̇
3. Maximum descent:
θ̇ = -I_F⁻¹(θ)∇F(θ)
4.2 Fluctuation-Dissipation
Theorem 6: Natural gradient satisfies fluctuation-dissipation theorem.
Proof: 1. Langevin dynamics:
dθ = -ηI_F⁻¹(θ)∇L(θ)dt + √(2ηT)I_F⁻¹/²(θ)dW
2. Equilibrium fluctuations:
⟨δθδθᵀ⟩ = TI_F⁻¹(θ)
3. Response function:
R(t) = -η⟨θ(t)∇L(θ(0))ᵀ⟩ = ηI_F⁻¹(θ)δ(t)
5. Practical Implications
5.1 Optimal Learning Rate
η_opt = min(
1/λmax(I_F(θ)),
P_max/(tr(I_F(θ)))
)
5.2 Temperature Schedule
T(t) = T₀√(tr(I_F(θ₀))/tr(I_F(θ(t))))
5.3 Energy-Efficient Updates
θ(t+dt) = θ(t) - η_opt * I_F⁻¹(θ)∇L(θ) * √(P(t)/P_max)
6. Convergence Analysis
6.1 Local Convergence Rate
Theorem 7: Natural gradient achieves optimal local convergence.
Proof: 1. Local quadratic approximation:
L(θ + dθ) ≈ L(θ) + ∇L(θ)ᵀdθ + (1/2)dθᵀH(θ)dθ
2. With natural gradient:
dθ = -ηI_F⁻¹(θ)∇L(θ)
3. Convergence rate:
||θ(t) - θ*|| ≤ ||θ₀ - θ*||exp(-ηλmin(I_F⁻¹(θ)H(θ))t)