tags: - colorclass/a thermodynamic theory of statistical learning ---# Kramers Theory Applied to Neural Network Training

1. Classical Kramers Theory

1.1 Basic Setup

In classical Kramers theory, we study a particle in a potential well subject to thermal noise:

mẍ + γẋ = -∇V(x) + √(2γkT)ξ(t)

where: - m: particle mass - γ: friction coefficient - V(x): potential energy - T: temperature - ξ(t): white noise

1.2 Escape Rate

The escape rate from a metastable minimum at x₀ over a barrier at x* is:

k = (ω₀/2π)(ωᵦ/γ)exp(-ΔV/kT)

where: - ω₀: well frequency = √|V”(x₀)|/m - ωᵦ: barrier frequency = √|V”(x*)|/m - ΔV = V(x*) - V(x₀)

2. SGD Mapping

2.1 Overdamped Limit

SGD corresponds to the overdamped limit (γ >> ω₀) of Kramers equation:

dθ = -η∇L(θ)dt + √(2ηT)dW

where: - η: learning rate (replaces 1/γ) - L(θ): loss landscape (replaces V(x)) - T: effective temperature from batch noise - dW: Wiener process

2.2 Modified Escape Rate

In the SGD context, the escape rate becomes:

k_SGD = (ω₀ωᵦ/2πη)exp(-ΔL/T)

where: - ω₀ = √|∇²L(θ₀)|: local minimum curvature - ωᵦ = √|∇²L(θ*)|: saddle point curvature - ΔL = L(θ*) - L(θ₀): barrier height

3. Effective Temperature in SGD

3.1 Temperature from Batch Noise

The effective temperature comes from batch sampling:

T = η⟨||∇L_B(θ) - ∇L(θ)||²⟩/2d

where: - L_B: batch loss - d: parameter dimension - ⟨…⟩: batch average

3.2 Batch Size Dependence

For batch size B:

T ∝ η/B

This explains why larger batches need larger learning rates.

4. Transition State Theory for SGD

4.1 Reaction Coordinate

Define progress variable q(θ):

q(θ) = (θ - θ₀)·v*

where v* is the dominant eigenvector of ∇²L at the saddle.

4.2 Committor Probability

Probability of escaping to new minimum:

P(θ) = ∫_{-∞}^{q(θ)} exp(-βL(θ'))dq'/Z

where Z is the partition function.

5. Optimal Training Schedule

5.1 Critical Temperature

For each barrier:

T_c = ΔL/ln(τω₀)

where τ is available training time.

5.2 Annealing Schedule

Optimal temperature schedule:

T(t) = T_c(1 + ln(τ/t))/ln(τω₀)

This gives learning rate schedule:

η(t) = η₀(1 + ln(τ/t))/ln(τω₀)

6. Practical Implications

6.1 Escape Time Distribution

Time to escape local minimum:

P(t_escape > t) = exp(-kt)

6.2 Optimal Batch Size

For target escape rate k:

B_opt = η⟨||∇L_B(θ) - ∇L(θ)||²⟩/(2dkΔL)

6.3 Training Termination

Stop when:

k < 1/(remaining_time × compute_cost)

7. Multiple Barriers

7.1 Parallel Pathways

Total escape rate:

k_total = Σᵢ kᵢ

7.2 Sequential Barriers

Effective rate:

1/k_eff = Σᵢ 1/kᵢ

8. Loss Landscape Analysis

8.1 Barrier Height Estimation

From parameter diversity:

ΔL ≈ -T ln(|det(∇²L(θ₀))/det(∇²L(θ*))|)/2

8.2 Transition Path Sampling

Probability of trajectory θ(t):

P[θ(t)] ∝ exp(-∫(dθ/dt + η∇L(θ))²/4ηTdt)

This framework explains: 1. Why high learning rates can get stuck 2. How batch size affects exploration 3. When to reduce learning rate 4. Why some models train easier than others