Kramer's Theory Applied to DNN TrainingDig That Data

tags: - colorclass/a thermodynamic theory of statistical learning ---# Kramers Theory Applied to Neural Network Training

1. Classical Kramers Theory

1.1 Basic Setup

In classical Kramers theory, we study a particle in a potential well subject to thermal noise:

mẍ + γẋ = -∇V(x) + √(2γkT)ξ(t)

where: - m: particle mass - γ: friction coefficient - V(x): potential energy - T: temperature - ξ(t): white noise

1.2 Escape Rate

The escape rate from a metastable minimum at x₀ over a barrier at x* is:

k = (ω₀/2π)(ωᵦ/γ)exp(-ΔV/kT)

where: - ω₀: well frequency = √|V”(x₀)|/m - ωᵦ: barrier frequency = √|V”(x*)|/m - ΔV = V(x*) - V(x₀)

2. SGD Mapping

2.1 Overdamped Limit

SGD corresponds to the overdamped limit (γ >> ω₀) of Kramers equation:

dθ = -η∇L(θ)dt + √(2ηT)dW

where: - η: learning rate (replaces 1/γ) - L(θ): loss landscape (replaces V(x)) - T: effective temperature from batch noise - dW: Wiener process

2.2 Modified Escape Rate

In the SGD context, the escape rate becomes:

k_SGD = (ω₀ωᵦ/2πη)exp(-ΔL/T)

where: - ω₀ = √|∇²L(θ₀)|: local minimum curvature - ωᵦ = √|∇²L(θ*)|: saddle point curvature - ΔL = L(θ*) - L(θ₀): barrier height

3. Effective Temperature in SGD

3.1 Temperature from Batch Noise

The effective temperature comes from batch sampling:

T = η⟨||∇L_B(θ) - ∇L(θ)||²⟩/2d

where: - L_B: batch loss - d: parameter dimension - ⟨…⟩: batch average

3.2 Batch Size Dependence

For batch size B:

T ∝ η/B

This explains why larger batches need larger learning rates.

4. Transition State Theory for SGD

4.1 Reaction Coordinate

Define progress variable q(θ):

q(θ) = (θ - θ₀)·v*

where v* is the dominant eigenvector of ∇²L at the saddle.

4.2 Committor Probability

Probability of escaping to new minimum:

P(θ) = ∫_{-∞}^{q(θ)} exp(-βL(θ'))dq'/Z

where Z is the partition function.

5. Optimal Training Schedule

5.1 Critical Temperature

For each barrier:

T_c = ΔL/ln(τω₀)

where τ is available training time.

5.2 Annealing Schedule

Optimal temperature schedule:

T(t) = T_c(1 + ln(τ/t))/ln(τω₀)

This gives learning rate schedule:

η(t) = η₀(1 + ln(τ/t))/ln(τω₀)

6. Practical Implications

6.1 Escape Time Distribution

Time to escape local minimum:

P(t_escape > t) = exp(-kt)

6.2 Optimal Batch Size

For target escape rate k:

B_opt = η⟨||∇L_B(θ) - ∇L(θ)||²⟩/(2dkΔL)

6.3 Training Termination

Stop when:

k < 1/(remaining_time × compute_cost)

7. Multiple Barriers

7.1 Parallel Pathways

Total escape rate:

k_total = Σᵢ kᵢ

7.2 Sequential Barriers

Effective rate:

1/k_eff = Σᵢ 1/kᵢ

8. Loss Landscape Analysis

8.1 Barrier Height Estimation

From parameter diversity:

ΔL ≈ -T ln(|det(∇²L(θ₀))/det(∇²L(θ*))|)/2

8.2 Transition Path Sampling

Probability of trajectory θ(t):

P[θ(t)] ∝ exp(-∫(dθ/dt + η∇L(θ))²/4ηTdt)

This framework explains: 1. Why high learning rates can get stuck 2. How batch size affects exploration 3. When to reduce learning rate 4. Why some models train easier than others

David Marx

Explorer

Kramer's Theory Applied to DNN Training