tags: - colorclass/a thermodynamic theory of statistical learning ---# Error Bounds and Convergence Rates for SGD: A Thermodynamic Analysis
1. Setup and Assumptions
1.1 Basic Dynamics
dθ = -η∇L(θ)dt + √(2ηT)dW
1.2 Key Assumptions
1. L(θ) is μ-strongly convex:
L(θ₂) ≥ L(θ₁) + ∇L(θ₁)ᵀ(θ₂-θ₁) + (μ/2)||θ₂-θ₁||²
2. L(θ) has L-Lipschitz gradients:
||∇L(θ₂) - ∇L(θ₁)|| ≤ L||θ₂-θ₁||
3. Gradient noise has bounded variance:
𝔼[||∇L_B(θ) - ∇L(θ)||²] ≤ σ²
2. Fundamental Bounds
2.1 Free Energy Bound
Define free energy:
F(θ) = L(θ) - T·s(θ)
where s(θ) is local entropy.
Free energy decreases in expectation:
d𝔼[F(θ)] ≤ -η(μL/L+μ)𝔼[F(θ) - F*] + ηTd/2
2.2 Entropy Production Bound
From fluctuation theorem:
𝔼[exp(-Δstot)] = 1
implies:
𝔼[Δstot] ≥ 0
3. Convergence Rates
3.1 Short-Time Regime
For t << τrelax = 1/(ημ):
𝔼[L(θ(t)) - L*] ≤ (L₀ - L*)exp(-ημt) + ηTd/2μ
3.2 Long-Time Regime
For t >> τrelax:
𝔼[L(θ(t)) - L*] ≤ ηTd/2μ + (σ²η/4μ)
3.3 Optimal Learning Rate
Minimizing long-time bound:
η_opt = √(2μ/σ²)
giving error bound:
𝔼[L(θ) - L*] ≤ σ√(T/2μ)
4. Error Probability Bounds
4.1 Concentration Inequality
From large deviation theory:
P(L(θ) - L* ≥ ε) ≤ exp(-βε + S(ε))
where: - β = 1/T - S(ε) is entropy of states with loss gap ε
4.2 Time-Dependent Bound
P(L(θ(t)) - L* ≥ ε) ≤ exp(-ημtε/2T + ηtd/2)
5. Escape Time Bounds
5.1 First Passage Time
Mean time to escape local minimum:
τescape = (2π/ω₀)exp(ΔL/T)
where: - ω₀ = √(μL) - ΔL is barrier height
5.2 Probability of Being Trapped
P(trapped after time t) ≤ exp(-t/τescape)
6. Finite-Batch Effects
6.1 Batch Size Scaling
For batch size B:
T_eff = T₀/B
σ²_eff = σ²/B
6.2 Modified Bounds
𝔼[L(θ) - L*] ≤ σ√(T₀/2μB)
7. Practical Implications
7.1 Learning Rate Schedule
To maintain constant escape probability:
η(t) = η₀/√(1 + t/τrelax)
7.2 Batch Size Schedule
To maintain constant effective temperature:
B(t) = B₀(1 + t/τrelax)
7.3 Error vs Compute Trade-off
For compute budget C:
min error ∝ 1/√C
8. Multiple Local Minima
8.1 Basin Hopping Rate
Transition rate between basins:
k_{i→j} = (ω₀/2π)exp(-ΔL_{ij}/T)
8.2 Equilibration Time
Time to reach equilibrium distribution:
τeq = max(1/λ₂)
where λ₂ is second eigenvalue of transition matrix.
9. Synthesis: Overall Bounds
9.1 Combined Error Bound
𝔼[L(θ) - L*] ≤ min{
(L₀ - L*)exp(-ημt),
σ√(T/2μ)
} + ηTd/2μ
9.2 Probability of Finding Global Minimum
P(L(θ) - L* ≤ ε) ≥ 1 - exp(-βε + Sd)
where Sd is entropy of d-dimensional parameter space.