tags: - colorclass/a thermodynamic theory of statistical learning ---see also: - Information Flow Across Capacity-Limited Computational Media
> assume a fixed time budget (GPU hours) to train a model. assume unrestricted access to unique tokens. assume that every time we increase the batch size by a factor of k
, the loss-decrease-per-sample (i.e. the sample efficiency) decreases by a factor of 1/sqrt(k)
. increasing batch size can potentially increase throughput, but up to a point there are dimensioning returns, i.e. there exists some sweet-spot batch size that maximizes throughput efficiently to result in the most time-efficient training that achieves the smallest loss achievable within the constraint of the time budget.
Problem Formalization
Given: - Fixed time budget (in GPU-hours) - Loss function to be minimized - Base batch size - Base throughput rate (samples/second at ) - Base sample efficiency (loss decrease per sample at )
Efficiency-Batch Size Relationship
For batch size , the sample efficiency follows:
This reflects the empirical observation that larger batch sizes reduce the information gained per sample.
Throughput-Batch Size Relationship
The computational throughput typically follows:
where is a hardware utilization function that depends on: - Memory bandwidth - Computational parallelism - Hardware architecture
Common approximation: where is the hardware saturation point
Total Progress Metric
The training progress over time can be expressed as:
Substituting the relationships:
Optimal Batch Size
To find the optimal , solve:
Taking the derivative:
Case Analysis
1. Pre-saturation : - - - Optimal
2. Post-saturation : - - - No local maximum
Therefore:
Practical Implementation
The optimal batch sizing algorithm:
1. Measure base metrics:
b0 = minimum_viable_batch_size()
r0 = measure_throughput(b0)
e0 = measure_efficiency(b0)
2. Determine hardware saturation:
k_max = find_throughput_plateau()
3. Set optimal batch size:
k_opt = min(4, k_max)
b_opt = b0 * k_opt
Trade-off Analysis
The key trade-offs involved:
1. Memory-Compute Trade-off: - Larger batches → Better hardware utilization - Until memory bandwidth saturates
2. Statistical-Computational Trade-off: - Larger batches → Lower sample efficiency - But higher throughput up to saturation
3. Time-Quality Trade-off: - Fixed time budget - Need to maximize (throughput × efficiency)
Optimization Extensions
1. Dynamic Batch Sizing: - Adjust batch size during training - Based on loss landscape characteristics
2. Gradient Accumulation: - Simulate larger batches - Without memory overhead
3. Mixed-Precision Training: - Increase effective batch size - Through reduced precision computation
Theoretical Guarantees
Under mild assumptions about the loss landscape:
1. Convergence Rate:
where is the dataset size.