tags: - colorclass/a thermodynamic theory of statistical learning ---see also: - Information Flow Across Capacity-Limited Computational Media

> assume a fixed time budget (GPU hours) to train a model. assume unrestricted access to unique tokens. assume that every time we increase the batch size by a factor of k, the loss-decrease-per-sample (i.e. the sample efficiency) decreases by a factor of 1/sqrt(k). increasing batch size can potentially increase throughput, but up to a point there are dimensioning returns, i.e. there exists some sweet-spot batch size that maximizes throughput efficiently to result in the most time-efficient training that achieves the smallest loss achievable within the constraint of the time budget.

Problem Formalization

Given: - Fixed time budget (in GPU-hours) - Loss function to be minimized - Base batch size - Base throughput rate (samples/second at ) - Base sample efficiency (loss decrease per sample at )

Efficiency-Batch Size Relationship

For batch size , the sample efficiency follows:

This reflects the empirical observation that larger batch sizes reduce the information gained per sample.

Throughput-Batch Size Relationship

The computational throughput typically follows:

where is a hardware utilization function that depends on: - Memory bandwidth - Computational parallelism - Hardware architecture

Common approximation: where is the hardware saturation point

Total Progress Metric

The training progress over time can be expressed as:

Substituting the relationships:

Optimal Batch Size

To find the optimal , solve:

Taking the derivative:

Case Analysis

1. Pre-saturation : - - - Optimal

2. Post-saturation : - - - No local maximum

Therefore:

Practical Implementation

The optimal batch sizing algorithm:

1. Measure base metrics:

b0 = minimum_viable_batch_size()
r0 = measure_throughput(b0)
e0 = measure_efficiency(b0)

2. Determine hardware saturation:

k_max = find_throughput_plateau()

3. Set optimal batch size:

k_opt = min(4, k_max)
b_opt = b0 * k_opt

Trade-off Analysis

The key trade-offs involved:

1. Memory-Compute Trade-off: - Larger batches → Better hardware utilization - Until memory bandwidth saturates

2. Statistical-Computational Trade-off: - Larger batches → Lower sample efficiency - But higher throughput up to saturation

3. Time-Quality Trade-off: - Fixed time budget - Need to maximize (throughput × efficiency)

Optimization Extensions

1. Dynamic Batch Sizing: - Adjust batch size during training - Based on loss landscape characteristics

2. Gradient Accumulation: - Simulate larger batches - Without memory overhead

3. Mixed-Precision Training: - Increase effective batch size - Through reduced precision computation

Theoretical Guarantees

Under mild assumptions about the loss landscape:

1. Convergence Rate:

2. Generalization Bound:

where is the dataset size.