Optimal Batch Size Under Fixed Time Budget with Efficiency-Throughput Trade-offDig That Data

tags: - colorclass/a thermodynamic theory of statistical learning ---see also: - Information Flow Across Capacity-Limited Computational Media

> assume a fixed time budget (GPU hours) to train a model. assume unrestricted access to unique tokens. assume that every time we increase the batch size by a factor of k, the loss-decrease-per-sample (i.e. the sample efficiency) decreases by a factor of 1/sqrt(k). increasing batch size can potentially increase throughput, but up to a point there are dimensioning returns, i.e. there exists some sweet-spot batch size that maximizes throughput efficiently to result in the most time-efficient training that achieves the smallest loss achievable within the constraint of the time budget.

Problem Formalization

Given: - Fixed time budget $T$ (in GPU-hours) - Loss function $L (θ)$ to be minimized - Base batch size $b_{0}$ - Base throughput rate $r_{0}$ (samples/second at $b_{0}$ ) - Base sample efficiency $e_{0}$ (loss decrease per sample at $b_{0}$ )

Efficiency-Batch Size Relationship

For batch size $b = k b_{0}$ , the sample efficiency follows:

$e (k) = \frac{e _{0}}{k}$

This reflects the empirical observation that larger batch sizes reduce the information gained per sample.

Throughput-Batch Size Relationship

The computational throughput typically follows:

$r (k) = r_{0} \cdot f (k)$

where $f (k)$ is a hardware utilization function that depends on: - Memory bandwidth - Computational parallelism - Hardware architecture

Common approximation: $f (k) = min (k, k_{max})$ where $k_{max}$ is the hardware saturation point

Total Progress Metric

The training progress $P$ over time $T$ can be expressed as:

$P (k) = T \cdot r (k) \cdot e (k)$

Substituting the relationships:

$P (k) = T \cdot r_{0} \cdot f (k) \cdot \frac{e _{0}}{k}$

Optimal Batch Size

To find the optimal $k^{*}$ , solve:

$k^{*} = ar g max_{k} P (k)$

Taking the derivative:

$\frac{d P}{d k} = T r_{0} e_{0} (\frac{f ^{'} ( k )}{k} - \frac{f ( k )}{2 k ^{3/2}})$

Case Analysis

1. Pre-saturation $(k < k_{max})$ : - $f (k) = k$ - $f^{'} (k) = 1$ - Optimal $k_{1}^{*} = 4$

2. Post-saturation $(k \geq k_{max})$ : - $f (k) = k_{max}$ - $f^{'} (k) = 0$ - No local maximum

Therefore: $k^{*} = min (4, k_{max})$

Practical Implementation

The optimal batch sizing algorithm:

1. Measure base metrics:

b0 = minimum_viable_batch_size()
r0 = measure_throughput(b0)
e0 = measure_efficiency(b0)

2. Determine hardware saturation:

k_max = find_throughput_plateau()

3. Set optimal batch size:

k_opt = min(4, k_max)
b_opt = b0 * k_opt

Trade-off Analysis

The key trade-offs involved:

1. Memory-Compute Trade-off: - Larger batches → Better hardware utilization - Until memory bandwidth saturates

2. Statistical-Computational Trade-off: - Larger batches → Lower sample efficiency - But higher throughput up to saturation

3. Time-Quality Trade-off: - Fixed time budget - Need to maximize (throughput × efficiency)

Optimization Extensions

1. Dynamic Batch Sizing: - Adjust batch size during training - Based on loss landscape characteristics

2. Gradient Accumulation: - Simulate larger batches - Without memory overhead

3. Mixed-Precision Training: - Increase effective batch size - Through reduced precision computation

Theoretical Guarantees

Under mild assumptions about the loss landscape:

1. Convergence Rate: $E [∥\nabla L (θ_{T}) ∥^{2}] \leq O (\frac{1}{T \cdot P ( k ^{*} )})$

2. Generalization Bound: $E [L_{test}] - L_{train} \leq O (\frac{k ^{*}}{n})$

where $n$ is the dataset size.

David Marx

Explorer

Optimal Batch Size Under Fixed Time Budget with Efficiency-Throughput Trade-off