Problems with Chinchilla Approach 2

Systematic biases in scaling law inference from IsoFLOP parabola fits

$Artistic rendering of IsoFLOP curves$

Motivation

Chinchilla Approach 2 is arguably the most widely adopted method for fitting scaling laws in practice today. Introduced in the original Chinchilla paper^[1], it has since been used by leading AI labs including DeepMind^[1],[7] (its creators), Meta^[2],[9], DeepSeek^[3], Microsoft^[4], Amazon^[6], Waymo^[8], and Arc Institute^[5], among others. It is also a workhorse method for academic scaling law studies^{[10],[11],[12]} and high-profile practitioner tutorials from researchers like Andrej Karpathy.

The method's appeal lies in its stability and data efficiency relative to nonlinear optimization over all loss surface parameters. Rather than fitting all five parameters of the loss surface simultaneously, Approach 2 targets only the two scaling exponents, relying on second-order Taylor approximations that reduce each IsoFLOP curve to a simple parabola. This sacrifices recovery of the full loss surface but makes estimation far more stable and data-efficient, letting practitioners extract the most actionable quantities for compute allocation planning through a sequence of straightforward polynomial and linear fits, without ever touching a nonlinear optimizer.

Despite this broad adoption, the sensitivity of the method's core approximations and its behavior on loss surfaces that are less symmetric than the original Chinchilla form (where parameter and token scaling exponents are roughly equal) have not, to our knowledge, been studied in detail. Here we revisit the basics of how to apply a simple model like Chinchilla with high precision and stability, to validation loss alone, before considering more advanced extensions. We investigate through noise-free synthetic simulations that isolate systematic biases inherent to the method itself by eliminating all sources of statistical noise.

We show how these biases affect downstream decisions like dataset size selection for final training runs at large compute budgets. We show how extrapolation errors trace back to suboptimal IsoFLOP experiment design, and that pathologies in these designs can be observed in real, high-profile scaling law studies even if they are difficult to quantify precisely. Finally, we propose an alternative fitting method that is simple, stable, and free of these biases while building on the same intuitive computational shortcut: optimizing exponential terms separately from linear terms. We call this approach Variable Projection with Non-negative Least Squares (VPNLS).

This investigation is also motivated by a broader landscape of analytical extensions to the Chinchilla loss surface. A growing body of work adds or modifies terms in the original functional form to account for additional training configuration choices such as data repetition^[17],[14], overfitting^[18], precision^[19], MoE sparsity^[24], data quality^[21], data mixtures^{[22],[23],[14]}, non-embedding parameters^[20], and downstream task performance^[25], to name a few. These extensions prescribe explicit functional forms rather than inferring scaling law structure automatically, and they build directly on the Chinchilla model as a foundation. A fitting method that recovers the base surface with higher precision may therefore offer a stronger starting point for these richer settings as well.

Preliminaries: Loss Surface, Notation, and Fitting Methods

Neural scaling laws describe how model performance improves with compute. The Chinchilla loss surface models this relationship as:

L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

where $N$ is the number of parameters, $D$ is the number of training tokens, $E$ is the irreducible loss, and $A, B, \alpha, \beta$ capture how quickly performance improves with scale.

Given a compute budget $C \approx 6ND$, the optimal allocation satisfies:

N^* \propto C^a \quad \text{where} \quad a = \frac{\beta}{\alpha + \beta} \] \[ D^* \propto C^b \quad \text{where} \quad b = \frac{\alpha}{\alpha + \beta}

Recovering the exponents $a$ and $b$ from empirical training runs is crucial for planning efficient large-scale training. Two canonical approaches exist:

Approach 2: IsoFLOP Parabolic Fitting

This method is presented in the Chinchilla paper. The key insight is that along a fixed-compute contour (IsoFLOP curve), loss as a function of $\log N$ is approximately parabolic near the optimum.

Sample IsoFLOP contours: For each compute budget $C$, train models at various $(N, D)$ pairs satisfying $C = 6ND$
Fit parabolas: For each budget, fit $L = p(\log N)^2 + q(\log N) + r$ and extract the minimum $N^*$
Fit power laws: Regress $\log N^*$ against $\log C$ to recover the exponent $a$ (and similarly for $D^*$, $b$)

The appeal is simplicity: only polynomial fits, no nonlinear optimization. The parabolic approximation comes from a Taylor expansion of the loss surface around the optimum.

Approach 3: Direct Surface Fitting

The alternative is to fit all five parameters $(E, A, B, \alpha, \beta)$ simultaneously via nonlinear least squares. This avoids the parabolic approximation entirely but is notoriously unstable: highly sensitive to initialization and prone to converging to spurious local minima.

The Happy Path: Symmetric Surfaces

Before examining failure modes, let's establish that Approach 2 works perfectly under ideal conditions. Consider a symmetric loss surface where $\alpha = \beta$:

L(N, D) = 1.69 + \frac{400}{N^{0.31}} + \frac{400}{D^{0.31}}

With equal exponents, the optimal allocation splits compute evenly between parameters and data. The true scaling exponents are:

a = b = \frac{0.31}{0.31 + 0.31} = 0.5

We sample five IsoFLOP contours spanning $10^{17}$ to $10^{21}$ FLOPs, with 15 model sizes per curve, fit parabolas to each, and extract the optimal token count $D^*$. All simulations throughout this article use these same five compute budgets and 15 points per IsoFLOP curve.

Approach 2 on symmetric surface showing perfect recovery — **Figure 1:** Approach 2 applied to a symmetric loss surface. Left: IsoFLOP curves with fitted parabolas. True (×) and inferred (+) optima are indistinguishable. Right: Power-law fit recovers the exact scaling exponent.

The results confirm perfect recovery of the token scaling exponent and intercept:

Parameter	True Value	Inferred Value	Relative Error
b (D* exponent)	0.500000	0.500000	+6.2×10⁻¹²%
b₀ (D* intercept)	−0.389076	−0.389076	−1.4×10⁻¹⁰%

✓ Key Result

On a symmetric loss surface with perfectly crafted IsoFLOP grid sampling, Approach 2 recovers both exponents and intercepts with machine-precision accuracy. When $\alpha = \beta$, the parabola vertex shift is zero, so the inferred optima coincide with the true optima.

This establishes our baseline. Approach 2 is precisely correct under ideal conditions that are unrealistic in practice. The problems arise when we deviate from these ideal conditions, as we'll see in the following sections where these conditions are perturbed in controlled ways.

Asymmetric Surfaces: Intercept and Extrapolation Errors

We repeat the exact same procedure as before: perfect sampling centers, no noise, identical methodology. The only change is that the loss surface is now asymmetric ($\alpha \neq \beta$).

What Happens

Simulation results show that when the loss surface is asymmetric, Approach 2 produces systematically wrong intercepts while exponents remain accurate. This isn't statistical noise; it's a deterministic bias from fitting parabolas to a non-parabolic surface.

We test two configurations to see how the effect scales:

Chinchilla: $\alpha = 0.34$, $\beta = 0.28$ (ratio ≈ 1.2)
Asymmetric: $\alpha = 0.46$, $\beta = 0.15$ (ratio = 3.0)

The Asymmetric surface is not a contrived stress test. An exponent ratio of 3.0 is comparable to what has been observed in practice, e.g. DeepSeek^[3] reports compute-optimal allocation exponents of $a = 0.73$, $b = 0.27$ for an OpenWebText2 variant. This implies a loss surface exponent ratio of $\beta / \alpha \approx 2.7$. The asymmetry runs in the opposite direction from our Asymmetric surface ($\beta > \alpha$ rather than $\alpha > \beta$), but the degree of imbalance is similar, and it is the magnitude of the imbalance, not its direction, that drives the biases studied here.

Approach 2 on asymmetric surfaces showing intercept errors — **Figure 2:** Approach 2 on asymmetric loss surfaces. Note the visible gap between true (dashed) and inferred (solid) power-law lines in the Asymmetric case. The exponents match perfectly, but the intercepts differ.

Chinchilla Surface

Parameter	True Value	Inferred Value	Relative Error
b (D* exponent)	0.548387	0.548387	≈ 0%
b₀ (D* intercept)	−0.555357	−0.578092	−4.1%

Asymmetric Surface

Parameter	True Value	Inferred Value	Relative Error
b (D* exponent)	0.750000	0.750000	≈ 0%
b₀ (D* intercept)	−1.345791	−1.459957	−8.5%

Why This Is Surprising

A few percent error in the intercept might seem minor, but consider that this simulation gave Approach 2 every advantage. The data is perfect: no measurement noise, with every point lying exactly on the true loss surface. The sampling is perfect too, with IsoFLOP grids centered precisely at the true optimum (something you wouldn't know how to do in practice). And the parameters are standard, taken directly from the Chinchilla paper rather than contrived to expose a potentially unrealistic weakness.

✓ Key Result

Even under these ideal conditions, Approach 2 produces biased intercepts for asymmetric surfaces. The error is systematic, a property of the parabolic approximation, not statistical noise.

Why It Happens

The IsoFLOP loss curve is not a true parabola; it contains exponential terms. When a parabola is fit to this curve, the parabola's minimum (vertex) doesn't land exactly at the true optimum. It shifts slightly, and the key insight is that this shift depends only on the loss surface shape ($\alpha$, $\beta$) and the sampling grid. It does not depend on compute budget. The sampling grid size becomes important here: wider grids amplify the mismatch between the true curve and its parabolic approximation, increasing the vertex shift.

Because the IsoFLOP parabola is fit in $\log N$ space (as described in the Approach 2 procedure), the vertex shift directly biases $N^*$. Since $C = 6ND$, analyzing the bias in either $N^*$ or $D^*$ is sufficient; we focus on $N^*$ here since that is where the parabolic fit typically operates.

Since the vertex shift is constant across all compute budgets, it biases every inferred $N^*$ by the same multiplicative factor. When fitting $\log N^*$ vs $\log C$ to extract scaling exponents:

The slope (exponent) is unchanged: multiplying all $N^*$ values by a constant factor adds a constant to $\log N^*$, which doesn't affect the slope
The intercept absorbs the entire error, biased by exactly that multiplicative factor

Exact derivation: The intercept error can be derived analytically in closed form. The parabola vertex shifts by $\delta w$ (in log-space), giving an intercept error of:

\text{Intercept error} = 10^{\delta w} - 1

where $\delta w = f(\alpha, \beta, W, n)$ depends only on the surface exponents and the sampling grid (width $W$ in log-space, number of points $n$ per IsoFLOP curve), not on $C$, $E$, $A$, or $B$. Here $W$ spans $10^{-W/2}$ to $10^{W/2}$ times the optimal $N^*$, so $W = 2.41$ (the XL grid) means sampling from $\frac{1}{16}\times$ to $16\times$ the optimum. And $n = 10$ means 10 model sizes per compute budget. Key properties:

$\delta w = 0$ when $\alpha = \beta$ (symmetric surfaces have no error)
$\delta w$ grows with $|\alpha - \beta|$ (more asymmetry → more error)
$\delta w$ grows with $W$ (wider sampling range → more error)

For example, with the Chinchilla parameters ($\alpha = 0.34$, $\beta = 0.28$): the XS grid ($W = 0.60$) yields 0.3% intercept error, while the XL grid ($W = 2.41$) yields 4.1% error.

The full derivation provides the closed-form expression for vertex shift $\delta w$ as a function of $\alpha$, $\beta$, $W$, and $n$. It also shows how this shift translates directly into intercept error, independent of compute budget.

Intuition via Taylor expansion: A parabola is a 2nd-order polynomial, which is equivalent to a 2nd-order Taylor expansion around the optimum. The approximation $L(w) \approx L(0) + \frac{1}{2}L''(0)w^2$ is only valid when higher-order terms are negligible, i.e., when samples are close to the true minimum. As sampling range increases, 3rd and 4th order terms grow. For symmetric surfaces ($\alpha = \beta$), odd-order terms cancel by symmetry, preserving the vertex location. For asymmetric surfaces, they don't cancel, shifting the fitted vertex away from the true optimum.

Why It Matters

Extrapolation to higher compute budgets requires both exponents and intercepts to be correct. The previous section established that asymmetric loss surfaces produce provably biased intercepts even under ideal experimental conditions. Here we quantify what those errors mean in practical terms by examining compute-optimal token prediction: given a compute budget, how many tokens does the inferred scaling law predict?

Up to this point, all analysis has assumed a single fixed sampling grid width. We now examine how token prediction error varies with both compute budget and sampling grid width. For surfaces with asymmetric exponents, wider sampling grids amplify the parabola-fitting mismatch, increasing the constant vertex shift and thus the intercept bias. To make this comparison concrete, we first define what "wider" and "narrower" mean in quantitative terms.

A sampling grid of "±kx" means the sampled values (whether model sizes or token counts) range from ¹⁄_k to k times the true optimum at each compute budget. The total range covered is k² (the ratio of largest to smallest), and the log₁₀ of that ratio tells you how many factors of 10, or "decades," the grid spans end-to-end (e.g. a value of 1.81 means the largest sample is 10^1.81 ≈ 64x the smallest). The table below shows the four grid widths used in this analysis:

Grid Name	±kx	Sampling Range	Total Ratio	Decade Span (factors of 10)
Extra Small (XS)	±2x	1/2x to 2x	4x	0.60
Small (S)	±4x	1/4x to 4x	16x	1.20
Large (L)	±8x	1/8x to 8x	64x	1.81
Extra Large (XL)	±16x	1/16x to 16x	256x	2.41

In practice, scaling law experiments typically sample across 1 to 2 decades in token count, placing the Small and Large grids squarely within the realistic range. The Extra Small and Extra Large grids bracket this range on either side, illustrating how the biases shrink or grow as the sampling window narrows or widens. The Extra Large grid (±16x, ~2.4 decades) is the default used in all single-grid analyses in the preceding sections.

Bar chart showing token prediction error by surface and grid width — **Figure 3:** Relative error in compute-optimal token prediction when extrapolating from the training range (10¹⁷–10²¹ FLOPs) to 10²⁴ FLOPs. Negative values indicate underestimation: the inferred scaling law predicts fewer tokens than optimal. Bars are grouped by sampling grid width. Annotations for the Chinchilla surface show $D^*$ (true compute-optimal token count) versus $\hat{D}^*$ (the Approach 2 estimate); the Small and Large grid annotations are emphasized (thicker borders) as they fall within the realistic 1–2 decade range typical of scaling law experiments, while Extra Small and Extra Large bracket either side as more extreme configurations.

📊 View raw data

Surface	α	β	Grid	True D*	Inferred D*	Abs Error	Rel Error
Symmetric Surface (α = β)
Symmetric	0.31	0.31	XS (±2×)	408.2B	408.2B	≈0	≈0%
Symmetric	0.31	0.31	S (±4×)	408.2B	408.2B	≈0	≈0%
Symmetric	0.31	0.31	L (±8×)	408.2B	408.2B	≈0	≈0%
Symmetric	0.31	0.31	XL (±16×)	408.2B	408.2B	≈0	≈0%
Chinchilla Surface (α ≠ β)
Chinchilla	0.34	0.28	XS (±2×)	4.04T	4.02T	−13.2B	−0.33%
Chinchilla	0.34	0.28	S (±4×)	4.04T	3.98T	−52.5B	−1.30%
Chinchilla	0.34	0.28	L (±8×)	4.04T	3.92T	−117.2B	−2.90%
Chinchilla	0.34	0.28	XL (±16×)	4.04T	3.83T	−205.8B	−5.10%
Asymmetric Surface (α/β = 3)
Asymmetric	0.465	0.155	XS (±2×)	45.1Q	44.3Q	−755.4T	−1.67%
Asymmetric	0.465	0.155	S (±4×)	45.1Q	42.2Q	−2.9Q	−6.50%
Asymmetric	0.465	0.155	L (±8×)	45.1Q	38.8Q	−6.3Q	−13.91%
Asymmetric	0.465	0.155	XL (±16×)	45.1Q	34.7Q	−10.4Q	−23.12%

B = billion, T = trillion, Q = quadrillion. Hover over cells for full-precision values. Training range: 10¹⁷–10²¹ FLOPs. Evaluation budget: 10²⁴ FLOPs.

The key observations from this figure are:

Symmetric surfaces are unaffected: When $\alpha = \beta$, all grid widths produce zero error
Asymmetric surfaces underestimate: Negative errors mean the inferred $D^*$ is smaller than the true $D^*$. Following these predictions would undertrain the model
Wider grids amplify error: Moving from XS (±2x) to XL (±16x) grids increases error from 0.3% to 5.1% on Chinchilla, and from 1.7% to 23% on the Asymmetric surface
Asymmetry magnifies everything: The Asymmetric surface ($\alpha/\beta = 3$) shows roughly 4–5x larger errors than Chinchilla at each grid width

✓ Key Result

Consider the Chinchilla surface with the Large grid (±8x), a practical sampling range for real experiments. When extrapolating to 10²⁴ FLOPs, the true optimal token count is 4.04 trillion, but Approach 2 predicts only 3.92 trillion: a 2.9% underestimate, or roughly 117 billion fewer tokens than optimal. While 2.9% may seem modest, recall that this simulation uses unrealistically ideal conditions: perfectly centered sampling grids at every compute budget and zero measurement noise. Real experiments, where the true optimum is unknown, data is noisy, and the scaling exponent imbalance may be larger than Chinchilla's modest $\alpha/\beta \approx 1.2$, can only do worse.

Off-Center Sampling: Exponent and Extrapolation Errors

The previous sections assumed perfectly centered sampling. At every compute budget, the IsoFLOP grid was placed exactly at the true optimum. In practice, you don't know $N^*$ before running the experiment. Sampling centers are guesses, informed by prior estimates or heuristics, and they will likely be wrong by some amount.

This is a distinct source of error from the asymmetry bias examined earlier. Asymmetry errors arise from the shape of the loss surface ($\alpha \neq \beta$); off-center errors arise from where you place the sampling grid. To isolate this new effect, we return to the symmetric surface ($\alpha = \beta = 0.31$) where asymmetry bias is zero by construction.

Constant Multiplicative Bias

The simplest form of off-center sampling is a constant multiplicative offset: every compute budget's sampling center is shifted by the same factor from the true optimum. A "3× offset" means each IsoFLOP grid is centered at $3 \times D^*$ instead of $D^*$, so the grid midpoint consistently sits at three times the true optimal token count.

Because this offset is the same at every compute budget, it has a familiar geometric effect where each parabola vertex shifts by a constant amount in log-space. This is the same mechanism as asymmetry bias. The slope of $\log D^*$ vs $\log C$ is unaffected (a constant additive shift in log-space doesn't change the slope), so the scaling exponent is preserved perfectly. The intercept, however, absorbs the entire error.

Off-center sampling with constant multiplicative bias showing zero exponent error but systematic intercept error — **Figure 4:** Effect of a constant 3× offset in sampling centers on the symmetric surface. Top left: IsoFLOP curves at the Large grid (±8×), with black diamonds marking the (off-center) sampling center, red × the true $D^*$, and blue + the inferred $D^*$. Top right: extrapolation error in compute-optimal token prediction at 10²⁴ FLOPs for each grid width, using the same XS through XL grids defined earlier. Bottom row: exponent and intercept errors across grid widths from XS (±2×) to XL (±16×), plotted on the same y-axis scale. The exponent is recovered perfectly (flat at zero) while the intercept shows systematic bias that varies with grid width.

The extrapolation bar chart (top right) shows what this means for token prediction. All four grid widths overestimate $D^*$, with the narrowest grid (XS) producing the largest error. This is the reverse of the asymmetry bias pattern, where wider grids amplified error. Here, narrower grids are more sensitive to off-center placement because fewer samples lie near the true optimum.

The intercept error panel (bottom right) confirms the pattern across the full continuum of grid widths. The error is always positive (the inferred $D^*$ overshoots) and decreases monotonically as the grid widens, reflecting how a wider sampling range brings more of the true loss curve's shape into the fit, partially compensating for the misplaced center.

✓ Key Result

Consider the symmetric surface with the Large grid (±8×) and a 3× offset, where every IsoFLOP grid is centered at three times the true optimal token count. When extrapolating to 10²⁴ FLOPs, the true optimal token count is 408.2 billion, but Approach 2 predicts 419.0 billion: a 2.6% overestimate, roughly 10.8 billion more tokens than optimal. Compare this with the Chinchilla asymmetry result at the same grid width: a 2.9% underestimate. The magnitudes are comparable, but the sources are entirely different. Asymmetry bias comes from the shape of the loss surface; off-center bias comes from where you place the grid. In a real experiment, both act simultaneously.

Drifting Bias

When the offset varies with compute budget, a qualitatively different failure mode emerges. To illustrate this, we apply a linear drift. The sampling center starts at the true optimum for the lowest budget and drifts to 3× the true optimum at the highest budget, interpolating linearly in log-compute space.

Because the offset now differs across compute budgets, it no longer cancels in the slope of $\log D^*$ vs $\log C$. Both the exponent and the intercept are affected.

Off-center sampling with drifting bias showing both exponent and intercept errors — **Figure 5:** Effect of a linear drift in sampling centers (centered at true optimum for lowest budget, drifting to 3× at highest budget) on the symmetric surface. Unlike the constant bias case, the exponent error (bottom left) is now non-zero: the slope of $\log D^*$ vs $\log C$ is distorted because the offset varies across compute budgets.

Compare the bottom-left panels of Figures 4 and 5: constant bias produces a flat line at zero (exponent preserved), while drifting bias produces a non-zero exponent error that varies with grid width.

✓ Key Message

Constant bias preserves exponents; any compute-dependent bias pattern distorts them. The distinction matters because exponent errors compound during extrapolation, while intercept errors remain fixed.

IsoFLOP Curves in the Wild: Evidence from Published Studies

The previous sections used synthetic, noise-free simulations to isolate Approach 2's biases under controlled conditions. A natural question is whether the conditions that trigger these biases, asymmetric loss surfaces and imperfectly centered sampling, actually arise in practice. To get a sense of this, we can look at IsoFLOP curves published in three of the most prominent scaling law studies^[1],[2],[3].

IsoFLOP curves from Chinchilla, Llama 3, and DeepSeek scaling law papers — **Figure 6:** IsoFLOP curves from three published scaling law studies. Left: Chinchilla (training loss vs parameters). Center: Llama 3 (validation loss vs training tokens). Right: DeepSeek (bits-per-byte vs FLOPs/token). Each panel shows curves at multiple compute budgets, fit using Approach 2.

Several features relevant to the biases studied in this article are visible across all three panels:

Asymmetric curve shapes: The IsoFLOP curves are visibly steeper on one side of the minimum than the other, consistent with $\alpha \neq \beta$. This is the condition under which the parabolic approximation introduces systematic intercept bias.
Off-center sampling: At some compute budgets, the sampling grid does not appear centered at the curve minimum. The degree of off-centering also appears to vary across compute budgets, which is the drifting-bias pattern that distorts both exponents and intercepts.

To be clear, this is not a criticism of these studies. These are among the most careful and influential scaling law analyses published. The point is a more general one: the conditions under which Approach 2's biases activate, asymmetric surfaces and imperfect sampling centers, appear to be the norm rather than the exception. The idealized conditions of the Happy Path (symmetric surface, perfectly centered grids) are the special case.

Compounding Errors

Given evidence that both surface asymmetry and off-center sampling are present in real studies, we can simulate what happens when these biases act simultaneously. Using the same three loss surfaces from earlier sections, we combine them with the 3× drift and 3× constant offset from the off-center analysis. We fit Approach 2 on compute budgets from 10¹⁷ to 10²¹ FLOPs and extrapolate $D^*$ predictions to 10²⁴ FLOPs across all four grid widths.

📊 View raw data

Config	Surface	Grid	True D*	Inferred D*	Rel Error
Offset 3× (sampling center at 3× true optimum at every budget)
Offset 3×	Symmetric	XS (±2×)	408.2B	424.5B	+3.97%
Offset 3×	Symmetric	S (±4×)	408.2B	422.4B	+3.47%
Offset 3×	Symmetric	L (±8×)	408.2B	419.0B	+2.65%
Offset 3×	Symmetric	XL (±16×)	408.2B	414.4B	+1.51%
Offset 3×	Chinchilla	XS (±2×)	4.04T	4.32T	+7.11%
Offset 3×	Chinchilla	S (±4×)	4.04T	4.27T	+5.69%
Offset 3×	Chinchilla	L (±8×)	4.04T	4.17T	+3.38%
Offset 3×	Chinchilla	XL (±16×)	4.04T	4.05T	+0.24%
Offset 3×	Asymmetric	XS (±2×)	45.1Q	53.8Q	+19.22%
Offset 3×	Asymmetric	S (±4×)	45.1Q	51.6Q	+14.41%
Offset 3×	Asymmetric	L (±8×)	45.1Q	48.2Q	+6.96%
Offset 3×	Asymmetric	XL (±16×)	45.1Q	44.0Q	−2.42%
Drift to 3× (sampling center drifts from true optimum to 3× at highest budget)
Drift to 3×	Symmetric	XS (±2×)	408.2B	433.0B	+6.07%
Drift to 3×	Symmetric	S (±4×)	408.2B	429.4B	+5.17%
Drift to 3×	Symmetric	L (±8×)	408.2B	423.3B	+3.70%
Drift to 3×	Symmetric	XL (±16×)	408.2B	415.1B	+1.69%
Drift to 3×	Chinchilla	XS (±2×)	4.04T	4.50T	+11.61%
Drift to 3×	Chinchilla	S (±4×)	4.04T	4.43T	+9.83%
Drift to 3×	Chinchilla	L (±8×)	4.04T	4.32T	+6.94%
Drift to 3×	Chinchilla	XL (±16×)	4.04T	4.16T	+3.05%
Drift to 3×	Asymmetric	XS (±2×)	45.1Q	60.7Q	+34.57%
Drift to 3×	Asymmetric	S (±4×)	45.1Q	58.7Q	+30.04%
Drift to 3×	Asymmetric	L (±8×)	45.1Q	55.5Q	+22.97%
Drift to 3×	Asymmetric	XL (±16×)	45.1Q	51.4Q	+14.00%

B = billion, T = trillion, Q = quadrillion. Hover over cells for full-precision values. Training range: 10¹⁷–10²¹ FLOPs. Evaluation budget: 10²⁴ FLOPs.

Comparing with the baseline in Figure 3, where asymmetry bias alone produces errors up to −5% on Chinchilla and −23% on the Asymmetric surface, the two bias sources interact in opposite directions. Off-center sampling pushes errors positive (overestimating $D^*$), while asymmetry bias pushes errors negative (underestimating). The net error depends on which source dominates. With narrow grids, asymmetry bias is negligible and the sampling bias determines the error: drift to 3× produces +6% on the symmetric surface and +12% on Chinchilla. With wider grids, asymmetry bias grows and begins to offset the sampling bias. On Chinchilla with a constant 3× offset, this cancellation is nearly perfect with the XL grid (+0.24%), but this is only coincidental.

On the Asymmetric surface, the drift configuration produces the largest errors in the figure: +35% with the XS grid and still +14% with XL. Even the constant offset configuration reaches +19% with XS before the asymmetry bias partially offsets it with wider grids.

These 3× perturbations are representative of realistic conditions. The IsoFLOP curves they produce on the symmetric surface (top-left panels of Figures 4 and 5) show sampling centers that are visibly displaced from the curve minima, with the displacement either uniform across budgets (constant offset) or growing toward higher budgets (drift). Both patterns are qualitatively similar to what is observed in the published studies shown in Figure 6, where sampling grids are not perfectly centered and the degree of off-centering varies across compute budgets. A 3× factor means the sampling center sits at three times the true optimal token count, which is likely within the range of uncertainty practitioners face when choosing sampling centers before the optimum is known.

Figure A2 provides a more detailed view: it shows how $D^*$ extrapolation errors evolve across compute budgets from 10²² to 10²⁵ FLOPs, revealing which bias sources produce errors that grow with extrapolation distance (drift) versus those that remain roughly constant (surface asymmetry and constant offsets), and how these patterns vary across multiple drift rates and center offset magnitudes.

✓ Key Result

Multiple bias sources act simultaneously in any real experiment. Surface asymmetry and off-center sampling each produce meaningful errors on their own. When they happen to act in the same direction, the combined error exceeds either one alone: on the Asymmetric surface with drift to 3×, errors reach 35% even when using the narrowest grid, where the parabolic approximation is most accurate. When they oppose, partial cancellation can occur, but this depends on the specific combination of surface geometry, offset magnitude, and grid width, making it unreliable in practice.

Robust Fits: Unbiased Estimation with Linear Separation

The previous sections showed that Approach 2's parabolic approximation introduces systematic biases in intercepts (from asymmetry) and potentially exponents (from off-center sampling), and that the conditions driving these biases are visible in published scaling law studies. The natural alternative is Approach 3, which fits all five surface parameters $(E, A, B, \alpha, \beta)$ simultaneously via nonlinear least squares. This avoids the parabolic approximation entirely but brings its own set of problems.

Problems with Direct Surface Fitting

A recent survey of over 50 scaling law papers^[13] documents the landscape of fitting practices and their failure modes. The problems described below apply to scaling law fitting in general, not just Chinchilla forms, but they are directly relevant because Approach 3 involves the same kind of nonlinear optimization. Over half of the papers surveyed do not fully specify their fitting procedure (optimizer, loss function, or initialization), which compounds reproducibility challenges.

The most common optimizers for scaling law fits are BFGS and L-BFGS. Some studies use SGD-family optimizers like Adam and Adagrad, though these are noted as sometimes poorly suited for curve fitting due to limited data efficiency. At least one study^[14] forgoes optimization entirely in favor of pure grid search because fitted solutions are too unstable.

In practice, this instability takes several forms. Results are sensitive to initialization: different starting points for the optimizer can lead to substantially different fitted parameters. Results are also sensitive to optimizer hyperparameters such as convergence tolerance and gradient estimation method. And the optimizer frequently converges to local minima rather than the global optimum.

Initialization is the most studied source of variability. Common mitigations include grid search over thousands of starting points (running the optimizer from each and keeping the best fit), random sampling of starting points, evaluating a coarse grid without optimization and seeding the optimizer from the single best candidate, or initializing from previously published parameter values. None of these reliably solve the problem. The survey's own experiments show that full-grid optimization over 4500 starting points can yield results that diverge significantly from reported figures, evidence of "the difficulty of optimizing over this space, and the presence of many local minima."

A simpler alternative is to log-linearize the power law and fit with linear regression. However, the log transformation changes the error distribution and exaggerates errors at small loss values, biasing parameter estimates. This bias is easily observed in simulations like ours. The survey also finds that the choice of loss function (whether Log-Huber, Huber, MSE, or MAE) affects fitted parameters unpredictably across datasets, and non-MSE objectives can introduce systematic bias in parameter estimates. Our goal is to identify a fitting method that is simple, stable, and efficient rather than to address outliers or other statistical concerns, so we use MSE for all fits in this article.

The survey's experimental analysis varies optimizer, loss function, and initialization strategy across three datasets. The overarching finding is that none of these choices reliably eliminates instability, and results shift unpredictably between datasets. A key contributor is the high dimensionality of the joint five-parameter optimization, which creates a complex loss landscape with many local minima and interacting sensitivities. Reducing the dimensionality of the nonlinear search is one way to make the problem more tractable.

Variable Projection (VPNLS)

The Chinchilla loss surface has a partially linear structure that can be exploited. For any fixed values of $\alpha$ and $\beta$, the remaining parameters $(E, A, B)$ enter the model linearly and can be solved exactly via least squares. This is the same computational shortcut that motivates Approach 2 (optimizing exponential terms separately from linear terms), but applied here without the parabolic approximation.

The algorithm searches over $(\alpha, \beta)$ and, at each candidate pair, solves for $(E, A, B)$ via non-negative least squares (NNLS). A coarse 32×32 grid search identifies a good starting region, and a Nelder-Mead simplex optimizer refines it. The linear separation is maintained throughout. The optimizer only ever navigates the two-dimensional $(\alpha, \beta)$ surface, never the full five-parameter space. We term this method Variable Projection with Non-negative Least Squares (VPNLS).

function VPNLS(data): function objective(α, β): X \leftarrow [1, N^(-α), D^(-β)] // design matrix, one row per observation (E, A, B) \leftarrow NNLS(X, L) // linear solve with E, A, B \geq 0 return ‖L - X\cdot[E, A, B]‖² (α₀, β₀) \leftarrow argmin objective(α, β) // coarse 32\times32 grid search (α*, β*) \leftarrow NelderMead(objective, // refine in 2D only start=(α₀, β₀)) (E*, A*, B*) \leftarrow NNLS(X(α*, β*), L) // recover linear params at solution return (E*, A*, B*, α*, β*)

The choice of Nelder-Mead over L-BFGS-B is deliberate. VPNLS uses NNLS for the inner solve to guarantee that $E$, $A$, and $B$ remain non-negative, preventing physically meaningless fits. However, NNLS has no closed-form gradient with respect to the outer parameters $(\alpha, \beta)$. Switching to ordinary least squares would restore differentiability but cannot enforce non-negativity. With NNLS, L-BFGS-B must rely on finite-difference gradients, which creates a set of interacting tuning parameters (eps, jac, ftol, gtol, maxcor, maxls) where tight tolerances demand gradient accuracy that finite differences cannot reliably provide.

Nelder-Mead avoids this entirely. Its few settings (xatol, fatol) are independent and work well out of the box. Nelder-Mead scales poorly to high dimensions, but variable projection reduces the search to just two dimensions, which is exactly the regime where simplex methods excel.

Method Comparison

To validate this choice, we compare nine method configurations on noise-free synthetic data across three loss surfaces (symmetric, Chinchilla, and high imbalance) and 20 sampling ranges. This is the best case for gradient-based methods since the data contains no noise that could obscure gradient information.

The configurations fall into two groups. The first uses 5D direct optimization (Approach 3), fitting all five parameters jointly with L-BFGS-B using either analytical gradients, forward finite differences, or central finite differences. The second uses 2D variable projection over $(\alpha, \beta)$ only, comparing VPNLS (Nelder-Mead), L-BFGS-B with four finite-difference configurations (default $\varepsilon$, central differences, $\varepsilon = 10^{-6}$, and $\varepsilon = 10^{-10}$), and a fine 256² grid search with no local refinement.

Method comparison showing geometric mean error and max error across nine optimizer configurations — **Figure 8:** Comparison of nine fitting methods on noise-free synthetic data across three loss surfaces and 20 sampling ranges (60 fits total per method). Left: geometric mean of |relative error| (%) pooled across all surfaces, grid widths, and parameters, with horizontal bars spanning the min-to-max range. Filled dots indicate convergence on all 60 fits; open dots indicate at least one failure (count annotated). Right: maximum |relative error| (%) per parameter over successful fits, on a log-scale colormap. Methods are sorted by geometric mean error, with the worst at top.

In the left panel, each dot shows the typical (geometric mean) parameter recovery error for one method, and the horizontal bar shows the range from best to worst case across 60 scenarios. The right panel breaks this down by parameter, showing the worst-case error for each.

Consider the best Approach 3 configuration (5D L-BFGS-B with analytical gradients). Even with exact gradients on noise-free data, the worst-case errors reach about 5% for the scaling exponent $\alpha$ and about 2% for the irreducible loss $E$. While a few percent may appear modest, the preceding sections show that errors of this magnitude in scaling parameters translate into meaningful distortions when extrapolating compute-optimal predictions to higher budgets. VPNLS recovers all five parameters with errors on the order of 10⁻⁸%, effectively eliminating parameter estimation as a source of extrapolation error. Figure A1 breaks this down by surface and sampling range, also revealing that Approach 3's errors can vary systematically with sampling range on certain surfaces.

Looking at the full set of methods, a clear hierarchy emerges. High-resolution grid search (256²) is stable across all conditions but provides the poorest overall precision among 2D methods, limited by grid resolution.

5D direct optimization (Approach 3) is more accurate on average than grid search but highly variable across conditions. The 5D configurations that rely on finite-difference gradients rather than analytical gradients perform particularly poorly and serve as a useful negative control. They demonstrate what high variability and instability look like, and Approach 3 with analytical gradients exhibits a similar pattern at somewhat lower magnitude. The full per-parameter breakdown (Figure A1) shows these instability patterns in detail.

L-BFGS-B with 2D variable projection can match VPNLS precision, but the optimizer fails to converge in a non-trivial fraction of scenarios even in this relatively small test suite. The choice of finite-difference scheme matters considerably. By default, scipy's L-BFGS-B approximates gradients with forward differences: each partial derivative is estimated as $(f(x + h) - f(x)) / h$. Passing jac='3-point' to scipy.optimize.minimize switches to 3-point central differences, where each partial is estimated as $(f(x + h) - f(x - h)) / 2h$. The central formula is generally more accurate for smooth objectives because it samples symmetrically around the point of interest. In our tests, this closes the precision gap with Nelder-Mead (from roughly 10⁻⁵% to 10⁻⁸% error), but introduces sporadic line search failures. Notably, these failures can be false positives. The optimizer has already reached the true minimum, with residual sum of squares near machine zero, but the line search cannot verify further progress because function values are too small to distinguish. In scipy, this surfaces as result.success = False with an ABNORMAL status from scipy.optimize.minimize, even though the returned parameters are correct.

L-BFGS-B remains a viable alternative to Nelder-Mead for practitioners willing to tune settings carefully and who understand that certain convergence errors from scipy are not necessarily problematic. That said, VPNLS with Nelder-Mead is simpler, requires less tuning, and recovers parameter estimates with precision at least as high as any other method tested. It technically achieves the most precise estimates, though the margin over a well-configured L-BFGS-B with 3-point central differences is small.

📊 View method comparison data

Method	Failures	Max E err%	Max A err%	Max B err%	Max α err%	Max β err%
2D Nelder-Mead (VPNLS)	0/60	5.2×10⁻⁸	6.3×10⁻⁸	7.9×10⁻⁸	1.2×10⁻⁸	2.0×10⁻⁸
2D L-BFGS-B (central diff)	1/60	8.3×10⁻⁸	5.3×10⁻⁸	6.4×10⁻⁸	1.2×10⁻⁸	2.0×10⁻⁸
2D L-BFGS-B (default ε)	0/60	1.6×10⁻⁵	1.0×10⁻⁵	1.3×10⁻⁵	2.1×10⁻⁶	3.9×10⁻⁶
2D L-BFGS-B (ε=10⁻¹⁰)	3/60	1.6×10⁻⁷	8.9×10⁻⁷	8.6×10⁻⁷	1.8×10⁻⁷	1.7×10⁻⁷
2D L-BFGS-B (ε=10⁻⁶)	20/60	1.2×10⁻³	1.1×10⁻³	1.3×10⁻³	2.2×10⁻⁴	3.8×10⁻⁴
2D Grid (256²)	0/60	2.58	2.03	2.03	0.44	0.57
5D L-BFGS-B (analytical)	0/60	2.23	29.8	6.14	5.03	1.33
5D L-BFGS-B (central diff)	1/60	103	2,334	343	78.9	28.8
5D L-BFGS-B (finite diff)	2/60	113	2,334	832	80.6	44.0

Maximum |relative error| (%) across 60 fits (3 surfaces × 20 sampling ranges), computed over successful (converged) fits only. Failure counts show convergence failures out of 60 total fits.

✓ Key Result

VPNLS eliminates the biases inherent in the parabolic approximation and avoids the fragile gradient tuning that complicates L-BFGS-B. All five loss surface parameters $(E, A, B, \alpha, \beta)$ are recovered with machine precision, and extrapolation to higher compute budgets is exact.

Conclusion

The biases documented in this article are structural, not statistical. They exist on noise-free data with perfect experimental conditions. Real experiments, which contend with measurement noise and unknown optima, can only make them worse.

Two independent sources of error compound in practice. Surface asymmetry ($\alpha \neq \beta$) biases intercepts, and off-center sampling biases intercepts or exponents depending on whether the offset is constant or varies with compute budget. Both act simultaneously in any real experiment, and IsoFLOP curves from published scaling law studies exhibit exactly the conditions that trigger them: parabola vertices that clearly do not coincide with the true loss minimum, with the degree of misalignment varying across compute budgets. At practical grid widths with Chinchilla-like asymmetry, token count errors of 5% or more are typical; on more asymmetric surfaces, the errors reach 20% or more.

A practical alternative exists. VPNLS (Variable Projection with Non-negative Least Squares) recovers all five surface parameters with machine precision, uses the same intuitive linear separation that makes Approach 2 appealing, and is straightforward to implement.

Because VPNLS recovers the full loss surface rather than just scaling exponents, it may also provide a more precise foundation for the analytical extensions to the Chinchilla model discussed in the introduction. These extensions build on the same functional form and in most cases retain the partially linear structure that variable projection exploits, making them a natural direction for future work.

Practitioners using Approach 2 should be aware that intercept estimates carry a systematic bias that grows with exponent asymmetry and sampling grid width. When precision matters for extrapolation to large compute budgets, VPNLS offers one robust alternative, though the underlying principle is more general. Any method that exploits the linear separability of the Chinchilla loss surface can avoid the parabolic approximation while retaining much of Approach 2's simplicity.

Limitations

Several limitations scope the conclusions of this study. We highlight the most important ones here.

Irreducible loss dominance at large scale. At sufficiently large compute budgets, scaling properties are dominated entirely by the irreducible loss $E$. When token counts and model sizes at fixed compute budgets are large enough, the Chinchilla surface reaches $E$ asymptotically and all training configurations become equally effective, meaning that extrapolations are irrelevant and compute-optimal training is no longer informed by scaling laws. We assume this study is only relevant to practitioners working in a regime where downstream model quality can still effectively be informed by scaling law extrapolations per the Chinchilla model.
No quantification of downstream cost. We do not connect token extrapolation error to under- or over-training, model performance, or the ultimate cost of Approach 2's errors in FLOPs or dollars. We avoid this because it is difficult to do well, alternatives to Approach 2 can be justified by theory and simulation alone, and those alternatives are easy to implement at effectively no extra computational cost.
Assumed correctness of the Chinchilla loss surface. We assume the Chinchilla loss surface model $L(N, D) = E + A/N^\alpha + B/D^\beta$ is correct in practice. While there is substantial evidence legitimizing this model^[15], alternatives exist, including the Kaplan loss model^[16], refined analytical surfaces like Farseer^[26] and MuPT^[18], and agent-discovered functional forms^[27].
Qualitative characterization of published study errors. Likely errors in published studies are characterized qualitatively rather than quantified. We believe the qualitative characterization is compelling enough on its own to justify that real IsoFLOP sampling pathologies occur in practice, but they are difficult to quantify precisely because they do not follow the convenient theoretical model we use for those pathologies in our simulations.

Appendix

A. Detailed Method Comparison

Detailed method comparison showing per-parameter error across surfaces and sampling ranges — **Figure A1:** Per-parameter recovery error for nine fitting methods across three loss surfaces and 20 sampling ranges (baseline, no bias). Each panel shows absolute relative error (%) on a log scale versus sampling range, with one curve per method. Rows correspond to loss surfaces (symmetric, Chinchilla, high imbalance); columns correspond to parameters (E, A, B, α, β). Gaps indicate convergence failures.

B. Combined Extrapolation Error by Compute Budget

Extrapolation error in D* across compute budgets for all surfaces, sampling ranges, and bias configurations — **Figure A2:** Relative error in compute-optimal token count $D^*$ when extrapolating from the fitting range (10¹⁷–10²¹ FLOPs) to higher compute budgets (10²²–10²⁵ FLOPs), with asymmetry and sampling biases acting simultaneously. Columns correspond to loss surfaces (symmetric, Chinchilla, Asymmetric); rows correspond to sampling ranges (narrow ±2×, medium ±16×, wide ±100×). The wide row uses an extreme grid width not employed in the main text, included here to further illustrate how far results can deviate with a misconfigured experiment. Each curve represents a different sampling bias configuration: baseline (no bias), two linear drift rates (drift_0.2 and drift_0.4, where the value is the log₁₀ offset at the highest compute budget), and two constant center offsets (scale_1.5 and scale_2.0, where the value is the multiplicative factor applied at every budget). On symmetric surfaces, errors are driven entirely by off-center sampling; on asymmetric surfaces, the inherent surface bias adds a constant offset visible as the non-zero baseline curve. Drift-based biases produce errors that grow with extrapolation distance (steeper curves), while constant offsets and surface asymmetry produce flat or slowly varying errors. With wider sampling ranges, surface asymmetry dominates and can either reinforce or partially offset the sampling biases.

References

"Training Compute-Optimal Large Language Models," ArXiv. https://arxiv.org/abs/2203.15556
"The Llama 3 Herd of Models," ArXiv. https://arxiv.org/abs/2407.21783
"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism," ArXiv. https://arxiv.org/abs/2401.02954
"Exploring Scaling Laws for EHR Foundation Models," ArXiv. https://arxiv.org/abs/2505.22964
"Sequence modeling and design from molecular to genome scale with Evo," bioRxiv. https://www.biorxiv.org/content/10.1101/2024.02.27.582234v2
"Scaling Laws for Imitation Learning in Single-Agent Games," TMLR. https://arxiv.org/abs/2307.09423
"Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design," NeurIPS. https://arxiv.org/abs/2305.13035
"Scaling Laws of Motion Forecasting and Planning -- Technical Report," ArXiv. https://arxiv.org/abs/2506.08228
"Training compute-optimal transformer encoder models," Other. https://aclanthology.org/2025.emnlp-main.1804.pdf
"Scaling Laws For Diffusion Transformers," ArXiv. https://arxiv.org/abs/2410.08184
"Scaling Behavior of Discrete Diffusion Language Models," ArXiv. https://arxiv.org/abs/2512.10858
"Scaling Laws for Compute Optimal Biosignal Transformers," Other. https://dspacemainprd01.lib.uwaterloo.ca/server/api/core/bitstreams/b66b1078-b359-4688-8dac-45e78806eb3d/content
"(Mis)fitting: A Survey of Scaling Laws," ICLR 2025. https://arxiv.org/abs/2502.18969
"Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic," CVPR 2024. https://arxiv.org/abs/2404.07177
"Evaluating the Robustness of Chinchilla Compute-Optimal Scaling," ArXiv. https://arxiv.org/abs/2509.23963
"Scaling Laws for Neural Language Models," ArXiv. https://arxiv.org/abs/2001.08361
"Scaling Data-Constrained Language Models," ArXiv. https://arxiv.org/abs/2305.16264
"MuPT: A Generative Symbolic Music Pretrained Transformer," ArXiv. https://arxiv.org/abs/2404.06393
"Scaling Laws for Precision," ArXiv. https://arxiv.org/abs/2411.04330
"Reconciling Kaplan and Chinchilla Scaling Laws," TMLR. https://arxiv.org/abs/2406.12907
"Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining," ArXiv. https://arxiv.org/abs/2510.03313
"Scaling Laws for Optimal Data Mixtures," ArXiv. https://arxiv.org/abs/2507.09404
"Scaling Laws are Redundancy Laws," ArXiv. https://arxiv.org/abs/2509.20721
"Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models," ArXiv. https://arxiv.org/abs/2507.17702
"Establishing Task Scaling Laws via Compute-Efficient Model Ladders," ArXiv. https://arxiv.org/abs/2412.04403
"Predictable Scale: Part II, Farseer: A Refined Scaling Law in Large Language Models," ArXiv. https://arxiv.org/abs/2506.10972
"Can Language Models Discover Scaling Laws?," ArXiv. https://arxiv.org/abs/2507.21184