PUSHDEVTechnologies
Papers
Working paper · PushDev Notes

Scaling laws revisited — what 2025 taught us

Five empirical updates to the Chinchilla picture, with implications for how labs spend their next billion.

Y
Yavari
PushDev Notes · March 11, 2026
Abstract

The Chinchilla result (Hoffmann et al., 2022) gave the field a clean rule for splitting a compute budget between parameters and tokens, and almost every frontier run since has been argued for or against it. Three years of follow-up work has revised the picture in five concrete ways: the original fit was shown to be buggy, inference cost reshapes the optimum toward smaller models trained far longer, the token side has a hard ceiling once unique data runs out, training precision is a hidden third axis, and the binding constraint has begun moving from pre-training to test time. This note maps those five updates onto primary sources, says where each is solid and where it is still soft, and draws out what they jointly imply for how a lab should spend its next large training budget.

For about two years, "Chinchilla-optimal" was the closest thing the field had to a law you could quote in a planning meeting. Given a fixed compute budget, how should you divide it between a bigger model and more tokens? The answer from Hoffmann et al. (2022), roughly twenty training tokens per parameter, was clean enough to act on and was directionally right. What three more years of work has shown is that almost every word in that sentence needed a footnote: the fit itself, the budget it optimizes, the supply of tokens, the precision of the parameters, and even whether pre-training is still the budget that matters most.

1. Introduction

The reason to revisit this now is not academic tidiness. The Chinchilla rule is a spending rule, and the bill has grown. A lab committing a nine-figure budget to a single run is implicitly betting on a particular point in a multi-dimensional trade-off, and the dimensions it was originally fit on are no longer the only ones that bind. Five things changed between the 2022 result and now. None of them overturns scaling; all of them move the optimum, sometimes by a lot.

2. Background

Scaling laws started as an empirical regularity: test loss falls as a smooth power law in model size, dataset size, and compute. Kaplan et al. (2020) drew the first influential version and concluded that, given more compute, most of it should go to a bigger model and comparatively little to more data. That prescription shaped the first generation of very large, relatively under-trained models.

Chinchilla corrected it. Hoffmann et al. (2022) fit the trade-off three ways and found that parameters and tokens should scale in roughly equal proportion, which put the compute-optimal ratio near twenty tokens per parameter and implied that the headline models of the day were badly over-sized for the data they had seen. The same compute, spent on a smaller model and more tokens, bought lower loss. That is the picture the next five updates revise.

3. Five updates to the picture

3.1 The original fit was quantitatively soft

The first revision is to Chinchilla itself. Besiroglu et al. (2024), at Epoch AI, attempted to replicate the third of Hoffmann et al.'s estimation procedures, the parametric fit to the loss surface, and could not. The reported confidence intervals were implausibly tight: intervals that narrow would imply hundreds of thousands of training runs, where the original work likely ran fewer than five hundred. The replication traced this to an optimizer that terminated early because of a loss-scale choice, plus parameter values that were rounded in the paper body in a way that biased the predictions. Their corrected fit lands in a different place and, tellingly, agrees with Hoffmann et al.'s other two methods.

The practical takeaway is modest but real: the precise exponents of the canonical fit were never as pinned-down as the tight error bars suggested. The twenty-to-one rule is a useful center of mass, not a constant of nature, and anyone treating its exact coefficients as load-bearing was building on softer ground than the original plots implied.

3.2 Inference rewrites the optimum

Chinchilla optimizes a training budget in isolation. But a model that gets deployed is paid for twice, once to train and then again, indefinitely, to serve. Sardana and Frankle (2024) re-derived the trade-off with inference in the objective and reached a different prescription: if you expect meaningful serving demand, on the order of a billion requests, you should train a smaller model for longer than Chinchilla recommends, well past twenty tokens per parameter, because the smaller model is cheaper every time it runs. They trained 47 models to check the formula and found quality still improving at token-to-parameter ratios into the thousands.

There is an important caveat buried in that same paper, and it cuts against naive over-training. Scaling laws fit only on data collected at typical token-to-parameter ratios tend to over-estimate the benefit of additional tokens once you push to extreme ratios. In other words, the curve that justifies training small-and-long is also the curve most likely to be extrapolated past where it was measured. The direction is well-supported; the magnitude at the extremes is exactly where the evidence thins.

3.3 The token side has a ceiling

Training a small model longer assumes you have the tokens to do it with. Increasingly, at the frontier, you do not: the supply of high-quality unique text is finite, and the obvious move is to repeat what you have. Muennighoff et al. (2023) measured the cost of doing so across more than four hundred runs. The encouraging part: repeating data for up to about four epochs does almost nothing to loss compared with the same volume of fresh tokens. The sobering part: the value of repeats decays after that, with meaningful gains mostly gone by around sixteen epochs and effectively nil by forty, and larger models overfit repeated data faster than small ones.

This turns "more tokens" from a free variable into a budgeted one. The data-constrained scaling law adds a term for the diminishing return of repetition, which means the inference-optimal advice from 3.2 (train smaller, feed it more) runs into a wall the moment "more" has to mean "again." The two updates have to be read together.

3.4 Precision is a hidden third axis

Chinchilla counts parameters and tokens. It does not count bits. Kumar et al. (2024) add training and inference precision as a first-class variable and show it behaves like a quiet discount on capacity: training in lower precision reduces a model's effective parameter count, so the parameter you paid for is not always the parameter you get. The more striking result is on the inference side. The damage done by post-training quantization grows with the amount of pre-training data, to the point where, for a model you intend to quantize for serving, additional pre-training tokens can become actively harmful.

That last finding interacts badly with the previous two. The inference-aware recipe says train small and long; the data-constrained law says you will be repeating tokens to do it; and the precision law says that if the payoff for all that training is a model you then quantize to serve cheaply, some of the extra training you just bought is working against you. None of these papers is wrong. They simply optimize different cross-sections of the same budget, and the cross-sections have started to collide.

3.5 The binding axis is moving to test time

The largest revision is not to any coefficient but to which budget dominates. Snell et al. (2024) showed that compute spent at inference, letting a model search, verify, and revise its own answers, can substitute for compute spent in pre-training. Their compute-optimal test-time strategy is more than four times as efficient as a naive best-of-N baseline, and in a FLOPs-matched comparison, test-time compute let a small base model outperform one fourteen times its size on problems it could already sometimes solve. This is the line of work that the reasoning models of 2025 are built on.

It does not repeal Chinchilla. It demotes it. If a non-trivial share of a model's delivered quality now comes from how hard it thinks at run time, then the pre-training trade-off that Chinchilla so cleanly described is optimizing a shrinking fraction of the total compute that determines how good the deployed system feels. The scaling question stopped being only "how do I split a training budget" and became "how do I split a budget across training and thinking."

4. Open problems

The honest state of the field is that we have five well-measured one-dimensional slices and almost no map of how they interact. The most valuable open work is in the seams between sections 3.2 through 3.5: a joint law over tokens, repetition, precision, and test-time compute, rather than four separate ones each assuming the others are held fixed.

Distillation is a sixth axis with the same problem. Busbridge et al. (2025), at Apple and Oxford, give a clean scaling law for it: student quality is predictable from how a compute budget is split between training the teacher and training the student, and distillation beats ordinary supervised training only up to a compute threshold that depends on student size, or when a capable teacher already exists. It is a useful recipe and also another isolated slice, optimized as if the other five were not in play.

Two cautions worth stating plainly. First, several of these results are best-supported in the regimes where they were measured and weakest exactly where labs most want to extrapolate them, the extreme token-to-parameter ratios and the largest models. Second, the dependent variable has quietly shifted under all of this work: pre-training laws predict next-token loss, but the thing being sold is now downstream task quality after post-training and test-time search, and the mapping between the two is not itself a clean power law.

5. What it means for the next budget

If there is a single revision to carry out of the last three years, it is that "Chinchilla-optimal" answers a question most labs no longer have. The relevant objective is no longer "lowest training loss for this compute," but "best delivered quality per dollar across training, serving, and thinking, for a model of a precision I can actually deploy." Under that objective the defaults invert: train smaller than your training budget alone would suggest, train longer but not past the point where you are merely re-reading your data, decide your serving precision before you decide your token count, and budget for test-time compute as a first-class line item rather than a rounding error. The original law is not wrong. It is one term in a longer expression the field is still learning to write down.

Sources


Cite this paper
@misc{yavari2026scaling,
  author = {Yavari},
  title  = {Scaling laws revisited: what 2025 taught us},
  year   = {2026},
  url    = {https://pushdev.tech/papers/scaling-laws-revisited}
}