Scaling Laws for LLM Pretraining

You might have heard about LLM progress hitting a wall, but at the same time, everyone seems to be raving about new models coming out with better performance and new capabilities. So what's going on?

To answer this question, we have to consider the different stages of LLM training and inference. When people refer to LLMs hitting a wall, they typically refer to the pre-training phase, where the model is trained on a large dataset for a long time. There are signs of this particular stage slowing down due to data bottlenecks - we simply used up most of the high quality text for pre-training that's available and aren't generating enough new data. As far as inference-time scaling is concerned, there is no sign of it slowing down yet.

This article is diving into the pre-training phase and the scaling laws that govern it. We're going to review some of the seminal papers in this area and how the focus has shifted from compute-optimality at training time, to compute-optimality at inference time. If you're interested in whether we in fact hit a wall as of late 2024, check out this article.

But maybe you're here because you simply wondered what LLMs have to do with rodents like the Chinchilla. In this case just keep on reading, we'll get to that in a bit.

What are LLM Scaling Laws?

LLM scaling laws are empirical observations about how scale correlates with model performance. What is meant by scale depends on whether we're talking about training or inference time.

When we talk about training time, scale typically refers to the following:

the number of parameters in the model, typically denoted as $N$
the amount of data used to train the model, typically denoted as $D$ , or number of steps, denoted as $S$
the amount of compute used to train the model, typically denoted as $C$ , often measured in GPU-days, or petaflop-days (PF-days), where one $\text{PF-day} = 10^{15} × 24 × 3600 = 8.64 × 10^{19}$ floating point operations

Performance in this case is the autoregressive log-likelihood of the model (cross-entropy loss), so not necessarily the performance on specific downstream tasks. But the idea is that models with a lower loss are likely more capable and perform better on downstream tasks as well.

At inference/test time, scale typically refers to the amount of compute used to generate the final model prediction.

Scaling laws are typically expressed in the functional form of a power-law relationship, $y=ax^{b}$ , which is often used to describe the relationship between two quantities $x$ and $y$ that hold over many orders of magnitude.[2]

Kaplan Scaling Laws

Kaplan et al.[1] is a seminal paper that introduced the concept of LLM scaling laws. They study the relationship between model size, data, and compute on test loss. As the figure below demonstrates, there is a power-law relationship between model size/data/compute and performance. These relationships seem to hold over several orders of magnitude, and as long as the model is not constrained by one of these factors.

Kaplan et al. (2020)

One obvious thing to note about these relationships is that they are necessarily bounded by a non-zero amount, given the inherent entropy in natural language. So we can hypothesize that these relationships will eventually have to flatten out. Though there is no explicit offset that accounts for this in the Kaplan Scaling Laws. As we'll see later, this is one of the reasons that was suggested for why there is a difference between the Kaplan and Chinchilla Scaling Laws.

Here are the actual coefficients they estimate for each of the relationships:

\text{Compute: } L(C) = (\frac{3.1 \times 10^{8}}{N} )^{0.05} \\ \text{Dataset: } L(D) = (\frac{5.4 \times 10^{13}}{N} )^{0.095} \\ \text{Parameters: } L(N) = (\frac{8.8 \times 10^{13}}{N} )^{0.076} \\

where $L$ is the loss, and $N$ is the number of non-embedding parameters, $D$ is number of tokens and $C$ is the number of PF-days. So, for example doubling the compute would reduce the loss by a factor of $2^{-0.05} \approx 0.9659$ , and $0.9363$ when doubling dataset size and $0.9487$ when doubling model size, respectively.

Apart from these basic scaling laws, the authors also investigate the effect of varying two factors at once. In particular, they propose the following parameterization of the loss when varying both model size and data:

L(N, D) = \Biggl[ \biggl( \frac{N_{c}}{N} \biggr)^{\frac{\alpha_{N}}{\alpha_{D}}} + \frac{D_{c}}{D} \Biggr]^{\alpha_{D}}

Empirically the coefficients are estimated with $\alpha_{N} = 0.076$ and $\alpha_{D} = 0.103$ (similar to the coefficients from the univariate scaling laws above, $0.076$ and $0.095$ ). These relationships are shown in the figure below.

Kaplan et al. (2020)

Off of these coefficients they arrive at the conclusion that as you increase model size, you should increase data by less, according to $D \propto N^{0.076/0.103} = N^{0.74}$ . Or in other words, when increasing model size by a factor of 10, you should increase data by a factor of roughly $10^{0.74}\approx5.5$ .

The authors also show a second scaling law that varies both model size and compute, measured here in number of training steps. In particular, the number of training steps here is $S_{min}$ , which refers to the minimum number of steps required to reach a certain loss (see paper for details).

L(N, S) = \biggl( \frac{N_{c}}{N} \biggr)^{\alpha_{N}} + \biggl(\frac{S_{c}}{S_{min}} \biggr)^{\alpha_{S}}

where the coefficients are estimated to be $\alpha_{N} = 0.0767$ and $\alpha_{S} = 0.76$ . This is illustrated nicely by the figure below.

Kaplan et al. (2020)

Finally the authors also establish various relationships between $C_{min}$ (the minimal computation budget to achieve a certain loss), optimal model size $N$ , optimal batch size $B$ , optimal number of steps $S$ , where dataset size $D=B\times S$ :

N\propto C^{0.73}_{min}, B\propto C^{0.24}_{min}, S\propto C^{0.03}_{min}

This can succinctly be summarised by: given a 10x increase in training budget, one should scale model size by 5.5x $(\approx10^{0.73})$ and data by 1.8x $(\approx10^{0.27})$ .

Or even more succinctly: big models are more important than big data.

Chinchilla Scaling Laws

While the Kaplan Scaling Laws suggest that scaling model size is more important than scaling data, given a fixed compute budget, the Chinchilla Scaling Laws by Hoffmann et al.[1] suggest that model size and data are equally important.

There are a few key differences that lead to this conclusion:

Kaplan et al. used relatively small models (up to 1B parameters), whereas Hoffmann et al. use models up to 16B parameters
Kaplan et al. use a fixed cosine cycle length and corresponding learning rate schedule. Hoffmann et al. argue that a cycle length that's too much longer than then target number of training steps leads to suboptimally trained models. This contributes to the conclusion that scaling model size and data are actually equally important.

To estimate the Chinchilla Scaling Laws, three different methods are used, all of which lead to roughly the same conclusion.

Method 1: Fix model sizes and vary number of training tokens

For a fixed set of different model sizes, vary the number of training tokens, and measure the loss at many steps during training. Then find the model size and training token count that lead to the lowest loss at each FLOP budget. This leads to

N\propto C^{0.5}, D\propto C^{0.5}

i.e. that model size and data should be scaled exactly equally with compute.

Hoffmann et al. (2022)

Method 2: IsoFLOP profiles

For a fixed set of different FLOP budgets, vary the model size, and consider the loss at the end of training (i.e. when the FLOP budget is reached). This leads to a similar result

N\propto C^{0.49}, D\propto C^{0.51}

i.e. that model size and data should be scaled almost exactly equally with compute.

Hoffmann et al. (2022)

Method 3: Fitting a parametric loss function

Similar to Kaplan et al., here the authors fit a parametric loss function to the data of the form

L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}

Where $E$ is the offset that was missing in Kaplan et al. and which accounts for the inherent entropy in natural language. The coefficients are estimated to be such that

N\propto C^{0.46}, D\propto C^{0.54}

i.e. slightly different from the above, but still suggesting that model size and data should be scaled almost equally with compute.

Hoffmann et al. (2022)

Segue: Reconciling Kaplan and Chinchilla Scaling Laws

As an aside, Pearce and Song[3] have recently proposed a way to reconcile the Kaplan and Chinchilla Scaling Laws. The main reasons for the discrepancy between the two are:

Kaplan et al. using much smaller models than Hoffmann et al.
Kaplan et al. using the non-embedding parameter count for estimating scaling laws, whereas Hoffmann et al. use the total parameter count.
As mentioned above, Kaplan et al. do not include an offset accounting for irreducible risk.

When accounting for these differences, the paper results are actually quite similar.

Pearce & Song (2024)

The Chinchilla Trap, or: Going Beyond Chinchilla-Optimal

While the Chinchilla paper suggested that scaling model size and data are equally important, the Scaling Laws proposed by Sardana et al.[4] go even further and say that we should scale data even more than model size, if we expect a certain amount of inference demand. Just following the Chinchilla Scaling Laws leads to the "Chinchilla Trap", whereby you end up with a model that is way too large and therefore expensive to run at large scale at inference time.

The bottom line of the paper is that if you expect a certain amount of inference demand for your model of a given size (number of parameters), you should actually train a smaller model for much longer, until you reach the same quality as the large model. Even though this will incur higher training costs, it will lead to lower total costs when considering both training and inference.

Sardana et al. (2023)

Sardana et al. investigate what it means for model size $N$ and pre-training data $D_{tr}$ if we assume that a model will be used for a fixed amount of inference time, $D_{inf}$ . The authors also impose the constraint $L(N, D_{tr}) = \ell$ , i.e. the loss is fixed to $\ell$ . This is to ensure that the compute cost are minimized for a given level of loss/quality.

The objective function they are minimizing is

N^{*}(\ell, D_{inf}), D^{*}(\ell, D_{inf}) = \argmin_{N, D | L(N, D_{tr}) = \ell} 6ND_{tr} + 2ND_{inf}

where $6N$ and $2N$ are standard approximations for the number of FLOPs required to train and run a transformer model, respectively.

The $argmin$ is finding the optimal model size $N^{*}$ and number of training tokens $D_{tr}^{*}$ that minimize total compute cost given a fixed pretraining loss. So this is the exact converse of the Kaplan and Chinchilla papers, where compute cost was fixed and model size and data were varied to minimize the pretraining loss.

The plots below summarize the findings from this approach. I just added additional horizontal and vertical lines to the original version from the paper in order to explain it better. To understand the results, consider you want to train to a loss of 2.4 and expect an inference demand of $10^{12}$ FLOPs. Then:

training a Chinchilla-optimal paper would have an overall (training + inference) cost 2x that of training a compute-optimal model
a compute optimal model would have only 33% of the parameters of a Chinchilla-optimal model
a compute-optimal model would have a pre-training token-to-parameter ratio 5x that of a Chinchilla-optimal model

Sardana et al. (2023)

The authors then train models with 150M to 6B parameters and with token-to-parameter ratios ranging from 1k to 10k. The plots below demonstrate that for the hyerparameter ranges they consider, the pre-training loss shows no sign of flattening out when a model is trained for longer (i.e. as token-to-parameter ratio increases).

Though note that the token-parameter ratio axis is logarithmic, so it becomes exponentially more expensive to train models to lower loss or higher accuracy levels.

Sardana et al. (2023)

One last interesting finding is that, since the authors trained models considerably longer than Chinchilla, they were able to estimate the coefficients of the Chinchilla Scaling Laws for those very high token-to-parameter ratios. The result from this exercise is that the loss decreases at a lower rate when you increase the token-to-parameter ratio than the Chinchilla Scaling Laws would suggest. So that means that the Chinchilla Scaling Laws were a bit too optimistic about the benefits of training longer, relatively speaking.

The parameters they estimate are:

N\propto C^{0.57}, D\propto C^{0.43}

So still roughly equally important, but slightly favoring model size over data. However this doesn't change the main conclusion, which is that if you want to serve a certain amount of inference demand, you should train a model on more tokens than the Chinchilla-optimal model. It just means that that is a bit more expensive than the Chinchilla Scaling Laws would suggest.

Conclusion

That's a wrap! We've covered 3 of the most important papers in my opinion that deal with LLM pre-training scaling laws.

We've seen how the initial focus was on training large models on relatively small number of tokens, because that's the most efficient thing to do given a small compute budget.

However this changed with the Chinchilla Scaling Laws, which suggest that model size and data are equally important, when you estimate the laws with larger models and other methodological improvements.

Finally, we've seen that Chinchilla-optimal models are only optimal in the academic setting, when you don't consider the inference demand. In the real world, when you expect a certain amount of inference demand, you should train (potentially smaller) models for much longer, as this will lead to lower total costs.

References

[1] J. Kaplan et al. Scaling Laws for Neural Language Models, 2020.[paper]

[2] Christopher T. Kello et al. Scaling laws in cognitive sciences, 2010.[paper]

[3] T. Pearce and J. Song. Reconciling Kaplan and Chinchilla Scaling Laws, 2024.[paper]

[4] N. Sardana et al. Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws, 2024. [paper]