- Published on
A brief history of LLM Scaling Laws and what to expect in 2025
- Authors
- Name
- Jonas Vetterle
- @jvetterle
OpenAI just unveiled their new reasoning model, o3, which breaks previous SOTA on the ARC dataset by a large margin and scored a breathtaking result on the challenging FrontierMath dataset. While we're still updating our priors what this means for the trajectory of AI progress, it's clear that the model is a significant step forward in terms of reasoning capabilities.
However, if you've read recent new coverage (i.e. up until last week) about stalling AI progress including anonymous leaks and the occasional Gary Marcus rant, you probably noticed a certain degree of pessimism about the speed of advancement. Many were and probably still are wondering whether LLM Scaling Laws, which predict that increases in compute, data and model size lead to ever better models, have "hit a wall". Have we reached a limit in terms of how much we can scale the current paradigm: transformer-based LLMs?
Apart from the releases of the first publicly available reasoning models (OpenAI's o1, Google's Gemini 2.0 Flash, and now also o3 which will be released to public in 2025), most model providers have been focussing on what on the surface looked like incremental improvements to their existing models. In that sense, for the most of it, 2024 has been a year of consolidation - many models have essentially caught up with what used to be the go-to model at the beginning of the year, GPT4.
But that masks the progress that's actually been made to the "work horse" models like GPT-4o, Sonnet 3.5, Llama 3 etc. (i.e. everything that's not a reasoning model), which are most commonly used in AI applications. The big labs have continued to ship new versions of these models that pushed SOTA performance across the board, and which came with huge improvements on tasks like coding and solving math problems.
One cannot but notice that 2024 has been the year in which improvements in model performance were primarily driven by post-training and scaling test-time compute. In terms of pretraining there hasn't been as much news. This has led to some speculation that the (pre-training) scaling laws are breaking down, and that we are reaching the limits of what is possible with current models, data and compute.
In this post, I'll recap the history of LLM scaling laws, and my thoughts on where we are headed next. It's different to make predictions from the outside of the big AI labs. But based on the information that I've seen, here is my summary of how scaling LLMs might continue in 2025:
- pre-training: limited - compute scaling is underway, but we're likely limited by new, high quality data of sufficient scale
- post-training: more likely - use of synthetic data has been shown to be very effective and this will likely continue
- inference-time: also likely - OpenAI and Google/Deepmind started this trend this year and other players will follow; also, watch out for open source replications of this; on the application layer we'll see ever more agent products
What are LLM Scaling Laws
Before we dive in, what are LLM Scaling Laws? In short: they are empirical observations about how scale (measured in compute, model size and dataset size) correlates with model performance. You can check out my more detailed article about pre-training scaling laws and test-time scaling laws.
With that in mind, let's see where we are and how we got here.
Compute-optimal pre-training - Kaplan and Chinchilla
The original Scaling Laws referred to the pre-training phase of LLMs. The "Kaplan" Scaling Laws[3] (OpenAI, 2020) suggest that as your pre-training compute budget increases, you should scale model size more than data. This means: given a 10x increase in training budget, one should scale model size by 5.5x and data by 1.8x.
GPT-3[1], which also released by OpenAI in 2020, presumably followed these scaling laws and was trained on an unusually small amount of data given its size. That is, it had 175B parameters, but was trained on only 300B tokens, which equates to roughly 1.7 tokens per parameter.
There were certain flaws in those original scaling laws, such as not accounting for embedding parameters and generally using quite small models to estimate the Scaling Laws, which didn't necessarily hold for larger models. The "Chinchilla" Scaling Laws[2] (Deepmind, 2022) corrected for some of these and came to very different conclusions.
In particular, data turns out to be much more important than previously thought, and so model size and data should be scaled equally with compute. These new findings suggested that models like GPT-3 and others that were published around that time were actually severely undertrained. A model like GPT-3 with 175B parameters should have been trained on ~3.5T tokens to be compute-optimal, which is roughly 20 tokens per parameter. Or by the inverse argument, a model like GPT-3 should have been 20x smaller, i.e. just 15B parameters.
The Chinchilla Trap or: Optimizing for Inference
Just following the Chinchilla Scaling Laws leads to the "Chinchilla Trap", whereby you end up with a model that is way too large and therefore expensive to run at large scale at inference time. See for example in the Llama 1 paper by Touvron et al.[4] (Meta, 2023), where it is noted that the loss continues to decrease beyond what's Chinchilla-optimal. The Llama 1 models are trained with a ratio of up to 142 tokens per parameter in the case of the smallest (7B) model, which is trained on 1T tokens. This trend continued with Llama 2 [5] (Meta, 2023), where tokens were doubled to 2T resulting in a ratio of up to 284 tokens per parameter. And finally also in Llama 3 [6] (Meta, 2024), where the ratio is up to 1,875 tokens per parameter (8B model trained on 15T tokens). Training these small models for longer makes them surprisingly high performant, and also cheap to run at inference time.
Evidence for this is not only coming from the fact that Llama 3 models are trained on extremely high token-to-parameter ratios, but also from the literature. For example, Sardana et al.[7] (MosaicML, 2023) estimate Scaling Laws that take into account inference time compute. In their experiments they train models with token-to-parameter ratios of up to 10,000 tokens per parameter, and find that the loss continues to decrease beyond what's Chinchilla-optimal. These graphs from their paper illustrate nicely the point of training smaller models for longer, and how that can lead to overall lower total costs if you expect a high enough inference demand.
Scaling Test Time Compute
Needless to say, training models on ever more data, and also more parameters, is computationally expensive. In the Llama 3 paper it's said that training the flagship model used FLOPs, which is 50x more than in Llama 2. And according to EpochAI, the largest training budget known so far (as of Dec 2024) is FLOPs in the case of Gemini Ultra. That's a lot of compute, especially if you consider scaling that up a few order of magnitudes more.
In response to this, 2024 has seen the release of models like OpenAI's o1 and most recently o3, which are making use of test-time compute to generate predictions. So instead of just generating an answers straight away, those models generate chains of thought, or use RL-techniques at test time to generate better answers. In layman's terms, one could say that we're giving the model more time to "think" before giving an answer. This has given rise to an entirely different LLM scaling law, which is that of test-time compute. If you're interested, this literature review goes into more detail about the different approaches to inference-time compute scaling.
I also recommend this interesting talk by OpenAI's Noam Brown where he talks about his learnings from training models for playing games (Poker, Chess, Hex etc.), and how test-time compute has allowed SOTA performance that would have been out of reach purely by scaling training compute.
For example, if it's the case that there is a trade-off between training and inference time compute, whereby you can trade off 10x training budget for a 15x increase in inference time compute, then it may make sense to do so in cases where training compute is already very costly and inference compute very cheap.
Are Scaling Laws still valid or have we hit a wall?
That's the big question and it's hard to answer from the outside of the big labs. Let's review what they say on the inside, knowing that there may be some bias in their statements.
- Dario Amodei of Anthropic: "I’ve seen the story happen for enough times to really believe that probably the scaling is going to continue, and that there’s some magic to it that we haven’t really explained on a theoretical basis yet."
- Sam Altman of OpenAI: "there is no wall"
Add to this the fact that companies are still expanding their data centers, with xAI's Colossus cluster hosting 100,000 H100s, with plans to expand this to at least 1 million.
While there are engineering challenges and energy bottlenecks to scaling compute, it is underway. But compute is just one of the factors in LLM Scaling Laws, the other two being model size and data. With bigger clusters, it's also possible to train bigger models in a given amount of time. However, scaling data is a different story.
EpochAI estimates that there are 510T tokens of data available in the indexed web and the largest known dataset is ~18T tokens (Qwen2.5). So it would seem that there is still a lot of room to scale data, however most of that data is low quality or repetitive. Add to that the fact that starting from 1-2 years ago a lot of newly added text on the internet is LLM-generated. There are still potentially new data sources available, like transcribing all video on the internet, or using text that is not on the open internet (e.g. proprietary data), however the low-hanging fruit has been picked.
Diminishing returns to scaling is actually exactly what you would expect from a power-law relationship. That is, to the get first unit of improvement you need 1 unit of data, then 10 for the next, then 100, and so on. As Yann LeCun puts it, this applies to all "long tail" domain, i.e. domains in which the diversity of the inputs keeps growing with dataset size like dialogue and question answering.
Looking at the equations and plots of the Scaling Laws, it should be clear that there is a limit to these relationships, which has also been acknowledged by the original Kaplan paper[3]. The reason being that there is an inherent entropy in natural language, and that the loss cannot be reduced to zero. So while currently it may seem that the performance just increasing linearly with the log of [compute, data, model size], at some point it will have to flatten out. The question is not if, but when this will happen.
Have we reached this point now? It's difficult to answer because it's not as easy as just scaling compute or data by another order of magnitude and see what happens. AI labs are building large new clusters which will enable them to train models for even longer and to see if the loss continues to decrease at the same rate. For all I know, we haven't really trained these models on 100,000 H100s yet, or 1,000,000, so it's hard to tell how much further we can reduce the training loss. More importantly though, we have but one internet, so scaling data is a much harder problem. And as we know from the Kaplan Scaling Laws, these laws only hold when the model is not constrained by one of these factors.
However, given the impressive performance of models that are making use of test-time compute, and this weeks's announcement of OpenAI's o3, it should be clear that scaling test-time compute is with us to stay.
As the chart below shows, the jump in performance on the challenging Arc dataset is quite significant, when scaling test time compute. Going from o3 low to o3 high, what happens is that the model is given 172x more compute to generate an answer. It uses on average 57M tokens per question, which amounts to 13.8 minutes of runtime, whereas it only uses 330k tokens per question/ 1.3 minutes per question in the low compute setting.
According to Noam Brown, this is only the beginning. It's possible we're going to let models runs for hours, days or weeks to answer really challenging questions next year
Conclusion
Given the momentum and hardware rollout that's happening, I think it's safe to say that people are going to try and push the scaling laws further by throwing more compute at the problem. Either on the training side, by pre-training models for longer or investing more in post-training, but especially on the inference side, by letting models "think" longer before giving an answer.
It's also likely that the public don't always have access to the biggest models, which might be most performant, but simply too expensive to run at scale. Models like GPT4o or Sonnet 3.5, are probably relatively small models for inference. Whereas the Llama 3 model with 405B parameters, which is quite unwieldy, could be used as a great teacher model for smaller models, or for generating synthetic data.
Trends from this year, which will for sure continue into 2025 (easy prediction to make at this point in the year), are:
Agents are really also just a way of test-time compute, but one that is a bit more accessible to the general public and application developers rather than just the big labs. Though it's clear that the big labs are heavily investing in agents too.
Test-time compute is the big one. Like we've seen in o1 Gemini 2.0 Flash and o3, these will be the solution for use cases that require more complex reasoning, or where it makes sense to trade-off some training compute for more inference compute.
As for synthetic data, it's to my knowledge is mainly used for post-training, though I think you could consider cleaning up the internet as a form of synthetic data generation as well. Reading the LLM papers from this year, it's clear that synthetic data for SFT was quite important for improving performance on tasks like math and coding. There are domains where synthetic is more useful than in others, so I'm not sure if it's really going to be able to plug the gap left by lack of human written data.
So my conclusion from all of this is that we might well have reached a point where pretraining scaling laws are not exactly breaking down, but perhaps slowing down, but that shouldn't be surprising. It's mainly driven by the fact that we've exhausted a lot of the sources of high quality text.
That however doesn't mean that there won't be any more progress in the field, because pre-training is just one part of the puzzle. As we've seen, scaling test time compute, and using synthetic data, are likely going to be the big drivers of progress in the near future. For all we know, we're only in the early innings of test-time scaling laws, so there's still a lot of room for improvement there.
In sum, this is where I see the most potential for scaling LLMs in 2025:
- pre-training: limited - compute scaling is underway, but we're likely limited by new, high quality data of sufficient scale
- post-training: more likely - use of synthetic data has been shown to be very effective and this will likely continue
- inference-time: also likely - OpenAI and Google/Deepmind started this trend this year and other players will follow; also, watch out for open source replications of this; on the application layer we'll see ever more agent products
References
[1] T. Brown et al. Language Models are Few-Shot Learners, 2020.[paper]
[2] J. Hoffmann et al. Training Compute-Optimal Large Language Models, 2022.[paper]
[3] J. Kaplan et al. Scaling Laws for Neural Language Models, 2020.[paper]
[4] H. Touvron et al. LLaMA: Open and Efficient Foundation Language Models, 2023.[paper]
[5] H. Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.[paper]
[6] Llama Team, AI @ Meta. The Llama 3 Herd of Models, 2024.[paper]
[7] N. Sardana et al. Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws, 2024. [paper]