When researchers build large language models (LLMS), their goal is to maximize performance under specific computational and financial budgets. Since training models can reach millions of dollars, developers must be wise to decide on cost barriers, such as before committing to the model, model architecture, optimizers and training datasets. To predict the quality and accuracy of predictions of large models, practitioners often turn to the law of scaling: using smaller, cheaper models to try to approximate the performance of larger target models. But the challenge is that there are thousands of ways to create laws of extension.
New works by researchers at the MIT and MIT-IBM Watson AI Labs address this problem by accumulating and releasing hundreds of collections of models and metrics about training and performance, with about a thousand extended laws. As a result, the team developed a meta-analysis and guide to select small models and estimate the scaling laws of different families of LLM models in order to best apply budgets to generate reliable performance predictions.
“You might have wanted to try to build the idea of mathematical models of the training process for a few years, but I think the new thing here is most of the work people have done before is saying, ‘We can happen after the fact when training all of these models so that when we try to train a new large model, we can train a new budget, can we say how to emphasize it’? The Department of Electrical Engineering and Computer Science at the Watson AI Labs and Principal Researcher.
The study was recently presented at Andreas’ International Machine Learning Conference and at MIT-IBM Watson AI Lab researchers Leshem Choshen and Yang Zhang of IBM Research.
Inferred performance
No matter how you cut it, developing LLMS is an expensive job: from decisions about the number of parameters and tokens, data selection and size, and training techniques to determining output accuracy and tuning to target applications and tasks. Scaling law provides a way to predict model behavior by linking the loss of large models to the performance of smaller, less expensive models in the same household, avoiding the need for adequate training for each candidate. Mainly, the differences between smaller models are the number of parameters and the token training size. According to Choshen, articulating scale laws not only enables better pre-training decisions, but also democratizes the field by enabling researchers to understand and formulate effective scaling laws without a large amount of resources.
The functional form of the scaling law is relatively simple, combining components in a small model that captures the number of parameters and the number of scaling effects, the number of training tokens and their scaling effects, and the baseline performance of the family of interesting models. Together, they helped researchers estimate the performance losses of the target large model; the smaller the loss, the better the output of the target model is likely.
These laws allow research teams to effectively weigh trade-offs and test how to best allocate limited resources. They are especially useful for evaluating the scaling of a variable (such as the number of tokens) and A/B testing with different training pre-training settings.
Usually, scaling laws are not new. However, in the AI field, they emerge as the model grows and costs soar. “It’s like a scaling law just appeared on the scene at some point,” Qiao Shen said. “They started to get attention, but no one really tested their goodness and what you need to do to make a good scaling law.” Moreover, in a sense, scaling law itself is also a black box. “Whenever people used to create scale laws, it’s always a model, a model series, a dataset and a developer,” Andreas said. “There’s really not a lot of systematic meta-analysis because everyone is training their own extended laws individually.
Build better
To investigate this, Choshen, Andreas and Zhang created a large data set. They collected LLM from 40 model series, including Python, Opt, Olmo, Llama, Bloom, T5-Pile, Experts, GPT and other families. These include 485 unique, pre-trained models and data on their training checkpoints, cost of computation (FLOPS), training periods and seeds, as well as performance metrics for 1.9 million losses and downstream tasks. These models vary in their architecture, weights, etc. Using these models, the researchers fit over 1,000 rules and compared their accuracy in architecture, model size and training, as well as how the number of test models, intermediate training checkpoints, and the number of partial training affects the predictive power of the scaling law to the target model. They used measurements of absolute relative error; this is the difference between predictions of the scaling law and the observed loss of large-scale training models. With this, the team compared the scaling laws and after analysis, they put forward practical suggestions on effective scaling laws for AI practitioners.
Their shared guidelines enable developers to consider and expect through steps and choices. First, it is crucial to decide on the accuracy of the calculation budget and target model. The team found that 4% of people were the best accuracy people could expect due to random seed noise, but up to 20% were still useful for decision-making. The researchers identified several factors that improve predictions, such as including intermediate training checkpoints, rather than relying solely on the final loss. This makes the law of extension more reliable. However, very early training data before 10 billion tokens is noisy, reduces accuracy and should be discarded. They recommend prioritizing training more models to improve the robustness of the law of scale predictions rather than just larger models; choosing five models provides a reliable starting point.
In general, including larger models can improve predictions, but target models can be partially trained to 30% of their dataset and used for extrapolation. If the budget is heavily constrained, developers should consider training a smaller model in the target model family and borrowing the scale law parameters from the model family with similar architectures; however, this may not apply to the encoder-decoder model. Finally, the MIT-IBM research team found that when comparing scaling laws across models families, there was a strong correlation between the two sets of hyperparameters, meaning that three of the five hyperparameters explain almost all changes and that model behavior may be captured. Together, these criteria provide a systematic approach that makes the law of scale estimates more efficient, reliable, and accessible AI researchers working under different budget constraints.
Several surprises were generated in this work: small models that were partially trained are still very predictive, and in addition, the intermediate training phase of a fully trained model (as if they were individual models) can be used to predict another target model. “Basically, you don’t pay anything in training because you’ve trained a full model, so the semi-trained model is just a byproduct of what you do,” Choshen said. Another feature that Andreas points out is that when summarizing, the variability and different experiments between model families jump out and are noisier than expected. Surprisingly, the researchers found that it is possible to use the law of scaling on large models to predict the performance of smaller models. Other studies in the field hypothesize that smaller models are “different beasts” compared to large models. However, Choshen disagreed. “If they are completely different, they should show completely different behaviors, not completely different behaviors.”
Although this work focuses on model training time, the researchers plan to extend its analysis to model reasoning. “As I add more training data or more parameters, how my model gets better, but instead thinks for a longer time, plotting more samples. I think there must be some lessons here about how to build a predictive model of how much you need to do during running time,” Andreas said. Theories of inferring the law of time expansion may become more important because “it’s not to say I’m going to train a model and then finish it.”
This study was supported in part by the MIT-IBM Watson AI Laboratory and the Sloan Research Fellowship.