Challenges of data selection in LLM pre-training
Developing large language models requires substantial computational investment, especially when trying to replace preprocessed corpus. A comprehensive comparison of datasets over billions of parameters and billions of tokens can consume hundreds of thousands of GPU hours per run. Therefore, practitioners use smaller-scale experiments as agents for large-scale behaviors. However, these “pilot” studies have rarely been published, resulting in a scattered landscape in which each lab repeats similar small-scale tests without sharing benchmarks or methods. This opacity hinders repeatability, underestimates collective insights and masks real transactions between development computing and final model performance.

Datadecide
To address these limitations, the Allen AI Institute (AI2) collaborates with the University of Washington and the University of Pennsylvania, today released Datadecide– A comprehensive suite of controlled pre-training experiments covering 25 different corpuses and 14 models, ranging from 4 million to 1 billion parameters. Datadecide’s dataset includes well-known sources such as DOLMA, DCLM, REFINDEDWEB, C4 and FINEWEB, as well as changes produced by domain ablation, deduplication, quality filtering, and source mixing. Each model is trained with a fixed token-to-parameter ratio of 100 (100 per parameter), reflecting the “overtraining” system for optimizing inference efficiency. In total, over 1,050 models and over 30,000 checkpoints were released to the public (evaluated in downstream tasks, all of which have been evaluated.
The benefits of technical structure and pragmatism
Datadecide coordinates experiments along three axes:
- Data Recipes: Twenty-five well-documented generalized corpuses, each embodying a different strategy (see Table 1 in this article for complete recipe specifications).
- model: Programmally derived 14 parameter configurations (4 m – 1 b) derived from the Olmo model ladder to ensure consistent training hyperparameters across scales. Each non-target scale includes two “early” seed runs, while 1 B-parameter model has three complete seed replays to quantify variability.
- Evaluation Kit: Olmes benchmarks for ten multitasking (e.g. MMLU, ARC Easy/Challenge, Hellaswag, MBPP, HumaneVal) provide a multifaceted perspective on language understanding, common sense reasoning and code generation performance.
By releasing training preprocessing and corresponding models, Datadecide enables researchers to:
- Reused checkpoints for new evaluation without reevaluation.
- Experimental novel prediction methods (e.g., advanced scaling fitting, smoothing techniques).
- Study the sensitivity of benchmarks to training data and model scales.
Key findings and quantitative insights
Datadecide’s system analysis has drawn four practical guides:
- Single-scale baseline robustness: Ranking the corpus by downstream accuracy by a single small scale (e.g., 150 m parameter) can predict the decision accuracy of the best dataset on the 1 B parameter target scale. In contrast, extrapolation of the eight baseline scaling scales does not exceed this simple heuristic, emphasizing its cost-effectiveness.
- Computational sensitivity of dependent tasks: The computational budget required for reliable decisions varies by task. Benchmarks such as MMLU and ARC are easy to predict, accounting for 0.01% of the target calculations, while Hellaswag and Socialiqa require orders of magnitude to achieve similar decision accuracy.
- Agent Metric Selection: Continuous Likelihood metrics – especially the average probability (correct probability) and total probability (total probability) of correct continuity of character ranges – take discrete accuracy measurements on small scales. This is most obvious on code tasks (MBPP, HumaneVal), where decision accuracy jumps from near channels to over 80%, proxying with the correct probability.
- Difference and communication considerations: High decision accuracy is associated with lower running differences (noise) and abundant performance. Proxy metrics that reduce noise or amplify differences thus directly enhance prediction reliability.
Conclusion
Datadecide transforms audit data selection from ad hoc technology to transparent data-driven science. By opening up all 25 Corpora, 1,050 models, more than 30,000 checkpoints and evaluation scripts about embracing faces and GitHub, AI2 invites the community to replicate the discovery, expanding the evaluation to new benchmarks and innovating in decision-making methods. As LLM development continues to require more and more computing resources, Datadecide provides a framework in principle to minimize wasted experiments and maximize insights, thus providing avenues for more efficient, repeatable and collaborative AI research.
Check Paper, model and technical details of embracing faces. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.
🔥 (Register now) Minicon Virtual Agent AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am to 1pm) + Hands-on for the seminar

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.
