LLM - Pretraining Dataset Research
updated
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Paper
• 2504.11393
• Published
• 18
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting
LLMs Across Languages and Resources
Paper
• 2504.04152
• Published
• 1
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale
Pretraining
Paper
• 2508.10975
• Published
• 60
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon
Pretraining Dataset
Paper
• 2412.02595
• Published
• 7
The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering
for LLM Pretraining
Paper
• 2510.00866
• Published
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Paper
• 2407.06380
• Published
Judging Quality Across Languages: A Multilingual Approach to Pretraining
Data Filtering with Language Models
Paper
• 2505.22232
• Published
• 18
Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math
Pretraining Dataset
Paper
• 2508.15096
• Published
• 6
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language
Models
Paper
• 2407.07263
• Published