Dataset and pre-trained models for "Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training (Neurips 2025)"