--- base_model: - answerdotai/ModernBERT-base datasets: - opendatalab/SlimPajama-Meta-rater language: - en license: mit metrics: - accuracy pipeline_tag: text-generation library_name: transformers --- # Random Baseline Language Model (1.3B Parameters, 30B Tokens) This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194). Code: https://github.com/opendatalab/Meta-rater ## Model Description This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens randomly sampled from SlimPajama dataset. It serves as a baseline for comparing data selection methods in the Meta-rater research. ## Model Details - **Architecture**: Transformer decoder-only - **Parameters**: 1.345B (1,345,423,360 parameters) - **Training Tokens**: 30B tokens - **Context Window**: 1,024 tokens - **Vocabulary Size**: 32,000 (LLaMA tokenizer) - **Training Data**: Randomly sampled from SlimPajama dataset - **Domain Distribution**: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%) ## Architecture Specifications - **Hidden Dimension**: 2,048 - **Number of Layers**: 24 - **Attention Heads**: 16 - **Key-Value Heads**: 16 - **MLP Ratio**: 8/3 - **Position Encoding**: RoPE (base=10,000) ## Training Details - **Hardware**: 32x NVIDIA A800 GPUs - **Global Batch Size**: 4,194,304 tokens - **Learning Rate**: 5e-5 - **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8) - **Training Time**: ~14 hours ## Performance Results ### Downstream Task Performance (Average Accuracy) - **General Knowledge**: 52.79% - ARC-Easy: 51.05% - ARC-Challenge: 23.81% - SciQ: 83.50% - **Commonsense Reasoning**: 43.94% - HellaSwag: 39.69% - SIQA: 40.28% - WinoGrande: 51.85% - **Reading Comprehension**: 30.02% - RACE: 30.43% - OpenbookQA: 29.60% - **Overall Average**: 43.78% ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and tokenizer model_name = "opendatalab/meta-rater-1b-random" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Generate text prompt = "The future of artificial intelligence is" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( inputs.input_ids, max_length=100, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ## Research Context This model serves as a crucial baseline in the Meta-rater research, demonstrating the performance achievable with random data selection. Key findings: - **Convergence Speed**: Models trained with Meta-rater data selection achieve equivalent performance using only 15B tokens compared to this 30B token baseline - **Efficiency**: Meta-rater models outperform this baseline by 3.23% average accuracy when using the same 30B tokens - **Token Efficiency**: This model requires 60B tokens to match the performance of Meta-rater models trained on 30B tokens ## Applications This model can be used for: - **Baseline comparisons** in data selection research - **General language modeling** tasks - **Research on training efficiency** and data quality - **Educational purposes** for understanding transformer training ## Limitations - Trained on randomly selected data without quality filtering - Limited context window (1,024 tokens) - No instruction tuning or safety alignment - Performance lower than models trained with curated data selection ## Citation If you use this model in your research, please cite: ```bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ``` ## License Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements. ## Contact For questions or issues, please contact the authors or open an issue in the repository.