---
base_model:
- answerdotai/ModernBERT-base
datasets:
- opendatalab/SlimPajama-Meta-rater
language:
- en
license: mit
metrics:
- accuracy
pipeline_tag: text-generation
library_name: transformers
---

# Random Baseline Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194).

Code: https://github.com/opendatalab/Meta-rater

## Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens randomly sampled from SlimPajama dataset. It serves as a baseline for comparing data selection methods in the Meta-rater research.

## Model Details

- **Architecture**: Transformer decoder-only
- **Parameters**: 1.345B (1,345,423,360 parameters)
- **Training Tokens**: 30B tokens
- **Context Window**: 1,024 tokens
- **Vocabulary Size**: 32,000 (LLaMA tokenizer)
- **Training Data**: Randomly sampled from SlimPajama dataset
- **Domain Distribution**: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%)

## Architecture Specifications

- **Hidden Dimension**: 2,048
- **Number of Layers**: 24
- **Attention Heads**: 16
- **Key-Value Heads**: 16
- **MLP Ratio**: 8/3
- **Position Encoding**: RoPE (base=10,000)

## Training Details

- **Hardware**: 32x NVIDIA A800 GPUs
- **Global Batch Size**: 4,194,304 tokens
- **Learning Rate**: 5e-5
- **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- **Training Time**: ~14 hours

## Performance Results

### Downstream Task Performance (Average Accuracy)

- **General Knowledge**: 52.79%
  - ARC-Easy: 51.05%
  - ARC-Challenge: 23.81%
  - SciQ: 83.50%

- **Commonsense Reasoning**: 43.94%
  - HellaSwag: 39.69%
  - SIQA: 40.28%
  - WinoGrande: 51.85%

- **Reading Comprehension**: 30.02%
  - RACE: 30.43%
  - OpenbookQA: 29.60%

- **Overall Average**: 43.78%

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-random"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Research Context

This model serves as a crucial baseline in the Meta-rater research, demonstrating the performance achievable with random data selection. Key findings:

- **Convergence Speed**: Models trained with Meta-rater data selection achieve equivalent performance using only 15B tokens compared to this 30B token baseline
- **Efficiency**: Meta-rater models outperform this baseline by 3.23% average accuracy when using the same 30B tokens
- **Token Efficiency**: This model requires 60B tokens to match the performance of Meta-rater models trained on 30B tokens

## Applications

This model can be used for:
- **Baseline comparisons** in data selection research
- **General language modeling** tasks
- **Research on training efficiency** and data quality
- **Educational purposes** for understanding transformer training

## Limitations

- Trained on randomly selected data without quality filtering
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- Performance lower than models trained with curated data selection

## Citation

If you use this model in your research, please cite:

```bibtex
@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}
```

## License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

## Contact

For questions or issues, please contact the authors or open an issue in the repository.