Kazakh SLM
Collection
Small Language Models for Kazakh: models, tokenizers, and datasets for Kazakh language modeling. • 23 items • Updated
A GPT-2 language model trained from scratch on Kazakh text. Best quality/size ratio in the Soz GPT-2 model family.
| Property | Value |
|---|---|
| Parameters | ~30M |
| Architecture | GPT-2 |
| Vocab size | 50,257 |
| Hidden dim | 384 |
| Layers | 8 |
| Attention heads | 6 |
| FFN inner dim | 1,536 |
| Training data | kazakh-clean-pretrain (~80M tokens) |
| Epochs | 4 |
| Final train loss | ~5.49 |
| Final eval loss | ~5.815 |
| Tokenizer | kazakh-gpt2-50k |
| License | Apache 2.0 |
This model achieves the best quality/size trade-off among the Soz GPT-2 models. With ~30M parameters trained on ~80M tokens over 4 epochs, it reaches the lowest eval loss relative to model size in the family.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("saken-tukenov/kazakh-gpt2-50k")
model = AutoModelForCausalLM.from_pretrained("saken-tukenov/kazakh-gpt2-30m")
input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.
@misc{tukenov2026soz,
title={Soz: Small Language Models for Kazakh},
author={Tukenov, Saken},
year={2026},
url={https://huggingface.co/saken-tukenov/kazakh-gpt2-30m}
}
Apache 2.0