Kazakh SLM
Collection
Small Language Models for Kazakh: models, tokenizers, and datasets for Kazakh language modeling. • 23 items • Updated
A small GPT-2 language model trained from scratch on Kazakh text. Part of the Soz project for building native Kazakh language models.
| Property | Value |
|---|---|
| Parameters | ~8M |
| Architecture | GPT-2 |
| Vocab size | 50,257 |
| Hidden dim | 128 |
| Layers | 6 |
| Attention heads | 4 |
| FFN inner dim | 512 |
| Training data | kazakh-clean-pretrain (~80M tokens) |
| Epochs | 2 |
| Final train loss | ~6.4 |
| Tokenizer | kazakh-gpt2-50k |
| License | Apache 2.0 |
This is the smallest model in the Soz GPT-2 family. At ~8M parameters with ~80M training tokens, it is approximately Chinchilla-optimal (tokens-to-params ratio of ~10:1). It serves as a baseline for scaling experiments.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("saken-tukenov/kazakh-gpt2-50k")
model = AutoModelForCausalLM.from_pretrained("saken-tukenov/kazakh-gpt2-8m")
input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.
@misc{tukenov2026soz,
title={Soz: Small Language Models for Kazakh},
author={Tukenov, Saken},
year={2026},
url={https://huggingface.co/saken-tukenov/kazakh-gpt2-8m}
}
Apache 2.0