Soz: Kazakh Language Models from Scratch
Collection
Building foundational language models for Kazakh — models, tokenizers, and training corpora. • 12 items • Updated
A LLaMA-architecture language model trained from scratch on Kazakh text, featuring modern design choices: RoPE positional embeddings, SwiGLU activations, RMSNorm, and no bias terms.
| Property | Value |
|---|---|
| Parameters | ~30M |
| Architecture | LLaMA (RoPE, SwiGLU, RMSNorm, no bias) |
| Vocab size | 50,257 |
| Hidden dim | 384 |
| Layers | 8 |
| Attention heads | 6 |
| Intermediate dim | 1,024 |
| Tied embeddings | Yes |
| Training data | kazakh-clean-pretrain (~80M tokens) |
| Epochs | 8 |
| Learning rate | 6e-4 |
| Weight decay | 0.05 |
| Tokenizer | kazakh-gpt2-50k |
| License | Apache 2.0 |
This model serves as a modern architecture comparison against the GPT-2 30M variant trained on the same data. The LLaMA architecture incorporates several advances: Rotary Position Embeddings (RoPE), SwiGLU feed-forward networks, RMSNorm instead of LayerNorm, and removal of bias terms.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("saken-tukenov/kazakh-gpt2-50k")
model = AutoModelForCausalLM.from_pretrained("saken-tukenov/kazakh-llama-30m")
input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.
@misc{tukenov2026soz,
title={Soz: Small Language Models for Kazakh},
author={Tukenov, Saken},
year={2026},
url={https://huggingface.co/saken-tukenov/kazakh-llama-30m}
}
Apache 2.0