Soz: Kazakh Language Models from Scratch
Collection
Building foundational language models for Kazakh — models, tokenizers, and training corpora. • 12 items • Updated
A LLaMA-architecture language model with ~50M parameters trained on a domain-balanced Kazakh corpus.
| Property | Value |
|---|---|
| Parameters | ~50M |
| Architecture | LLaMA (RoPE, SwiGLU, RMSNorm) |
| Training data | Domain-balanced Kazakh corpus |
| License | Apache 2.0 |
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("saken-tukenov/kazakh-gpt2-50k")
model = AutoModelForCausalLM.from_pretrained("saken-tukenov/kazakh-llama-50m-balanced")
input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.
@misc{tukenov2026soz,
title={Soz: Small Language Models for Kazakh},
author={Tukenov, Saken},
year={2026},
url={https://huggingface.co/saken-tukenov/kazakh-llama-50m-balanced}
}
Apache 2.0