Kazakh GPT-2 30M (Soz)

A GPT-2 language model trained from scratch on Kazakh text. Best quality/size ratio in the Soz GPT-2 model family.

Overview

Property	Value
Parameters	~30M
Architecture	GPT-2
Vocab size	50,257
Hidden dim	384
Layers	8
Attention heads	6
FFN inner dim	1,536
Training data	kazakh-clean-pretrain (~80M tokens)
Epochs	4
Final train loss	~5.49
Final eval loss	~5.815
Tokenizer	kazakh-gpt2-50k
License	Apache 2.0

Design

This model achieves the best quality/size trade-off among the Soz GPT-2 models. With ~30M parameters trained on ~80M tokens over 4 epochs, it reaches the lowest eval loss relative to model size in the family.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("saken-tukenov/kazakh-gpt2-50k")
model = AutoModelForCausalLM.from_pretrained("saken-tukenov/kazakh-gpt2-30m")

input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Details

Trained from scratch using the Soz training pipeline
Optimizer: AdamW
Precision: bfloat16
Hardware: NVIDIA A10 GPUs

Project

Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.

Citation

@misc{tukenov2026soz,
  title={Soz: Small Language Models for Kazakh},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/saken-tukenov/kazakh-gpt2-30m}
}

License

Apache 2.0

Downloads last month: 3

Safetensors

Model size

33.9M params

Tensor type

F32

Collection including stukenov/sozkz-core-gpt2-30m-kk-base-v1

Kazakh SLM

Collection

Small Language Models for Kazakh: models, tokenizers, and datasets for Kazakh language modeling. • 23 items • Updated 14 days ago