Kazakh GPT-2 8M (Soz)

A small GPT-2 language model trained from scratch on Kazakh text. Part of the Soz project for building native Kazakh language models.

Overview

Property Value
Parameters ~8M
Architecture GPT-2
Vocab size 50,257
Hidden dim 128
Layers 6
Attention heads 4
FFN inner dim 512
Training data kazakh-clean-pretrain (~80M tokens)
Epochs 2
Final train loss ~6.4
Tokenizer kazakh-gpt2-50k
License Apache 2.0

Design

This is the smallest model in the Soz GPT-2 family. At ~8M parameters with ~80M training tokens, it is approximately Chinchilla-optimal (tokens-to-params ratio of ~10:1). It serves as a baseline for scaling experiments.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("saken-tukenov/kazakh-gpt2-50k")
model = AutoModelForCausalLM.from_pretrained("saken-tukenov/kazakh-gpt2-8m")

input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Details

  • Trained from scratch using the Soz training pipeline
  • Optimizer: AdamW
  • Precision: bfloat16
  • Hardware: NVIDIA A10 GPUs

Project

Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.

Citation

@misc{tukenov2026soz,
  title={Soz: Small Language Models for Kazakh},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/saken-tukenov/kazakh-gpt2-8m}
}

License

Apache 2.0

Downloads last month
2
Safetensors
Model size
7.75M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/sozkz-core-gpt2-8m-kk-base-v1