Kazakh LLaMA 30M (Soz)

A LLaMA-architecture language model trained from scratch on Kazakh text, featuring modern design choices: RoPE positional embeddings, SwiGLU activations, RMSNorm, and no bias terms.

Overview

Property	Value
Parameters	~30M
Architecture	LLaMA (RoPE, SwiGLU, RMSNorm, no bias)
Vocab size	50,257
Hidden dim	384
Layers	8
Attention heads	6
Intermediate dim	1,024
Tied embeddings	Yes
Training data	kazakh-clean-pretrain (~80M tokens)
Epochs	8
Learning rate	6e-4
Weight decay	0.05
Tokenizer	kazakh-gpt2-50k
License	Apache 2.0

Design

This model serves as a modern architecture comparison against the GPT-2 30M variant trained on the same data. The LLaMA architecture incorporates several advances: Rotary Position Embeddings (RoPE), SwiGLU feed-forward networks, RMSNorm instead of LayerNorm, and removal of bias terms.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("saken-tukenov/kazakh-gpt2-50k")
model = AutoModelForCausalLM.from_pretrained("saken-tukenov/kazakh-llama-30m")

input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Details

Trained from scratch using the Soz training pipeline
Optimizer: AdamW with weight decay 0.05
Precision: bfloat16
Hardware: NVIDIA A10 GPUs

Project

Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.

Citation

@misc{tukenov2026soz,
  title={Soz: Small Language Models for Kazakh},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/saken-tukenov/kazakh-llama-30m}
}

License

Apache 2.0

Downloads last month: 3

Safetensors

Model size

33.5M params

Tensor type

F32

Collections including stukenov/sozkz-core-llama-30m-kk-base-v1

Soz: Kazakh Language Models from Scratch

Collection

Building foundational language models for Kazakh — models, tokenizers, and training corpora. • 12 items • Updated 13 days ago

Kazakh SLM

Collection

Small Language Models for Kazakh: models, tokenizers, and datasets for Kazakh language modeling. • 23 items • Updated 13 days ago