Kazakh LLaMA 30M (Soz)

A LLaMA-architecture language model trained from scratch on Kazakh text, featuring modern design choices: RoPE positional embeddings, SwiGLU activations, RMSNorm, and no bias terms.

Overview

Property Value
Parameters ~30M
Architecture LLaMA (RoPE, SwiGLU, RMSNorm, no bias)
Vocab size 50,257
Hidden dim 384
Layers 8
Attention heads 6
Intermediate dim 1,024
Tied embeddings Yes
Training data kazakh-clean-pretrain (~80M tokens)
Epochs 8
Learning rate 6e-4
Weight decay 0.05
Tokenizer kazakh-gpt2-50k
License Apache 2.0

Design

This model serves as a modern architecture comparison against the GPT-2 30M variant trained on the same data. The LLaMA architecture incorporates several advances: Rotary Position Embeddings (RoPE), SwiGLU feed-forward networks, RMSNorm instead of LayerNorm, and removal of bias terms.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("saken-tukenov/kazakh-gpt2-50k")
model = AutoModelForCausalLM.from_pretrained("saken-tukenov/kazakh-llama-30m")

input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Details

  • Trained from scratch using the Soz training pipeline
  • Optimizer: AdamW with weight decay 0.05
  • Precision: bfloat16
  • Hardware: NVIDIA A10 GPUs

Project

Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.

Citation

@misc{tukenov2026soz,
  title={Soz: Small Language Models for Kazakh},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/saken-tukenov/kazakh-llama-30m}
}

License

Apache 2.0

Downloads last month
3
Safetensors
Model size
33.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collections including stukenov/sozkz-core-llama-30m-kk-base-v1