stukenov/sozkz-corpus-balanced-kk-gpt2-v1
Viewer
• Updated
• 480k • 42
Building foundational language models for Kazakh — models, tokenizers, and training corpora.
Note Legacy tokenized corpus (v1, domain-balanced)
Note Legacy tokenized corpus for LLaMA experiments (32K BPE)
Note LLaMA 30M — modern arch (RoPE, SwiGLU, RMSNorm)
Note LLaMA 50M — early from-scratch experiment
Note LLaMA 150M — largest early model
Note LLaMA 50M on balanced corpus
Note LLaMA 150M on balanced corpus
Note Pythia 14M DAPT — first Kazakh LM experiment (proof of concept)