KZ-CALM SentencePiece Tokenizer (Kazakh, 4096 vocab)
A SentencePiece BPE tokenizer trained on Kazakh text for use in TTS (text-to-speech) models. Part of the KZ-CALM project — a Kazakh consistency-latent TTS system.
Model Details
| Property | Value |
|---|---|
| Algorithm | Byte-Pair Encoding (BPE) |
| Vocabulary size | 4,096 tokens |
| Character coverage | 100% |
| Training data | 232,350 Kazakh utterances from stukenov/kzcalm-tts-kk-v1 |
| Library | SentencePiece |
| License | Apache 2.0 |
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<pad> |
0 | Padding |
<s> |
1 | Beginning of sequence (BOS) |
</s> |
2 | End of sequence (EOS) |
<unk> |
3 | Unknown token |
Files
| File | Description |
|---|---|
tokenizer.model |
SentencePiece binary model (load with SentencePieceProcessor) |
tokenizer.vocab |
Human-readable vocabulary file (token + log-probability per line) |
Usage
Basic Encoding/Decoding
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
# Encode text to token IDs
text = "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."
ids = sp.Encode(text)
print(ids)
# Example output: [142, 87, 12, 305, 8, ...]
# Decode back to text
decoded = sp.Decode(ids)
print(decoded)
# "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."
# Encode to subword pieces
pieces = sp.EncodeAsPieces(text)
print(pieces)
# Example: ['▁Сәлем', ',', '▁әлем', '!', '▁Бұл', '▁қазақ', ...]
Download and Use with huggingface_hub
from huggingface_hub import hf_hub_download
import sentencepiece as spm
model_path = hf_hub_download("stukenov/kzcalm-sp-tokenizer-4k-kk-v1", "tokenizer.model")
sp = spm.SentencePieceProcessor(model_file=model_path)
print(sp.GetPieceSize()) # 4096
print(sp.Encode("Қазақстан"))
Use with KZ-CALM Wrapper
from kzcalm.tokenizer.sp_tokenizer import KazakhTokenizer
tok = KazakhTokenizer("tokenizer.model")
ids = tok.encode("Сәлем, әлем!", add_bos=True, add_eos=True)
# [1, 142, 87, 12, 305, 2] (BOS + tokens + EOS)
text = tok.decode(ids)
# "Сәлем, әлем!"
Training Details
- Source corpus: All transcription texts from
stukenov/kzcalm-tts-kk-v1— a unified Kazakh TTS dataset combining KazakhTTS (177K samples) and KazEmoTTS (55K samples). - Preprocessing: Texts extracted via DuckDB column pruning from remote Parquet shards (no audio download needed). Empty and whitespace-only lines excluded.
- Training: SentencePiece
SentencePieceTrainer.Train()withinput_sentence_size=5,000,000,shuffle_input_sentence=True, multi-threaded. - Vocabulary size choice: 4,096 was selected as a balance between granularity and sequence length for TTS. Kazakh has a rich morphology (agglutinative), so a smaller vocab would produce very long sequences, while a larger vocab would be sparse given the ~232K training sentences.
Intended Use
This tokenizer is designed for:
- KZ-CALM TTS pipeline: Converts Kazakh text into token IDs that feed into the Transformer backbone for speech synthesis.
- Kazakh NLP experiments: General-purpose Kazakh subword tokenization.
- Text preprocessing for any model consuming Kazakh text input.
Limitations
- Trained only on TTS transcription data — may not cover specialized vocabulary (medical, legal, technical terms).
- No language-specific normalization is applied (numbers, dates, abbreviations appear in their raw text form). A separate text normalizer should be used upstream.
- The vocabulary is optimized for Kazakh; it will perform poorly on other languages (Russian, English text will be heavily fragmented).
Related Resources
- Training dataset:
stukenov/kzcalm-tts-kk-v1— 232K samples, 438.8h Kazakh speech - Project: KZ-CALM — Kazakh Consistency-Latent TTS
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support