KZ-CALM SentencePiece Tokenizer (Kazakh, 4096 vocab)

A SentencePiece BPE tokenizer trained on Kazakh text for use in TTS (text-to-speech) models. Part of the KZ-CALM project — a Kazakh consistency-latent TTS system.

Model Details

Property	Value
Algorithm	Byte-Pair Encoding (BPE)
Vocabulary size	4,096 tokens
Character coverage	100%
Training data	232,350 Kazakh utterances from `stukenov/kzcalm-tts-kk-v1`
Library	SentencePiece
License	Apache 2.0

Special Tokens

Token	ID	Purpose
`<pad>`	0	Padding
`<s>`	1	Beginning of sequence (BOS)
`</s>`	2	End of sequence (EOS)
`<unk>`	3	Unknown token

Files

File	Description
`tokenizer.model`	SentencePiece binary model (load with `SentencePieceProcessor`)
`tokenizer.vocab`	Human-readable vocabulary file (token + log-probability per line)

Usage

Basic Encoding/Decoding

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="tokenizer.model")

# Encode text to token IDs
text = "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."
ids = sp.Encode(text)
print(ids)
# Example output: [142, 87, 12, 305, 8, ...]

# Decode back to text
decoded = sp.Decode(ids)
print(decoded)
# "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."

# Encode to subword pieces
pieces = sp.EncodeAsPieces(text)
print(pieces)
# Example: ['▁Сәлем', ',', '▁әлем', '!', '▁Бұл', '▁қазақ', ...]

Download and Use with `huggingface_hub`

from huggingface_hub import hf_hub_download
import sentencepiece as spm

model_path = hf_hub_download("stukenov/kzcalm-sp-tokenizer-4k-kk-v1", "tokenizer.model")
sp = spm.SentencePieceProcessor(model_file=model_path)

print(sp.GetPieceSize())  # 4096
print(sp.Encode("Қазақстан"))

Use with KZ-CALM Wrapper

from kzcalm.tokenizer.sp_tokenizer import KazakhTokenizer

tok = KazakhTokenizer("tokenizer.model")
ids = tok.encode("Сәлем, әлем!", add_bos=True, add_eos=True)
# [1, 142, 87, 12, 305, 2]  (BOS + tokens + EOS)

text = tok.decode(ids)
# "Сәлем, әлем!"

Training Details

Source corpus: All transcription texts from stukenov/kzcalm-tts-kk-v1 — a unified Kazakh TTS dataset combining KazakhTTS (177K samples) and KazEmoTTS (55K samples).
Preprocessing: Texts extracted via DuckDB column pruning from remote Parquet shards (no audio download needed). Empty and whitespace-only lines excluded.
Training: SentencePiece SentencePieceTrainer.Train() with input_sentence_size=5,000,000, shuffle_input_sentence=True, multi-threaded.
Vocabulary size choice: 4,096 was selected as a balance between granularity and sequence length for TTS. Kazakh has a rich morphology (agglutinative), so a smaller vocab would produce very long sequences, while a larger vocab would be sparse given the ~232K training sentences.

Intended Use

This tokenizer is designed for:

KZ-CALM TTS pipeline: Converts Kazakh text into token IDs that feed into the Transformer backbone for speech synthesis.
Kazakh NLP experiments: General-purpose Kazakh subword tokenization.
Text preprocessing for any model consuming Kazakh text input.

Limitations

Trained only on TTS transcription data — may not cover specialized vocabulary (medical, legal, technical terms).
No language-specific normalization is applied (numbers, dates, abbreviations appear in their raw text form). A separate text normalizer should be used upstream.
The vocabulary is optimized for Kazakh; it will perform poorly on other languages (Russian, English text will be heavily fragmented).

Related Resources

Training dataset: stukenov/kzcalm-tts-kk-v1 — 232K samples, 438.8h Kazakh speech
Project: KZ-CALM — Kazakh Consistency-Latent TTS

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support