KZ-CALM SentencePiece Tokenizer (Kazakh, 4096 vocab)

A SentencePiece BPE tokenizer trained on Kazakh text for use in TTS (text-to-speech) models. Part of the KZ-CALM project — a Kazakh consistency-latent TTS system.

Model Details

Property Value
Algorithm Byte-Pair Encoding (BPE)
Vocabulary size 4,096 tokens
Character coverage 100%
Training data 232,350 Kazakh utterances from stukenov/kzcalm-tts-kk-v1
Library SentencePiece
License Apache 2.0

Special Tokens

Token ID Purpose
<pad> 0 Padding
<s> 1 Beginning of sequence (BOS)
</s> 2 End of sequence (EOS)
<unk> 3 Unknown token

Files

File Description
tokenizer.model SentencePiece binary model (load with SentencePieceProcessor)
tokenizer.vocab Human-readable vocabulary file (token + log-probability per line)

Usage

Basic Encoding/Decoding

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="tokenizer.model")

# Encode text to token IDs
text = "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."
ids = sp.Encode(text)
print(ids)
# Example output: [142, 87, 12, 305, 8, ...]

# Decode back to text
decoded = sp.Decode(ids)
print(decoded)
# "Сәлем, әлем! Бұл қазақ тіліндегі мәтін."

# Encode to subword pieces
pieces = sp.EncodeAsPieces(text)
print(pieces)
# Example: ['▁Сәлем', ',', '▁әлем', '!', '▁Бұл', '▁қазақ', ...]

Download and Use with huggingface_hub

from huggingface_hub import hf_hub_download
import sentencepiece as spm

model_path = hf_hub_download("stukenov/kzcalm-sp-tokenizer-4k-kk-v1", "tokenizer.model")
sp = spm.SentencePieceProcessor(model_file=model_path)

print(sp.GetPieceSize())  # 4096
print(sp.Encode("Қазақстан"))

Use with KZ-CALM Wrapper

from kzcalm.tokenizer.sp_tokenizer import KazakhTokenizer

tok = KazakhTokenizer("tokenizer.model")
ids = tok.encode("Сәлем, әлем!", add_bos=True, add_eos=True)
# [1, 142, 87, 12, 305, 2]  (BOS + tokens + EOS)

text = tok.decode(ids)
# "Сәлем, әлем!"

Training Details

  • Source corpus: All transcription texts from stukenov/kzcalm-tts-kk-v1 — a unified Kazakh TTS dataset combining KazakhTTS (177K samples) and KazEmoTTS (55K samples).
  • Preprocessing: Texts extracted via DuckDB column pruning from remote Parquet shards (no audio download needed). Empty and whitespace-only lines excluded.
  • Training: SentencePiece SentencePieceTrainer.Train() with input_sentence_size=5,000,000, shuffle_input_sentence=True, multi-threaded.
  • Vocabulary size choice: 4,096 was selected as a balance between granularity and sequence length for TTS. Kazakh has a rich morphology (agglutinative), so a smaller vocab would produce very long sequences, while a larger vocab would be sparse given the ~232K training sentences.

Intended Use

This tokenizer is designed for:

  • KZ-CALM TTS pipeline: Converts Kazakh text into token IDs that feed into the Transformer backbone for speech synthesis.
  • Kazakh NLP experiments: General-purpose Kazakh subword tokenization.
  • Text preprocessing for any model consuming Kazakh text input.

Limitations

  • Trained only on TTS transcription data — may not cover specialized vocabulary (medical, legal, technical terms).
  • No language-specific normalization is applied (numbers, dates, abbreviations appear in their raw text form). A separate text normalizer should be used upstream.
  • The vocabulary is optimized for Kazakh; it will perform poorly on other languages (Russian, English text will be heavily fragmented).

Related Resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support