YAML Metadata Warning: The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

kazakh-gec-mt5-base-run13-finetune

Run 13: Latest and best mT5-base GEC model — final fine-tuning.

Overview

Property	Value
Task	Kazakh Grammatical Error Correction
Architecture	mt5-base (seq2seq)
Base model	saken-tukenov/kazakh-gec-mt5-base-run12-kazsandra-new
Training data	kazakh-synthetic-gec-datasets
Language	Kazakh (kk)
License	CC-BY-SA-4.0

Best mT5-base variant. Final fine-tuning stage.

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("saken-tukenov/kazakh-gec-mt5-base-run13-finetune")
model = AutoModelForSeq2SeqLM.from_pretrained("saken-tukenov/kazakh-gec-mt5-base-run13-finetune")

input_text = "gec: " + "Мен кеше мектепке бардым"
inputs = tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Fine-tuned from saken-tukenov/kazakh-gec-mt5-base-run12-kazsandra-new
Training data: 1M+ synthetic GEC pairs (correct Kazakh with introduced errors)
Task prefix: "gec: "

Project

Part of the Kazakh GEC project, building grammatical error correction models for Kazakh.

Citation

@misc{tukenov2026gec,
  title={Kazakh Grammatical Error Correction with mT5},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/saken-tukenov/kazakh-gec-mt5-base-run13-finetune}
}

License

CC-BY-SA-4.0

Downloads last month: 1

Safetensors

Model size

0.6B params

Tensor type

F32

Collection including stukenov/sozkz-fix-mt5b-kk-gec-run13-v1

Kazakh GEC: Grammar Error Correction

Collection

Kazakh grammatical error correction — 13 progressive training runs on mT5-small and mT5-base. • 7 items • Updated 15 days ago