YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Model Card for Model ID

OmniGEC-Minimal-8B extends the open-weight AYA-Expanse-8B with instruction tuning and supervised fine-tuning on OmniGEC, a silver-standard GEC corpus that includes MultiGEC-25, Wikipedia, Reddit edits for 11 low-/mid-resource European languages. The result is a single model capable of paragraph-level correction across all covered languages achieving State-Of-The-Art (SOTA) results for paragraph-based editing in minimal and fluency tracks.

Per-language GLEU scores on MultiGEC-25 test set (Minimal edits)

Language	OmniGEC-Minimal-8B (AYA-Expanse-8B)	OmniGEC-Minimal-12B (Gemma-3-12B)
Czech	65.13	66.39
English	78.08	77.30
Estonian	41.52	55.12
German	78.22	75.47
Greek	56.03	53.01
Italian	77.83	74.70
Latvian	71.71	81.54
Slovenian	54.22	58.31
Swedish	55.99	63.91
Ukrainian	76.41	75.17
Average	65.51	68.09

Per-language GLEU scores on MultiGEC-25 test set (Fluency)

Language	OmniGEC-Fluency-8B (AYA-Expanse-8B)	OmniGEC-Fluency-12B (Gemma-3-12B)
Estonian	49.55	52.42
Icelandic	35.04	42.50
Ukrainian	75.82	71.88
Average	53.47	55.60

Training Data

Sub-corpus	Tokens	Source	Notes
WikiEdits-MultiGEC	≈ 1.2 M	Human Wikipedia “copy-edit” revisions (6 m window)	capped EN size to reduce bias
Reddit-MultiGEC	≈ 13 M	Posts from ≥ 400 language-specific subreddits	content-moderated, GPT-4o-mini corrections
UberText-GEC (TBD)	≈ 110 M	Ukrainian Telegram corpus	GPT-4o-mini corrections, UA-only
MultiGEC-25	≈ 0.5 M	Golden shared-task data	train/dev/test = 80 / 10 / 10

Silver corrections were created with a three-step prompt → generate 3 candidates → aggregate pipeline using o1-preview and GPT-4o-mini.

Evaluation

Metric: GLEU via the official MultiGEC-25 CodaLab evaluator (minimal & fluency tracks).
Both OmniGEC-tuned models surpass the paragraph-based baseline LLaMA-3-8B by +9–10 GLEU on the minimal track and deliver the current best open scores for Estonian and Latvian.

OmniGEC-Tuned Checkpoints

AYA-Expanse-8B · Gemma-3-12B-IT

🔧 Quick start

pip install transformers
git clone https://github.com/r-kovalch/omnigec-models.git
cd multigec-models

from transformers import AutoTokenizer, AutoModelForCausalLM
from src.instruction_templates import multigec_prompts
from src.utils.multigec import LANG_TO_CODE, LANG_CODE_TO_TOKEN

# For AYA-based models (OmniGEC-Minimal-8B, OmniGEC-Fluency-8B)
def formatting_prompts_func(example):
    language_code = LANG_TO_CODE[example["language"]]
    language_token = LANG_CODE_TO_TOKEN[language_code]

    user_input = example['feature']
    prompt_template = multigec_prompts[example["language"]].prompt_template
    instruction = prompt_template.format(original_text=user_input)

    text = f"<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{language_token}{instruction}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"

    return text

# For Gemma-based models (OmniGEC-Minimal-12B, OmniGEC-Fluency-12B)
def formatting_prompts_func(example):
    language_code = LANG_TO_CODE[example["language"]]
    # Since special tokens for Gemma models does not have |, we remove them
    language_token = LANG_CODE_TO_TOKEN[language_code].replace("|", "")

    user_input = example['feature']
    prompt_template = multigec_prompts[example["language"]].prompt_template
    instruction = prompt_template.format(original_text=user_input)

    text = f"<start_of_turn>user\n{language_token}{instruction}<end_of_turn>\n<start_of_turn>model\n"

    return text

repo = "lang-uk/OmniGEC-Minimal-8B"   # or -Fluency-8B / -12B
tok  = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto")

lang = "en"
text = "She go to school every day ."
# Choose formatting func accordingly to base model Gemma/Aya
prompt = formatting_prompts_func(text)

out = model.generate(**tok(prompt, return_tensors="pt"), max_new_tokens=1600)
print(tok.decode(out[0], skip_special_tokens=True))

Limitations

Reddit and UberText corrections are machine-generated; noise remains, esp. in slang.
Sequences > 1,600 tokens are truncated unless you raise max_new_tokens.

Details

For details on use, please refer to our GitHub
We strongly recommend you to follow the inference code we used in notebooks for both gemma and aya, as there's additional parameters, like temperature, top_k, max_new_tokens and others, specific for each model.