MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

MCAT (Multilingual Cost-effective Accelerated Speech-to-Text Translator) is a framework designed to scale many-to-many speech-to-text translation (S2TT) to 70 languages using Multimodal Large Language Models (MLLMs). It introduces a language scaling method using curriculum learning and an optimized speech adapter that reduces speech sequences to just 30 tokens, significantly improving inference efficiency.

Paper: MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages
Repository: https://github.com/yxduir/m2m-70

Model Description

MCAT addresses the challenges of language coverage and efficiency in S2TT. It supports all 4,830 possible translation directions across 70 languages. By utilizing a curriculum learning strategy and a compressed speech representation, it achieves state-of-the-art performance on datasets like FLEURS while maintaining high inference speed.

Supported Languages (70)

Afrikaans (afr), Amharic (amh), Arabic (ara), Assamese (asm), Azerbaijani (azj), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Catalan (cat), Czech (ces), Chinese (cmn), Welsh (cym), Danish (dan), German (deu), Greek (ell), English (eng), Estonian (est), Persian (fas), Finnish (fin), French (fra), Galician (glg), Gujarati (guj), Hebrew (heb), Hindi (hin), Croatian (hrv), Hungarian (hun), Armenian (hye), Indonesian (ind), Icelandic (isl), Italian (ita), Javanese (jav), Japanese (jpn), Kannada (kan), Georgian (kat), Kazakh (kaz), Khmer (khm), Kyrgyz (kir), Korean (kor), Lao (lao), Latvian (lav), Lithuanian (lit), Malayalam (mal), Macedonian (mkd), Malay (msa), Burmese (mya), Dutch (nld), Norwegian (nob), Nepali (npi), Punjabi (pan), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Slovak (slk), Slovenian (slv), Spanish (spa), Serbian (srp), Swedish (swe), Swahili (swh), Tamil (tam), Telugu (tel), Tagalog (tgl), Thai (tha), Turkish (tur), Ukrainian (ukr), Urdu (urd), Uzbek (uzb), Vietnamese (vie), Cantonese (yue).

Installation and Usage

The model is implemented using the SLAM-LLM framework. For detailed installation instructions and inference scripts, please refer to the official GitHub repository.

Citation

@article{du2025mcat,
  title={MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages},
  author={Du, Yexing and Liu, Kaiyuan and Pan, Youcheng and Yang, Bo and Deng, Keqi and Chen, Xie and Xiang, Yang and Liu, Ming and Qin, Bin and Wang, YaoWei},
  journal={arXiv preprint arXiv:2512.01512},
  year={2025}
}

@inproceedings{du2025making,
  title={Making llms better many-to-many speech-to-text translators with curriculum learning},
  author={Du, Yexing and Pan, Youcheng and Ma, Ziyang and Yang, Bo and Yang, Yifan and Deng, Keqi and Chen, Xie and Xiang, Yang and Liu, Ming and Qin, Bing},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={12466--12478},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for yxdu/mcat-small

MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

Paper • 2512.01512 • Published Dec 1, 2025