Automatic Speech Recognition
Transformers
Safetensors
DiCoW
speech
whisper
multilingual
fine-tuned
mlc-slm
speaker-diarization
meeting-transcription
BUT-FIT
custom_code
Instructions to use BUT-FIT/DiCoW_v3_MLC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BUT-FIT/DiCoW_v3_MLC with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="BUT-FIT/DiCoW_v3_MLC", trust_remote_code=True)# Load model directly from transformers import AutoModelForSpeechSeq2Seq model = AutoModelForSpeechSeq2Seq.from_pretrained("BUT-FIT/DiCoW_v3_MLC", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - speech | |
| - automatic-speech-recognition | |
| - whisper | |
| - multilingual | |
| - fine-tuned | |
| - mlc-slm | |
| - speaker-diarization | |
| - meeting-transcription | |
| - DiCoW | |
| - BUT-FIT | |
| pipeline_tag: automatic-speech-recognition | |
| license: cc-by-4.0 | |
| datasets: | |
| - microsoft/NOTSOFAR | |
| - edinburghcstr/ami | |
| # DiCoW\_v3\_MLC — BUT-FIT Model for MLC-SLM Challenge | |
| This repository contains the **DiCoW\_v3\_MLC** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) for the [MLC-SLM Challenge](https://www.nexdata.ai/competition/mlc-slm). | |
| Diarization-Conditioned Whisper (DiCoW) is a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. | |
| This model is available under the terms of CC BY 4.0. It incorporates an MIT-licensed base model and CC BY 4.0 licensed training data. | |
| The model is described in detail in the following papers: | |
| * 📰 **Journal paper (main DiCoW paper):** [DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition](https://authors.elsevier.com/a/1lI9m_K8BYumVY) | |
| * 📰 **ICASSP paper (initial DiCoW experiments):** [Target Speaker ASR with Whisper](https://ieeexplore.ieee.org/document/10887683) | |
| * 📰 **MLC-SLM Challenge submission paper:** [BUT System for the MLC-SLM Challenge](https://www.arxiv.org/abs/2506.13414) | |
| ## Model Summary | |
| The model is based on **Whisper large-v3-turbo**, initially trained on: | |
| * **NOTSOFAR-1** | |
| * **AMI** Meeting Corpus | |
| * **Libri2Mix** dataset | |
| It is then fine-tuned on the **MLC-SLM dataset** as part of the MLC-SLM Challenge. | |
| ## Model Details | |
| * **Developed by:** BUT Speech\@FIT, Brno University of Technology | |
| * **Model type:** Whisper large-v3-turbo + DiCoW composition | |
| * **Language(s):** Multilingual (primarily English, but supports multiple languages) | |
| * **License:** apache-2.0 | |
| * **Fine-tuned from:** openai/whisper-large-v3-turbo | |
| * **Challenge:** MLC-SLM (Multilingual Conversational Speech Language Model) | |
| ## Model Sources | |
| * **Training Code:** [TS-ASR-Whisper GitHub](https://github.com/BUTSpeechFIT/TS-ASR-Whisper) | |
| * **Inference Code & DiCoW framework:** [DiCoW GitHub](https://github.com/BUTSpeechFIT/DiCoW) | |
| ## Getting Started | |
| ```python | |
| from transformers import AutoModelForSpeechSeq2Seq | |
| MODEL_NAME = "BUT-FIT/DiCoW_v3_MLC" | |
| dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True) | |
| ``` | |
| For detailed inference and full pipelines, refer to: | |
| 👉 [DiCoW GitHub inference repo](https://github.com/BUTSpeechFIT/DiCoW) | |
| ### tcpWER/CER (%) on the MLC-SLM development set | |
| | Language | Baseline (GT) | DiCoW (GT) | FT (GT) | Baseline (Real diar) | DiCoW (Real diar) | FT (Real diar) | | |
| |----------------|---------------|------------|---------|-----------------------|-------------------|----------------| | |
| | American En. | 14.1 | 20.6 | 11.1 | 53.7 | 36.5 | 22.5 | | |
| | Australian En. | 11.7 | 19.4 | 7.4 | 52.6 | 23.6 | 13.0 | | |
| | British En. | 10.1 | 16.7 | 7.7 | 71.9 | 26.1 | 17.6 | | |
| | Filipino En. | 9.2 | 17.7 | 7.5 | 50.4 | 25.5 | 15.2 | | |
| | Indian En. | 14.0 | 14.3 | 13.3 | 70.7 | 14.9 | 14.0 | | |
| | French | 28.1 | 27.7 | 16.1 | 96.0 | 37.8 | 27.5 | | |
| | German | 20.7 | 21.2 | 23.9 | 86.7 | 30.1 | 27.3 | | |
| | Italian | 17.9 | 16.2 | 12.3 | 83.3 | 19.8 | 16.4 | | |
| | Japanese (\*) | 21.6 | 19.2 | 13.7 | 71.3 | 25.8 | 23.3 | | |
| | Korean (\*) | 13.8 | 12.8 | 8.5 | 59.6 | 24.5 | 22.8 | | |
| | Portuguese | 21.2 | 24.5 | 19.5 | 118.8 | 33.1 | 29.7 | | |
| | Russian | 17.7 | 17.6 | 11.6 | 69.2 | 22.5 | 16.7 | | |
| | Spanish | 12.3 | 11.6 | 8.7 | 75.6 | 18.2 | 16.3 | | |
| | Thai (\*) | 14.5 | 31.9 | 14.2 | 83.6 | 34.4 | 20.1 | | |
| | Vietnamese | 27.2 | 30.0 | 15.3 | 82.8 | 33.8 | 24.7 | | |
| | **Overall** | **16.8** | **22.0** | **12.9**| **76.1** | **28.4** | **20.8** | | |
| > *Results marked with an asterisk (*) are reported using tcpCER, following the official evaluation protocol.* | |
| **Notes:** | |
| - GT = Ground-Truth Segmentation | |
| - Real diar = Real Diarization | |
| - Baseline uses Whisper large-v3 with chunked inference + finetunned Pyannote diarization. | |
| - DiCoW uses fine-tuned DiariZen diarization. | |
| ## Citation | |
| If you use this model, please cite: | |
| ```bibtex | |
| @article{POLOK2026101841, | |
| title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition}, | |
| journal = {Computer Speech & Language}, | |
| volume = {95}, | |
| pages = {101841}, | |
| year = {2026}, | |
| issn = {0885-2308}, | |
| doi = {https://doi.org/10.1016/j.csl.2025.101841}, | |
| url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X}, | |
| author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget}, | |
| keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation}, | |
| } | |
| @INPROCEEDINGS{10887683, | |
| author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš}, | |
| booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, | |
| title={Target Speaker ASR with Whisper}, | |
| year={2025}, | |
| volume={}, | |
| number={}, | |
| pages={1-5}, | |
| keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper}, | |
| doi={10.1109/ICASSP49660.2025.10887683} | |
| } | |
| @misc{polok2025mlcslmchallenge, | |
| title={BUT System for the MLC-SLM Challenge}, | |
| author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget}, | |
| year={2025}, | |
| eprint={2506.13414}, | |
| archivePrefix={arXiv}, | |
| primaryClass={eess.AS}, | |
| url={https://arxiv.org/abs/2506.13414}, | |
| } | |
| ``` | |
| ## Contact | |
| For questions or collaborations, feel free to email: [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz) | |
| **BUT Speech@FIT, Brno University of Technology** | |
| GitHub: [BUTSpeechFIT](https://github.com/BUTSpeechFIT) |