--- license: other license_name: meralion-public-license-v3 license_link: https://huggingface.co/datasets/MERaLiON/MERaLiON_Public_Licence/blob/main/MERaLiON-Public-Licence-v3.pdf extra_gated_fields: First Name: text Last Name: text Country: country Affiliation: text Job title: type: select options: - Student - Research Graduate - AI developer/Researcher - Other I consent to being contacted by the MERaLiON team for feedback or follow-up regarding my experience using the model: checkbox extra_gated_description: >- By downloading this model, you acknowledge that you have read and agree to be bound by the Terms and Conditions set out in this document [MERaLiON Public License v3](https://huggingface.co/datasets/MERaLiON/MERaLiON_Public_Licence/blob/main/MERaLiON-Public-Licence-v3.pdf). The information you provide will be collected, stored, processed, and shared in accordance with the [A*STAR Privacy Policy](https://www.a-star.edu.sg/privacy-statement). extra_gated_button_content: Submit datasets: - MERaLiON/Multitask-National-Speech-Corpus-v1 language: - en - zh - ms - ta - id - th - vi metrics: - wer - bleu base_model: - openai/whisper-large-v3 - google/gemma-2-9b-it library_name: transformers tags: - meralion - meralion-2 ---
🚀 MERaLiON-2-10B | 🚀 MERaLiON-2-10B-ASR | 🚀 MERaLiON-2-3B
## Introduction We are pleased to announce the release of **MERaLiON2**, the latest addition to the MERaLiON family of speech-text large language models. Our flagship model, [**MERaLiON-2-10B**](https://huggingface.co/MERaLiON/MERaLiON-2-10B), demonstrates competitive performance across benchmark evaluations in tasks such as multilingual automatic speech recognition (ASR), speech translation (ST), audio scene understanding, emotion recognition, and general speech comprehension. These results are comparable to those achieved by other state-of-the-art open-source AudioLLMs, including Qwen2.5-Omni-7B and Phi-4-multimodal-instruct. MERaLiON-2-10B is specifically designed to follow complex instructions with a nuanced understanding of **Singapore’s multilingual and multicultural context**. It integrates a localized Whisper-large-v3 speech encoder and Gemma-2-9b text decoder. The following graph presents task-specific evaluation scores, assessed using the **LLM-as-a-Judge** framework across multiple datasets. For the speech translation task, performance is measured using the BLEU metric, where higher scores indicate better translation quality.
In addition, we introduce an ASR-optimized variant, [**MERaLiON-2-10B-ASR**](https://huggingface.co/MERaLiON/MERaLiON-2-10B-ASR), which delivers a **5–30%** performance improvement over OpenAI’s `whisper-large-v3` on speech recognition tasks. This enhancement spans Singapore’s 4 official languages—**English**, **Mandarin**, **Malay**, and **Tamil**—as well as 3 South-East Asian languages: **Indonesian**, **Thai**, and **Vietnamese**. The model also demonstrates robust handling of **code-switching scenarios** and local colloquialisms, reflecting its adaptability to Singapore’s diverse linguistic landscape.
The following visualization illustrates the **1 - Word Error Rate (WER)** metric across these seven languages, comparing MERaLiON-2-10B-ASR with other leading models. A higher value indicates better transcription accuracy.
We also provide [MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B) that balances performance with reduced computational requirements, enabling broader accessibility and lightweight deployment.
- **Extended Audio Length**: Support audio inputs up to 300 seconds (5 minutes) for audio & speech question answering tasks, **30s for a satisfactory performance for speech transcription (ASR) and speech translation (ST) tasks**.
- **Expanded Language Coverage**: In addition to English, Chinese, and Singlish, V2 introduces support for Malay, Tamil, and other South-East Asia languages including Indonesian, Thai, and Vietnamese.
- **Improved Performance**: Achieves higher performance across a wide range of tasks. See the [Evaluation](#performance) section for detailed benchmarks.
- **Higher Quality Training Data**: Trained on 120,000 hours of curated speech and audio data, filtered for quality and diversity, with an emphasis on local and multilingual audio sources.
- **Three Model Variants**: Available in general-purpose ([MERaLiON-2-10B](https://huggingface.co/MERaLiON/MERaLiON-2-10B)), ASR-optimized ([MERaLiON-2-10B-ASR](https://huggingface.co/MERaLiON/MERaLiON-2-10B-ASR)) and light-weight ([MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B)) configurations to balance latency, compute efficiency, and task performance across different deployment needs.
## Model Description:
MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**earning **i**n **O**ne **N**etwork.
MERaLiON-2 is a family of Speech-Text Large Language Models tailored for **Singapore’s multilingual and multicultural landscape**, as well as the wider **Southeast Asian region**.
The 10B model integrates a localized [Whisper-Large-V3](https://huggingface.co/openai/whisper-large-v3) speech encoder with the [Gemma2-9b-IT](https://huggingface.co/google/gemma-2-9b-it) text decoder.
The 3B model integrates a localized [Whisper-Large-V3](https://huggingface.co/openai/whisper-large-v3) speech encoder with the [Gemma2-2b-IT](https://huggingface.co/google/gemma-2-2b-it) text decoder.
MERaLiON-2-10B is finetuned on **120,000 hours of speech and audio data** across **6 diverse tasks**: Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), Audio Captioning (AC), Audio-Scene Question Answering (ASQA) and Paralinguistic Question Answering (PQA).
The model supports long-form audio inputs of up to 300 seconds (5 minutes) and is specifically adapted to handle the linguistic nuances, accents, and dialects commonly found across Singapore and neighboring countries.
- **Developed by:** I2R, A\*STAR, Singapore
- **Model type:** Multimodal LLM
- **Language(s):** Primarily English (Global and Singapore), Chinese, with support for audio of regional languages including Malay, Tamil, Indonesian, Thai, and Vietnamese.
- **Audio:** **Mono** channel audio, **16000** hz, up to **300** seconds.
- **License:** [MERaLiON Public License](https://huggingface.co/datasets/MERaLiON/MERaLiON_Public_Licence/blob/main/MERaLiON-Public-Licence-v3.pdf)
- **Demo:** [MERaLiON-AudioLLM Web Demo](https://meralion.org/demo/)
**MERaLiON-2** is an upgraded version of [MERaLiON-AudioLLM](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION).
## Performance:
We benchmark MERaLiON-2 series models with extended [AudioBench benchmark](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) against several recently released open-source multimodal models — SALMONN-7B, Qwen2.5-Omni series and Phi-4-Multimodal — as well as two cascade model.
**Better Automatic Speech Recognition (ASR) Accuracy**
MERaLiON-2-10B-ASR and MERaLiON-2-10B demonstrate leading performance in Singlish, Mandarin, Malay, Tamil, and other Southeast Asian languages, while maintaining competitive results in English compared to `Whisper-large-v3`. The following table shows the average transcription `Word Error Rate` by language for the MERaLiON family and other leading AudioLLMs. The `Private Dataset` includes a collection of Singapore's locally accented speeches with code-switch.
Please visit [AudioBench benchmark](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) for dataset-level evaluation results.
| MERaLiON-2-10B-ASR | MERaLiON-2-10B | MERaLiON-2-3B | whisper_large_v3 | cascade-whisper_large_v3-llama_3_8b_instruct | cascade-whisper_large_v2-gemma2_9b_cpt-sea_lionv3_instruct | MERaLiON-AudioLLM-Whisper-SEA-LION | Qwen2.5-Omni-7B | SeaLLMs-Audio-7B | Qwen2.5-Omni-3B | SALMONN_7B | phi_4_multimodal_instruct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Thai | 0.096526 | 0.109365 | 0.107279 | 0.121073 | 0.120257 | 0.172105 | 0.919330 | 0.126497 | 0.117152 | 0.163150 | 1.191099 | 1.510068 |
| Tamil | 0.271279 | 0.327081 | 0.344081 | 0.441483 | 0.475225 | 0.492336 | 0.561315 | 1.024916 | 2.325402 | 1.315143 | 1.306694 | 1.876722 |
| Singlish | 0.129830 | 0.168813 | 0.180395 | 0.248945 | 0.251608 | 0.255717 | 0.143800 | 0.439071 | 0.795990 | 0.389393 | 0.441490 | 0.448863 |
| Malay | 0.194638 | 0.209074 | 0.279891 | 0.219692 | 0.311921 | 0.314378 | 0.289895 | 1.460664 | 0.765565 | 2.943750 | 1.085867 | 3.762933 |
| English | 0.078544 | 0.088259 | 0.122295 | 0.080841 | 0.081568 | 0.104830 | 0.110567 | 0.134216 | 0.197824 | 0.110353 | 0.191492 | 0.098225 |
| Indonesian | 0.121020 | 0.142813 | 0.131950 | 0.137102 | 0.135390 | 0.159476 | 0.298365 | 0.168659 | 0.220227 | 0.205216 | 1.653502 | 3.565510 |
| Mandarian | 0.103694 | 0.132025 | 0.145878 | 0.170980 | 0.196867 | 0.291733 | 0.291183 | 0.102419 | 0.309782 | 0.130429 | 0.939545 | 0.238879 |
| Vietnamese | 0.118693 | 0.134808 | 0.155110 | 0.148474 | 0.136075 | 0.164078 | 0.952040 | 0.205491 | 0.222001 | 0.186786 | 1.521174 | 1.805643 |
| Private Dataset | 0.106150 | 0.112360 | 0.147258 | 0.116630 | 0.118434 | 0.143812 | 0.130667 | 0.222770 | 0.496540 | 0.164556 | 0.273304 | 0.229450 |
| MERaLiON-2-10B | MERaLiON-AudioLLM-Whisper-SEA-LION | MERaLiON-2-10B-ASR | MERaLiON-2-3B | SeaLLMs-Audio-7B | Qwen2-Audio-7B-Instruct | Qwen2.5-Omni-3B | phi_4_multimodal_instruct | cascade-whisper_large_v3-llama_3_8b_instruct | Qwen2.5-Omni-7B | cascade-whisper_large_v2-gemma2_9b_cpt-sea_lionv3_instruct | Qwen-Audio-Chat | SALMONN_7B | WavLLM_fairseq | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Speech Instruction | 70.200000 | 70.800000 | 13.400000 | 19.100000 | 66.900000 | 48.700000 | 65.000000 | 36.200000 | 66.100000 | 58.300000 | 72.900000 | 10.200000 | 12.900000 | 20.400000 |
| Emotion Recognition | 63.736268 | 48.577313 | 53.693298 | 54.040797 | 52.007576 | 49.846540 | 33.037836 | 40.677800 | 50.937578 | 31.469397 | 48.214969 | 41.671551 | 33.584869 | 50.801545 |
| Audio Scene Question Answering | 51.140374 | 52.207756 | 49.511886 | 46.141353 | 50.193739 | 47.048025 | 48.123228 | 42.217143 | 21.876943 | 45.669153 | 18.043681 | 51.618622 | 51.816958 | 33.034083 |
| Gender Recognition | 95.109423 | 97.177396 | 97.220335 | 93.810266 | 75.449392 | 95.963266 | 47.867210 | 70.718047 | 57.039409 | 48.724711 | 19.421130 | 60.349349 | 84.365092 | 60.773275 |
| Spoken QA (Singlish) | 66.550000 | 58.900000 | 61.850000 | 59.700000 | 51.350000 | 46.700000 | 60.500000 | 61.950000 | 59.350000 | 58.400000 | 53.750000 | 42.300000 | 43.200000 | 51.200000 |
| Audio Captioning | 35.604270 | 36.976419 | 34.466710 | 33.243839 | 45.089372 | 37.278810 | 39.200328 | 30.832409 | 2.915778 | 31.896243 | 3.140568 | 39.988663 | 28.880570 | 6.200867 |
| Spoken Dialogue Summarisation | 53.100000 | 53.600000 | 55.800000 | 48.550000 | 45.450000 | 36.300000 | 46.750000 | 50.750000 | 45.850000 | 43.150000 | 51.000000 | 25.250000 | 14.400000 | 39.450000 |
| Spoken QA (English) | 79.735049 | 63.711481 | 73.975834 | 68.715179 | 70.920519 | 68.888565 | 67.818546 | 75.513152 | 78.526569 | 68.415131 | 67.814538 | 66.069047 | 60.649071 | 70.595242 |
| Music Understanding | 63.942713 | 51.347936 | 60.657119 | 55.602359 | 63.689975 | 71.609099 | 59.309183 | 55.265375 | 56.697557 | 47.598989 | 50.463353 | 59.056445 | 49.705139 | 44.313395 |
| Accent Recognition | 41.815396 | 43.799799 | 47.788864 | 60.054981 | 10.143836 | 10.901397 | 0.478694 | 3.097615 | 21.398482 | 0.587293 | 25.929693 | 17.550294 | 11.577381 | 14.294613 |
| Speech Translation | 27.391115 | 27.086366 | 28.540359 | 22.130258 | 21.143215 | 10.826666 | 21.776628 | 13.827110 | 13.536272 | 20.688241 | 21.437997 | 4.973184 | 13.486003 | 9.046791 |