Whisper Large v3 Turbo Quran (LoRA Fine-Tuned)

This is a specialized Automatic Speech Recognition (ASR) model for Quranic Recitation. It is a fine-tuned version of openai/whisper-large-v3-turbo, optimized to recognize Quranic Arabic (with diacritics or tashkeel) with high accuracy while maintaining exceptional inference speed.

Model Performance

  • Word Error Rate (WER): Achieved 12.69% on the ahishamm/QURANICWhisperDataset test set (20% of the dataset).
  • Accuracy: The model demonstrates high precision in capturing Quranic vocabulary and standard script nuances.

Architecture & Trade-offs

This model utilizes the Turbo architecture, which reduces the decoder depth from 32 layers (in the standard Large v3) to 4 layers.

  • Pros: Extremely fast inference speed (significantly lower latency than Medium or Large v3).
  • Cons: Due to the reduced decoder depth, the model has less capacity for long-range context retention compared to the full Large model. This makes it slightly more prone to hallucinations and repetition loops (e.g., repeating a word during silence), especially if not configured correctly during inference.

Recommendation: This model is great for fast verse transcription, Live Transcription, or applications where low latency is critical. For offline batch processing where speed is not a priority, a full Large-v3 (or even a medium whisper) model will offer higher semantic stability, especially on long verses.

Training Details

The model was trained using LoRA (Low-Rank Adaptation) in a multi-stage curriculum learning process to ensure stability and precision.

Datasets

The training and evaluation process utilized a comprehensive mix of professional recitations:

  1. Training & Validation:
  2. Testing:

Methodology

  • Curriculum Learning: The model was trained gradually across these datasets to refine its understanding of Tajweed and Quranic sentence structures.
  • Data Augmentation: To ensure the model remains robust against real-world conditions (non-studio microphones, background noise, varying volumes), diverse audio augmentations (gain adjustments, spectral masking and white noise) were applied during the training process.

Usage

This model is fully compatible with the Hugging Face transformers pipeline.

CRITICAL NOTE ON INFERENCE: Due to the Turbo architecture's reduced decoder depth, you must carefully control the stride_length_s parameter. A long stride (e.g., 5s or more) can cause the model to lose context and enter infinite repetition loops. It is recommended to keep the stride length to around 2 or 3 seconds.

from transformers import pipeline

# Load the pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model="MaddoggProduction/whisper-l-v3-turbo-quran-lora-dataset-mix",
    device=0 # for GPU usage, -1 for CPU
)

# Transcribe audio 
result = pipe(
    "path_to_audio.mp3",
    chunk_length_s=30, 
    stride_length_s=2, # 2s or 3s to prevent loops and hallucinations
    batch_size=8,
    return_timestamps=True,
    generate_kwargs={
        "task": "transcribe",
        "language": "arabic",
        "num_beams": 1           # 1 is sufficient, adjust as needed
    }
)

print(result["text"])
Downloads last month
784
Safetensors
Model size
0.8B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MaddoggProduction/whisper-l-v3-turbo-quran-lora-dataset-mix

Finetuned
(467)
this model

Datasets used to train MaddoggProduction/whisper-l-v3-turbo-quran-lora-dataset-mix

Spaces using MaddoggProduction/whisper-l-v3-turbo-quran-lora-dataset-mix 2