Spanish Regional Slang Whisper

A fine-tuned OpenAI Whisper model specialized for Spanish speech recognition with enhanced support for regional dialects and colloquial expressions (slang).

Model Description

This model extends Whisper's Spanish ASR capabilities to accurately transcribe informal speech patterns, regional vocabulary, and dialectal variations across four major Spanish-speaking regions:

Region Code Key Features
Spain es-ES Vosotros conjugation, distinción (z/c vs s), slang: tío, mola, guay, flipar
Mexico es-MX Ustedes, diminutives (-ito/ita), slang: chido, güey, qué onda, neta
Argentina es-AR Voseo (vos sos), sheísmo (ll/y→sh), slang: che, boludo, pibe, laburo
Chile es-CL Mixed voseo, "po" particle, slang: bacán, cachai, fome, al tiro

Intended Uses

  • Language Learning Applications: Transcribe learner speech for pronunciation and fluency feedback
  • Regional Dialect Recognition: Identify and preserve regional markers in transcriptions
  • Conversational Spanish ASR: Handle informal, colloquial speech patterns
  • Spanish Media Transcription: Transcribe content with regional slang and expressions

Training Data

The model was fine-tuned on a diverse dataset including:

  • Synthetic Audio: TTS-generated regional slang phrases (ElevenLabs)
  • Conversational Corpus: 5.5+ hours of authentic Spanish speech
  • Movie Dialogues: 6,500+ examples from Spanish films with regional variety
  • Regional Voice Samples: Mozilla Common Voice Spanish accents

Vocabulary Coverage: 156+ unique slang words and 67+ region-specific phrases across all four regions.

Training Procedure

Hyperparameters

  • Base Model: openai/whisper-small
  • Epochs: 3
  • Batch Size: 8
  • Learning Rate: 1e-5
  • Optimizer: AdamW
  • Compute Type: int8 (quantized)
  • Hardware: NVIDIA A100 80GB / RTX 4090

Training Command

python -m src.regional_training.train_all_regions train \
  --component stt \
  --region all \
  --epochs 3 \
  --batch-size 8

Evaluation Results

Metric Target Description
WER <15% Word Error Rate per region
CER <10% Character Error Rate
Latency <2s Processing time for typical utterances
Dialect Accuracy 80%+ Regional slang marker recognition
Marker Preservation 90%+ Regional slang words correctly transcribed

How to Use

With Transformers

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
processor = WhisperProcessor.from_pretrained("shraavb/spanish-slang-whisper")
model = WhisperForConditionalGeneration.from_pretrained("shraavb/spanish-slang-whisper")

# Load audio (16kHz mono)
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)

# Process and transcribe
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features

# Generate with forced Spanish language
forced_decoder_ids = processor.get_decoder_prompt_ids(language="es", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

With Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="shraavb/spanish-slang-whisper",
    chunk_length_s=30,
    device="cuda"  # or "cpu"
)

result = transcriber("audio.wav", generate_kwargs={"language": "es"})
print(result["text"])

With faster-whisper (Optimized Inference)

from faster_whisper import WhisperModel

model = WhisperModel("shraavb/spanish-slang-whisper", compute_type="int8")
segments, info = model.transcribe("audio.wav", language="es")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Model Architecture

Whisper Encoder-Decoder Architecture
├── Input: Audio (16kHz, mono)
├── Preprocessing: Mel-spectrogram extraction
├── Encoder: 12 transformer layers (8 attention heads)
├── Decoder: 12 transformer layers (autoregressive)
└── Output: Spanish text transcription + timestamps

Model Size: 244M parameters (0.2B) Tensor Type: F32 (Safetensors format)

Limitations and Bias

  • Regional Coverage: Optimized for Spain, Mexico, Argentina, and Chile. Other Spanish dialects (Caribbean, Andean, etc.) may have reduced accuracy.
  • Slang Evolution: Colloquial language evolves rapidly; some newer slang may not be recognized.
  • Audio Quality: Performance degrades with noisy audio or heavy background music.
  • Code-Switching: Limited support for Spanish-English code-switching common in some regions.
  • Formal Speech: Optimized for informal/conversational speech; may introduce colloquialisms in formal transcriptions.

Ethical Considerations

  • The model reflects regional speech patterns and may transcribe profanity or crude slang when present in audio.
  • Regional dialect detection should not be used for speaker profiling or discrimination.
  • Training data includes synthetic and web-sourced content; original sources should be respected.

Citation

@misc{spanish-slang-whisper,
  author = {Shraavasti Bhat},
  title = {Spanish Regional Slang Whisper: Fine-tuned ASR for Spanish Dialects},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/shraavb/spanish-slang-whisper}}
}

Related Resources

Acknowledgments

  • OpenAI for the Whisper base model
  • Mozilla Common Voice for Spanish accent data
  • ElevenLabs for TTS synthesis capabilities
Downloads last month
3
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shraavb/spanish-slang-whisper

Finetuned
(3339)
this model

Dataset used to train shraavb/spanish-slang-whisper

Evaluation results

  • Word Error Rate on Spanish Regional Slang STT Data
    self-reported
    15.000
  • Character Error Rate on Spanish Regional Slang STT Data
    self-reported
    10.000