Spanish Regional Slang Whisper
A fine-tuned OpenAI Whisper model specialized for Spanish speech recognition with enhanced support for regional dialects and colloquial expressions (slang).
Model Description
This model extends Whisper's Spanish ASR capabilities to accurately transcribe informal speech patterns, regional vocabulary, and dialectal variations across four major Spanish-speaking regions:
| Region | Code | Key Features |
|---|---|---|
| Spain | es-ES | Vosotros conjugation, distinción (z/c vs s), slang: tÃo, mola, guay, flipar |
| Mexico | es-MX | Ustedes, diminutives (-ito/ita), slang: chido, güey, qué onda, neta |
| Argentina | es-AR | Voseo (vos sos), sheÃsmo (ll/y→sh), slang: che, boludo, pibe, laburo |
| Chile | es-CL | Mixed voseo, "po" particle, slang: bacán, cachai, fome, al tiro |
Intended Uses
- Language Learning Applications: Transcribe learner speech for pronunciation and fluency feedback
- Regional Dialect Recognition: Identify and preserve regional markers in transcriptions
- Conversational Spanish ASR: Handle informal, colloquial speech patterns
- Spanish Media Transcription: Transcribe content with regional slang and expressions
Training Data
The model was fine-tuned on a diverse dataset including:
- Synthetic Audio: TTS-generated regional slang phrases (ElevenLabs)
- Conversational Corpus: 5.5+ hours of authentic Spanish speech
- Movie Dialogues: 6,500+ examples from Spanish films with regional variety
- Regional Voice Samples: Mozilla Common Voice Spanish accents
Vocabulary Coverage: 156+ unique slang words and 67+ region-specific phrases across all four regions.
Training Procedure
Hyperparameters
- Base Model:
openai/whisper-small - Epochs: 3
- Batch Size: 8
- Learning Rate: 1e-5
- Optimizer: AdamW
- Compute Type: int8 (quantized)
- Hardware: NVIDIA A100 80GB / RTX 4090
Training Command
python -m src.regional_training.train_all_regions train \
--component stt \
--region all \
--epochs 3 \
--batch-size 8
Evaluation Results
| Metric | Target | Description |
|---|---|---|
| WER | <15% | Word Error Rate per region |
| CER | <10% | Character Error Rate |
| Latency | <2s | Processing time for typical utterances |
| Dialect Accuracy | 80%+ | Regional slang marker recognition |
| Marker Preservation | 90%+ | Regional slang words correctly transcribed |
How to Use
With Transformers
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
# Load model and processor
processor = WhisperProcessor.from_pretrained("shraavb/spanish-slang-whisper")
model = WhisperForConditionalGeneration.from_pretrained("shraavb/spanish-slang-whisper")
# Load audio (16kHz mono)
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)
# Process and transcribe
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
# Generate with forced Spanish language
forced_decoder_ids = processor.get_decoder_prompt_ids(language="es", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
With Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="shraavb/spanish-slang-whisper",
chunk_length_s=30,
device="cuda" # or "cpu"
)
result = transcriber("audio.wav", generate_kwargs={"language": "es"})
print(result["text"])
With faster-whisper (Optimized Inference)
from faster_whisper import WhisperModel
model = WhisperModel("shraavb/spanish-slang-whisper", compute_type="int8")
segments, info = model.transcribe("audio.wav", language="es")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Model Architecture
Whisper Encoder-Decoder Architecture
├── Input: Audio (16kHz, mono)
├── Preprocessing: Mel-spectrogram extraction
├── Encoder: 12 transformer layers (8 attention heads)
├── Decoder: 12 transformer layers (autoregressive)
└── Output: Spanish text transcription + timestamps
Model Size: 244M parameters (0.2B) Tensor Type: F32 (Safetensors format)
Limitations and Bias
- Regional Coverage: Optimized for Spain, Mexico, Argentina, and Chile. Other Spanish dialects (Caribbean, Andean, etc.) may have reduced accuracy.
- Slang Evolution: Colloquial language evolves rapidly; some newer slang may not be recognized.
- Audio Quality: Performance degrades with noisy audio or heavy background music.
- Code-Switching: Limited support for Spanish-English code-switching common in some regions.
- Formal Speech: Optimized for informal/conversational speech; may introduce colloquialisms in formal transcriptions.
Ethical Considerations
- The model reflects regional speech patterns and may transcribe profanity or crude slang when present in audio.
- Regional dialect detection should not be used for speaker profiling or discrimination.
- Training data includes synthetic and web-sourced content; original sources should be respected.
Citation
@misc{spanish-slang-whisper,
author = {Shraavasti Bhat},
title = {Spanish Regional Slang Whisper: Fine-tuned ASR for Spanish Dialects},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/shraavb/spanish-slang-whisper}}
}
Related Resources
- Training Dataset: shraavb/spanish-slang-stt-data
- Base Model: openai/whisper-small
- Project Repository: SpeakEasy
Acknowledgments
- OpenAI for the Whisper base model
- Mozilla Common Voice for Spanish accent data
- ElevenLabs for TTS synthesis capabilities
- Downloads last month
- 3
Model tree for shraavb/spanish-slang-whisper
Base model
openai/whisper-smallDataset used to train shraavb/spanish-slang-whisper
Evaluation results
- Word Error Rate on Spanish Regional Slang STT Dataself-reported15.000
- Character Error Rate on Spanish Regional Slang STT Dataself-reported10.000