Spanish Regional Slang Whisper

A fine-tuned OpenAI Whisper model specialized for Spanish speech recognition with enhanced support for regional dialects and colloquial expressions (slang).

Model Description

This model extends Whisper's Spanish ASR capabilities to accurately transcribe informal speech patterns, regional vocabulary, and dialectal variations across four major Spanish-speaking regions:

Region	Code	Key Features
Spain	es-ES	Vosotros conjugation, distinción (z/c vs s), slang: tío, mola, guay, flipar
Mexico	es-MX	Ustedes, diminutives (-ito/ita), slang: chido, güey, qué onda, neta
Argentina	es-AR	Voseo (vos sos), sheísmo (ll/y→sh), slang: che, boludo, pibe, laburo
Chile	es-CL	Mixed voseo, "po" particle, slang: bacán, cachai, fome, al tiro

Intended Uses

Language Learning Applications: Transcribe learner speech for pronunciation and fluency feedback
Regional Dialect Recognition: Identify and preserve regional markers in transcriptions
Conversational Spanish ASR: Handle informal, colloquial speech patterns
Spanish Media Transcription: Transcribe content with regional slang and expressions

Training Data

The model was fine-tuned on a diverse dataset including:

Synthetic Audio: TTS-generated regional slang phrases (ElevenLabs)
Conversational Corpus: 5.5+ hours of authentic Spanish speech
Movie Dialogues: 6,500+ examples from Spanish films with regional variety
Regional Voice Samples: Mozilla Common Voice Spanish accents

Vocabulary Coverage: 156+ unique slang words and 67+ region-specific phrases across all four regions.

Training Procedure

Hyperparameters

Base Model: openai/whisper-small
Epochs: 3
Batch Size: 8
Learning Rate: 1e-5
Optimizer: AdamW
Compute Type: int8 (quantized)
Hardware: NVIDIA A100 80GB / RTX 4090

Training Command

python -m src.regional_training.train_all_regions train \
  --component stt \
  --region all \
  --epochs 3 \
  --batch-size 8

Evaluation Results

Metric	Target	Description
WER	<15%	Word Error Rate per region
CER	<10%	Character Error Rate
Latency	<2s	Processing time for typical utterances
Dialect Accuracy	80%+	Regional slang marker recognition
Marker Preservation	90%+	Regional slang words correctly transcribed

How to Use

With Transformers

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
processor = WhisperProcessor.from_pretrained("shraavb/spanish-slang-whisper")
model = WhisperForConditionalGeneration.from_pretrained("shraavb/spanish-slang-whisper")

# Load audio (16kHz mono)
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)

# Process and transcribe
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features

# Generate with forced Spanish language
forced_decoder_ids = processor.get_decoder_prompt_ids(language="es", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

With Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="shraavb/spanish-slang-whisper",
    chunk_length_s=30,
    device="cuda"  # or "cpu"
)

result = transcriber("audio.wav", generate_kwargs={"language": "es"})
print(result["text"])

With faster-whisper (Optimized Inference)

from faster_whisper import WhisperModel

model = WhisperModel("shraavb/spanish-slang-whisper", compute_type="int8")
segments, info = model.transcribe("audio.wav", language="es")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Model Architecture

Whisper Encoder-Decoder Architecture
├── Input: Audio (16kHz, mono)
├── Preprocessing: Mel-spectrogram extraction
├── Encoder: 12 transformer layers (8 attention heads)
├── Decoder: 12 transformer layers (autoregressive)
└── Output: Spanish text transcription + timestamps

Model Size: 244M parameters (0.2B) Tensor Type: F32 (Safetensors format)

Limitations and Bias

Regional Coverage: Optimized for Spain, Mexico, Argentina, and Chile. Other Spanish dialects (Caribbean, Andean, etc.) may have reduced accuracy.
Slang Evolution: Colloquial language evolves rapidly; some newer slang may not be recognized.
Audio Quality: Performance degrades with noisy audio or heavy background music.
Code-Switching: Limited support for Spanish-English code-switching common in some regions.
Formal Speech: Optimized for informal/conversational speech; may introduce colloquialisms in formal transcriptions.

Ethical Considerations

The model reflects regional speech patterns and may transcribe profanity or crude slang when present in audio.
Regional dialect detection should not be used for speaker profiling or discrimination.
Training data includes synthetic and web-sourced content; original sources should be respected.

Citation

@misc{spanish-slang-whisper,
  author = {Shraavasti Bhat},
  title = {Spanish Regional Slang Whisper: Fine-tuned ASR for Spanish Dialects},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/shraavb/spanish-slang-whisper}}
}

Related Resources

Training Dataset: shraavb/spanish-slang-stt-data
Base Model: openai/whisper-small
Project Repository: SpeakEasy

Acknowledgments

OpenAI for the Whisper base model
Mozilla Common Voice for Spanish accent data
ElevenLabs for TTS synthesis capabilities

Downloads last month: 3

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for shraavb/spanish-slang-whisper

Base model

openai/whisper-small

Finetuned

(3339)

this model

Dataset used to train shraavb/spanish-slang-whisper

Evaluation results

Word Error Rate on Spanish Regional Slang STT Data
self-reported

15.000
Character Error Rate on Spanish Regional Slang STT Data
self-reported

10.000