Azeem Ahmed commited on
Commit
f82e096
ยท
1 Parent(s): fe38834

Updated README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -1
README.md CHANGED
@@ -18,4 +18,150 @@ tags:
18
  datasets:
19
  - azeem-ahmed/Common_Voice_Corpus_22_0_Urdu
20
  library_name: transformers
21
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  datasets:
19
  - azeem-ahmed/Common_Voice_Corpus_22_0_Urdu
20
  library_name: transformers
21
+ ---
22
+
23
+
24
+ # Wav2Vec2-XLS-R-1B Fine-Tuned for Urdu ASR ๐ŸŽ™๏ธ๐Ÿ‡ต๐Ÿ‡ฐ
25
+ This repository hosts a fine-tuned version of **[facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b)** for **Automatic Speech Recognition (ASR) in Urdu**.
26
+ The model has been trained on the **Common Voice Corpus 22.0 (Urdu subset)** with extensive enhancements in preprocessing, error handling, and training monitoring.
27
+
28
+ ---
29
+
30
+ ## โœจ Highlights
31
+
32
+ - **Base Model**: facebook/wav2vec2-xls-r-1b (1B parameters, multilingual)
33
+ - **Target Language**: Urdu
34
+ - **Dataset**: [Mozilla Common Voice 22.0 (Urdu)](https://commonvoice.mozilla.org/en/datasets)
35
+ - **Training Framework**: Hugging Face Transformers + Datasets
36
+ - **Metrics Logged**: Training Loss, Validation Loss, WER, CER
37
+ - **Hardware**: Single NVIDIA RTX 4090 (24 GB VRAM)
38
+ - **Optimizations**:
39
+ - FP16 mixed precision
40
+ - Gradient checkpointing
41
+ - RTX 4090โ€“specific CUDA/TF32 tuning
42
+ - Early stopping & loss monitoring
43
+ - **Robust Preprocessing**: Custom Urdu text cleaner, enhanced audio validation, dynamic vocabulary generation
44
+ - **Comprehensive Tracking**: Weights & Biases integration, CSV logging, and Markdown summary reports
45
+
46
+ ---
47
+
48
+ ### ๐Ÿ—๏ธ Model Architecture
49
+ - Base: `facebook/wav2vec2-xls-r-1b`
50
+ - Architecture: Wav2Vec2-CTC (Connectionist Temporal Classification)
51
+ - Feature encoder: Frozen during fine-tuning
52
+ - Dropouts (for regularization):
53
+ - Attention: 0.1
54
+ - Activation: 0.1
55
+ - Hidden: 0.1
56
+ - Feature projection: 0.0
57
+ - Final: 0.0
58
+
59
+ **Hyperparameters**
60
+ - Batch Size:
61
+ - Train: 4 (gradient accumulation = 2 โ†’ effective batch = 8)
62
+ - Eval: 8
63
+ - Learning Rate: 3e-5
64
+ - Optimizer: AdamW with weight decay = 0.01
65
+ - Warmup Steps: 1000
66
+ - Max Grad Norm: 1.0
67
+ - Epochs: 30
68
+ - Save/Eval Steps: 1000
69
+ - Logging Steps: 25
70
+ - Early Stopping: patience = 5
71
+
72
+ **Metrics**
73
+ - Word Error Rate (WER)
74
+ - Character Error Rate (CER)
75
+ - Training/Validation Loss (with NaN/Inf safeguards)
76
+
77
+ ---
78
+ ## ๐Ÿš€ Model Performance
79
+ This model achieves exceptional performance on Urdu speech recognition:
80
+
81
+ - **Best WER (Word Error Rate): 33.75%**
82
+ - **Best CER (Character Error Rate): 27.00%**
83
+ - **44.7% improvement** from initial performance
84
+ - Robust performance across 30 training epochs
85
+
86
+ ## ๐Ÿ“Š Training Metrics
87
+
88
+ The model was trained for **30 epochs** with a batch size optimized for the RTX 4090. Metrics were logged continuously.
89
+
90
+ |Step |Epoch|Training Loss|Validation Loss|WER |CER |
91
+ |-----|-----|-------------|---------------|------|------|
92
+ |1000 |1.09 |3.1996 |1.0216 |0.6107|0.4886|
93
+ |2000 |2.18 |5.5422 |0.8069 |0.4751|0.3801|
94
+ |3000 |3.28 |3.8995 |0.7641 |0.4441|0.3553|
95
+ |4000 |4.37 |1.7375 |0.714 |0.4175|0.334 |
96
+ |5000 |5.46 |1.8486 |0.7205 |0.3998|0.3198|
97
+ |6000 |6.55 |4.2864 |0.6949 |0.397 |0.3176|
98
+ |7000 |7.64 |5.7143 |0.7016 |0.3783|0.3026|
99
+ |8000 |8.73 |3.0777 |0.6733 |0.3817|0.3053|
100
+ |9000 |9.83 |3.3163 |0.6827 |0.3646|0.2916|
101
+ |10000|10.92|2.6399 |0.6645 |0.3647|0.2918|
102
+ |11000|12.01|1.9039 |0.7104 |0.3684|0.2947|
103
+ |12000|13.1 |2.7625 |0.693 |0.3624|0.2899|
104
+ |13000|14.19|4.189 |0.7066 |0.3621|0.2897|
105
+ |14000|15.28|4.8301 |0.7281 |0.3565|0.2852|
106
+ |15000|16.38|2.8099 |0.7179 |0.354 |0.2832|
107
+ |16000|17.47|2.191 |0.7339 |0.3527|0.2821|
108
+ |17000|18.56|6.7916 |0.7245 |0.3589|0.2871|
109
+ |18000|19.65|4.7375 |0.7599 |0.3485|0.2788|
110
+ |19000|20.74|6.2273 |0.7414 |0.3471|0.2776|
111
+ |20000|21.83|2.4164 |0.7877 |0.3519|0.2815|
112
+ |21000|22.93|3.9591 |0.7595 |0.3422|0.2737|
113
+ |22000|24.02|7.3049 |0.7994 |0.343 |0.2744|
114
+ |23000|25.11|4.7571 |0.8182 |0.3457|0.2766|
115
+ |24000|26.2 |2.9164 |0.8067 |0.3417|0.2733|
116
+ |25000|27.29|4.1302 |0.8132 |0.3377|0.2701|
117
+ |26000|28.38|4.2031 |0.8328 |0.3383|0.2707|
118
+ |27000|29.48|1.2038 |0.8367 |0.3375|0.27 |
119
+ |27480|30 |5.8839 |0.8261 |0.3376|0.2701|
120
+
121
+
122
+ ## ๐Ÿ’ป Usage
123
+ ### 1. Install Dependencies
124
+ ```bash
125
+ pip install torch librosa soundfile transformers datasets jiwer
126
+ ```
127
+
128
+ ```python
129
+ from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
130
+ import torch, soundfile as sf
131
+
132
+ processor = Wav2Vec2Processor.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")
133
+ model = Wav2Vec2ForCTC.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")
134
+
135
+ speech, sr = sf.read("sample.wav")
136
+ inputs = processor(speech, sampling_rate=sr, return_tensors="pt", padding=True)
137
+
138
+ with torch.no_grad():
139
+ logits = model(inputs.input_values).logits
140
+
141
+ pred_ids = torch.argmax(logits, dim=-1)
142
+ print(processor.batch_decode(pred_ids)[0])
143
+ ```
144
+
145
+
146
+ ### ๐Ÿ“œ Citation
147
+ ```
148
+ @misc{azeem2025wav2vec2urdu,
149
+ title={Fine-tuned Wav2Vec2-XLS-R-1B for Urdu ASR},
150
+ author={Ahmed, Azeem},
151
+ year={2025},
152
+ howpublished={\url{https://huggingface.co/azeem-ahmed/wav2vec2-xls-r-1b-urdu}},
153
+ }
154
+ ```
155
+
156
+ ## ๐Ÿ™ Acknowledgements
157
+
158
+ - Facebook AI Research for Wav2Vec2-XLS-R
159
+ - Mozilla for Common Voice 22.0
160
+ - Hugging Face team
161
+ - Weights & Biases for experiment tracking
162
+
163
+
164
+
165
+
166
+ ##### ๐ŸŒŸ Star this repository if you find it useful!
167
+ _Built with โค๏ธ for the Urdu language community_