lordChipotle commited on
Commit
ef4deaf
·
verified ·
1 Parent(s): 90ae74a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -34
README.md CHANGED
@@ -1,58 +1,201 @@
1
  ---
2
- base_model: lordChipotle/qwen2-vl-audio-7b
 
3
  library_name: transformers
4
- model_name: qwen2-vl-audio-7b-qlora
5
  tags:
6
- - generated_from_trainer
7
- - trl
8
- - sft
9
- licence: license
 
 
10
  ---
11
 
12
- # Model Card for qwen2-vl-audio-7b-qlora
13
 
14
- This model is a fine-tuned version of [lordChipotle/qwen2-vl-audio-7b](https://huggingface.co/lordChipotle/qwen2-vl-audio-7b).
15
- It has been trained using [TRL](https://github.com/huggingface/trl).
16
 
17
- ## Quick start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ```python
20
- from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
21
 
22
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
- generator = pipeline("text-generation", model="lordChipotle/qwen2-vl-audio-7b-qlora", device="cuda")
24
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
- print(output["generated_text"])
26
- ```
27
 
28
- ## Training procedure
 
29
 
30
-
 
 
31
 
 
 
 
 
 
 
32
 
33
- This model was trained with SFT.
 
 
 
 
 
34
 
35
- ### Framework versions
 
 
36
 
37
- - TRL: 0.19.0
38
- - Transformers: 4.53.0.dev0
39
- - Pytorch: 2.6.0+cu124
40
- - Datasets: 3.6.0
41
- - Tokenizers: 0.21.1
42
 
43
- ## Citations
 
 
 
 
 
 
 
 
44
 
 
 
 
 
 
 
 
 
45
 
 
 
 
46
 
47
- Cite TRL as:
 
 
 
 
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ```bibtex
50
- @misc{vonwerra2022trl,
51
- title = {{TRL: Transformer Reinforcement Learning}},
52
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
53
- year = 2020,
54
- journal = {GitHub repository},
55
- publisher = {GitHub},
56
- howpublished = {\url{https://github.com/huggingface/trl}}
57
  }
58
  ```
 
1
  ---
2
+ language: en
3
+ license: apache-2.0
4
  library_name: transformers
5
+ pipeline_tag: automatic-speech-recognition
6
  tags:
7
+ - qwen2-vl
8
+ - automatic-speech-recognition
9
+ - speech-understanding
10
+ - audio
11
+ - multi-modal
12
+ model_name: Qwen2-VL-7B-Audio-ASR
13
  ---
14
 
15
+ # Model Card for Qwen2-VL-7B-Audio-ASR
16
 
17
+ ## Model Details
 
18
 
19
+
20
+ **Model Description:** This project extends `Qwen/Qwen2.5-VL-7B-Instruct`, a powerful Vision-Language Model (VLM), into a multi-modal model capable of understanding and transcribing spoken English. By integrating the audio-encoding capabilities of OpenAI's Whisper `large-v3` encoder, we have effectively taught a VLM to "hear," enabling it to perform high-quality Automatic Speech Recognition (ASR).
21
+
22
+ The core of this work lies in a novel data processing pipeline that allows for batch-efficient training. The model was fine-tuned using a two-stage process, starting with adapter tuning and followed by end-to-end QLoRA optimization.
23
+
24
+ - **Developed by:** lordChipotle
25
+ - **Model Type:** Audio-Vision-Language Model
26
+ - **Language(s):** English
27
+ - **License:** Apache-2.0
28
+ - **Finetuned from model:** `Qwen/Qwen2.5-VL-7B-Instruct`
29
+ - **Audio Encoder:** OpenAI Whisper `large-v3`
30
+ # Notebook Walkthrough
31
+
32
+ If you're interested in the entire training code, please see this Colab Notebook(https://colab.research.google.com/drive/132FZOydWessJdiPxt5hlXJri44WkP90P?usp=sharing)
33
+ # Technical Approach & Pipeline
34
+
35
+ The primary challenge was to enable a VLM, originally designed for text and images, to process variable-length audio inputs. We achieved this through the following pipeline:
36
+
37
+ ![Training Pipeline Diagram](https://imgur.com/a/CKuM9sf)
38
+
39
+ See the Diagram(https://imgur.com/a/CKuM9sf)
40
+ 1. **Conversation Formatting:** Each audio-text pair from the dataset is first structured into a conversational format.
41
+ 2. **Chat Templating & Placeholder Injection:** A custom chat template is applied, which inserts special placeholder tokens (`<|audio_start|>`, `<|audio_pad|>`, `<|audio_end|>`) where the audio information belongs. The number of `<|audio_pad|>` tokens is scaled based on the audio clip's duration.
42
+ 3. **Dual-Path Encoding:**
43
+ * The **Whisper audio encoder** processes the raw audio waveform to generate rich audio embeddings.
44
+ * The **Qwen2 text encoder** processes the text part of the prompt.
45
+ 4. **Dynamic Embedding Swapping:** In the final step before the LLM, the placeholder embeddings from the text stream are dynamically replaced ("hot-swapped") with their corresponding audio embeddings. This creates a unified text-and-audio embedding sequence.
46
+ 5. **Training:** The model is then trained on this combined sequence to predict the ground-truth text transcript. This approach allows for efficient batching of audio and text data.
47
+
48
+ ## How to Get Started with the Model
49
+
50
+ Use the code below to get started with the model for speech transcription.
51
 
52
  ```python
53
+ import torch
54
+ import torchaudio
55
+ import torchaudio.transforms as T
56
+ import requests
57
+ from peft import PeftModel
58
+ from transformers import (
59
+ BitsAndBytesConfig,
60
+ Qwen2VLProcessor,
61
+ AutoModelForCausalLM
62
+ )
63
+ from transformers.models.qwen2_vl.modeling_qwen2_vl import AudioQwen2VLForConditionalGeneration
64
 
65
+ # --- Configuration ---
66
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
67
+ DTYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float32
 
 
68
 
69
+ BASE_REPO = "lordChipotle/qwen2-vl-audio-7b"
70
+ ADAPTER_REPO = "lordChipotle/qwen2-vl-audio-7b-qlora"
71
 
72
+ # --- Load Model and Processor ---
73
+ print("Loading base model, processor, and applying LoRA adapter...")
74
+ processor = Qwen2VLProcessor.from_pretrained(BASE_REPO, trust_remote_code=True)
75
 
76
+ # Load the base model with quantization config
77
+ bnb_config = BitsAndBytesConfig(
78
+ load_in_4bit=True,
79
+ bnb_4bit_quant_type="nf4",
80
+ bnb_4bit_compute_dtype=DTYPE,
81
+ )
82
 
83
+ model = AudioQwen2VLForConditionalGeneration.from_pretrained(
84
+ BASE_REPO,
85
+ quantization_config=bnb_config,
86
+ device_map="auto",
87
+ attn_implementation="flash_attention_2"
88
+ )
89
 
90
+ # Apply the LoRA adapter
91
+ model = PeftModel.from_pretrained(model, ADAPTER_REPO)
92
+ print("Model, processor, and adapter loaded.")
93
 
 
 
 
 
 
94
 
95
+ # --- Inference Functions ---
96
+ def prepare_audio(audio_path, target_sr=16000):
97
+ waveform, sample_rate = torchaudio.load(audio_path)
98
+ if sample_rate != target_sr:
99
+ resampler = T.Resample(orig_freq=sample_rate, new_freq=target_sr)
100
+ waveform = resampler(waveform)
101
+ if waveform.shape[0] > 1:
102
+ waveform = torch.mean(waveform, dim=0, keepdim=True)
103
+ return waveform.squeeze().numpy()
104
 
105
+ def transcribe(audio_path, max_new_tokens=128):
106
+ print(f"Loading and preparing audio from: {audio_path}")
107
+ audio_array = prepare_audio(audio_path)
108
+
109
+ chat = [
110
+ {"role": "system", "content": [{"type": "text", "text": "You are an ASR assistant."}]},
111
+ {"role": "user", "content": [{"type": "audio", "array": audio_array}, {"type": "text", "text": "Transcribe this."}]},
112
+ ]
113
 
114
+ text = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
115
+ inputs = processor(text=[text], return_tensors="pt")
116
+ inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
117
 
118
+ print("Generating transcription...")
119
+ with torch.no_grad():
120
+ outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
121
+
122
+ response = processor.decode(outputs[0], skip_special_tokens=True)
123
 
124
+ try:
125
+ return response.split("assistant\n")[-1].strip()
126
+ except:
127
+ return response
128
+
129
+ # --- Example Usage ---
130
+ # Download a sample audio file for testing
131
+ # !wget [https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac](https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac)
132
+ AUDIO_FILE = "1.flac"
133
+
134
+ transcription = transcribe(AUDIO_FILE)
135
+ print("\n--- TRANSCRIPTION ---")
136
+ print(transcription)
137
+ print("---------------------")
138
+ ```
139
+
140
+ ## Deployment and Inference
141
+
142
+ For optimized inference, especially in a production environment, it is recommended to use serving frameworks like `vLLM`, which can provide significant speedups.
143
+
144
+ ## Training Details
145
+
146
+ ### Training Data
147
+ The model was fine-tuned on a subset of the `speechbrain/LargeScaleASR`(Recently renamed to speechbrain/LoquaciousSet) dataset. This dataset comprises 25,000 hours of diverse, transcribed English speech. For this project, a smaller shard consisting of the first two parts of the 'small' configuration (`train-0000*` and `train-0001*`) was used for training, and the first part of the 'test' set (`test-00000*`) was used for validation.
148
+
149
+ ### Training Procedure
150
+ The fine-tuning was conducted in two stages to effectively adapt the VLM for audio processing.
151
+
152
+ #### Stage 1: Audio Adapter Training
153
+ In the first stage, the language model and the pre-trained Whisper audio encoder were frozen. Only the newly introduced `audio_proj` layer was trained. This stage aims to align the audio feature space with the language model's embedding space.
154
+
155
+ - **Learning Rate:** `1e-4`
156
+ - **Batch Size:** `2` (per device)
157
+ - **Gradient Accumulation Steps:** `4` (Effective batch size of 8)
158
+ - **Max Steps:** `1000`
159
+
160
+
161
+ #### Stage 2: QLoRA End-to-End Fine-Tuning
162
+ In the second stage, the entire model was unfrozen and fine-tuned end-to-end using **QLoRA** (Quantized Low-Rank Adaptation). This method significantly reduces memory requirements by quantizing the base model to 4-bits using **NF4 (4-bit NormalFloat)** quantization and then training a small number of LoRA adapters on top.
163
+
164
+ - **Learning Rate:** `2e-5`
165
+ - **Batch Size:** `2` (per device)
166
+ - **Gradient Accumulation Steps:** `8` (Effective batch size of 16)
167
+ - **Epochs:** `1`
168
+ - **Quantization:** `4-bit NF4` with `bfloat16` compute dtype.
169
+ - **LoRA Config:**
170
+ - `r`: 16
171
+ - `lora_alpha`: 32
172
+ - `target_modules`: `['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'audio_proj']`
173
+ - `lora_dropout`: 0.05
174
+
175
+ ## Evaluation
176
+
177
+ The model's performance was monitored using Weights & Biases. The plots below show the training and evaluation loss during the second stage of fine-tuning.
178
+
179
+ **Training & Evaluation Loss (Stage 2)**
180
+ ![Training and Evaluation Loss](https://imgur.com/a/zXj0jF1)
181
+
182
+ Chart(https://imgur.com/a/zXj0jF1)
183
+
184
+ The evaluation loss shows a consistent downward trend, indicating that the model was successfully learning to transcribe speech from the audio data. The training loss also decreased steadily, converging to a low value.
185
+
186
+ ## Citation
187
+
188
+ If you use this model in your work, please consider citing the original Qwen and Whisper models, as well as this derivative work.
189
+
190
+
191
+
192
  ```bibtex
193
+ @misc{qwen2_vl_audio_asr,
194
+ author = {lordChipotle},
195
+ title = {Qwen2-VL-7B for Speech Understanding},
196
+ year = {2025},
197
+ publisher = {Hugging Face},
198
+ journal = {Hugging Face Hub},
199
+ howpublished = {\\url{[https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora](https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora)}}
200
  }
201
  ```