--- language: en license: apache-2.0 library_name: transformers pipeline_tag: automatic-speech-recognition tags: - qwen2-vl - automatic-speech-recognition - speech-understanding - audio - multi-modal model_name: Qwen2-VL-7B-Audio-ASR --- # Model Card for Qwen2-VL-7B-Audio-ASR ## Model Details **Model Description:** This project extends `Qwen/Qwen2.5-VL-7B-Instruct`, a powerful Vision-Language Model (VLM), into a multi-modal model capable of understanding and transcribing spoken English. By integrating the audio-encoding capabilities of OpenAI's Whisper `large-v3` encoder, we have effectively taught a VLM to "hear," enabling it to perform high-quality Automatic Speech Recognition (ASR). The core of this work lies in a novel data processing pipeline that allows for batch-efficient training. The model was fine-tuned using a two-stage process, starting with adapter tuning and followed by end-to-end QLoRA optimization. - **Developed by:** lordChipotle - **Model Type:** Audio-Vision-Language Model - **Language(s):** English - **License:** Apache-2.0 - **Finetuned from model:** `Qwen/Qwen2.5-VL-7B-Instruct` - **Audio Encoder:** OpenAI Whisper `large-v3` # Notebook Walkthrough If you're interested in the entire training code, please see this Colab Notebook(https://colab.research.google.com/drive/132FZOydWessJdiPxt5hlXJri44WkP90P?usp=sharing) # Technical Approach & Pipeline The primary challenge was to enable a VLM, originally designed for text and images, to process variable-length audio inputs. We achieved this through the following pipeline: ![Training Pipeline Diagram](https://imgur.com/a/CKuM9sf) See the Diagram(https://imgur.com/a/CKuM9sf) 1. **Conversation Formatting:** Each audio-text pair from the dataset is first structured into a conversational format. 2. **Chat Templating & Placeholder Injection:** A custom chat template is applied, which inserts special placeholder tokens (`<|audio_start|>`, `<|audio_pad|>`, `<|audio_end|>`) where the audio information belongs. The number of `<|audio_pad|>` tokens is scaled based on the audio clip's duration. 3. **Dual-Path Encoding:** * The **Whisper audio encoder** processes the raw audio waveform to generate rich audio embeddings. * The **Qwen2 text encoder** processes the text part of the prompt. 4. **Dynamic Embedding Swapping:** In the final step before the LLM, the placeholder embeddings from the text stream are dynamically replaced ("hot-swapped") with their corresponding audio embeddings. This creates a unified text-and-audio embedding sequence. 5. **Training:** The model is then trained on this combined sequence to predict the ground-truth text transcript. This approach allows for efficient batching of audio and text data. ## How to Get Started with the Model Use the code below to get started with the model for speech transcription. ```python import torch import torchaudio import torchaudio.transforms as T import requests from peft import PeftModel from transformers import ( BitsAndBytesConfig, Qwen2VLProcessor, AutoModelForCausalLM ) from transformers.models.qwen2_vl.modeling_qwen2_vl import AudioQwen2VLForConditionalGeneration # --- Configuration --- DEVICE = "cuda" if torch.cuda.is_available() else "cpu" DTYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float32 BASE_REPO = "lordChipotle/qwen2-vl-audio-7b" ADAPTER_REPO = "lordChipotle/qwen2-vl-audio-7b-qlora" # --- Load Model and Processor --- print("Loading base model, processor, and applying LoRA adapter...") processor = Qwen2VLProcessor.from_pretrained(BASE_REPO, trust_remote_code=True) # Load the base model with quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=DTYPE, ) model = AudioQwen2VLForConditionalGeneration.from_pretrained( BASE_REPO, quantization_config=bnb_config, device_map="auto", attn_implementation="flash_attention_2" ) # Apply the LoRA adapter model = PeftModel.from_pretrained(model, ADAPTER_REPO) print("Model, processor, and adapter loaded.") # --- Inference Functions --- def prepare_audio(audio_path, target_sr=16000): waveform, sample_rate = torchaudio.load(audio_path) if sample_rate != target_sr: resampler = T.Resample(orig_freq=sample_rate, new_freq=target_sr) waveform = resampler(waveform) if waveform.shape[0] > 1: waveform = torch.mean(waveform, dim=0, keepdim=True) return waveform.squeeze().numpy() def transcribe(audio_path, max_new_tokens=128): print(f"Loading and preparing audio from: {audio_path}") audio_array = prepare_audio(audio_path) chat = [ {"role": "system", "content": [{"type": "text", "text": "You are an ASR assistant."}]}, {"role": "user", "content": [{"type": "audio", "array": audio_array}, {"type": "text", "text": "Transcribe this."}]}, ] text = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], return_tensors="pt") inputs = {k: v.to(DEVICE) for k, v in inputs.items()} print("Generating transcription...") with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=max_new_tokens) response = processor.decode(outputs[0], skip_special_tokens=True) try: return response.split("assistant\n")[-1].strip() except: return response # --- Example Usage --- # Download a sample audio file for testing # !wget [https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac](https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac) AUDIO_FILE = "1.flac" transcription = transcribe(AUDIO_FILE) print("\n--- TRANSCRIPTION ---") print(transcription) print("---------------------") ``` ## Deployment and Inference For optimized inference, especially in a production environment, it is recommended to use serving frameworks like `vLLM`, which can provide significant speedups. ## Training Details ### Training Data The model was fine-tuned on a subset of the `speechbrain/LargeScaleASR`(Recently renamed to speechbrain/LoquaciousSet) dataset. This dataset comprises 25,000 hours of diverse, transcribed English speech. For this project, a smaller shard consisting of the first two parts of the 'small' configuration (`train-0000*` and `train-0001*`) was used for training, and the first part of the 'test' set (`test-00000*`) was used for validation. ### Training Procedure The fine-tuning was conducted in two stages to effectively adapt the VLM for audio processing. #### Stage 1: Audio Adapter Training In the first stage, the language model and the pre-trained Whisper audio encoder were frozen. Only the newly introduced `audio_proj` layer was trained. This stage aims to align the audio feature space with the language model's embedding space. - **Learning Rate:** `1e-4` - **Batch Size:** `2` (per device) - **Gradient Accumulation Steps:** `4` (Effective batch size of 8) - **Max Steps:** `1000` #### Stage 2: QLoRA End-to-End Fine-Tuning In the second stage, the entire model was unfrozen and fine-tuned end-to-end using **QLoRA** (Quantized Low-Rank Adaptation). This method significantly reduces memory requirements by quantizing the base model to 4-bits using **NF4 (4-bit NormalFloat)** quantization and then training a small number of LoRA adapters on top. - **Learning Rate:** `2e-5` - **Batch Size:** `2` (per device) - **Gradient Accumulation Steps:** `8` (Effective batch size of 16) - **Epochs:** `1` - **Quantization:** `4-bit NF4` with `bfloat16` compute dtype. - **LoRA Config:** - `r`: 16 - `lora_alpha`: 32 - `target_modules`: `['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'audio_proj']` - `lora_dropout`: 0.05 ## Evaluation The model's performance was monitored using Weights & Biases. The plots below show the training and evaluation loss during the second stage of fine-tuning. **Training & Evaluation Loss (Stage 2)** ![Training and Evaluation Loss](https://imgur.com/a/zXj0jF1) Chart(https://imgur.com/a/zXj0jF1) The evaluation loss shows a consistent downward trend, indicating that the model was successfully learning to transcribe speech from the audio data. The training loss also decreased steadily, converging to a low value. ## Citation If you use this model in your work, please consider citing the original Qwen and Whisper models, as well as this derivative work. ```bibtex @misc{qwen2_vl_audio_asr, author = {lordChipotle}, title = {Qwen2-VL-7B for Speech Understanding}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face Hub}, howpublished = {\\url{[https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora](https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora)}} } ```