---
language: en
license: apache-2.0
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- qwen2-vl
- automatic-speech-recognition
- speech-understanding
- audio
- multi-modal
model_name: Qwen2-VL-7B-Audio-ASR
---

# Model Card for Qwen2-VL-7B-Audio-ASR

## Model Details


**Model Description:** This project extends `Qwen/Qwen2.5-VL-7B-Instruct`, a powerful Vision-Language Model (VLM), into a multi-modal model capable of understanding and transcribing spoken English. By integrating the audio-encoding capabilities of OpenAI's Whisper `large-v3` encoder, we have effectively taught a VLM to "hear," enabling it to perform high-quality Automatic Speech Recognition (ASR).

The core of this work lies in a novel data processing pipeline that allows for batch-efficient training. The model was fine-tuned using a two-stage process, starting with adapter tuning and followed by end-to-end QLoRA optimization.

- **Developed by:** lordChipotle
- **Model Type:** Audio-Vision-Language Model
- **Language(s):** English
- **License:** Apache-2.0
- **Finetuned from model:** `Qwen/Qwen2.5-VL-7B-Instruct`
- **Audio Encoder:** OpenAI Whisper `large-v3`
# Notebook Walkthrough

If you're interested in the entire training code, please see this Colab Notebook(https://colab.research.google.com/drive/132FZOydWessJdiPxt5hlXJri44WkP90P?usp=sharing)
# Technical Approach & Pipeline

The primary challenge was to enable a VLM, originally designed for text and images, to process variable-length audio inputs. We achieved this through the following pipeline:

![Training Pipeline Diagram](https://imgur.com/a/CKuM9sf)

See the Diagram(https://imgur.com/a/CKuM9sf)
1.  **Conversation Formatting:** Each audio-text pair from the dataset is first structured into a conversational format.
2.  **Chat Templating & Placeholder Injection:** A custom chat template is applied, which inserts special placeholder tokens (`<|audio_start|>`, `<|audio_pad|>`, `<|audio_end|>`) where the audio information belongs. The number of `<|audio_pad|>` tokens is scaled based on the audio clip's duration.
3.  **Dual-Path Encoding:**
    * The **Whisper audio encoder** processes the raw audio waveform to generate rich audio embeddings.
    * The **Qwen2 text encoder** processes the text part of the prompt.
4.  **Dynamic Embedding Swapping:** In the final step before the LLM, the placeholder embeddings from the text stream are dynamically replaced ("hot-swapped") with their corresponding audio embeddings. This creates a unified text-and-audio embedding sequence.
5.  **Training:** The model is then trained on this combined sequence to predict the ground-truth text transcript. This approach allows for efficient batching of audio and text data.

## How to Get Started with the Model

Use the code below to get started with the model for speech transcription.

```python
import torch
import torchaudio
import torchaudio.transforms as T
import requests
from peft import PeftModel
from transformers import (
    BitsAndBytesConfig,
    Qwen2VLProcessor,
    AutoModelForCausalLM
)
from transformers.models.qwen2_vl.modeling_qwen2_vl import AudioQwen2VLForConditionalGeneration

# --- Configuration ---
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float32

BASE_REPO = "lordChipotle/qwen2-vl-audio-7b"
ADAPTER_REPO = "lordChipotle/qwen2-vl-audio-7b-qlora"

# --- Load Model and Processor ---
print("Loading base model, processor, and applying LoRA adapter...")
processor = Qwen2VLProcessor.from_pretrained(BASE_REPO, trust_remote_code=True)

# Load the base model with quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=DTYPE,
)

model = AudioQwen2VLForConditionalGeneration.from_pretrained(
    BASE_REPO,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

# Apply the LoRA adapter
model = PeftModel.from_pretrained(model, ADAPTER_REPO)
print("Model, processor, and adapter loaded.")


# --- Inference Functions ---
def prepare_audio(audio_path, target_sr=16000):
    waveform, sample_rate = torchaudio.load(audio_path)
    if sample_rate != target_sr:
        resampler = T.Resample(orig_freq=sample_rate, new_freq=target_sr)
        waveform = resampler(waveform)
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)
    return waveform.squeeze().numpy()

def transcribe(audio_path, max_new_tokens=128):
    print(f"Loading and preparing audio from: {audio_path}")
    audio_array = prepare_audio(audio_path)
    
    chat = [
        {"role": "system", "content": [{"type": "text", "text": "You are an ASR assistant."}]},
        {"role": "user", "content": [{"type": "audio", "array": audio_array}, {"type": "text", "text": "Transcribe this."}]},
    ]

    text = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], return_tensors="pt")
    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

    print("Generating transcription...")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
    
    response = processor.decode(outputs[0], skip_special_tokens=True)
    
    try:
        return response.split("assistant\n")[-1].strip()
    except:
        return response

# --- Example Usage ---
# Download a sample audio file for testing
# !wget [https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac](https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac)
AUDIO_FILE = "1.flac" 

transcription = transcribe(AUDIO_FILE)
print("\n--- TRANSCRIPTION ---")
print(transcription)
print("---------------------")
```

## Deployment and Inference

For optimized inference, especially in a production environment, it is recommended to use serving frameworks like `vLLM`, which can provide significant speedups.

## Training Details

### Training Data
The model was fine-tuned on a subset of the `speechbrain/LargeScaleASR`(Recently renamed to speechbrain/LoquaciousSet) dataset. This dataset comprises 25,000 hours of diverse, transcribed English speech. For this project, a smaller shard consisting of the first two parts of the 'small' configuration (`train-0000*` and `train-0001*`) was used for training, and the first part of the 'test' set (`test-00000*`) was used for validation.

### Training Procedure
The fine-tuning was conducted in two stages to effectively adapt the VLM for audio processing.

#### Stage 1: Audio Adapter Training
In the first stage, the language model and the pre-trained Whisper audio encoder were frozen. Only the newly introduced `audio_proj` layer was trained. This stage aims to align the audio feature space with the language model's embedding space.

- **Learning Rate:** `1e-4`
- **Batch Size:** `2` (per device)
- **Gradient Accumulation Steps:** `4` (Effective batch size of 8)
- **Max Steps:** `1000`


#### Stage 2: QLoRA End-to-End Fine-Tuning
In the second stage, the entire model was unfrozen and fine-tuned end-to-end using **QLoRA** (Quantized Low-Rank Adaptation). This method significantly reduces memory requirements by quantizing the base model to 4-bits using **NF4 (4-bit NormalFloat)** quantization and then training a small number of LoRA adapters on top.

- **Learning Rate:** `2e-5`
- **Batch Size:** `2` (per device)
- **Gradient Accumulation Steps:** `8` (Effective batch size of 16)
- **Epochs:** `1`
- **Quantization:** `4-bit NF4` with `bfloat16` compute dtype.
- **LoRA Config:**
    - `r`: 16
    - `lora_alpha`: 32
    - `target_modules`: `['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'audio_proj']`
    - `lora_dropout`: 0.05

## Evaluation

The model's performance was monitored using Weights & Biases. The plots below show the training and evaluation loss during the second stage of fine-tuning.

**Training & Evaluation Loss (Stage 2)**
![Training and Evaluation Loss](https://imgur.com/a/zXj0jF1)

Chart(https://imgur.com/a/zXj0jF1)

The evaluation loss shows a consistent downward trend, indicating that the model was successfully learning to transcribe speech from the audio data. The training loss also decreased steadily, converging to a low value.

## Citation

If you use this model in your work, please consider citing the original Qwen and Whisper models, as well as this derivative work.


```bibtex
@misc{qwen2_vl_audio_asr,
  author = {lordChipotle},
  title = {Qwen2-VL-7B for Speech Understanding},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Hub},
  howpublished = {\\url{[https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora](https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora)}}
}
```