lordChipotle
/

qwen2-vl-audio-7b-qlora

@@ -1,58 +1,201 @@
 ---
-base_model: lordChipotle/qwen2-vl-audio-7b
 library_name: transformers
-model_name: qwen2-vl-audio-7b-qlora
 tags:
-- generated_from_trainer
-- trl
-- sft
-licence: license
 ---
-# Model Card for qwen2-vl-audio-7b-qlora
-This model is a fine-tuned version of [lordChipotle/qwen2-vl-audio-7b](https://huggingface.co/lordChipotle/qwen2-vl-audio-7b).
-It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
 ```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="lordChipotle/qwen2-vl-audio-7b-qlora", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
-```
-## Training procedure
-This model was trained with SFT.
-### Framework versions
-- TRL: 0.19.0
-- Transformers: 4.53.0.dev0
-- Pytorch: 2.6.0+cu124
-- Datasets: 3.6.0
-- Tokenizers: 0.21.1
-## Citations
-Cite TRL as:
 ```bibtex
-@misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
 }
 ```

 ---
+language: en
+license: apache-2.0
 library_name: transformers
+pipeline_tag: automatic-speech-recognition
 tags:
+- qwen2-vl
+- automatic-speech-recognition
+- speech-understanding
+- audio
+- multi-modal
+model_name: Qwen2-VL-7B-Audio-ASR
 ---
+# Model Card for Qwen2-VL-7B-Audio-ASR
+## Model Details
+**Model Description:** This project extends `Qwen/Qwen2.5-VL-7B-Instruct`, a powerful Vision-Language Model (VLM), into a multi-modal model capable of understanding and transcribing spoken English. By integrating the audio-encoding capabilities of OpenAI's Whisper `large-v3` encoder, we have effectively taught a VLM to "hear," enabling it to perform high-quality Automatic Speech Recognition (ASR).
+The core of this work lies in a novel data processing pipeline that allows for batch-efficient training. The model was fine-tuned using a two-stage process, starting with adapter tuning and followed by end-to-end QLoRA optimization.
+- **Developed by:** lordChipotle
+- **Model Type:** Audio-Vision-Language Model
+- **Language(s):** English
+- **License:** Apache-2.0
+- **Finetuned from model:** `Qwen/Qwen2.5-VL-7B-Instruct`
+- **Audio Encoder:** OpenAI Whisper `large-v3`
+# Notebook Walkthrough
+If you're interested in the entire training code, please see this Colab Notebook(https://colab.research.google.com/drive/132FZOydWessJdiPxt5hlXJri44WkP90P?usp=sharing)
+# Technical Approach & Pipeline
+The primary challenge was to enable a VLM, originally designed for text and images, to process variable-length audio inputs. We achieved this through the following pipeline:
+![Training Pipeline Diagram](https://imgur.com/a/CKuM9sf)
+See the Diagram(https://imgur.com/a/CKuM9sf)
+1.  **Conversation Formatting:** Each audio-text pair from the dataset is first structured into a conversational format.
+2.  **Chat Templating & Placeholder Injection:** A custom chat template is applied, which inserts special placeholder tokens (`<|audio_start|>`, `<|audio_pad|>`, `<|audio_end|>`) where the audio information belongs. The number of `<|audio_pad|>` tokens is scaled based on the audio clip's duration.
+3.  **Dual-Path Encoding:**
+    * The **Whisper audio encoder** processes the raw audio waveform to generate rich audio embeddings.
+    * The **Qwen2 text encoder** processes the text part of the prompt.
+4.  **Dynamic Embedding Swapping:** In the final step before the LLM, the placeholder embeddings from the text stream are dynamically replaced ("hot-swapped") with their corresponding audio embeddings. This creates a unified text-and-audio embedding sequence.
+5.  **Training:** The model is then trained on this combined sequence to predict the ground-truth text transcript. This approach allows for efficient batching of audio and text data.
+## How to Get Started with the Model
+Use the code below to get started with the model for speech transcription.
 ```python
+import torch
+import torchaudio
+import torchaudio.transforms as T
+import requests
+from peft import PeftModel
+from transformers import (
+    BitsAndBytesConfig,
+    Qwen2VLProcessor,
+    AutoModelForCausalLM
+)
+from transformers.models.qwen2_vl.modeling_qwen2_vl import AudioQwen2VLForConditionalGeneration
+# --- Configuration ---
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+DTYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float32
+BASE_REPO = "lordChipotle/qwen2-vl-audio-7b"
+ADAPTER_REPO = "lordChipotle/qwen2-vl-audio-7b-qlora"
+# --- Load Model and Processor ---
+print("Loading base model, processor, and applying LoRA adapter...")
+processor = Qwen2VLProcessor.from_pretrained(BASE_REPO, trust_remote_code=True)
+# Load the base model with quantization config
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=DTYPE,
+)
+model = AudioQwen2VLForConditionalGeneration.from_pretrained(
+    BASE_REPO,
+    quantization_config=bnb_config,
+    device_map="auto",
+    attn_implementation="flash_attention_2"
+)
+# Apply the LoRA adapter
+model = PeftModel.from_pretrained(model, ADAPTER_REPO)
+print("Model, processor, and adapter loaded.")
+# --- Inference Functions ---
+def prepare_audio(audio_path, target_sr=16000):
+    waveform, sample_rate = torchaudio.load(audio_path)
+    if sample_rate != target_sr:
+        resampler = T.Resample(orig_freq=sample_rate, new_freq=target_sr)
+        waveform = resampler(waveform)
+    if waveform.shape[0] > 1:
+        waveform = torch.mean(waveform, dim=0, keepdim=True)
+    return waveform.squeeze().numpy()
+def transcribe(audio_path, max_new_tokens=128):
+    print(f"Loading and preparing audio from: {audio_path}")
+    audio_array = prepare_audio(audio_path)
+    chat = [
+        {"role": "system", "content": [{"type": "text", "text": "You are an ASR assistant."}]},
+        {"role": "user", "content": [{"type": "audio", "array": audio_array}, {"type": "text", "text": "Transcribe this."}]},
+    ]
+    text = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+    inputs = processor(text=[text], return_tensors="pt")
+    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
+    print("Generating transcription...")
+    with torch.no_grad():
+        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
+    response = processor.decode(outputs[0], skip_special_tokens=True)
+    try:
+        return response.split("assistant\n")[-1].strip()
+    except:
+        return response
+# --- Example Usage ---
+# Download a sample audio file for testing
+# !wget [https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac](https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac)
+AUDIO_FILE = "1.flac"
+transcription = transcribe(AUDIO_FILE)
+print("\n--- TRANSCRIPTION ---")
+print(transcription)
+print("---------------------")
+```
+## Deployment and Inference
+For optimized inference, especially in a production environment, it is recommended to use serving frameworks like `vLLM`, which can provide significant speedups.
+## Training Details
+### Training Data
+The model was fine-tuned on a subset of the `speechbrain/LargeScaleASR`(Recently renamed to speechbrain/LoquaciousSet) dataset. This dataset comprises 25,000 hours of diverse, transcribed English speech. For this project, a smaller shard consisting of the first two parts of the 'small' configuration (`train-0000*` and `train-0001*`) was used for training, and the first part of the 'test' set (`test-00000*`) was used for validation.
+### Training Procedure
+The fine-tuning was conducted in two stages to effectively adapt the VLM for audio processing.
+#### Stage 1: Audio Adapter Training
+In the first stage, the language model and the pre-trained Whisper audio encoder were frozen. Only the newly introduced `audio_proj` layer was trained. This stage aims to align the audio feature space with the language model's embedding space.
+- **Learning Rate:** `1e-4`
+- **Batch Size:** `2` (per device)
+- **Gradient Accumulation Steps:** `4` (Effective batch size of 8)
+- **Max Steps:** `1000`
+#### Stage 2: QLoRA End-to-End Fine-Tuning
+In the second stage, the entire model was unfrozen and fine-tuned end-to-end using **QLoRA** (Quantized Low-Rank Adaptation). This method significantly reduces memory requirements by quantizing the base model to 4-bits using **NF4 (4-bit NormalFloat)** quantization and then training a small number of LoRA adapters on top.
+- **Learning Rate:** `2e-5`
+- **Batch Size:** `2` (per device)
+- **Gradient Accumulation Steps:** `8` (Effective batch size of 16)
+- **Epochs:** `1`
+- **Quantization:** `4-bit NF4` with `bfloat16` compute dtype.
+- **LoRA Config:**
+    - `r`: 16
+    - `lora_alpha`: 32
+    - `target_modules`: `['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'audio_proj']`
+    - `lora_dropout`: 0.05
+## Evaluation
+The model's performance was monitored using Weights & Biases. The plots below show the training and evaluation loss during the second stage of fine-tuning.
+**Training & Evaluation Loss (Stage 2)**
+![Training and Evaluation Loss](https://imgur.com/a/zXj0jF1)
+Chart(https://imgur.com/a/zXj0jF1)
+The evaluation loss shows a consistent downward trend, indicating that the model was successfully learning to transcribe speech from the audio data. The training loss also decreased steadily, converging to a low value.
+## Citation
+If you use this model in your work, please consider citing the original Qwen and Whisper models, as well as this derivative work.
 ```bibtex
+@misc{qwen2_vl_audio_asr,
+  author = {lordChipotle},
+  title = {Qwen2-VL-7B for Speech Understanding},
+  year = {2025},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Hub},
+  howpublished = {\\url{[https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora](https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora)}}
 }
 ```