ColQwen3.5-v1
A visual document retrieval model based on Qwen3.5-4B, trained using the ColPali late-interaction architecture. Generates multi-vector embeddings from document page images for token-level query-document matching via MaxSim.
4.5B parameters | 320-dim embeddings | LoRA fine-tuned (r=32) | BF16
Benchmark Results
Evaluated with MTEB v2.10.8. Best score per dataset in bold.
ViDoRe V1 (nDCG@5)
| Dataset | ColQwen3.5-v1 | Nemotron 4B | Nemotron 3B | TomoroAI 4B | Jina v4 |
|---|---|---|---|---|---|
| ArxivQA | 0.9250 | 0.9203 | 0.8835 | 0.9066 | 0.8846 |
| DocVQA | 0.6628 | 0.6739 | 0.6621 | 0.6624 | 0.6014 |
| InfoVQA | 0.9388 | 0.9331 | 0.9492 | 0.9429 | 0.9379 |
| ShiftProject | 0.9298 | 0.9226 | 0.9070 | 0.8739 | 0.9293 |
| SynDocQA AI | 1.0000 | 0.9926 | 0.9963 | 0.9926 | 0.9926 |
| SynDocQA Energy | 0.9639 | 0.9619 | 0.9663 | 0.9691 | 0.9726 |
| SynDocQA Gov | 0.9782 | 0.9802 | 0.9782 | 0.9717 | 0.9659 |
| SynDocQA Health | 0.9889 | 0.9852 | 0.9926 | 0.9963 | 0.9913 |
| TabFQuAD | 0.9374 | 0.9805 | 0.9594 | 0.9433 | 0.9560 |
| TatDQA | 0.8409 | 0.8119 | 0.8057 | 0.7983 | 0.8035 |
| Average | 0.9166 | 0.9162 | 0.9100 | 0.9057 | 0.9035 |
ViDoRe V3 β English (nDCG@5)
| Dataset | ColQwen3.5-v1 | TomoroAI 4B | Nemotron 3B | Jina v4 | ColNomic 7B |
|---|---|---|---|---|---|
| ComputerScience | 0.7734 | 0.7419 | 0.7514 | 0.7175 | 0.7528 |
| Energy | 0.6294 | 0.6023 | 0.5838 | 0.5842 | 0.5824 |
| FinanceEn | 0.6599 | 0.6753 | 0.6712 | 0.6417 | 0.6041 |
| FinanceFr | 0.4241 | 0.4202 | 0.3730 | 0.3859 | 0.3877 |
| Hr | 0.6353 | 0.6037 | 0.6256 | 0.6206 | 0.6060 |
| Industrial | 0.5574 | 0.5787 | 0.5447 | 0.5443 | 0.5229 |
| Pharmaceuticals | 0.6551 | 0.6612 | 0.6524 | 0.6303 | 0.6226 |
| Physics | 0.4688 | 0.4640 | 0.4128 | 0.4191 | 0.4423 |
| Average | 0.6004 | 0.5934 | 0.5769 | 0.5680 | 0.5651 |
ViDoRe V3 β Multilingual (nDCG@5)
| Dataset | ColQwen3.5-v1 | TomoroAI 4B | Nemotron 3B | Jina v4 | ColNomic 7B |
|---|---|---|---|---|---|
| ComputerScience | 0.7506 | 0.7419 | 0.7514 | 0.7175 | 0.7528 |
| Energy | 0.6766 | 0.6023 | 0.5838 | 0.5842 | 0.5824 |
| FinanceEn | 0.5791 | 0.6753 | 0.6712 | 0.6417 | 0.6041 |
| FinanceFr | 0.4769 | 0.4202 | 0.3730 | 0.3859 | 0.3877 |
| Hr | 0.5843 | 0.6037 | 0.6256 | 0.6206 | 0.6060 |
| Industrial | 0.4908 | 0.5787 | 0.5447 | 0.5443 | 0.5229 |
| Pharmaceuticals | 0.6264 | 0.6612 | 0.6524 | 0.6303 | 0.6226 |
| Physics | 0.4790 | 0.4640 | 0.4128 | 0.4191 | 0.4423 |
| Average | 0.5830 | 0.5934 | 0.5769 | 0.5680 | 0.5651 |
ViDoRe V2 β English (nDCG@5)
| Dataset | ColQwen3.5-v1 | TomoroAI 4B | Nemotron 3B | Jina v4 | ColNomic 7B |
|---|---|---|---|---|---|
| BioMedicalLectures | 0.6705 | 0.6718 | 0.6518 | 0.6359 | 0.6479 |
| ESGReportsHL | 0.6998 | 0.7465 | 0.7538 | 0.6512 | 0.6871 |
| ESGReports | 0.5817 | 0.6300 | 0.6030 | 0.5194 | 0.5498 |
| EconomicsReports | 0.5947 | 0.5910 | 0.6619 | 0.5955 | 0.5955 |
| Average | 0.6367 | 0.6598 | 0.6676 | 0.6005 | 0.6201 |
ViDoRe V2 β Multilingual (nDCG@5)
| Dataset | ColQwen3.5-v1 | TomoroAI 4B | Nemotron 3B | Jina v4 | ColNomic 7B |
|---|---|---|---|---|---|
| BioMedicalLectures | 0.6338 | 0.6718 | 0.6518 | 0.6359 | 0.6479 |
| ESGReports | 0.5510 | 0.6300 | 0.6030 | 0.5194 | 0.5498 |
| EconomicsReports | 0.5420 | 0.5910 | 0.6619 | 0.5955 | 0.5955 |
| Average | 0.5756 | 0.6309 | 0.6389 | 0.5836 | 0.5977 |
Limitations
Performance is strongly tied to training data coverage. The model excels on general documents, finance, and multilingual content where training data was available, but lags on specialized domains (industrial, HR, ESG, biomedical, economics) that were not represented in training. We expect targeted domain data to close these gaps in future releases.
Usage
import torch
from PIL import Image
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor
model = ColQwen3_5.from_pretrained(
"athrael-soju/colqwen3.5-v1",
torch_dtype=torch.bfloat16,
device_map="cuda",
attn_implementation="sdpa",
)
processor = ColQwen3_5Processor.from_pretrained("athrael-soju/colqwen3.5-v1")
# Embed document images
images = [Image.open("page1.png"), Image.open("page2.png")]
batch = processor.process_images(images).to(model.device)
with torch.no_grad():
doc_embeddings = model(**batch) # (batch, seq_len, 320)
# Embed queries
queries = ["What is the revenue for Q4?", "Show me the organizational chart"]
batch = processor.process_queries(queries).to(model.device)
with torch.no_grad():
model.rope_deltas = None # Reset for text-only input
query_embeddings = model(**batch) # (batch, seq_len, 320)
# Score with late interaction (MaxSim)
scores = processor.score(query_embeddings, doc_embeddings)
Training
4-phase LoRA pipeline with seed averaging at each stage:
- Base training on ~761K query-image pairs (8 datasets, 5 languages)
- Seed averaging (4 seeds)
- Hard negative retraining (mined ~6 negatives per sample)
- Domain specialization on finance + table documents with 20% replay
Acknowledgments
Training compute provided by Vultr.
Citation
@misc{colqwen35,
title={ColQwen3.5: Visual Document Retrieval with Hybrid Linear-Attention Models},
author={athrael-soju},
year={2026},
url={https://huggingface.co/athrael-soju/colqwen3.5-v1}
}
License
Apache 2.0
- Downloads last month
- 185