ColQwen3.5-v1

A visual document retrieval model based on Qwen3.5-4B, trained using the ColPali late-interaction architecture. Generates multi-vector embeddings from document page images for token-level query-document matching via MaxSim.

4.5B parameters | 320-dim embeddings | LoRA fine-tuned (r=32) | BF16

Benchmark Results

Evaluated with MTEB v2.10.8. Best score per dataset in bold.

ViDoRe V1 (nDCG@5)

Dataset ColQwen3.5-v1 Nemotron 4B Nemotron 3B TomoroAI 4B Jina v4
ArxivQA 0.9250 0.9203 0.8835 0.9066 0.8846
DocVQA 0.6628 0.6739 0.6621 0.6624 0.6014
InfoVQA 0.9388 0.9331 0.9492 0.9429 0.9379
ShiftProject 0.9298 0.9226 0.9070 0.8739 0.9293
SynDocQA AI 1.0000 0.9926 0.9963 0.9926 0.9926
SynDocQA Energy 0.9639 0.9619 0.9663 0.9691 0.9726
SynDocQA Gov 0.9782 0.9802 0.9782 0.9717 0.9659
SynDocQA Health 0.9889 0.9852 0.9926 0.9963 0.9913
TabFQuAD 0.9374 0.9805 0.9594 0.9433 0.9560
TatDQA 0.8409 0.8119 0.8057 0.7983 0.8035
Average 0.9166 0.9162 0.9100 0.9057 0.9035

ViDoRe V3 β€” English (nDCG@5)

Dataset ColQwen3.5-v1 TomoroAI 4B Nemotron 3B Jina v4 ColNomic 7B
ComputerScience 0.7734 0.7419 0.7514 0.7175 0.7528
Energy 0.6294 0.6023 0.5838 0.5842 0.5824
FinanceEn 0.6599 0.6753 0.6712 0.6417 0.6041
FinanceFr 0.4241 0.4202 0.3730 0.3859 0.3877
Hr 0.6353 0.6037 0.6256 0.6206 0.6060
Industrial 0.5574 0.5787 0.5447 0.5443 0.5229
Pharmaceuticals 0.6551 0.6612 0.6524 0.6303 0.6226
Physics 0.4688 0.4640 0.4128 0.4191 0.4423
Average 0.6004 0.5934 0.5769 0.5680 0.5651

ViDoRe V3 β€” Multilingual (nDCG@5)

Dataset ColQwen3.5-v1 TomoroAI 4B Nemotron 3B Jina v4 ColNomic 7B
ComputerScience 0.7506 0.7419 0.7514 0.7175 0.7528
Energy 0.6766 0.6023 0.5838 0.5842 0.5824
FinanceEn 0.5791 0.6753 0.6712 0.6417 0.6041
FinanceFr 0.4769 0.4202 0.3730 0.3859 0.3877
Hr 0.5843 0.6037 0.6256 0.6206 0.6060
Industrial 0.4908 0.5787 0.5447 0.5443 0.5229
Pharmaceuticals 0.6264 0.6612 0.6524 0.6303 0.6226
Physics 0.4790 0.4640 0.4128 0.4191 0.4423
Average 0.5830 0.5934 0.5769 0.5680 0.5651

ViDoRe V2 β€” English (nDCG@5)

Dataset ColQwen3.5-v1 TomoroAI 4B Nemotron 3B Jina v4 ColNomic 7B
BioMedicalLectures 0.6705 0.6718 0.6518 0.6359 0.6479
ESGReportsHL 0.6998 0.7465 0.7538 0.6512 0.6871
ESGReports 0.5817 0.6300 0.6030 0.5194 0.5498
EconomicsReports 0.5947 0.5910 0.6619 0.5955 0.5955
Average 0.6367 0.6598 0.6676 0.6005 0.6201

ViDoRe V2 β€” Multilingual (nDCG@5)

Dataset ColQwen3.5-v1 TomoroAI 4B Nemotron 3B Jina v4 ColNomic 7B
BioMedicalLectures 0.6338 0.6718 0.6518 0.6359 0.6479
ESGReports 0.5510 0.6300 0.6030 0.5194 0.5498
EconomicsReports 0.5420 0.5910 0.6619 0.5955 0.5955
Average 0.5756 0.6309 0.6389 0.5836 0.5977

Limitations

Performance is strongly tied to training data coverage. The model excels on general documents, finance, and multilingual content where training data was available, but lags on specialized domains (industrial, HR, ESG, biomedical, economics) that were not represented in training. We expect targeted domain data to close these gaps in future releases.

Usage

import torch
from PIL import Image
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor

model = ColQwen3_5.from_pretrained(
    "athrael-soju/colqwen3.5-v1",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="sdpa",
)
processor = ColQwen3_5Processor.from_pretrained("athrael-soju/colqwen3.5-v1")

# Embed document images
images = [Image.open("page1.png"), Image.open("page2.png")]
batch = processor.process_images(images).to(model.device)
with torch.no_grad():
    doc_embeddings = model(**batch)  # (batch, seq_len, 320)

# Embed queries
queries = ["What is the revenue for Q4?", "Show me the organizational chart"]
batch = processor.process_queries(queries).to(model.device)
with torch.no_grad():
    model.rope_deltas = None  # Reset for text-only input
    query_embeddings = model(**batch)  # (batch, seq_len, 320)

# Score with late interaction (MaxSim)
scores = processor.score(query_embeddings, doc_embeddings)

Training

4-phase LoRA pipeline with seed averaging at each stage:

  1. Base training on ~761K query-image pairs (8 datasets, 5 languages)
  2. Seed averaging (4 seeds)
  3. Hard negative retraining (mined ~6 negatives per sample)
  4. Domain specialization on finance + table documents with 20% replay

Acknowledgments

Training compute provided by Vultr.

Citation

@misc{colqwen35,
  title={ColQwen3.5: Visual Document Retrieval with Hybrid Linear-Attention Models},
  author={athrael-soju},
  year={2026},
  url={https://huggingface.co/athrael-soju/colqwen3.5-v1}
}

License

Apache 2.0

Downloads last month
185
Safetensors
Model size
5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for athrael-soju/colqwen3.5-4.5B-v1

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(53)
this model

Paper for athrael-soju/colqwen3.5-4.5B-v1