Reason-mxbai-colbert-v0-32m
Reason-mxbai-colbert-v0-32m is a ~32M-parameter late-interaction retriever, fine-tuned from mixedbread-ai/mxbai-edge-colbert-v0-32m for reasoning-intensive retrieval on the BRIGHT benchmark.
It is an edge-scale sibling of Reason-ModernColBERT (150M): the same late-interaction recipe applied to an order-of-magnitude smaller backbone, with a widened projection head (64 → 128 dim) and a two-stage curriculum on VL + BGE-reasoner + ReasonIR-HQ hard negatives.
Average BRIGHT nDCG@10 = 19.00 at ~5× smaller inference cost than Reason-ModernColBERT (150M). See the Evaluation section for per-split numbers.
Model Details
- Model Type: PyLate ColBERT (late-interaction, multi-vector)
- Base model: mixedbread-ai/mxbai-edge-colbert-v0-32m
- Parameters: ~32M (backbone) + widened projection head
- Document Length (training): 2048 tokens
- Query Length (training): 256 tokens
- Output Dimensionality: 128 per token (widened from the base's 64-dim)
- Similarity Function: MaxSim
- Training Data:
- hanhainebula/bge-reasoner-data — 12 BRIGHT-domain instruction-prefixed triples
- reasonir/reasonir-data —
vlsplit (warmup) andhqwith hard negatives (polish)
- Language: en
- License: CC-BY-NC-4.0 (inherited from training data)
Model Architecture
ColBERT(
(0): Transformer({'max_seq_length': 127, 'do_lower_case': True}) with ModernBertModel
hidden_size=384, num_hidden_layers=10, num_attention_heads=6,
position_embedding_type='sans_pos', max_position_embeddings=7999
(1): Dense(384 → 768, bias=False)
(2): Dense(768 → 768, bias=False)
(3): Dense(768 → 128, bias=False) # widened from 64 → 128 to give MaxSim more channels
)
The final projection head was widened from 64 → 128 using a small-random initialization (std = 10% of existing row std) so the new channels receive non-zero gradient from the start. See training/widen_colbert_projection.py.
Why widen the projection?
The base mxbai-edge-colbert-v0-32m outputs 64-dim per-token vectors. On reasoning-intensive retrieval with many structurally-similar tokens (code syntax, LaTeX math notation, operator punctuation), 64 channels saturate fast — MaxSim discrimination on those splits hits an architectural ceiling. Widening to 128-dim doubles the per-token channel budget, matching Reason-ModernColBERT's output dimensionality. The base weights are preserved exactly on the first 64 dims; only the extra 64 dims are learned during fine-tuning.
Usage
First install PyLate:
pip install -U pylate
Retrieval
from pylate import indexes, models, retrieve
model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0-32m")
index = indexes.Voyager(index_folder="pylate-index", index_name="index", override=True)
docs = ["document 1 text", "document 2 text", "document 3 text"]
doc_ids = ["1", "2", "3"]
doc_embs = model.encode(docs, batch_size=32, is_query=False, show_progress_bar=True)
index.add_documents(documents_ids=doc_ids, documents_embeddings=doc_embs)
retriever = retrieve.ColBERT(index=index)
query_embs = model.encode(
["Given a Biology post, retrieve relevant passages that help answer the post.\nQuery: how do cells divide?"],
is_query=True,
)
scores = retriever.retrieve(queries_embeddings=query_embs, k=10)
Tip: for best results on BRIGHT-style queries, prepend the domain instruction (Given a {Biology|Coding|Math|...} post, retrieve relevant passages...) followed by \nQuery: {raw_query} — that's the format the model was trained on (via the BGE-reasoner data).
Reranking (no index)
from pylate import rank, models
model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0-32m")
queries = ["query A", "query B"]
documents = [["document A", "document B"], ["document 1", "document C", "document B"]]
doc_ids = [[1, 2], [1, 3, 2]]
q_embs = model.encode(queries, is_query=True)
d_embs = model.encode(documents, is_query=False)
reranked = rank.rerank(
documents_ids=doc_ids,
queries_embeddings=q_embs,
documents_embeddings=d_embs,
)
Evaluation
BRIGHT Benchmark — nDCG@10
Evaluated with the MTEB BrightRetrieval task via evaluation/evaluate_bright.py. query_length=256 (pony=32) and document_length=2048 (matches training setup).
| Split | nDCG@10 |
|---|---|
| Biology | 32.71 |
| Earth Science | 43.88 |
| Economics | 18.70 |
| Psychology | 22.62 |
| Robotics | 18.43 |
| Stackoverflow | 16.78 |
| Sustainable Living | 20.77 |
| Leetcode | 17.67 |
| Pony | 20.73 |
| AoPS | 5.05 |
| Theorem — Q | 8.38 |
| Theorem — T | 2.25 |
| Full mean | 19.00 |
Raw per-split JSON under results/.
Context
- Reason-ModernColBERT (150M): 22.62 mean at ~5× the parameter count.
- Dense single-vector baselines at similar scale (< 1B): ~13-15 mean.
- Our 64-dim predecessor (mxbai-edge-32m trained on same curriculum, pre-widening): ~18.4 mean.
On the natural-language and instruction-following splits (biology, earth_science, sustainable_living, pony, psychology) the 32M is competitive with or beats the 150M Reason-ModernColBERT on individual splits. It lags on symbol-dense splits (leetcode, stackoverflow, aops, theoremqa) because of architectural choices in the base model: case-insensitive tokenizer, no global positional embeddings (sans_pos), and a shallow 10-layer backbone — these cannot be recovered by training and cap performance on code / formal-math retrieval.
Training
Two-stage curriculum on 8 H100 GPUs (2 nodes × 4 GPUs, matching Reason-ModernColBERT's 8-GPU setup):
- Widen projection head: Small-random init for the new 64 channels, verified non-zero at encode time.
- Stage 1 (VL warmup): 1 epoch on
reasonir/reasonir-dataVL split (~181k triples),lr=1e-5, global batch 2048,query_length=256,document_length=2048. - Stage 2 (BGE + HQ-hn polish): 1 epoch on merged BGE-reasoner (12 BRIGHT-domain triples with instruction prefixes) + ReasonIR-HQ with hard negatives (~2.7M triples total),
lr=5e-6, global batch 2048.
Training loss
pylate.losses.cached_contrastive.CachedContrastive(temperature=1.0,gather_across_devices=True).max_grad_norm=100(set via env var; default 1.0 over-clips when the widened projection has high bootstrap gradients).
Expected total wall-clock on 2 × 4 × H100: 6-8 hours.
Evaluation (reproduce)
python evaluation/evaluate_bright.py \
--model_path <path-to-this-model> \
--model_version baseline \
--query_length 256 \
--document_length 2048 \
--output_root results/
License
apache-2.0
Citation
If you find this model useful, please cite the upstream work it builds on:
@misc{Reason-mxbai-colbert-v0-32m,
title={Reason-mxbai-colbert-v0-32m},
author={Abdelrahman Abdallah and Adam Jatowt},
url={https://huggingface.co/DataScience-UIBK/Reason-mxbai-colbert-v0-32m},
year={2025}
}
@misc{Reason-ModernColBERT,
title={Reason-ModernColBERT},
author={Chaffin, Antoine},
url={https://huggingface.co/lightonai/Reason-ModernColBERT},
year={2025}
}
@misc{mxbai-edge-colbert-v0-32m,
title={mxbai-edge-colbert-v0-32m},
author={Mixedbread AI},
url={https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m},
year={2025}
}
Framework Versions
- Python: 3.10
- PyLate: 1.1.7+
- Sentence Transformers: 4.0.2
- Transformers: 4.48.2
- PyTorch: 2.5.1 (CUDA 12.4)
- Accelerate: 1.1.1
- Datasets: 2.21.0
- Downloads last month
- 84
Model tree for DataScience-UIBK/Reason-mxbai-colbert-v0-32m
Base model
mixedbread-ai/mxbai-edge-colbert-v0-32m