Reason-mxbai-colbert-v0-32m

Reason-mxbai-colbert-v0-32m is a ~32M-parameter late-interaction retriever, fine-tuned from mixedbread-ai/mxbai-edge-colbert-v0-32m for reasoning-intensive retrieval on the BRIGHT benchmark.

It is an edge-scale sibling of Reason-ModernColBERT (150M): the same late-interaction recipe applied to an order-of-magnitude smaller backbone, with a widened projection head (64 → 128 dim) and a two-stage curriculum on VL + BGE-reasoner + ReasonIR-HQ hard negatives.

Average BRIGHT nDCG@10 = 19.00 at ~5× smaller inference cost than Reason-ModernColBERT (150M). See the Evaluation section for per-split numbers.

Model Details

  • Model Type: PyLate ColBERT (late-interaction, multi-vector)
  • Base model: mixedbread-ai/mxbai-edge-colbert-v0-32m
  • Parameters: ~32M (backbone) + widened projection head
  • Document Length (training): 2048 tokens
  • Query Length (training): 256 tokens
  • Output Dimensionality: 128 per token (widened from the base's 64-dim)
  • Similarity Function: MaxSim
  • Training Data:
  • Language: en
  • License: CC-BY-NC-4.0 (inherited from training data)

Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 127, 'do_lower_case': True}) with ModernBertModel
      hidden_size=384, num_hidden_layers=10, num_attention_heads=6,
      position_embedding_type='sans_pos', max_position_embeddings=7999
  (1): Dense(384 → 768, bias=False)
  (2): Dense(768 → 768, bias=False)
  (3): Dense(768 → 128, bias=False)       # widened from 64 → 128 to give MaxSim more channels
)

The final projection head was widened from 64 → 128 using a small-random initialization (std = 10% of existing row std) so the new channels receive non-zero gradient from the start. See training/widen_colbert_projection.py.

Why widen the projection?

The base mxbai-edge-colbert-v0-32m outputs 64-dim per-token vectors. On reasoning-intensive retrieval with many structurally-similar tokens (code syntax, LaTeX math notation, operator punctuation), 64 channels saturate fast — MaxSim discrimination on those splits hits an architectural ceiling. Widening to 128-dim doubles the per-token channel budget, matching Reason-ModernColBERT's output dimensionality. The base weights are preserved exactly on the first 64 dims; only the extra 64 dims are learned during fine-tuning.

Usage

First install PyLate:

pip install -U pylate

Retrieval

from pylate import indexes, models, retrieve

model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0-32m")

index = indexes.Voyager(index_folder="pylate-index", index_name="index", override=True)

docs = ["document 1 text", "document 2 text", "document 3 text"]
doc_ids = ["1", "2", "3"]
doc_embs = model.encode(docs, batch_size=32, is_query=False, show_progress_bar=True)
index.add_documents(documents_ids=doc_ids, documents_embeddings=doc_embs)

retriever = retrieve.ColBERT(index=index)
query_embs = model.encode(
    ["Given a Biology post, retrieve relevant passages that help answer the post.\nQuery: how do cells divide?"],
    is_query=True,
)
scores = retriever.retrieve(queries_embeddings=query_embs, k=10)

Tip: for best results on BRIGHT-style queries, prepend the domain instruction (Given a {Biology|Coding|Math|...} post, retrieve relevant passages...) followed by \nQuery: {raw_query} — that's the format the model was trained on (via the BGE-reasoner data).

Reranking (no index)

from pylate import rank, models

model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0-32m")

queries = ["query A", "query B"]
documents = [["document A", "document B"], ["document 1", "document C", "document B"]]
doc_ids = [[1, 2], [1, 3, 2]]

q_embs = model.encode(queries, is_query=True)
d_embs = model.encode(documents, is_query=False)
reranked = rank.rerank(
    documents_ids=doc_ids,
    queries_embeddings=q_embs,
    documents_embeddings=d_embs,
)

Evaluation

BRIGHT Benchmark — nDCG@10

Evaluated with the MTEB BrightRetrieval task via evaluation/evaluate_bright.py. query_length=256 (pony=32) and document_length=2048 (matches training setup).

Split nDCG@10
Biology 32.71
Earth Science 43.88
Economics 18.70
Psychology 22.62
Robotics 18.43
Stackoverflow 16.78
Sustainable Living 20.77
Leetcode 17.67
Pony 20.73
AoPS 5.05
Theorem — Q 8.38
Theorem — T 2.25
Full mean 19.00

Raw per-split JSON under results/.

Context

  • Reason-ModernColBERT (150M): 22.62 mean at ~5× the parameter count.
  • Dense single-vector baselines at similar scale (< 1B): ~13-15 mean.
  • Our 64-dim predecessor (mxbai-edge-32m trained on same curriculum, pre-widening): ~18.4 mean.

On the natural-language and instruction-following splits (biology, earth_science, sustainable_living, pony, psychology) the 32M is competitive with or beats the 150M Reason-ModernColBERT on individual splits. It lags on symbol-dense splits (leetcode, stackoverflow, aops, theoremqa) because of architectural choices in the base model: case-insensitive tokenizer, no global positional embeddings (sans_pos), and a shallow 10-layer backbone — these cannot be recovered by training and cap performance on code / formal-math retrieval.

Training

Two-stage curriculum on 8 H100 GPUs (2 nodes × 4 GPUs, matching Reason-ModernColBERT's 8-GPU setup):

  1. Widen projection head: Small-random init for the new 64 channels, verified non-zero at encode time.
  2. Stage 1 (VL warmup): 1 epoch on reasonir/reasonir-data VL split (~181k triples), lr=1e-5, global batch 2048, query_length=256, document_length=2048.
  3. Stage 2 (BGE + HQ-hn polish): 1 epoch on merged BGE-reasoner (12 BRIGHT-domain triples with instruction prefixes) + ReasonIR-HQ with hard negatives (~2.7M triples total), lr=5e-6, global batch 2048.

Training loss

  • pylate.losses.cached_contrastive.CachedContrastive (temperature=1.0, gather_across_devices=True).
  • max_grad_norm=100 (set via env var; default 1.0 over-clips when the widened projection has high bootstrap gradients).

Expected total wall-clock on 2 × 4 × H100: 6-8 hours.

Evaluation (reproduce)

python evaluation/evaluate_bright.py \
    --model_path <path-to-this-model> \
    --model_version baseline \
    --query_length 256 \
    --document_length 2048 \
    --output_root results/

License

apache-2.0

Citation

If you find this model useful, please cite the upstream work it builds on:

@misc{Reason-mxbai-colbert-v0-32m,
  title={Reason-mxbai-colbert-v0-32m},
  author={Abdelrahman Abdallah and Adam Jatowt},
  url={https://huggingface.co/DataScience-UIBK/Reason-mxbai-colbert-v0-32m},
  year={2025}
}



@misc{Reason-ModernColBERT,
  title={Reason-ModernColBERT},
  author={Chaffin, Antoine},
  url={https://huggingface.co/lightonai/Reason-ModernColBERT},
  year={2025}
}

@misc{mxbai-edge-colbert-v0-32m,
  title={mxbai-edge-colbert-v0-32m},
  author={Mixedbread AI},
  url={https://huggingface.co/mixedbread-ai/mxbai-edge-colbert-v0-32m},
  year={2025}
}

Framework Versions

  • Python: 3.10
  • PyLate: 1.1.7+
  • Sentence Transformers: 4.0.2
  • Transformers: 4.48.2
  • PyTorch: 2.5.1 (CUDA 12.4)
  • Accelerate: 1.1.1
  • Datasets: 2.21.0
Downloads last month
84
Safetensors
Model size
31.9M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for DataScience-UIBK/Reason-mxbai-colbert-v0-32m

Finetuned
(6)
this model

Datasets used to train DataScience-UIBK/Reason-mxbai-colbert-v0-32m