bge-small-rrf-v3
Paper: vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents (arXiv:2604.15484)
A 33M-parameter (384-dim) English embedding model, fine-tuned from BAAI/bge-small-en-v1.5 using the self-supervised hybrid-retrieval disagreement signal described in Steffens (2026) and implemented in vstash.
Same size and speed as the base model. Higher retrieval quality on all three BEIR datasets it was trained on.
What changed vs v2
bge-small-rrf-v3 is trained with the winning config from the
2026-04-19 H-R9 ablation:
- 2x training volume. 60,000 target triples across three BEIR
datasets instead of v2's 30,000. Volume was the single largest
lever at this scale: every dataset improved simultaneously when
doubling
total_triples, no trade-off observed. - Same corpus balance.
temperature=0.5sampling keeps the same ratio v2 used; the volume increase scales every dataset proportionally rather than reshuffling. - Observability. The training pipeline now records NDCG@3 and
Recall@100 alongside NDCG@10 in
training_meta.json.
Eval numbers
Evaluated on the full 5-dataset BEIR cut (SciFact, NFCorpus, SciDocs, FiQA, ArguAna) with vstash's production retrieval pipeline (RRF hybrid + adaptive weights + doc-level dedup, wide top-100 candidate pool). Both v2 and v3 re-evaluated under the same pipeline for an apples-to-apples comparison.
Absolute NDCG@10 vs BM25, ColBERTv2, and previous releases
| Dataset | BM25 | ColBERTv2 | Base | v2 | v3 |
|---|---|---|---|---|---|
| SciFact | 0.665 | 0.693 | 0.9082 | 0.9107 | 0.9361 |
| NFCorpus | 0.325 | 0.344 | 0.3674 | 0.4325 | 0.3927 |
| SciDocs | 0.158 | 0.154 | 0.3637 | 0.3676 | 0.3693 |
| FiQA | 0.236 | 0.356 | 0.6509 | 0.6541 | 0.7506 |
| ArguAna | 0.315 | 0.463 | 0.7686 | 0.7579 | 0.7540 |
| macro | - | - | 0.6118 | 0.6246 | 0.6405 |
Wins vs ColBERTv2
Both v2 and v3 beat ColBERTv2 on 5/5 BEIR datasets under this pipeline. v3 improves macro by +1.6 absolute NDCG@10 over v2 (+2.6% relative).
v3 vs v2, dataset by dataset
| Dataset | v2 | v3 | Winner | Note |
|---|---|---|---|---|
| SciFact | 0.9107 | 0.9361 | v3 | +0.025 absolute |
| FiQA | 0.6541 | 0.7506 | v3 | +0.097 absolute (the big v3 win) |
| SciDocs | 0.3676 | 0.3693 | v3 | within noise |
| NFCorpus | 0.4325 | 0.3927 | v2 | v2 retains the advantage (-0.040 in v3) |
| ArguAna | 0.7579 | 0.7540 | v2 | within noise |
Use v3 by default: it wins or ties on 3/5 datasets, the macro is cleanly higher, and the FiQA improvement is substantial (+14% absolute relative to the base). v2 remains the better pick when NFCorpus-style retrieval dominates your workload (keyword-heavy medical / biomedical corpora).
Supporting metrics
v3's candidate-set health (Recall@100) and head quality (NDCG@3) also improve across the board on datasets where the overall NDCG@10 went up:
- FiQA Recall@100: 0.9188 (v2) -> 0.9867 (v3)
- SciFact NDCG@3: 0.8976 (v2) -> 0.9265 (v3)
Full metrics JSON at experiments/results/v2_v3_head_to_head.json.
Reproduce the comparison with python -m experiments.v2_v3_head_to_head
or the Colab notebook
experiments/v2_v3_head_to_head.ipynb.
Full H-R9 ablation (temperature / volume sweep that produced the
v3 config) in
experiments/retrain_roadmap.md.
Usage
Drop-in via sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Stffens/bge-small-rrf-v3")
embeddings = model.encode(["what is hybrid retrieval?"], normalize_embeddings=True)
Inside vstash
vstash reindex --model Stffens/bge-small-rrf-v3
As a search / RAG backbone
Same API as bge-small-en-v1.5: 384 dimensions, cosine similarity,
instruction-free encoding. Drop into any retrieval stack built
around the base model.
Training recipe
Reproducible via the published notebook
retrain_t1_4_multi_beir.ipynb
after setting total_triples=60000. Single command:
vstash retrain-multi \
--store scifact=./scifact.db \
--store nfcorpus=./nfcorpus.db \
--store fiqa=./fiqa.db \
--sampling-strategy temperature \
--sampling-temperature 0.5 \
--total-triples 60000 \
--epochs 2 --lr 3e-6 --batch-size 32 \
--bulk-mine --bulk-eval \
--seed 42 \
--output ./bge-small-rrf-v3
Pipeline
- Ingest SciFact (5183 docs), NFCorpus (3633), FiQA (57638) into separate vstash stores.
- Sample training queries from BEIR
queries.jsonl+ qrels (v5 labeled-query recipe). - Mine hard negatives via vec-heavy / FTS-heavy RRF disagreement on each store.
- Train one model on the union with MNRL for 2 epochs.
- Evaluate per-dataset; promote the candidate only if macro NDCG@10 exceeds the base.
Hyperparameters
| Key | Value |
|---|---|
| Base model | BAAI/bge-small-en-v1.5 |
| Loss | MultipleNegativesRankingLoss |
| Total training triples | 60,000 (target) / 39,852 (emitted) |
| Sampling | temperature, alpha=0.5 |
| Epochs | 2 |
| Learning rate | 3e-6 |
| Batch size | 32 |
| Warmup steps | 50 |
max_seq_length |
256 |
| Mixed precision | FP16 (AMP on) |
| Seed | 42 |
| Training hardware | NVIDIA A100 |
| Training time | ~15 minutes |
Limitations
- English only. The base model and training data are English. Cross-lingual retrieval may regress vs a multilingual model.
- NFCorpus still saturates. Even at 2x volume, the NFCorpus NDCG@10 stays around 0.376, short of v5's published 0.409. The gap is likely model-capacity (33M params) and can be closed by a cross-encoder reranker on top; see vstash's T2.4 design doc.
- Domain-specific corpora (clinical, legal, heavily-jargoned)
may benefit more from retraining with
vstash retrain-multi --store mydomain=...on top of v3 than from v3 out of the box.
Citation
@software{vstash_bge_small_rrf_v3_2026,
author = {Steffens, Jay},
title = {bge-small-rrf-v3: self-supervised retrieval fine-tune of BGE-small via vstash},
year = {2026},
url = {https://huggingface.co/Stffens/bge-small-rrf-v3}
}
For vstash itself:
@software{vstash_bge_small_rrf_v3_2026,
author = {Steffens, Jayson},
title = {bge-small-rrf-v3: self-supervised retrieval fine-tune of BGE-small via vstash},
year = {2026},
url = {https://huggingface.co/Stffens/bge-small-rrf-v3}
}
@misc{steffens2026vstash,
author = {Steffens, Jayson},
title = {vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents},
year = {2026},
eprint = {2604.15484},
archivePrefix = {arXiv},
primaryClass = {cs.IR},
url = {https://arxiv.org/abs/2604.15484}
}
@software{vstash_2026,
author = {Steffens, Jayson},
title = {vstash: local-first document memory with instant semantic search},
year = {2026},
url = {https://github.com/stffns/vstash}
}
- Downloads last month
- 367
Model tree for Stffens/bge-small-rrf-v3
Base model
BAAI/bge-small-en-v1.5