Instructions to use cometadata/jina-reranker-v2-multilingual-affiliations-comet-training-only with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use cometadata/jina-reranker-v2-multilingual-affiliations-comet-training-only with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("cometadata/jina-reranker-v2-multilingual-affiliations-comet-training-only", trust_remote_code=True) query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Notebooks
- Google Colab
- Kaggle
Jina Affiliation Reranker
Cross Encoder reranker model fine-tuned for affiliation string matching. Given a pair of affiliation strings, it predicts how likely they refer to the same institution.
Use Case
This model is designed for matching and disambiguating messy real-world affiliation strings against canonical institution records (ROR).
Examples of what it handles:
- Abbreviations: "MIT" ↔ "Massachusetts Institute of Technology"
- Word reordering: "University of Oxford" ↔ "Oxford University"
- Partial matches: "Dept. of Physics, Stanford" ↔ "Stanford University"
- International variants: "東京大学" ↔ "University of Tokyo"
- OCR noise: "Univ ersity of Cal ifornia" ↔ "University of California"
Usage
from sentence_transformers import CrossEncoder
model = CrossEncoder(
"cometadata/jina-reranker-v2-multilingual-affiliations",
trust_remote_code=True,
)
# Score affiliation pairs (higher = more likely same institution)
pairs = [
["University of California, Berkeley", "UC Berkeley"],
["University of California, Berkeley", "Berkeley College"],
]
scores = model.predict(pairs)
# [0.82, 0.15] - first pair matches, second doesn't
# Rank candidates for an affiliation string
results = model.rank(
"MIT, Cambridge, MA",
[
"Massachusetts Institute of Technology",
"MIT University (India)",
"University of Cambridge",
]
)
# Returns candidates ranked by relevance
Training
Base Model: jinaai/jina-reranker-v2-base-multilingual
Dataset: cometadata/triplet-loss-for-embedding-affiliations-sample-1
- ~8K triplets (anchor, positive, negative)
- 80% hard negatives (similar but different institutions)
- 20% easy negatives (clearly different institutions)
Configuration:
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Loss | BinaryCrossEntropyLoss |
| Validation split | 15% |
Evaluation
Evaluated on 300 test cases across 10 difficulty tiers:
| Tier | Cases | Base Model | Fine-tuned | Δ |
|---|---|---|---|---|
| Baseline | 30 | 100.0% | 100.0% | — |
| OCR/Noise | 30 | 100.0% | 100.0% | — |
| Abbreviations | 40 | 60.0% | 80.0% | +20.0% |
| Hierarchical | 35 | 71.4% | 77.1% | +5.7% |
| Medical/Hospital | 25 | 64.0% | 68.0% | +4.0% |
| Research Labs | 25 | 80.0% | 84.0% | +4.0% |
| International | 35 | 82.9% | 91.4% | +8.6% |
| Disambiguation | 31 | 45.2% | 51.6% | +6.5% |
| Negative Controls | 19 | 100.0% | 100.0% | — |
| Ultra-Hard | 30 | 93.3% | 96.7% | +3.3% |
Overall: 78.3% → 84.3% accuracy (+6.0%), MRR 0.873 → 0.913
Model Details
- Parameters: 278M
- Max sequence length: 1024 tokens
- Output: Single relevance score (0-1)
- Languages: Multilingual (inherits from base model)
License
CC-BY-NC-4.0 (inherited from base model - non-commercial use only)
Citation
@misc{jina-affiliation-reranker,
title={Jina Affiliation Reranker},
author={cometadata},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/cometadata/jina-reranker-v2-multilingual-affiliations}
}
- Downloads last month
- 1
Model tree for cometadata/jina-reranker-v2-multilingual-affiliations-comet-training-only
Base model
jinaai/jina-reranker-v2-base-multilingual