---
pipeline_tag: text-classification
tags:
- gguf
- embedding
- qwen3
- llama-cpp
- jina-embeddings-v5
- feature-extraction
- mteb
- vllm
- sentence-transformers
language:
- multilingual
base_model: jinaai/jina-embeddings-v5-text-small
base_model_relation: quantized
inference: false
license: cc-by-nc-4.0
library_name: llama.cpp
---
### **jina-embeddings-v5-text-small-classification**: Classification-Targeted Embedding Distillation
[Elastic Inference Service](https://www.elastic.co/docs/explore-analyze/elastic-inference/eis) | [ArXiv](https://arxiv.org/abs/2602.15547) | [Release Note](https://jina.ai/news/jina-embeddings-v5-text-distilling-4b-quality-into-sub-1b-multilingual-embeddings) | [Blog](https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text)
### Model Overview
`jina-embeddings-v5-text-small-classification` is a compact, high-performance text embedding model designed for classification.
It is part of the **jina-embeddings-v5-text** model family, which also includes [jina-embeddings-v5-text-nano](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano), a smaller model for more resource-constrained use cases.
Trained using a novel approach that combines distillation with task-specific contrastive losses, `jina-embeddings-v5-text-small-classification` outperforms existing state-of-the-art models of similar size across diverse embedding benchmarks.
| Feature | Value |
| --- | --- |
| Parameters | 677M |
| Supported Tasks | `classification`|
| Max Sequence Length | 32768 |
| Embedding Dimension | 1024 |
| Matryoshka Dimensions | 32, 64, 128, 256, 512, 768, 1024 |
| Pooling Strategy | Last-token pooling |
| Base Model | jinaai/jina-embeddings-v5-text-small |

### Training and Evaluation
For training details and evaluation results, see our [technical report](https://arxiv.org/abs/2602.15547).
### Usage
Requirements
The following Python packages are required:
- `transformers>=5.1.0`
- `torch>=2.8.0`
- `peft>=0.15.2`
- `vllm>=0.15.1`
### Optional / Recommended
- **flash-attention**: Installing [flash-attention](https://github.com/Dao-AILab/flash-attention) is recommended for improved inference speed and efficiency, but not mandatory.
- **sentence-transformers**: If you want to use the model via the `sentence-transformers` interface, install this package as well.
via Elastic Inference Service
The fastest way to use v5-text in production. Elastic Inference Service (EIS) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment.
```bash
PUT _inference/text_embedding/jina-v5
{
"service": "elastic",
"service_settings": {
"model_id": "jina-embeddings-v5-text-small"
}
}
```
See the [Elastic Inference Service documentation](https://www.elastic.co/docs/explore-analyze/elastic-inference/eis) for setup details.
via sentence-transformers
```python
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer(
"jinaai/jina-embeddings-v5-text-small-classification",
model_kwargs={"dtype": torch.bfloat16}, # Recommended for GPUs
config_kwargs={"_attn_implementation": "flash_attention_2"}, # Recommended but optional
)
# Optional: set truncate_dim in encode() to control embedding size
texts = [
"My order hasn't arrived yet and it's been two weeks.",
"How do I reset my password?",
"I'd like a refund for my recent purchase.",
"Your product exceeded my expectations. Great job!",
]
# Encode texts
embeddings = model.encode(texts)
print(embeddings.shape)
# (4, 1024)
similarity = model.similarity(embeddings, embeddings)
print(similarity)
# tensor([[1.0000, 0.7347, 0.7988, 0.7523],
# [0.7347, 1.0000, 0.7440, 0.7228],
# [0.7988, 0.7440, 1.0000, 0.7321],
# [0.7523, 0.7228, 0.7321, 1.0000]])
```
via vLLM
```python
from vllm import LLM
from vllm.config.pooler import PoolerConfig
# Initialize model
name = "jinaai/jina-embeddings-v5-text-small-classification"
model = LLM(
model=name,
dtype="float16",
runner="pooling",
pooler_config=PoolerConfig(seq_pooling_type="LAST", normalize=True)
)
# Create text prompts
document1 = "Overview of climate change impacts on coastal cities"
document1_prompt = f"Document: {document1}"
document2 = "The impacts of climate change on large cities"
document2_prompt = f"Document: {document2}"
# Encode all prompts
prompts = [document1_prompt, document2_prompt]
outputs = model.encode(prompts, pooling_task="embed")
embed_document1 = outputs[0].outputs.data
embed_document2 = outputs[1].outputs.data
```
via Text Embeddings Inference
- Via Docker on CPU:
```bash
docker run -p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id jinaai/jina-embeddings-v5-text-small-classification \
--dtype float32 --pooling last-token
```
- Via Docker on NVIDIA GPU (Turing, Ampere, Ada Lovelace, Hopper or Blackwell):
```bash
docker run --gpus all --shm-size 1g -p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 \
--model-id jinaai/jina-embeddings-v5-text-small-classification \
--dtype float16 --pooling last-token
```
> Alternatively, you can also run with `cargo`, more information can be found in the [Text Embeddings Inference documentation](https://hf.co/docs/text-embeddings-inference).
Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create):
```bash
curl -X POST http://127.0.0.1:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "jinaai/jina-embeddings-v5-text-small-classification",
"input": [
"Document: The impacts of climate change on coastal cities are significant...",
]
}'
```
Or rather via the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead, to prevent from manually formatting the inputs:
```bash
curl -X POST http://127.0.0.1:8080/embed \
-H "Content-Type: application/json" \
-d '{
"inputs": "Overview of climate change impacts on coastal cities",
"prompt_name": "document",
}'
```
via llama.cpp (GGUF)
After installing llama.cpp one can run llama-server to host the embedding model as OpenAI API compatible HTTP server with the respective model version:
```sh
llama-server -hf jinaai/jina-embeddings-v5-text-small-classification:F16 --embedding --pooling last -ub 32768
```
Client:
```
curl -X POST "http://127.0.0.1:8080/v1/embeddings" \
-H "Content-Type: application/json" \
-d '{
"input": [
"Document: A beautiful sunset over the beach",
"Document: Un beau coucher de soleil sur la plage",
"Document: 海滩上美丽的日落",
"Document: 浜辺に沈む美しい夕日",
"Document: Golden sunlight melts into the horizon, painting waves in warm amber and rose, while the sky whispers goodnight to the quiet, endless sea."
]
}'
```
via Optimum (ONNX)
You can run the ONNX-optimized version of the model locally using Hugging Face's `optimum` library. Make sure you have the required dependencies installed (e.g., `pip install optimum[onnxruntime] transformers torch`):
```python
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch
model_id = "jinaai/jina-embeddings-v5-text-small-classification"
# 1. Load tokenizer and ONNX model
# We specify the subfolder 'onnx' where the weights are located
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = ORTModelForFeatureExtraction.from_pretrained(
model_id,
subfolder="onnx",
file_name="model.onnx",
provider="CPUExecutionProvider", # Or "CUDAExecutionProvider" for GPU
trust_remote_code=True,
)
# 2. Prepare input
texts = ["Document: How do I use Jina ONNX models?", "Document: Information about semantic matching."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# 4. Inference
with torch.no_grad():
outputs = model(**inputs)
# 5. Pooling (Crucial for Jina-v5)
# Jina-v5 uses LAST-TOKEN pooling.
# We take the hidden state of the last non-padding token.
last_hidden_state = outputs.last_hidden_state
# Find the indices of the last token (usually the end of the sequence)
sequence_lengths = inputs.attention_mask.sum(dim=1) - 1
embeddings = last_hidden_state[torch.arange(last_hidden_state.size(0)), sequence_lengths]
print('embeddings shape:', embeddings.shape)
print('embeddings:', embeddings)
```
### License
The model is licensed under CC BY-NC 4.0. For commercial use, please [contact us](sales@jina.ai).
### Citation
If you find `jina-embeddings-v5-text-small-classification` useful in your research, please cite the following paper:
```
@misc{akram2026jinaembeddingsv5texttasktargetedembeddingdistillation,
title={jina-embeddings-v5-text: Task-Targeted Embedding Distillation},
author={Mohammad Kalim Akram and Saba Sturua and Nastia Havriushenko and Quentin Herreros and Michael Günther and Maximilian Werk and Han Xiao},
year={2026},
eprint={2602.15547},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.15547},
}
```