--- pipeline_tag: text-classification tags: - gguf - embedding - qwen3 - llama-cpp - jina-embeddings-v5 - feature-extraction - mteb - vllm - sentence-transformers language: - multilingual base_model: jinaai/jina-embeddings-v5-text-small base_model_relation: quantized inference: false license: cc-by-nc-4.0 library_name: llama.cpp ---

Jina AI: Your Search Foundation, Supercharged!

### **jina-embeddings-v5-text-small-classification**: Classification-Targeted Embedding Distillation [Elastic Inference Service](https://www.elastic.co/docs/explore-analyze/elastic-inference/eis) | [ArXiv](https://arxiv.org/abs/2602.15547) | [Release Note](https://jina.ai/news/jina-embeddings-v5-text-distilling-4b-quality-into-sub-1b-multilingual-embeddings) | [Blog](https://www.elastic.co/search-labs/blog/jina-embeddings-v5-text) ### Model Overview

jina-embeddings-v5-text Architecture

`jina-embeddings-v5-text-small-classification` is a compact, high-performance text embedding model designed for classification. It is part of the **jina-embeddings-v5-text** model family, which also includes [jina-embeddings-v5-text-nano](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano), a smaller model for more resource-constrained use cases. Trained using a novel approach that combines distillation with task-specific contrastive losses, `jina-embeddings-v5-text-small-classification` outperforms existing state-of-the-art models of similar size across diverse embedding benchmarks. | Feature | Value | | --- | --- | | Parameters | 677M | | Supported Tasks | `classification`| | Max Sequence Length | 32768 | | Embedding Dimension | 1024 | | Matryoshka Dimensions | 32, 64, 128, 256, 512, 768, 1024 | | Pooling Strategy | Last-token pooling | | Base Model | jinaai/jina-embeddings-v5-text-small | ![v5_benchmarks_combined](https://cdn-uploads.huggingface.co/production/uploads/6476ff2699a5ce743ccea3fc/7WjMQChM6XAOI9LhREChg.png) ### Training and Evaluation For training details and evaluation results, see our [technical report](https://arxiv.org/abs/2602.15547). ### Usage
Requirements The following Python packages are required: - `transformers>=5.1.0` - `torch>=2.8.0` - `peft>=0.15.2` - `vllm>=0.15.1` ### Optional / Recommended - **flash-attention**: Installing [flash-attention](https://github.com/Dao-AILab/flash-attention) is recommended for improved inference speed and efficiency, but not mandatory. - **sentence-transformers**: If you want to use the model via the `sentence-transformers` interface, install this package as well.
via Elastic Inference Service The fastest way to use v5-text in production. Elastic Inference Service (EIS) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment. ```bash PUT _inference/text_embedding/jina-v5 { "service": "elastic", "service_settings": { "model_id": "jina-embeddings-v5-text-small" } } ``` See the [Elastic Inference Service documentation](https://www.elastic.co/docs/explore-analyze/elastic-inference/eis) for setup details.
via sentence-transformers ```python from sentence_transformers import SentenceTransformer import torch model = SentenceTransformer( "jinaai/jina-embeddings-v5-text-small-classification", model_kwargs={"dtype": torch.bfloat16}, # Recommended for GPUs config_kwargs={"_attn_implementation": "flash_attention_2"}, # Recommended but optional ) # Optional: set truncate_dim in encode() to control embedding size texts = [ "My order hasn't arrived yet and it's been two weeks.", "How do I reset my password?", "I'd like a refund for my recent purchase.", "Your product exceeded my expectations. Great job!", ] # Encode texts embeddings = model.encode(texts) print(embeddings.shape) # (4, 1024) similarity = model.similarity(embeddings, embeddings) print(similarity) # tensor([[1.0000, 0.7347, 0.7988, 0.7523], # [0.7347, 1.0000, 0.7440, 0.7228], # [0.7988, 0.7440, 1.0000, 0.7321], # [0.7523, 0.7228, 0.7321, 1.0000]]) ```
via vLLM ```python from vllm import LLM from vllm.config.pooler import PoolerConfig # Initialize model name = "jinaai/jina-embeddings-v5-text-small-classification" model = LLM( model=name, dtype="float16", runner="pooling", pooler_config=PoolerConfig(seq_pooling_type="LAST", normalize=True) ) # Create text prompts document1 = "Overview of climate change impacts on coastal cities" document1_prompt = f"Document: {document1}" document2 = "The impacts of climate change on large cities" document2_prompt = f"Document: {document2}" # Encode all prompts prompts = [document1_prompt, document2_prompt] outputs = model.encode(prompts, pooling_task="embed") embed_document1 = outputs[0].outputs.data embed_document2 = outputs[1].outputs.data ```
via Text Embeddings Inference - Via Docker on CPU: ```bash docker run -p 8080:80 \ ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \ --model-id jinaai/jina-embeddings-v5-text-small-classification \ --dtype float32 --pooling last-token ``` - Via Docker on NVIDIA GPU (Turing, Ampere, Ada Lovelace, Hopper or Blackwell): ```bash docker run --gpus all --shm-size 1g -p 8080:80 \ ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 \ --model-id jinaai/jina-embeddings-v5-text-small-classification \ --dtype float16 --pooling last-token ``` > Alternatively, you can also run with `cargo`, more information can be found in the [Text Embeddings Inference documentation](https://hf.co/docs/text-embeddings-inference). Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create): ```bash curl -X POST http://127.0.0.1:8080/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "jinaai/jina-embeddings-v5-text-small-classification", "input": [ "Document: The impacts of climate change on coastal cities are significant...", ] }' ``` Or rather via the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead, to prevent from manually formatting the inputs: ```bash curl -X POST http://127.0.0.1:8080/embed \ -H "Content-Type: application/json" \ -d '{ "inputs": "Overview of climate change impacts on coastal cities", "prompt_name": "document", }' ```
via llama.cpp (GGUF) After installing llama.cpp one can run llama-server to host the embedding model as OpenAI API compatible HTTP server with the respective model version: ```sh llama-server -hf jinaai/jina-embeddings-v5-text-small-classification:F16 --embedding --pooling last -ub 32768 ``` Client: ``` curl -X POST "http://127.0.0.1:8080/v1/embeddings" \ -H "Content-Type: application/json" \ -d '{ "input": [ "Document: A beautiful sunset over the beach", "Document: Un beau coucher de soleil sur la plage", "Document: 海滩上美丽的日落", "Document: 浜辺に沈む美しい夕日", "Document: Golden sunlight melts into the horizon, painting waves in warm amber and rose, while the sky whispers goodnight to the quiet, endless sea." ] }' ```
via Optimum (ONNX) You can run the ONNX-optimized version of the model locally using Hugging Face's `optimum` library. Make sure you have the required dependencies installed (e.g., `pip install optimum[onnxruntime] transformers torch`): ```python from optimum.onnxruntime import ORTModelForFeatureExtraction from transformers import AutoTokenizer import torch model_id = "jinaai/jina-embeddings-v5-text-small-classification" # 1. Load tokenizer and ONNX model # We specify the subfolder 'onnx' where the weights are located tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = ORTModelForFeatureExtraction.from_pretrained( model_id, subfolder="onnx", file_name="model.onnx", provider="CPUExecutionProvider", # Or "CUDAExecutionProvider" for GPU trust_remote_code=True, ) # 2. Prepare input texts = ["Document: How do I use Jina ONNX models?", "Document: Information about semantic matching."] inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt") # 4. Inference with torch.no_grad(): outputs = model(**inputs) # 5. Pooling (Crucial for Jina-v5) # Jina-v5 uses LAST-TOKEN pooling. # We take the hidden state of the last non-padding token. last_hidden_state = outputs.last_hidden_state # Find the indices of the last token (usually the end of the sequence) sequence_lengths = inputs.attention_mask.sum(dim=1) - 1 embeddings = last_hidden_state[torch.arange(last_hidden_state.size(0)), sequence_lengths] print('embeddings shape:', embeddings.shape) print('embeddings:', embeddings) ```
### License The model is licensed under CC BY-NC 4.0. For commercial use, please [contact us](sales@jina.ai). ### Citation If you find `jina-embeddings-v5-text-small-classification` useful in your research, please cite the following paper: ``` @misc{akram2026jinaembeddingsv5texttasktargetedembeddingdistillation, title={jina-embeddings-v5-text: Task-Targeted Embedding Distillation}, author={Mohammad Kalim Akram and Saba Sturua and Nastia Havriushenko and Quentin Herreros and Michael Günther and Maximilian Werk and Han Xiao}, year={2026}, eprint={2602.15547}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.15547}, } ```