gte_L4_uniform_distilled (Distilled)
Lightweight sentence encoder created from alibaba-NLP/gte-multilingual-base via layer pruning + vocabulary pruning + knowledge distillation.
Model Details
| Property |
Value |
| Teacher |
alibaba-NLP/gte-multilingual-base |
| Architecture |
GTE-multilingual (pruned) |
| Hidden dim |
768 |
| Layers |
4 / 12 |
| Layer indices |
[0, 4, 7, 11] |
| Strategy |
4 layers, evenly spaced from GTE-multilingual (12L) |
| Parameters |
220,757,760 |
| Model size (FP32) |
277.7MB |
| Distilled |
Yes |
Architecture
==============================================================
TEACHER: GTE-multilingual β STUDENT: 4L / 57,376 vocab
==============================================================
TEACHER STUDENT
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Input Tokens β β Input Tokens β
ββββββββββββββ¬βββββββββββββ ββββββββββββββ¬βββββββββββββ
β β
ββββββββββββββ΄βββββββββββββ ββββββββββββββ΄βββββββββββββ
β Embeddings β β Embeddings (pruned) β
β vocab: 250,048 β β vocab: 57,376 β
β dim: 768 β β dim: 768 β
ββββββββββββββ¬βββββββββββββ ββββββββββββββ¬βββββββββββββ
β β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Layer 0 β βββΊ β Layer 0 β L0 β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 1 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 2 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 3 β β³ β β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 4 β βββΊ β Layer 1 β L4 β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 5 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 6 β β³ β β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 7 β βββΊ β Layer 2 β L7 β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 8 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 9 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 10 β β³ β β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 11 β βββΊ β Layer 3 β L11 β
ββββββββββββββ¬βββββββββββββ ββββββββββββββ¬βββββββββββββ
β β
ββββββββββββββ΄βββββββββββββ ββββββββββββββ΄βββββββββββββ
β Mean Pooling β β Mean Pooling β
β β 768d embedding β β β 768d embedding β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
Size: 1058.2MB (FP32) β 277.7MB (FP32)
Params: 277,405,440 β 72,785,664
Reduction: 73.8%
==============================================================
Quick Start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("gte_L4_uniform_distilled", trust_remote_code=True)
sentences = [
"Hello, how are you?",
"μλ
νμΈμ",
"Bonjour, comment allez-vous?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
MTEB Evaluation Results
Overall Average: 56.97%
| Task Group |
Average |
| Classification |
63.0% |
| Clustering |
35.73% |
| STS |
70.49% |
Classification
| Task |
Average |
Details |
| AmazonCounterfactualClassification |
69.34% |
en-ext: 74.06%, en: 72.1%, de: 69.15% |
| Banking77Classification |
82.7% |
default: 82.7% |
| ImdbClassification |
60.94% |
default: 60.94% |
| MTOPDomainClassification |
81.66% |
en: 89.15%, es: 84.21%, fr: 82.21% |
| MassiveIntentClassification |
40.9% |
en: 70.82%, zh-CN: 68.47%, ja: 66.5% |
| MassiveScenarioClassification |
46.01% |
en: 76.68%, zh-CN: 75.65%, ja: 72.94% |
| ToxicConversationsClassification |
63.41% |
default: 63.41% |
| TweetSentimentExtractionClassification |
59.05% |
default: 59.05% |
Clustering
| Task |
Average |
Details |
| ArXivHierarchicalClusteringP2P |
52.87% |
default: 52.87% |
| ArXivHierarchicalClusteringS2S |
47.21% |
default: 47.21% |
| BiorxivClusteringP2P.v2 |
25.97% |
default: 25.97% |
| MedrxivClusteringP2P.v2 |
29.68% |
default: 29.68% |
| MedrxivClusteringS2S.v2 |
24.9% |
default: 24.9% |
| StackExchangeClustering.v2 |
43.5% |
default: 43.5% |
| StackExchangeClusteringP2P.v2 |
34.78% |
default: 34.78% |
| TwentyNewsgroupsClustering.v2 |
26.93% |
default: 26.93% |
STS
| Task |
Average |
Details |
| BIOSSES |
67.24% |
default: 67.24% |
| SICK-R |
73.92% |
default: 73.92% |
| STS12 |
73.6% |
default: 73.6% |
| STS13 |
76.98% |
default: 76.98% |
| STS14 |
75.26% |
default: 75.26% |
| STS15 |
84.9% |
default: 84.9% |
| STS17 |
58.52% |
en-en: 83.29%, es-es: 79.59%, ko-ko: 70.34% |
| STS22.v2 |
44.26% |
zh: 68.37%, es: 61.22%, it: 60.83% |
| STSBenchmark |
79.77% |
default: 79.77% |
Distillation Impact
| Task |
Before |
After |
Delta |
| AmazonCounterfactualClassification |
65.24% |
69.34% |
+4.1%p |
| ArXivHierarchicalClusteringP2P |
50.97% |
52.87% |
+1.9%p |
| ArXivHierarchicalClusteringS2S |
43.38% |
47.21% |
+3.83%p |
| Banking77Classification |
68.58% |
82.7% |
+14.12%p |
| BiorxivClusteringP2P.v2 |
20.78% |
25.97% |
+5.19%p |
| BIOSSES |
42.61% |
67.24% |
+24.63%p |
| ImdbClassification |
63.28% |
60.94% |
-2.34%p |
| MassiveIntentClassification |
35.71% |
40.9% |
+5.19%p |
| MassiveScenarioClassification |
37.58% |
46.01% |
+8.43%p |
| MedrxivClusteringP2P.v2 |
26.37% |
29.68% |
+3.31%p |
| MedrxivClusteringS2S.v2 |
20.98% |
24.9% |
+3.92%p |
| MTOPDomainClassification |
68.67% |
81.66% |
+12.99%p |
| SICK-R |
55.11% |
73.92% |
+18.81%p |
| StackExchangeClustering.v2 |
34.36% |
43.5% |
+9.14%p |
| StackExchangeClusteringP2P.v2 |
31.55% |
34.78% |
+3.23%p |
| STS12 |
47.97% |
73.6% |
+25.63%p |
| STS13 |
65.61% |
76.98% |
+11.37%p |
| STS14 |
57.02% |
75.26% |
+18.24%p |
| STS15 |
64.76% |
84.9% |
+20.14%p |
| STS17 |
17.95% |
58.52% |
+40.57%p |
| STS22.v2 |
40.55% |
44.26% |
+3.71%p |
| STSBenchmark |
62.23% |
79.77% |
+17.54%p |
| ToxicConversationsClassification |
57.84% |
63.41% |
+5.57%p |
| TweetSentimentExtractionClassification |
48.1% |
59.05% |
+10.95%p |
| TwentyNewsgroupsClustering.v2 |
12.03% |
26.93% |
+14.9%p |
Training
Stage 1: Layer Pruning
- Teacher:
alibaba-NLP/gte-multilingual-base (12 layers, 768d)
- Selected layers:
[0, 4, 7, 11] (4 layers, evenly spaced from GTE-multilingual (12L))
- Vocabulary pruning applied
Stage 2: Knowledge Distillation
- Method: MSE + Cosine Similarity loss
- Data: MTEB Classification/Clustering/STS task datasets
- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Schedule: Cosine annealing over 3 epochs
Supported Languages (18)
ko, en, ja, zh, es, fr, de, pt, it, ru, ar, hi, th, vi, id, tr, nl, pl