---
base_model:
- google/embeddinggemma-300m
base_model_relation: finetune
datasets:
- ByronLeeee/CN-Law-Query-Retrieval-Dataset
frameworks:
- ""
language:
- zh
library_name: sentence-transformers
license: apache-2.0
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:65783
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
tasks:
- ""
widget:
- sentences:
  - 'title: 中华人民共和国生物安全法 第五十条 | text: 第五十条 病原微生物实验室的设立单位应当制定生物安全事件应急预案...'
  - 'title: 中华人民共和国土地管理法 第五十五条 | text: 第五十五条 以出让等有偿使用方式取得国有土地使用权的建设单位...'
  - 'title: 江西省实施《中华人民共和国土地管理法》办法 第三十五条 | text: 第三十五条 以出让等有偿使用方式取得国有土地使用权的建设单位...'
  source_sentence: 国有土地使用权出让费用缴纳规定
- sentences:
  - 'title: 中华人民共和国城乡规划法 第四十条 | text: 第四十条 在城市、镇规划区内进行建筑物、构筑物...'
  - 'title: 中华人民共和国禁毒法 第六十一条 | text: 第六十一条 容留他人吸食、注射毒品或者介绍买卖毒品，构成犯罪的...'
  - 'title: 中华人民共和国刑法 第三百四十七条 | text: 第三百四十七条 走私、贩卖、运输、制造毒品，无论数量多少...'
  source_sentence: 介绍买卖毒品构成犯罪吗？
---

# EmbeddingGemma-300M-LawVault (Chinese Legal RAG)

## 📖 Model Introduction

**EmbeddingGemma-300M-LawVault** is a high-performance embedding model fine-tuned specifically for **Chinese Legal RAG (Retrieval-Augmented Generation)** scenarios.

Fine-tuned on Google's [embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m), this model employs a rigorous contrastive learning approach using **MultipleNegativesRankingLoss** and **MatryoshkaLoss**. It was trained on a high-quality dataset of over 60,000 `(Query, Positive, Hard Negative)` triplets to significantly improve retrieval accuracy for legal statutes, colloquial legal inquiries, and noise resistance compared to the base model.

> **Note**: This model is fine-tuned exclusively on **Chinese laws and regulations**. Its performance on other languages or non-legal domains has not been evaluated and is not guaranteed.

### Key Highlights
*   **Domain Specialization**: Specifically addresses the pain point where general models fail to distinguish between "National Laws" and "Local Regulations/Administrative Rules" with similar wording.
*   **Anti-Interference**: Trained with "Source-aware Hard Negatives"—using the base model's incorrect top retrievals for the same query as hard negatives—enabling the model to precisely filter out confusingly similar but incorrect clauses.
*   **Colloquial Understanding**: The training set includes queries generated by LLMs to simulate real-world user questions, bridging the semantic gap between formal legal terminology and everyday language.
*   **Matryoshka Embeddings**: Supports flexible output vector dimensions (768, 512, 256, 128), allowing for significantly reduced storage costs without major performance loss.

## 📊 Evaluation Performance

The model was evaluated on a held-out test set constructed from real legal scenarios (containing 120 unseen colloquial legal queries generated by Deepseek V3.2). The End-to-End RAG retrieval results are as follows:

| Metric | Base Model | **Finetuned Model (Ours)** | Improvement |
| :--- | :---: | :---: | :---: |
| **Hit Rate @ 10** | 85.0% | **98.0%** | Significant reduction in "answer not found" cases |
| **Top-1 Accuracy** | 58.0% | **92.0%** | **Huge Leap (+34%)**, the vast majority of correct answers are ranked 1st |
| **MRR @ 10** | 0.78 | **0.96** | Extremely high ranking quality |

*Note: The test environment used a LanceDB vector database covering a full slice of the Chinese laws and regulations database.*

### Case Study

| User Query | Base Model Rank | **Finetuned Rank** |
| :--- | :---: | :---: |
| "Can the provincial cultural relics bureau directly transfer artifacts unearthed in our area?" | ❌ Not Retrieved (10+) | ✅ **1st** |
| "What are the legal requirements for merchants when setting product prices?" | ❌ Not Retrieved (10+) | ✅ **1st** |
| "If land is requisitioned for a large hydropower station, how is compensation calculated?" | 2nd | ✅ **1st** |
| "How does the government financially support rural revitalization?" | 6th | ✅ **1st** |

## 🚀 Usage

### Install Dependencies
```bash
pip install -U sentence-transformers
```

### Load Model
```python
from sentence_transformers import SentenceTransformer

# Load model
model_path = "ByronLeeee/EmbeddingGemma-300M-LawVault"
model = SentenceTransformer(model_path, trust_remote_code=True)

# 1. Define Query
query = "What is the penalty for robbery?" # (In Chinese: 抢劫罪一般判几年？)

# 2. Define Documents - Recommended Format: title: {Law Name} | text: {Content}
documents = [
    "title: 中华人民共和国刑法 第二百六十三条 | text: 以暴力、胁迫或者其他方法抢劫公私财物的，处三年以上十年以下有期徒刑，并处罚金...",
    "title: 中华人民共和国刑法 第二百六十七条 | text: 抢夺公私财物，数额较大的，或者多次抢夺的，处三年以下有期徒刑、拘役或者管制...",
    "title: 陕西省专利条例 第二十四条 | text: 负责专利执法的部门...可以查封或者扣押。"
]

# 3. Encode
query_vec = model.encode(query)
doc_vecs = model.encode(documents)

# 4. Compute Similarity
similarities = model.similarity(query_vec, doc_vecs)
print(similarities)
```

---

## Training Details (Generated by Trainer)

### Dataset
*   **Size**: 65,783 training triplets (Anchor, Positive, Hard Negative)
*   **Source**: Chinese Laws & Regulations (Civil, Criminal, Administrative, etc.)

### Training Hyperparameters
*   **Batch Size**: 24 (Effective Batch Size = 144 with Gradient Accumulation)
*   **Learning Rate**: 2e-05
*   **Epochs**: 3
*   **Precision**: bf16 (BFloat16)
*   **Gradient Accumulation**: 6 steps
*   **Max Sequence Length**: 1024 tokens

### Loss Function
**MatryoshkaLoss** wrapping **MultipleNegativesRankingLoss**:
```json
{
    "matryoshka_dims": [768, 512, 256, 128],
    "matryoshka_weights": [1, 1, 1, 1]
}
```

### Training Logs
<details><summary>Click to expand detailed logs</summary>

| Epoch  | Step | Training Loss |
|:------:|:----:|:-------------:|
| 0.0022 | 1    | 3.5148        |
| ...    | ...  | ...           |
| 1.0    | 457  | 0.2123        |
| 2.0    | 914  | 0.0749        |
| 3.0    | 1371 | 0.0369        |

</details>

### Framework Versions
- Python: 3.13.1
- Sentence Transformers: 5.1.2
- Transformers: 4.57.1
- PyTorch: 2.9.1+cu130
- Accelerate: 1.12.0
- Datasets: 4.4.1
- Tokenizers: 0.22.1

## Citation

If you use this model, please cite the following:

```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```