--- base_model: - google/embeddinggemma-300m base_model_relation: finetune datasets: - ByronLeeee/CN-Law-Query-Retrieval-Dataset frameworks: - "" language: - zh library_name: sentence-transformers license: apache-2.0 pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - generated_from_trainer - dataset_size:65783 - loss:MatryoshkaLoss - loss:MultipleNegativesRankingLoss tasks: - "" widget: - sentences: - 'title: 中华人民共和国生物安全法 第五十条 | text: 第五十条 病原微生物实验室的设立单位应当制定生物安全事件应急预案...' - 'title: 中华人民共和国土地管理法 第五十五条 | text: 第五十五条 以出让等有偿使用方式取得国有土地使用权的建设单位...' - 'title: 江西省实施《中华人民共和国土地管理法》办法 第三十五条 | text: 第三十五条 以出让等有偿使用方式取得国有土地使用权的建设单位...' source_sentence: 国有土地使用权出让费用缴纳规定 - sentences: - 'title: 中华人民共和国城乡规划法 第四十条 | text: 第四十条 在城市、镇规划区内进行建筑物、构筑物...' - 'title: 中华人民共和国禁毒法 第六十一条 | text: 第六十一条 容留他人吸食、注射毒品或者介绍买卖毒品,构成犯罪的...' - 'title: 中华人民共和国刑法 第三百四十七条 | text: 第三百四十七条 走私、贩卖、运输、制造毒品,无论数量多少...' source_sentence: 介绍买卖毒品构成犯罪吗? --- # EmbeddingGemma-300M-LawVault (Chinese Legal RAG) ## 📖 Model Introduction **EmbeddingGemma-300M-LawVault** is a high-performance embedding model fine-tuned specifically for **Chinese Legal RAG (Retrieval-Augmented Generation)** scenarios. Fine-tuned on Google's [embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m), this model employs a rigorous contrastive learning approach using **MultipleNegativesRankingLoss** and **MatryoshkaLoss**. It was trained on a high-quality dataset of over 60,000 `(Query, Positive, Hard Negative)` triplets to significantly improve retrieval accuracy for legal statutes, colloquial legal inquiries, and noise resistance compared to the base model. > **Note**: This model is fine-tuned exclusively on **Chinese laws and regulations**. Its performance on other languages or non-legal domains has not been evaluated and is not guaranteed. ### Key Highlights * **Domain Specialization**: Specifically addresses the pain point where general models fail to distinguish between "National Laws" and "Local Regulations/Administrative Rules" with similar wording. * **Anti-Interference**: Trained with "Source-aware Hard Negatives"—using the base model's incorrect top retrievals for the same query as hard negatives—enabling the model to precisely filter out confusingly similar but incorrect clauses. * **Colloquial Understanding**: The training set includes queries generated by LLMs to simulate real-world user questions, bridging the semantic gap between formal legal terminology and everyday language. * **Matryoshka Embeddings**: Supports flexible output vector dimensions (768, 512, 256, 128), allowing for significantly reduced storage costs without major performance loss. ## 📊 Evaluation Performance The model was evaluated on a held-out test set constructed from real legal scenarios (containing 120 unseen colloquial legal queries generated by Deepseek V3.2). The End-to-End RAG retrieval results are as follows: | Metric | Base Model | **Finetuned Model (Ours)** | Improvement | | :--- | :---: | :---: | :---: | | **Hit Rate @ 10** | 85.0% | **98.0%** | Significant reduction in "answer not found" cases | | **Top-1 Accuracy** | 58.0% | **92.0%** | **Huge Leap (+34%)**, the vast majority of correct answers are ranked 1st | | **MRR @ 10** | 0.78 | **0.96** | Extremely high ranking quality | *Note: The test environment used a LanceDB vector database covering a full slice of the Chinese laws and regulations database.* ### Case Study | User Query | Base Model Rank | **Finetuned Rank** | | :--- | :---: | :---: | | "Can the provincial cultural relics bureau directly transfer artifacts unearthed in our area?" | ❌ Not Retrieved (10+) | ✅ **1st** | | "What are the legal requirements for merchants when setting product prices?" | ❌ Not Retrieved (10+) | ✅ **1st** | | "If land is requisitioned for a large hydropower station, how is compensation calculated?" | 2nd | ✅ **1st** | | "How does the government financially support rural revitalization?" | 6th | ✅ **1st** | ## 🚀 Usage ### Install Dependencies ```bash pip install -U sentence-transformers ``` ### Load Model ```python from sentence_transformers import SentenceTransformer # Load model model_path = "ByronLeeee/EmbeddingGemma-300M-LawVault" model = SentenceTransformer(model_path, trust_remote_code=True) # 1. Define Query query = "What is the penalty for robbery?" # (In Chinese: 抢劫罪一般判几年?) # 2. Define Documents - Recommended Format: title: {Law Name} | text: {Content} documents = [ "title: 中华人民共和国刑法 第二百六十三条 | text: 以暴力、胁迫或者其他方法抢劫公私财物的,处三年以上十年以下有期徒刑,并处罚金...", "title: 中华人民共和国刑法 第二百六十七条 | text: 抢夺公私财物,数额较大的,或者多次抢夺的,处三年以下有期徒刑、拘役或者管制...", "title: 陕西省专利条例 第二十四条 | text: 负责专利执法的部门...可以查封或者扣押。" ] # 3. Encode query_vec = model.encode(query) doc_vecs = model.encode(documents) # 4. Compute Similarity similarities = model.similarity(query_vec, doc_vecs) print(similarities) ``` --- ## Training Details (Generated by Trainer) ### Dataset * **Size**: 65,783 training triplets (Anchor, Positive, Hard Negative) * **Source**: Chinese Laws & Regulations (Civil, Criminal, Administrative, etc.) ### Training Hyperparameters * **Batch Size**: 24 (Effective Batch Size = 144 with Gradient Accumulation) * **Learning Rate**: 2e-05 * **Epochs**: 3 * **Precision**: bf16 (BFloat16) * **Gradient Accumulation**: 6 steps * **Max Sequence Length**: 1024 tokens ### Loss Function **MatryoshkaLoss** wrapping **MultipleNegativesRankingLoss**: ```json { "matryoshka_dims": [768, 512, 256, 128], "matryoshka_weights": [1, 1, 1, 1] } ``` ### Training Logs
Click to expand detailed logs | Epoch | Step | Training Loss | |:------:|:----:|:-------------:| | 0.0022 | 1 | 3.5148 | | ... | ... | ... | | 1.0 | 457 | 0.2123 | | 2.0 | 914 | 0.0749 | | 3.0 | 1371 | 0.0369 |
### Framework Versions - Python: 3.13.1 - Sentence Transformers: 5.1.2 - Transformers: 4.57.1 - PyTorch: 2.9.1+cu130 - Accelerate: 1.12.0 - Datasets: 4.4.1 - Tokenizers: 0.22.1 ## Citation If you use this model, please cite the following: ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ```