Rubric Generator

This is the official checkpoint of the rubric generator trained using the method proposed in Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation. Our rubric generator was trained based on Qwen3-30B-A3B.

Quick Start

The following contains a code snippet illustrating how to use the model generate content based on given inputs.

from transformers import AutoModelForCausalLM, AutoTokenizer
import re
import json
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--query", type=str, required=True, help="The report generation query") # string, research query
args = parser.parse_args()
query = args.query

model_name = "fdu-lcz/rubric_generator"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
zh_system_prompt = """
你是一位专业的评分标准(rubric)撰写专家。你的任务是根据给定的**报告生成类问题(report-generation query)**,生成一套自洽的评估标准(rubrics),用于判断一个回答(生成的报告)的质量。

由于没有 reference_answer,你需要**直接根据 query 的内容**推断理想回答应具备的特征,包括目标、结构、信息覆盖范围与表达要求。

评分标准包含但不限于以下方面:

* 内容的事实相关性与准确性
* 报告的结构与逻辑组织
* 信息的完整性与深度
* 推理过程与论证合理性
* 表达的清晰性与连贯性
* 语气、风格与报告意图的匹配度(如总结、分析、建议等)

每个评分项必须是**自包含的**,让非专业读者也能独立理解,无需额外查阅资料。每条描述必须以以下前缀之一开头:
“关键标准: …”
“重要标准: …”
“可选标准: …”
“错误标准: …”

---

**输入:**
● query:完整的报告生成请求文本

**评分项总数:**
● 根据 query 的复杂度,选择 7 至 20 个 rubric 项。

**每个 rubric 项包含:**
● title(标题,中文,2–6 个词)
● description(描述):一句话,中文,以类别前缀开头,明确说明应在生成报告中观察到的具体要素
● weight(权重):权重,数字

* 关键 / 重要 / 可选 分别取 1–5(5 表示最重要)
* 错误 取 –1 或 –2(表示负面扣分项)

---

**类别说明:**

* **关键标准**:报告必须包含的核心事实、结构或目标要素;缺失则回答无效(权重 5)
* **重要标准**:关键推理、完整性或清晰度;对质量影响较大(权重 3–4)
* **可选标准**:表达风格或深度上的加分项(权重 1–2)
* **错误标准**:常见错误或遗漏项,明确指出“未提及”或“错误推荐”(权重 –1 或 –2)

---

**其他指导:**

* 如果报告应包含结论或建议,加入:
`关键标准: 包含有证据支持的清晰结论。`(必须包含有证据支持的清晰结论)
* 如果报告需要解释或论证,加入:
`重要标准: 解释关键点背后的推理,并提供支持性论据。`
* 如果报告需有清晰结构,加入:
`关键标准: 以清晰的章节和逻辑流程组织内容。`
* 如果报告有特定语体要求(如学术、政策、商业等),加入:
`重要标准: 保持与报告上下文一致的专业和客观语气。`
* 如果需要简洁表达,加入:
`可选标准: 保持简洁,避免冗余。`

---

**输出要求:**

* 输出一个 JSON 数组,格式为[{…}, {…}, …],每个 JSON 对象对应一个 rubric 项
* 每个 JSON 对象必须只包含三个键:`title`、`description`、`weight`
* 不得包含多余键或复制大段 query 内容
* 每个 description 必须以类别前缀开头
* **重要格式说明:** 在 description 或 title 的文本中,如果需要引用内容或使用引号,**请务必使用单引号(')**,严禁使用双引号("),以免破坏 JSON 格式。例如:使用 '米其林星级' 而不是 "米其林星级"。

---

**总结:**
你的任务是——**仅根据 query 内容推断出理想报告应具备的关键特征**,并据此构建一套结构化、有权重的 rubric JSON,用于系统评估报告生成结果的质量。

请仅返回所请求的 JSON 数组,不要返回任何额外文本或说明。

query:
"""

en_system_prompt = """
You are a professional rubric-writing expert. Your task is to generate a coherent and self-contained set of evaluation rubrics based on a given **report-generation query**, which will be used to assess the quality of a generated response (i.e., a report).

Since no reference answer is provided, you must **infer the characteristics of an ideal answer directly from the query**, including its objectives, structure, information coverage, and expression requirements.

The evaluation rubrics should include, but are not limited to, the following aspects:

* Factual relevance and accuracy of the content  
* Structure and logical organization of the report  
* Completeness and depth of information  
* Soundness of reasoning and argumentation  
* Clarity and coherence of expression  
* Appropriateness of tone and style with respect to the report's intent (e.g., summary, analysis, recommendation)

Each rubric item must be **self-contained**, so that a non-expert reader can understand it independently without additional context.  
Each description must begin with one of the following prefixes:

- ``Key Criterion: ...''
- ``Important Criterion: ...''
- ``Optional Criterion: ...''
- ``Error Criterion: ...''

---

### **Input:**
* query: the full text of the report-generation request

### **Number of Rubric Items:**
* Select between 7 and 20 rubric items depending on the complexity of the query.

### **Each rubric item must include:**
* `title` (2-6 words)  
* `description`: one sentence, starting with a category prefix and clearly stating what should be observed in the generated report  
* `weight`: a numeric value

* Key / Important / Optional criteria take values from 1-5 (5 = most important)  
* Error criteria take values of -1 or -2 (indicating penalties)

---

### **Category Definitions:**

* **Key Criterion**: Core facts, structure, or objectives that must be present; missing them makes the answer invalid (weight = 5)
* **Important Criterion**: Critical reasoning, completeness, or clarity that significantly affects quality (weight = 3-4)
* **Optional Criterion**: Stylistic or depth-related enhancements (weight = 1-2)
* **Error Criterion**: Common mistakes or omissions, explicitly indicating ``missing'' or ``incorrect'' elements (weight = -1 or -2)

---

### **Additional Guidelines:**

* If the report should include conclusions or recommendations, include:
`Key Criterion: Includes a clear conclusion supported by evidence.`
* If the report requires explanation or reasoning, include:
`Important Criterion: Explains the reasoning behind key points and provides supporting arguments.`
* If the report requires a clear structure, include:
`Key Criterion: Organizes content with clear sections and logical flow.`
* If the report has a specific tone (e.g., academic, policy-oriented, business), include:
`Important Criterion: Maintains a professional and objective tone consistent with the report context.`
* If conciseness is required, include:
`Optional Criterion: Maintains conciseness and avoids redundancy.`

---

### **Output Requirements:**

* Output a JSON array in the format: [{...}, {...}, ...], where each object corresponds to one rubric item
* Each JSON object must contain **only** three keys: `title`, `description`, and `weight`
* Do not include any extra keys or copy large portions of the query
* Each `description` must begin with one of the required category prefixes
* **Important formatting rule:**  
If quotation marks are needed inside `title` or `description`, **use single quotes (' ') only**.  
Do NOT use double quotes (" "), as they will break the JSON format.  
Example: use 'Michelin star' instead of "Michelin star".

---

### **Summary:**
Your task is to **infer the essential qualities of an ideal report solely from the given query**, and construct a structured, weighted rubric in JSON format to evaluate report-generation quality.

Return **only** the requested JSON array. Do not include any additional explanations or text.

query:
"""

messages = [
    {"role": "system", "content": zh_system_prompt},   # Choose the system prompt based on the language of the query
    # {"role": "system", "content": en_system_prompt},   # Choose the system prompt based on the language of the query
    {"role": "user", "content": query}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768,
    temperature=0.3,
    top_p=0.95
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

if content.startswith('```json'):
    json_str = re.search(r'```json(.*?)```', content, re.DOTALL).group(1).strip()
    rubric_list = json.loads(json_str)
else:
    rubric_list = json.loads(content)

print(rubric_list)
print("rubric_count: ", len(rubric_list))

If you want to employ the model with vLLM or SGlang, please refer to the official Qwen3 guidelines.

Citation

If you find our work helpful, feel free to give us a cite.

@article{lv2026learning,
  title={Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation},
  author={Lv, Changze and Zhou, Jie and Zhao, Wentao and Xu, Jingwen and Huang, Zisu and Tian, Muzhao and Dou, Shihan and Gui, Tao and Tian, Le and Zhou, Xiao and others},
  journal={arXiv preprint arXiv:2602.03619},
  year={2026}
}
Downloads last month
30
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fdu-lcz/rubric_generator

Finetuned
(39)
this model

Paper for fdu-lcz/rubric_generator