Tool-Genesis-Qwen3-8B-SFT

A fine-tuned Qwen3-8B model for autonomous MCP (Model Context Protocol) tool server generation. Given a natural language scenario description, the model generates a complete, runnable MCP server with tool schemas and implementation code.

Model Details

Property	Value
Base model	Qwen/Qwen3-8B
Architecture	Qwen3ForCausalLM
Parameters	8B
Hidden size	4096
Layers	36
Attention heads	32
Context length	131,072 tokens
Training method	Full-parameter SFT
Training epochs	3
Training steps	117
Training loss	0.522
Training data	~2,500 samples

Training

The model was fine-tuned on curated MCP server generation examples from the Tool-Genesis benchmark. Each training sample consists of:

Input: A natural language scenario description specifying what the MCP server should do
Output: A complete Python MCP server implementation using the FastMCP framework

Training Configuration

Epochs: 3
Total steps: 117 (~39 steps/epoch)
Final training loss: 0.522
Training runtime: ~4.6 hours

Loss Curve

Step	Loss
1	0.763
10	0.690
20	0.641
39 (epoch 1)	0.539
60	0.434
78 (epoch 2)	0.436
100	0.420
117 (epoch 3)	0.522

Benchmark Results

Evaluated on the Tool-Genesis Benchmark (86 MCP servers, 4-level evaluation).

Direct Generation (single-call, no agent loop)

Model	L1 Compliance	L1 Launch	L2 Schema F1	L2 UT Soft
Qwen3-8B (base)	0.686	0.012	0.011	0.001
Qwen3-8B-SFT (ours)	0.826	0.047	0.046	0.017
Qwen3-235B	0.874	0.333	0.316	0.142
GPT-4.1	0.881	0.738	0.691	0.267
GPT-5.1	0.855	0.759	0.713	0.291

SFT gains over base Qwen3-8B (Direct):

L1 Compliance: +14.0% (0.686 → 0.826)
L1 Launch: +3.5% (0.012 → 0.047)
L2 Schema F1: +3.5% (0.011 → 0.046)
L2 UT Soft: +1.6% (0.001 → 0.017)

With Coder-Agent (multi-turn with sandbox)

Model	L1 Compliance	L1 Launch	L2 Schema F1	L2 UT Soft
Qwen3-8B (base, coder-agent)	0.776	0.694	0.653	0.246
Qwen3-235B (coder-agent)	0.868	0.971	0.914	0.459
GPT-4.1 (coder-agent)	0.884	0.756	0.691	0.288
GPT-5.1 (coder-agent)	0.906	0.941	0.877	0.426

Note: The coder-agent strategy dramatically improves all models by providing an iterative sandbox-based coding loop. The SFT model has not yet been evaluated with the coder-agent strategy.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tool-genesis/Tool-Genesis-Qwen3-8B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

prompt = """You are a developer building MCP tool servers in Python.
Build a complete MCP server for the following scenario:

A weather information service that provides current weather data, 
forecasts, and weather alerts for any location worldwide.

Output only the Python source code using the FastMCP framework."""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation Protocol

The Tool-Genesis benchmark evaluates generated MCP servers across four levels:

Level	What it tests
L1: Protocol Compliance	JSON format validity and server launch success
L2: Semantic Correctness	Tool schema matching (F1) and unit test pass rate
L3: Capability Boundary	No unauthorized capabilities or dangerous extra tools
L4: Task Utility	Downstream task completion using generated tools

Citation

@misc{tool_genesis_2025,
  title={Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent},
  author={Xia, Bowei and Hu, Mengkang and Wang, Shijian and Jin, Jiarui and Jiao, Wenxiang and Lu, Yuan and Li, Kexin and Luo, Ping},
  year={2025},
  note={Project page: https://tool-genesis.github.io}
}

License

Apache 2.0

Downloads last month: 22

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for tool-genesis/Tool-Genesis-Qwen3-8B-SFT

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B