Tool-Genesis-Qwen3-8B-SFT

A fine-tuned Qwen3-8B model for autonomous MCP (Model Context Protocol) tool server generation. Given a natural language scenario description, the model generates a complete, runnable MCP server with tool schemas and implementation code.

Model Details

Property Value
Base model Qwen/Qwen3-8B
Architecture Qwen3ForCausalLM
Parameters 8B
Hidden size 4096
Layers 36
Attention heads 32
Context length 131,072 tokens
Training method Full-parameter SFT
Training epochs 3
Training steps 117
Training loss 0.522
Training data ~2,500 samples

Training

The model was fine-tuned on curated MCP server generation examples from the Tool-Genesis benchmark. Each training sample consists of:

  • Input: A natural language scenario description specifying what the MCP server should do
  • Output: A complete Python MCP server implementation using the FastMCP framework

Training Configuration

  • Epochs: 3
  • Total steps: 117 (~39 steps/epoch)
  • Final training loss: 0.522
  • Training runtime: ~4.6 hours

Loss Curve

Step Loss
1 0.763
10 0.690
20 0.641
39 (epoch 1) 0.539
60 0.434
78 (epoch 2) 0.436
100 0.420
117 (epoch 3) 0.522

Benchmark Results

Evaluated on the Tool-Genesis Benchmark (86 MCP servers, 4-level evaluation).

Direct Generation (single-call, no agent loop)

Model L1 Compliance L1 Launch L2 Schema F1 L2 UT Soft
Qwen3-8B (base) 0.686 0.012 0.011 0.001
Qwen3-8B-SFT (ours) 0.826 0.047 0.046 0.017
Qwen3-235B 0.874 0.333 0.316 0.142
GPT-4.1 0.881 0.738 0.691 0.267
GPT-5.1 0.855 0.759 0.713 0.291

SFT gains over base Qwen3-8B (Direct):

  • L1 Compliance: +14.0% (0.686 → 0.826)
  • L1 Launch: +3.5% (0.012 → 0.047)
  • L2 Schema F1: +3.5% (0.011 → 0.046)
  • L2 UT Soft: +1.6% (0.001 → 0.017)

With Coder-Agent (multi-turn with sandbox)

Model L1 Compliance L1 Launch L2 Schema F1 L2 UT Soft
Qwen3-8B (base, coder-agent) 0.776 0.694 0.653 0.246
Qwen3-235B (coder-agent) 0.868 0.971 0.914 0.459
GPT-4.1 (coder-agent) 0.884 0.756 0.691 0.288
GPT-5.1 (coder-agent) 0.906 0.941 0.877 0.426

Note: The coder-agent strategy dramatically improves all models by providing an iterative sandbox-based coding loop. The SFT model has not yet been evaluated with the coder-agent strategy.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tool-genesis/Tool-Genesis-Qwen3-8B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

prompt = """You are a developer building MCP tool servers in Python.
Build a complete MCP server for the following scenario:

A weather information service that provides current weather data, 
forecasts, and weather alerts for any location worldwide.

Output only the Python source code using the FastMCP framework."""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation Protocol

The Tool-Genesis benchmark evaluates generated MCP servers across four levels:

Level What it tests
L1: Protocol Compliance JSON format validity and server launch success
L2: Semantic Correctness Tool schema matching (F1) and unit test pass rate
L3: Capability Boundary No unauthorized capabilities or dangerous extra tools
L4: Task Utility Downstream task completion using generated tools

Links

Citation

@misc{tool_genesis_2025,
  title={Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent},
  author={Xia, Bowei and Hu, Mengkang and Wang, Shijian and Jin, Jiarui and Jiao, Wenxiang and Lu, Yuan and Li, Kexin and Luo, Ping},
  year={2025},
  note={Project page: https://tool-genesis.github.io}
}

License

Apache 2.0

Downloads last month
22
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tool-genesis/Tool-Genesis-Qwen3-8B-SFT

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
(1034)
this model

Dataset used to train tool-genesis/Tool-Genesis-Qwen3-8B-SFT

Evaluation results