--- language: - en - zh license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3-8B tags: - qwen3 - tool-use - mcp - code-generation - sft datasets: - tool-genesis/Tool-Genesis-Benchmark pipeline_tag: text-generation model-index: - name: Tool-Genesis-Qwen3-8B-SFT results: - task: type: text-generation name: MCP Server Generation (Direct) dataset: name: Tool-Genesis Benchmark type: tool-genesis/Tool-Genesis-Benchmark metrics: - type: compliance value: 0.826 name: L1 Compliance - type: launch_rate value: 0.047 name: L1 Launch Rate - type: schema_f1 value: 0.046 name: L2 Schema F1 - type: ut_soft value: 0.017 name: L2 UT Soft --- # Tool-Genesis-Qwen3-8B-SFT A fine-tuned [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) model for autonomous MCP (Model Context Protocol) tool server generation. Given a natural language scenario description, the model generates a complete, runnable MCP server with tool schemas and implementation code. ## Model Details | Property | Value | |---|---| | Base model | [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | | Architecture | Qwen3ForCausalLM | | Parameters | 8B | | Hidden size | 4096 | | Layers | 36 | | Attention heads | 32 | | Context length | 131,072 tokens | | Training method | Full-parameter SFT | | Training epochs | 3 | | Training steps | 117 | | Training loss | 0.522 | | Training data | ~2,500 samples | ## Training The model was fine-tuned on curated MCP server generation examples from the Tool-Genesis benchmark. Each training sample consists of: - **Input**: A natural language scenario description specifying what the MCP server should do - **Output**: A complete Python MCP server implementation using the FastMCP framework ### Training Configuration - **Epochs**: 3 - **Total steps**: 117 (~39 steps/epoch) - **Final training loss**: 0.522 - **Training runtime**: ~4.6 hours ### Loss Curve | Step | Loss | |---|---| | 1 | 0.763 | | 10 | 0.690 | | 20 | 0.641 | | 39 (epoch 1) | 0.539 | | 60 | 0.434 | | 78 (epoch 2) | 0.436 | | 100 | 0.420 | | 117 (epoch 3) | 0.522 | ## Benchmark Results Evaluated on the [Tool-Genesis Benchmark](https://huggingface.co/datasets/tool-genesis/Tool-Genesis-Benchmark) (86 MCP servers, 4-level evaluation). ### Direct Generation (single-call, no agent loop) | Model | L1 Compliance | L1 Launch | L2 Schema F1 | L2 UT Soft | |---|---|---|---|---| | Qwen3-8B (base) | 0.686 | 0.012 | 0.011 | 0.001 | | **Qwen3-8B-SFT (ours)** | **0.826** | **0.047** | **0.046** | **0.017** | | Qwen3-235B | 0.874 | 0.333 | 0.316 | 0.142 | | GPT-4.1 | 0.881 | 0.738 | 0.691 | 0.267 | | GPT-5.1 | 0.855 | 0.759 | 0.713 | 0.291 | **SFT gains over base Qwen3-8B (Direct):** - L1 Compliance: +14.0% (0.686 → 0.826) - L1 Launch: +3.5% (0.012 → 0.047) - L2 Schema F1: +3.5% (0.011 → 0.046) - L2 UT Soft: +1.6% (0.001 → 0.017) ### With Coder-Agent (multi-turn with sandbox) | Model | L1 Compliance | L1 Launch | L2 Schema F1 | L2 UT Soft | |---|---|---|---|---| | Qwen3-8B (base, coder-agent) | 0.776 | 0.694 | 0.653 | 0.246 | | Qwen3-235B (coder-agent) | 0.868 | 0.971 | 0.914 | 0.459 | | GPT-4.1 (coder-agent) | 0.884 | 0.756 | 0.691 | 0.288 | | GPT-5.1 (coder-agent) | 0.906 | 0.941 | 0.877 | 0.426 | > Note: The coder-agent strategy dramatically improves all models by providing an iterative sandbox-based coding loop. The SFT model has not yet been evaluated with the coder-agent strategy. ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "tool-genesis/Tool-Genesis-Qwen3-8B-SFT" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") prompt = """You are a developer building MCP tool servers in Python. Build a complete MCP server for the following scenario: A weather information service that provides current weather data, forecasts, and weather alerts for any location worldwide. Output only the Python source code using the FastMCP framework.""" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.2) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Evaluation Protocol The Tool-Genesis benchmark evaluates generated MCP servers across four levels: | Level | What it tests | |---|---| | **L1: Protocol Compliance** | JSON format validity and server launch success | | **L2: Semantic Correctness** | Tool schema matching (F1) and unit test pass rate | | **L3: Capability Boundary** | No unauthorized capabilities or dangerous extra tools | | **L4: Task Utility** | Downstream task completion using generated tools | ## Links - **Benchmark Dataset**: [tool-genesis/Tool-Genesis-Benchmark](https://huggingface.co/datasets/tool-genesis/Tool-Genesis-Benchmark) - **Code**: [github.com/Tool-Genesis/Tool-Genesis](https://github.com/Tool-Genesis/Tool-Genesis) ## Citation ```bibtex @misc{tool_genesis_2025, title={Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent}, author={Xia, Bowei and Hu, Mengkang and Wang, Shijian and Jin, Jiarui and Jiao, Wenxiang and Lu, Yuan and Li, Kexin and Luo, Ping}, year={2025}, note={Project page: https://tool-genesis.github.io} } ``` ## License Apache 2.0