ThinkFL: Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning

ThinkFL Model Family

Model	Link
ThinkFL-Llama3.2-3B	🤗 Hugging Face
ThinkFL-Llama3-8B	🤗 Hugging Face
ThinkFL-Qwen2.5-0.5B	🤗 Hugging Face
ThinkFL-Qwen2.5-3B	🤗 Hugging Face

ThinkFL Overview

ThinkFL is a lightweight, self-refining failure localization model for microservice systems that overcomes the limitations of rigid invocation workflows and resource-intensive inference in existing LLM-based approaches. Built on a sub-10B-parameter LLM and trained via a Progressive Multi-Stage Group Relative Policy Optimization (GRPO) fine-tuning strategy, ThinkFL features a Recursion-of-Thought Actor capable of autonomously exploring multi-step diagnostic paths through standardized tool invocation—integrating trace analysis and metrics evaluation to produce interpretable, ranked root cause hypotheses. These outputs are refined using expert SRE feedback and a Multi-Factor Failure Localization Grader, enabling efficient, accurate, and cost-effective failure localization even in high-frequency incident environments.

Quickstart

ThinkFL can be directly launched using vLLM.

Deployment with vLLM

For production deployment, you can use vLLM to serve ThinkFL as an OpenAI-compatible API endpoint. Please refer to the vLLM PR for implementation details.

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model ThinkFL-Llama3.2-3B \
  --served-model-name ThinkFL \
  --trust-remote-code \
  --tensor-parallel-size=1 \
  --port 8000

Example Prompt Usage

You can interact with the deployed ThinkFL model using standard OpenAI-compatible chat completions. The model is designed to act as a software operations engineer that diagnoses system failures by invoking predefined tools.

Below is a complete example of a multi-turn conversation where the model:

Analyzes a root trace, Recursively explores child spans, Queries fluctuating metrics around a suspicious service, Finally outputs a ranked list of potential root causes.

[
  {
    "role": "system",
    "content": "You are a software operations engineer. Your task is to systematically diagnose and identify the root cause of software failures.\n\nYou have the following tools:\n\n[{\"type\": \"function\", \"function\": {\"name\": \"search_traces\", \"description\": \"Get all traces with the input span_id as the parent\", \"parameters\": {\"type\": \"object\", \"properties\": {\"parent_span_id\": {\"type\": \"string\", \"description\": \"The parent span_id to be queried.\"}}, \"required\": [\"parent_span_id\"]}}}, {\"type\": \"function\", \"function\": {\"name\": \"search_fluctuating_metrics\", \"description\": \"Get all fluctuating metrics with the input service_name and around the input timestamp\", \"parameters\": {\"type\": \"object\", \"properties\": {\"service_name\": {\"type\": \"string\", \"description\": \"The service_name to be queried.\"}, \"timestamp\": {\"type\": \"string\", \"description\": \"The timestamp to be queried.\"}}, \"required\": [\"service_name\", \"timestamp\"]}}}, {\"type\": \"function\", \"function\": {\"name\": \"print_results\", \"description\": \"Print the root cause node/service/pod and corresponding reason.\", \"parameters\": {\"type\": \"object\", \"properties\": {\"root_causes\": {\"type\": \"array\", \"description\": \"An array of potential root causes ordered by likelihood. You may include any number of causes, but you must provide at least three potential root causes.\", \"items\": {\"type\": \"object\", \"properties\": {\"node\": {\"type\": \"string\", \"description\": \"The root cause node (e.g. node-1).\"}, \"service\": {\"type\": \"string\", \"description\": \"The root cause service (e.g. recommendationservice).\"}, \"pod\": {\"type\": \"string\", \"description\": \"The root cause pod (e.g. recommendationservice-0).\"}}, \"required\": []}}}}}}]\n\nThe parameters for these tools should be filled as a JSON object, with JSON keys as parameter names and key values as parameter content.\n\nYou must insert the following command once in your reply to call the tool:\n\n<tool_call>{'name': tool name, **can and must only** be one of [search_traces, search_fluctuating_metrics, print_results], 'arguments': tool input}<tool_call>"
  },
  {
    "role": "user",
    "content": "Please read the following root trace and identify corresponding root cause service.\nA possible approach is to recursively search for the traces you suspect have issues, and combine with metrics data to confirm the root cause.\n                             timestamp      cmdb_id           span_id                          trace_id  duration  type status_code              operation_name parent_span\n5159784  1647758897785   frontend-2  a260c8c867cf4b1c  f25cf75421090eef9c6c14270f1eb385  17244804  http           0  hipstershop.Frontend/Recv.         NaN"
  },
  {
    "role": "assistant",
    "content": "<tool_call>{\"name\": \"search_traces\", \"arguments\": \"{\\\"parent_span_id\\\": \\\"a260c8c867cf4b1c\\\"}\"}</tool_call>"
  },
  {
    "role": "tool",
    "content": "[{\"cmdb_id\":\"frontend-2\",\"duration\":17132774,\"operation_name\":\"hipstershop.CartService/GetCart\",\"parent_span\":\"a260c8c867cf4b1c\",\"span_id\":\"91486122a04fec40\",\"status_code\":\"0\",\"timestamp\":1647758897795,\"trace_id\":\"f25cf75421090eef9c6c14270f1eb385\",\"type\":\"rpc\"}, ...]"
  },
  {
    "role": "assistant",
    "content": "<tool_call>{\"name\": \"search_traces\", \"arguments\": \"{\\\"parent_span_id\\\": \\\"91486122a04fec40\\\"}\"}</tool_call>"
  },
  {
    "role": "tool",
    "content": "[{\"cmdb_id\":\"cartservice-2\",\"duration\":16177000,\"operation_name\":\"/hipstershop.CartService/GetCart\",\"parent_span\":\"91486122a04fec40\",\"span_id\":\"d55ed143102a3146\",\"status_code\":\"200\",\"timestamp\":1647758898844,\"trace_id\":\"f25cf75421090eef9c6c14270f1eb385\",\"type\":\"http\"}]"
  },
  {
    "role": "assistant",
    "content": "<tool_call>{\"name\": \"search_traces\", \"arguments\": \"{\\\"parent_span_id\\\": \\\"d55ed143102a3146\\\"}\"}</tool_call>"
  },
  {
    "role": "tool",
    "content": "[{\"cmdb_id\":\"cartservice-2\",\"duration\":2000,\"operation_name\":\"HGET\",\"parent_span\":\"d55ed143102a3146\",\"span_id\":\"f3d07142bfe0404b\",\"status_code\":\"Ok\",\"timestamp\":1647758914868,\"trace_id\":\"f25cf75421090eef9c6c14270f1eb385\",\"type\":\"db\"}]"
  },
  {
    "role": "assistant",
    "content": "<tool_call>{\"name\": \"search_fluctuating_metrics\", \"arguments\": \"{\\\"service_name\\\": \\\"cartservice\\\", \\\"timestamp\\\": \\\"1647758897785\\\"}\"}</tool_call>"
  },
  {
    "role": "tool",
    "content": "{\"node-5.cartservice2-0.container_cpu_cfs_throttled_periods\": {\"regular_mean\": 0.2, ...}, \"node-6.cartservice-2.container_cpu_usage_seconds\": {\"regular_mean\": 0.492..., \"current_mean\": 0.547...}, ...}"
  },
  {
    "role": "assistant",
    "content": "<tool_call>{\"name\": \"print_results\", \"arguments\": \"{\\\"root_causes\\\": [{\\\"service\\\": \\\"cartservice\\\"}, {\\\"pod\\\": \\\"cartservice-2\\\"}, {\\\"node\\\": \\\"node-6\\\"}, {\\\"service\\\": \\\"redis-cart\\\"}, {\\\"service\\\": \\\"productcatalog-cart\\\"}, {\\\"pod\\\": \\\"cartservice-2\\\"}, {\\\"service\\\": \\\"cartservice\\\"}, {\\\"node\\\": \\\"node-6\\\"}, {\\\"service\\\": \\\"cartservice\\\"}, {\\\"service\\\": \\\"recommendationservice\\\"}, {\\\"service\\\": \\\"adservice\\\"}, {\\\"service\\\": \\\"productcatalogservice\\\"}, {\\\"service\\\": \\\"currencyservice\\\"}]}\"}<tool_call>"
  }
]

In practice, you can directly adopt the calling pattern from RCLAgent (https://github.com/LLMLog/RCLAgent), which provides a more comprehensive agent framework-including advanced mechanisms such as critical rethink, tool orchestration, and multi-step reasoning. To integrate ThinkFL, simply replace the underlying LLM interface in RCLAgent with the ThinkFL API endpoint deployed via vLLM.

We also provide a dedicated calling repository for ThinkFL (https://github.com/LLM4AIOps/ThinkFL), which offers a simpler integration path-though it omits the data preprocessing steps for brevity and ease of use.

Citation

If you find our work helpful, please cite:

@article{zhang2025thinkfl,
  title={ThinkFL: Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning},
  author={Zhang, Lingzhe and Zhai, Yunpeng and Jia, Tong and Duan, Chiming and Yu, Siyu and Gao, Jinyang and Ding, Bolin and Wu, Zhonghai and Li, Ying},
  journal={arXiv preprint arXiv:2504.18776},
  year={2025}
}
@article{zhang2025adaptive,
  title={Adaptive root cause localization for microservice systems with multi-agent recursion-of-thought},
  author={Zhang, Lingzhe and Jia, Tong and Wang, Kangjin and Hong, Weijie and Duan, Chiming and He, Minghua and Li, Ying},
  journal={arXiv preprint arXiv:2508.20370},
  year={2025}
}

Downloads last month: 1

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zhang-Lingzhe/ThinkFL-Qwen2.5-3B

Quantizations

1 model

Papers for Zhang-Lingzhe/ThinkFL-Qwen2.5-3B

Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought

Paper • 2508.20370 • Published Aug 28, 2025

ThinkFL: Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning

Paper • 2504.18776 • Published Apr 26, 2025