---
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
---
# SubagentVL (Self-Calling Chain-of-Thoughts, sCoT)

[![arXiv](https://img.shields.io/badge/Arxiv-2512.08511-b31b1b.svg?logo=arXiv)](http://arxiv.org/abs/2512.08511) 
[![Paper](https://img.shields.io/badge/Hugging%20Face-Paper-yellow?logo=huggingface)](https://huggingface.co/papers/2512.08511) 

## Model Description

The SubagentVL model is a Vision-Language Model (VLM) trained using the **Self-Calling Chain-of-Thoughts (sCoT)** visual reasoning paradigm. This novel approach is designed to enhance visual reasoning, particularly on high-resolution images, by **reformulating the complex interleaved multimodal Chain-of-Thought (iMCoT) into a language-only reasoning trajectory augmented with self-calling**.

The core idea is that a **main agent** (the VLM itself) decomposes a complex visual query into a sequence of simple, atomic subtasks, which are then delegated to **parameter-sharing virtual replicas called subagents**. These subagents handle localized visual capabilities like grounding, OCR, or captioning in isolated contexts, returning concise textual outputs that the main agent aggregates to derive the final answer.

*   **Base Model:** Qwen2.5-VL-7B.
*   **Paradigm:** Thinking-with-images-through-self-calling (sCoT).
*   **Key Advantage:** sCoT is significantly **easier to incentivize** through reinforcement learning than traditional iMCoT methods, leading to substantial training effectiveness and efficiency.

## Training Details

### Training Paradigm: Agentic Reinforcement Learning

The model was optimized using end-to-end **Agentic Reinforcement Learning (RL)** to explicitly reward coherent self-reflection and efficient subtask coordination.

*   **Algorithm:** **Group Relative Policy Optimization (GRPO)** was employed due to its efficiency and stability.
*   **RL Steps:** Training was conducted for **80 iterations** (or RL steps). Analysis of training dynamics showed that the model initially learns to solve tasks independently, but by the third stage, the agent consistently issues more subagent calls, indicating a matured coordination strategy.
*   **Optimization Strategy:** Optimization focuses solely on the main agent’s reasoning and action outputs, achieved by applying a **token-wise loss mask to exclude gradients on the subagents’ textual responses**.
*   
### Training Data

The model was trained on subsets of the comprehensive dataset collected by DeepEyes, which integrates diverse visual reasoning sources.

The training corpus specified for this model includes:

1.  **Fine-grained data (47%):** Derived from the V\* training set, this data consists of high-resolution images and detailed perception questions, designed to maximize tool-based reasoning effectiveness.
2.  **Chart data (30%):** Composed of synthetic charts and graph images that enrich the diversity of visual elements and quantitative patterns.

The use of both Fine and Chart data was shown to **stabilize the training process** and maintain strong scores on high-resolution benchmarks.

## Intended Uses and Limitations

### Intended Uses

This model is intended for **complex visual reasoning tasks** that require decomposing a query and performing localized perception, such as:

*   Answering questions based on ultra-high-resolution images (up to 8K) where details are confined to small regions.
*   Tasks involving tool-calling behaviors like **OCR, visual grounding, and captioning** through the self-calling mechanism.
*   Applications benefiting from a resource-efficient RL-trained agent, as the sCoT paradigm is more scalable than previous iMCoT methods.

### Limitations

*   **General Visual Ability:** While sCoT greatly enhances complex reasoning, reinforcement learning focusing only on the reasoning trajectory (masking out subagent responses) showed **only marginal gains in low-level visual skills** like grounding and OCR compared to DeepEyes.
*   **Abstract Reasoning:** The sources suggest that including data emphasizing abstract symbolic reasoning (like the "Reason" subset of DeepEyes data) can **lead to degraded performance** on high-resolution visual tasks, as it shifts the model’s focus away from precise region-based perception and tool-calling strategies.
*   **Tool Constraints:** The model relies on strict constraints regarding its tool-calling protocol (requiring a task type, prompt, and bounding box), as relaxing these constraints leads to degenerated calling patterns and lower performance.

## Citation

If you find our work helpful, please cite us with the following:

```bibtex
@article{yang2025thinking,
  title={Thinking with Images via Self-Calling Agent},
  author={Yang, Wenxi and Zhao, Yuzhong and Wan, Fang and Ye, Qixiang},
  journal={arXiv preprint arXiv:2512.08511},
  year={2025}
}
```