--- language: - en pipeline_tag: image-text-to-text library_name: transformers --- # SubagentVL (Self-Calling Chain-of-Thoughts, sCoT) [![arXiv](https://img.shields.io/badge/Arxiv-2512.08511-b31b1b.svg?logo=arXiv)](http://arxiv.org/abs/2512.08511) [![Paper](https://img.shields.io/badge/Hugging%20Face-Paper-yellow?logo=huggingface)](https://huggingface.co/papers/2512.08511) ## Model Description The SubagentVL model is a Vision-Language Model (VLM) trained using the **Self-Calling Chain-of-Thoughts (sCoT)** visual reasoning paradigm. This novel approach is designed to enhance visual reasoning, particularly on high-resolution images, by **reformulating the complex interleaved multimodal Chain-of-Thought (iMCoT) into a language-only reasoning trajectory augmented with self-calling**. The core idea is that a **main agent** (the VLM itself) decomposes a complex visual query into a sequence of simple, atomic subtasks, which are then delegated to **parameter-sharing virtual replicas called subagents**. These subagents handle localized visual capabilities like grounding, OCR, or captioning in isolated contexts, returning concise textual outputs that the main agent aggregates to derive the final answer. * **Base Model:** Qwen2.5-VL-7B. * **Paradigm:** Thinking-with-images-through-self-calling (sCoT). * **Key Advantage:** sCoT is significantly **easier to incentivize** through reinforcement learning than traditional iMCoT methods, leading to substantial training effectiveness and efficiency. ## Training Details ### Training Paradigm: Agentic Reinforcement Learning The model was optimized using end-to-end **Agentic Reinforcement Learning (RL)** to explicitly reward coherent self-reflection and efficient subtask coordination. * **Algorithm:** **Group Relative Policy Optimization (GRPO)** was employed due to its efficiency and stability. * **RL Steps:** Training was conducted for **80 iterations** (or RL steps). Analysis of training dynamics showed that the model initially learns to solve tasks independently, but by the third stage, the agent consistently issues more subagent calls, indicating a matured coordination strategy. * **Optimization Strategy:** Optimization focuses solely on the main agent’s reasoning and action outputs, achieved by applying a **token-wise loss mask to exclude gradients on the subagents’ textual responses**. * ### Training Data The model was trained on subsets of the comprehensive dataset collected by DeepEyes, which integrates diverse visual reasoning sources. The training corpus specified for this model includes: 1. **Fine-grained data (47%):** Derived from the V\* training set, this data consists of high-resolution images and detailed perception questions, designed to maximize tool-based reasoning effectiveness. 2. **Chart data (30%):** Composed of synthetic charts and graph images that enrich the diversity of visual elements and quantitative patterns. The use of both Fine and Chart data was shown to **stabilize the training process** and maintain strong scores on high-resolution benchmarks. ## Intended Uses and Limitations ### Intended Uses This model is intended for **complex visual reasoning tasks** that require decomposing a query and performing localized perception, such as: * Answering questions based on ultra-high-resolution images (up to 8K) where details are confined to small regions. * Tasks involving tool-calling behaviors like **OCR, visual grounding, and captioning** through the self-calling mechanism. * Applications benefiting from a resource-efficient RL-trained agent, as the sCoT paradigm is more scalable than previous iMCoT methods. ### Limitations * **General Visual Ability:** While sCoT greatly enhances complex reasoning, reinforcement learning focusing only on the reasoning trajectory (masking out subagent responses) showed **only marginal gains in low-level visual skills** like grounding and OCR compared to DeepEyes. * **Abstract Reasoning:** The sources suggest that including data emphasizing abstract symbolic reasoning (like the "Reason" subset of DeepEyes data) can **lead to degraded performance** on high-resolution visual tasks, as it shifts the model’s focus away from precise region-based perception and tool-calling strategies. * **Tool Constraints:** The model relies on strict constraints regarding its tool-calling protocol (requiring a task type, prompt, and bounding box), as relaxing these constraints leads to degenerated calling patterns and lower performance. ## Citation If you find our work helpful, please cite us with the following: ```bibtex @article{yang2025thinking, title={Thinking with Images via Self-Calling Agent}, author={Yang, Wenxi and Zhao, Yuzhong and Wan, Fang and Ye, Qixiang}, journal={arXiv preprint arXiv:2512.08511}, year={2025} } ```