AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process
Abstract
AdaptMMBench presents a comprehensive benchmark for evaluating adaptive multimodal reasoning in Vision-Language Models, measuring reasoning mode selection rationality through dynamic difficulty assessment and multi-dimensional process evaluation.
Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models' capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.
Community
AdaptMMBench is designed to evaluate adaptive multimodal reasoning beyond final accuracy. It focuses on whether vision-language models make correct reasoning mode selections and execute high-quality, efficient reasoning processes.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images (2026)
- MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning (2026)
- SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning (2025)
- MM-THEBench: Do Reasoning MLLMs Think Reasonably? (2026)
- ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking (2025)
- Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs (2026)
- Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper