kaizuberbuehler 's Collections Vision Language Models
updated
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
• 2404.12390
• Published
• 26
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
• 2404.12803
• Published
• 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
• 2404.13013
• Published
• 31
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
• 2404.06512
• Published
• 30
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
• 2404.05719
• Published
• 83
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
• 2404.05726
• Published
• 23
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
• 2404.03413
• Published
• 27
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
Benchmark for Expert AGI
Paper
• 2311.16502
• Published
• 38
Kosmos-2: Grounding Multimodal Large Language Models to the World
Paper
• 2306.14824
• Published
• 35
CogVLM: Visual Expert for Pretrained Language Models
Paper
• 2311.03079
• Published
• 27
Pegasus-v1 Technical Report
Paper
• 2404.14687
• Published
• 33
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
• 2404.16821
• Published
• 59
List Items One by One: A New Data Source and Learning Paradigm for
Multimodal LLMs
Paper
• 2404.16375
• Published
• 18
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
• 2404.16790
• Published
• 10
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published
• 37
What matters when building vision-language models?
Paper
• 2405.02246
• Published
• 103
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
• 2405.21075
• Published
• 26
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
• 2406.04325
• Published
• 74
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
• 2406.09415
• Published
• 51
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
• 2406.09246
• Published
• 43
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
• 2406.09411
• Published
• 19
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
• 2406.09403
• Published
• 23
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
• 2406.08707
• Published
• 17
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
• 2406.11833
• Published
• 62
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
• 2406.11816
• Published
• 26
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
• 2406.09961
• Published
• 55
Needle In A Multimodal Haystack
Paper
• 2406.07230
• Published
• 54
Wolf: Captioning Everything with a World Summarization Framework
Paper
• 2407.18908
• Published
• 32
Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal
Language Model
Paper
• 2408.00754
• Published
• 23
OmniParser for Pure Vision Based GUI Agent
Paper
• 2408.00203
• Published
• 24
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
• 2408.10188
• Published
• 52
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
• 2408.12528
• Published
• 51
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
• 2409.01071
• Published
• 27
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
• 2409.12191
• Published
• 78
NVLM: Open Frontier-Class Multimodal LLMs
Paper
• 2409.11402
• Published
• 74
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with
3D-awareness
Paper
• 2409.18125
• Published
• 34
OmniBench: Towards The Future of Universal Omni-Language Models
Paper
• 2409.15272
• Published
• 30
Progressive Multimodal Reasoning via Active Retrieval
Paper
• 2412.14835
• Published
• 73
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published
• 147
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published
• 54
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
• 2412.01169
• Published
• 13
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
• 2412.03555
• Published
• 133
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published
• 89
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained
Video Reasoning via Core Frame Selection
Paper
• 2411.14794
• Published
• 13
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
• 2411.10442
• Published
• 87
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published
• 129
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Paper
• 2411.07461
• Published
• 23
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
• 2411.06176
• Published
• 45
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
• 2411.04996
• Published
• 50
Analyzing The Language of Visual Tokens
Paper
• 2411.05001
• Published
• 24
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile
Manipulation
Paper
• 2411.04999
• Published
• 18
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
• 2501.00958
• Published
• 109
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Paper
• 2501.01904
• Published
• 33
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published
• 34
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
• 2501.06186
• Published
• 65
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
• 2501.05510
• Published
• 44
Are VLMs Ready for Autonomous Driving? An Empirical Study from the
Reliability, Data, and Metric Perspectives
Paper
• 2501.04003
• Published
• 27
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper
• 2501.09012
• Published
• 10
Learnings from Scaling Visual Tokenizers for Reconstruction and
Generation
Paper
• 2501.09755
• Published
• 35
Do generative video models learn physical principles from watching
videos?
Paper
• 2501.09038
• Published
• 34
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
• 2501.09747
• Published
• 28
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper
• 2501.12380
• Published
• 84
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward
Model
Paper
• 2501.12368
• Published
• 45
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published
• 90
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline
Professional Videos
Paper
• 2501.13826
• Published
• 23
Temporal Preference Optimization for Long-Form Video Understanding
Paper
• 2501.13919
• Published
• 23
PixelWorld: Towards Perceiving Everything as Pixels
Paper
• 2501.19339
• Published
• 17
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
Modality Alignment
Paper
• 2502.04328
• Published
• 29
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Paper
• 2502.03738
• Published
• 11
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
• 2502.07617
• Published
• 29
CoS: Chain-of-Shot Prompting for Long Video Understanding
Paper
• 2502.06428
• Published
• 10
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
• 2502.09560
• Published
• 35
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
Reasoning Quality, Robustness, and Efficiency
Paper
• 2502.09621
• Published
• 28
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Paper
• 2502.09620
• Published
• 26
mmE5: Improving Multimodal Multilingual Embeddings via High-quality
Synthetic Data
Paper
• 2502.08468
• Published
• 16
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
• 2502.09696
• Published
• 43
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex
Task Automation on PC
Paper
• 2502.14282
• Published
• 29
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published
• 214
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
• 2502.12900
• Published
• 86
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
• 2502.16033
• Published
• 18
VEM: Environment-Free Exploration for Training GUI Agent with Value
Environment Model
Paper
• 2502.18906
• Published
• 12
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
• 2503.04130
• Published
• 96
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
• 2503.01743
• Published
• 89
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
• 2503.01785
• Published
• 86
EgoLife: Towards Egocentric Life Assistant
Paper
• 2503.03803
• Published
• 46
Unified Video Action Model
Paper
• 2503.00200
• Published
• 14
Unified Reward Model for Multimodal Understanding and Generation
Paper
• 2503.05236
• Published
• 123
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
• 2503.07536
• Published
• 88
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
• 2503.07365
• Published
• 61
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Paper
• 2503.05132
• Published
• 57
World Modeling Makes a Better Planner: Dual Preference Optimization for
Embodied Task Planning
Paper
• 2503.10480
• Published
• 56
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
• 2503.10291
• Published
• 36
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
Reinforcing Learning
Paper
• 2503.05379
• Published
• 38
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
• 2503.06749
• Published
• 31
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via
Reinforcement Learning and Reasoning
Paper
• 2503.07608
• Published
• 23
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
• 2503.10615
• Published
• 17
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based
VLM Agent Training
Paper
• 2503.08525
• Published
• 17
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large
Vision-Language Models in Fact-Seeking Question Answering
Paper
• 2503.06492
• Published
• 11
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
Paper
• 2503.10391
• Published
• 12
Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution
Autonomous Driving VQA from Peru
Paper
• 2503.07587
• Published
• 11
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial
Samples
Paper
• 2410.14669
• Published
• 39
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain
Knowledge
Paper
• 2504.10342
• Published
• 11
Being-0: A Humanoid Robotic Agent with Vision-Language Models and
Modular Skills
Paper
• 2503.12533
• Published
• 68
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
• 2503.15558
• Published
• 50
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Paper
• 2503.14478
• Published
• 48
API Agents vs. GUI Agents: Divergence and Convergence
Paper
• 2503.11069
• Published
• 36
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
• 2503.12605
• Published
• 35
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
• 2503.12797
• Published
• 32
Cube: A Roblox View of 3D Intelligence
Paper
• 2503.15475
• Published
• 31
R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization
Paper
• 2503.12937
• Published
• 30
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the
LLM Era
Paper
• 2503.12329
• Published
• 27
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language
Models
Paper
• 2503.16257
• Published
• 27
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Paper
• 2503.11579
• Published
• 21
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Paper
• 2503.13444
• Published
• 17
CLS-RL: Image Classification with Rule-Based Reinforcement Learning
Paper
• 2503.16188
• Published
• 13
Free-form language-based robotic reasoning and grasping
Paper
• 2503.13082
• Published
• 11
Qwen2.5-Omni Technical Report
Paper
• 2503.20215
• Published
• 170
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
• 2503.21776
• Published
• 79
Dita: Scaling Diffusion Transformer for Generalist
Vision-Language-Action Policy
Paper
• 2503.19757
• Published
• 51
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Paper
• 2503.19990
• Published
• 35
Exploring Hallucination of Large Multimodal Models in Video
Understanding: Benchmark, Analysis and Mitigation
Paper
• 2503.19622
• Published
• 31
Judge Anything: MLLM as a Judge Across Any Modality
Paper
• 2503.17489
• Published
• 23
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
• 2503.18013
• Published
• 20
Video SimpleQA: Towards Factuality Evaluation in Large Video Language
Models
Paper
• 2503.18923
• Published
• 14
Can Large Vision Language Models Read Maps Like a Human?
Paper
• 2503.14607
• Published
• 10
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
• 2504.00883
• Published
• 67
Towards Physically Plausible Video Generation via VLM Planning
Paper
• 2503.23368
• Published
• 40
Unicorn: Text-Only Data Synthesis for Vision Language Model Training
Paper
• 2503.22655
• Published
• 39
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
• 2503.24376
• Published
• 38
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal
LLMs on Academic Resources
Paper
• 2504.00595
• Published
• 37
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
• 2504.02587
• Published
• 32
Scaling Analysis of Interleaved Speech-Text Language Models
Paper
• 2504.02398
• Published
• 31
Scaling Language-Free Visual Representation Learning
Paper
• 2504.01017
• Published
• 32
ShortV: Efficient Multimodal Large Language Models by Freezing Visual
Tokens in Ineffective Layers
Paper
• 2504.00502
• Published
• 26
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Paper
• 2503.23573
• Published
• 12
SmolVLM: Redefining small and efficient multimodal models
Paper
• 2504.05299
• Published
• 205
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Paper
• 2504.06263
• Published
• 183
Paper
• 2504.07491
• Published
• 137
One-Minute Video Generation with Test-Time Training
Paper
• 2504.05298
• Published
• 110
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Paper
• 2504.05599
• Published
• 85
An Empirical Study of GPT-4o Image Generation Capabilities
Paper
• 2504.05979
• Published
• 64
VCR-Bench: A Comprehensive Evaluation Framework for Video
Chain-of-Thought Reasoning
Paper
• 2504.07956
• Published
• 46
Scaling Laws for Native Multimodal Models Scaling Laws for Native
Multimodal Models
Paper
• 2504.07951
• Published
• 30
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via
Iterative Instruction Tuning and Reinforcement Learning
Paper
• 2504.02949
• Published
• 21
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual
Reasoning Self-Improvement
Paper
• 2504.07934
• Published
• 21
Towards Visual Text Grounding of Multimodal Large Language Model
Paper
• 2504.04974
• Published
• 17
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
• 2504.03151
• Published
• 15
MME-Unify: A Comprehensive Benchmark for Unified Multimodal
Understanding and Generation Models
Paper
• 2504.03641
• Published
• 14
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
Capabilities in Multimodal Large Language Models
Paper
• 2504.06148
• Published
• 13
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
• 2504.06958
• Published
• 13
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published
• 306
Have we unified image generation and understanding yet? An empirical
study of GPT-4o's image generation ability
Paper
• 2504.08003
• Published
• 49
ColorBench: Can VLMs See and Understand the Colorful World? A
Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Paper
• 2504.10514
• Published
• 48
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
• 2504.08837
• Published
• 43
Generate, but Verify: Reducing Hallucination in Vision-Language Models
with Retrospective Resampling
Paper
• 2504.13169
• Published
• 39
FUSION: Fully Integration of Vision-Language Representations for Deep
Cross-Modal Understanding
Paper
• 2504.09925
• Published
• 39
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Paper
• 2504.07615
• Published
• 35
Mavors: Multi-granularity Video Representation for Multimodal Large
Language Model
Paper
• 2504.10068
• Published
• 30
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large
Vision-Language Models
Paper
• 2504.11468
• Published
• 30
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper
• 2504.10465
• Published
• 27
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question
Answering
Paper
• 2504.05506
• Published
• 25
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Paper
• 2504.13055
• Published
• 19
PerceptionLM: Open-Access Data and Models for Detailed Visual
Understanding
Paper
• 2504.13180
• Published
• 20
Breaking the Data Barrier -- Building GUI Agents Through Task
Generalization
Paper
• 2504.10127
• Published
• 17
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published
• 16
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Paper
• 2504.09130
• Published
• 12
Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections
of Images
Paper
• 2504.08727
• Published
• 12
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
• 2504.15279
• Published
• 78
Eagle 2.5: Boosting Long-Context Post-Training for Frontier
Vision-Language Models
Paper
• 2504.15271
• Published
• 67
Breaking the Modality Barrier: Universal Embedding Learning with
Multimodal LLMs
Paper
• 2504.17432
• Published
• 40
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery
Simulation
Paper
• 2504.17207
• Published
• 30
Seeing from Another Perspective: Evaluating Multi-View Understanding in
MLLMs
Paper
• 2504.15280
• Published
• 25
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning
in Multimodal LLMs
Paper
• 2504.15415
• Published
• 23