Hunmin_vlm_235b_v0.11_merged_cua
1. Overview
Author: 임성준 (Sungjun Lim) [LinkedIn]
Role: Project Lead & Primary Researcher
Base model: Qwen/Qwen3-VL-235B-A22B-Instruct (MoE, 235B total / ~22B active)
Adapter: LoRA with layer-wise rank allocation and adaptive learning rate scheduling
Focus: Enhancing grounding, UI understanding, and end-to-end computer-use task execution
License: Apache 2.0 (model weights & code)
This model is fine-tuned on Qwen3-VL-235B-A22B-Instruct to enhance computer use and browser use performance, while preserving general Korean VLM/LLM capabilities.
2. Evaluation Results
2-1. Agent (Computer Use & Browser Use)
The final performance improvements were verified to be statistically significant based on six independent evaluation runs (p-value < 0.0001).
i. OSWorld-G (Grounding Benchmark)
| Category | Baseline | LoRA-Tuned | Δ (pp) | Δ rel. |
|---|---|---|---|---|
| Overall Accuracy | 65.99% (66.67%) | 68.00% (68.44%) | +2.01 | +3.05% |
| Fine-grained Manipulation | 52.69% | 56.60% | +3.92 | +7.43% |
- An average improvement of approximately 2% was observed in grounding performance, with the largest gain (+3.92%) occurring in fine-grained manipulation.
ii. OSWorld (Simulation Benchmark)
| Category | Baseline | LoRA-Tuned | Δ (pp) | Δ rel. |
|---|---|---|---|---|
| Overall Success Rate | 31.51% (31.85%) | 33.57% (34.9%) | +2.06 | +6.53% |
| Daily Applications | 40.32% | 46.34% | +6.01 | +14.91% |
- An average performance improvement of approximately 2% was observed.
- When comparing the best runs, the success rate increased from 31.85% (baseline) to 34.9% (LoRA-tuned).
| Domain | Baseline | LoRA-Tuned | Δ (pp) | Δ rel. |
|---|---|---|---|---|
| vlc | 26.83% | 40.29% | +13.46 | +50.17% |
| chrome | 38.21% | 41.94% | +3.73 | +9.76% |
| libreoffice_writer | 52.16% | 55.06% | +2.90 | +5.56% |
- In particular, significant performance improvements were observed in the vlc, chrome, and libreoffice_writer domains.
2-2. Korean General Capability
This training focuses on multimodal grounding and agent optimization, and the results indicate that general-purpose multimodal and text capabilities are preserved at their original performance levels.
i. Multimodal Benchmarks
| Model | K-MMBench | K-SEED | K-MMStar | K-DTCBench | K-LLaVA-W |
|---|---|---|---|---|---|
| Baseline | 90.94% | 80.85% | 65.33% | 95.83% | 97.36% |
| LoRA-Tuned | 91.06% | 79.74% | 65.40% | 95.00% | 99.83% |
ii. Language & Reasoning Benchmarks
| Model | KMMLU-Redux | KMMLU-Pro | K-Arena-Hard-Auto | K-IFEval |
|---|---|---|---|---|
| Baseline | 71.67 | 67.67 | 73.14 | 89.40 |
| LoRA-Tuned | 71.00 | 66.68 | 73.14 | 88.98 |
3. Curated Training Data (≈1,000 High-Quality Samples)
Curated multimodal instruction-tuning data focused on UI grounding, layout understanding, icon recognition, and interaction trajectories:
Component & description grounding (~450 samples)
Rule-based + generated datasets covering documents, slides, scrolling, snap icons, component librariesIcon grounding (~200 samples)
Icon-specific vision-language alignment dataLayout understanding (~200 samples)
Layout200k, Layout400k (Claude-augmented), OS layout datasetsUI / Interaction trajectories (~150 samples)
SeeClick, GUIEnv, WebUI, OmniAct, Mind2Web, AndroidControl, RicoSCA, etc.
4. Future Work & Extensions
While this release empirically demonstrates the effectiveness of lightweight LoRA adaptation, it represents only an initial step toward fully optimized multimodal computer-use agents. Several key directions are planned for future extensions:
Full-Parameter Optimization
The current model applies LoRA-based fine-tuning. Future work will explore full-parameter fine-tuning for further improvements.Reinforcement Learning
To move beyond supervised fine-tuning, we plan to incorporate reinforcement learning–based optimization methods. These approaches will directly reward successful long-horizon task completion and robustness under real-world UI uncertainty.Long-Horizon Trajectory Data Synthesis
Complex computer-use tasks often require reasoning over extended action sequences across multiple applications. To address this, we will develop a scalable data synthesis pipeline that generates long-horizon, multi-step interaction trajectories.Expanded Evaluation Ecosystem
Evaluation will be extended beyond OSWorld and OSWorld-G to include additional agent benchmarks. This will enable more fine-grained analysis of real-world usability.
Together, these extensions aim to transition the model from a grounding-focused PoC to a strong multimodal agent.
- Downloads last month
- 17
Model tree for mncai/hunmin_vlm_235b_v0.11_merged_cua
Base model
Qwen/Qwen3-VL-235B-A22B-Instruct