Hunmin_vlm_235b_v0.11_merged_cua

1. Overview

Author: 임성준 (Sungjun Lim) [LinkedIn]
Role: Project Lead & Primary Researcher
Base model: Qwen/Qwen3-VL-235B-A22B-Instruct (MoE, 235B total / ~22B active)
Adapter: LoRA with layer-wise rank allocation and adaptive learning rate scheduling
Focus: Enhancing grounding, UI understanding, and end-to-end computer-use task execution
License: Apache 2.0 (model weights & code)

This model is fine-tuned on Qwen3-VL-235B-A22B-Instruct to enhance computer use and browser use performance, while preserving general Korean VLM/LLM capabilities.

2. Evaluation Results

2-1. Agent (Computer Use & Browser Use)

The final performance improvements were verified to be statistically significant based on six independent evaluation runs (p-value < 0.0001).

i. OSWorld-G (Grounding Benchmark)

Category	Baseline	LoRA-Tuned	Δ (pp)	Δ rel.
Overall Accuracy	65.99% (66.67%)	68.00% (68.44%)	+2.01	+3.05%
Fine-grained Manipulation	52.69%	56.60%	+3.92	+7.43%

An average improvement of approximately 2% was observed in grounding performance, with the largest gain (+3.92%) occurring in fine-grained manipulation.

ii. OSWorld (Simulation Benchmark)

Category	Baseline	LoRA-Tuned	Δ (pp)	Δ rel.
Overall Success Rate	31.51% (31.85%)	33.57% (34.9%)	+2.06	+6.53%
Daily Applications	40.32%	46.34%	+6.01	+14.91%

An average performance improvement of approximately 2% was observed.
When comparing the best runs, the success rate increased from 31.85% (baseline) to 34.9% (LoRA-tuned).

Domain	Baseline	LoRA-Tuned	Δ (pp)	Δ rel.
vlc	26.83%	40.29%	+13.46	+50.17%
chrome	38.21%	41.94%	+3.73	+9.76%
libreoffice_writer	52.16%	55.06%	+2.90	+5.56%

In particular, significant performance improvements were observed in the vlc, chrome, and libreoffice_writer domains.

2-2. Korean General Capability

This training focuses on multimodal grounding and agent optimization, and the results indicate that general-purpose multimodal and text capabilities are preserved at their original performance levels.

i. Multimodal Benchmarks

Model	K-MMBench	K-SEED	K-MMStar	K-DTCBench	K-LLaVA-W
Baseline	90.94%	80.85%	65.33%	95.83%	97.36%
LoRA-Tuned	91.06%	79.74%	65.40%	95.00%	99.83%

ii. Language & Reasoning Benchmarks

Model	KMMLU-Redux	KMMLU-Pro	K-Arena-Hard-Auto	K-IFEval
Baseline	71.67	67.67	73.14	89.40
LoRA-Tuned	71.00	66.68	73.14	88.98

3. Curated Training Data (≈1,000 High-Quality Samples)

Curated multimodal instruction-tuning data focused on UI grounding, layout understanding, icon recognition, and interaction trajectories:

Component & description grounding (~450 samples)
Rule-based + generated datasets covering documents, slides, scrolling, snap icons, component libraries
Icon grounding (~200 samples)
Icon-specific vision-language alignment data
Layout understanding (~200 samples)
Layout200k, Layout400k (Claude-augmented), OS layout datasets
UI / Interaction trajectories (~150 samples)
SeeClick, GUIEnv, WebUI, OmniAct, Mind2Web, AndroidControl, RicoSCA, etc.

4. Future Work & Extensions

While this release empirically demonstrates the effectiveness of lightweight LoRA adaptation, it represents only an initial step toward fully optimized multimodal computer-use agents. Several key directions are planned for future extensions:

Full-Parameter Optimization
The current model applies LoRA-based fine-tuning. Future work will explore full-parameter fine-tuning for further improvements.
Reinforcement Learning
To move beyond supervised fine-tuning, we plan to incorporate reinforcement learning–based optimization methods. These approaches will directly reward successful long-horizon task completion and robustness under real-world UI uncertainty.
Long-Horizon Trajectory Data Synthesis
Complex computer-use tasks often require reasoning over extended action sequences across multiple applications. To address this, we will develop a scalable data synthesis pipeline that generates long-horizon, multi-step interaction trajectories.
Expanded Evaluation Ecosystem
Evaluation will be extended beyond OSWorld and OSWorld-G to include additional agent benchmarks. This will enable more fine-grained analysis of real-world usability.

Together, these extensions aim to transition the model from a grounding-focused PoC to a strong multimodal agent.

Downloads last month: 17

Model tree for mncai/hunmin_vlm_235b_v0.11_merged_cua

Base model

Qwen/Qwen3-VL-235B-A22B-Instruct

Finetuned

(9)

this model