Hunmin_vlm_235b_v0.11_merged_cua

1. Overview

Author: 임성준 (Sungjun Lim) [LinkedIn]
Role: Project Lead & Primary Researcher
Base model: Qwen/Qwen3-VL-235B-A22B-Instruct (MoE, 235B total / ~22B active)
Adapter: LoRA with layer-wise rank allocation and adaptive learning rate scheduling
Focus: Enhancing grounding, UI understanding, and end-to-end computer-use task execution
License: Apache 2.0 (model weights & code)

This model is fine-tuned on Qwen3-VL-235B-A22B-Instruct to enhance computer use and browser use performance, while preserving general Korean VLM/LLM capabilities.


2. Evaluation Results

2-1. Agent (Computer Use & Browser Use)

The final performance improvements were verified to be statistically significant based on six independent evaluation runs (p-value < 0.0001).

i. OSWorld-G (Grounding Benchmark)

Category Baseline LoRA-Tuned Δ (pp) Δ rel.
Overall Accuracy 65.99% (66.67%) 68.00% (68.44%) +2.01 +3.05%
Fine-grained Manipulation 52.69% 56.60% +3.92 +7.43%
  • An average improvement of approximately 2% was observed in grounding performance, with the largest gain (+3.92%) occurring in fine-grained manipulation.

ii. OSWorld (Simulation Benchmark)

Category Baseline LoRA-Tuned Δ (pp) Δ rel.
Overall Success Rate 31.51% (31.85%) 33.57% (34.9%) +2.06 +6.53%
Daily Applications 40.32% 46.34% +6.01 +14.91%
  • An average performance improvement of approximately 2% was observed.
  • When comparing the best runs, the success rate increased from 31.85% (baseline) to 34.9% (LoRA-tuned).

Domain Baseline LoRA-Tuned Δ (pp) Δ rel.
vlc 26.83% 40.29% +13.46 +50.17%
chrome 38.21% 41.94% +3.73 +9.76%
libreoffice_writer 52.16% 55.06% +2.90 +5.56%
  • In particular, significant performance improvements were observed in the vlc, chrome, and libreoffice_writer domains.

2-2. Korean General Capability

This training focuses on multimodal grounding and agent optimization, and the results indicate that general-purpose multimodal and text capabilities are preserved at their original performance levels.

i. Multimodal Benchmarks

Model K-MMBench K-SEED K-MMStar K-DTCBench K-LLaVA-W
Baseline 90.94% 80.85% 65.33% 95.83% 97.36%
LoRA-Tuned 91.06% 79.74% 65.40% 95.00% 99.83%

ii. Language & Reasoning Benchmarks

Model KMMLU-Redux KMMLU-Pro K-Arena-Hard-Auto K-IFEval
Baseline 71.67 67.67 73.14 89.40
LoRA-Tuned 71.00 66.68 73.14 88.98

3. Curated Training Data (≈1,000 High-Quality Samples)

Curated multimodal instruction-tuning data focused on UI grounding, layout understanding, icon recognition, and interaction trajectories:

  • Component & description grounding (~450 samples)
    Rule-based + generated datasets covering documents, slides, scrolling, snap icons, component libraries

  • Icon grounding (~200 samples)
    Icon-specific vision-language alignment data

  • Layout understanding (~200 samples)
    Layout200k, Layout400k (Claude-augmented), OS layout datasets

  • UI / Interaction trajectories (~150 samples)
    SeeClick, GUIEnv, WebUI, OmniAct, Mind2Web, AndroidControl, RicoSCA, etc.


4. Future Work & Extensions

While this release empirically demonstrates the effectiveness of lightweight LoRA adaptation, it represents only an initial step toward fully optimized multimodal computer-use agents. Several key directions are planned for future extensions:

  • Full-Parameter Optimization
    The current model applies LoRA-based fine-tuning. Future work will explore full-parameter fine-tuning for further improvements.

  • Reinforcement Learning
    To move beyond supervised fine-tuning, we plan to incorporate reinforcement learning–based optimization methods. These approaches will directly reward successful long-horizon task completion and robustness under real-world UI uncertainty.

  • Long-Horizon Trajectory Data Synthesis
    Complex computer-use tasks often require reasoning over extended action sequences across multiple applications. To address this, we will develop a scalable data synthesis pipeline that generates long-horizon, multi-step interaction trajectories.

  • Expanded Evaluation Ecosystem
    Evaluation will be extended beyond OSWorld and OSWorld-G to include additional agent benchmarks. This will enable more fine-grained analysis of real-world usability.

Together, these extensions aim to transition the model from a grounding-focused PoC to a strong multimodal agent.

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mncai/hunmin_vlm_235b_v0.11_merged_cua

Finetuned
(9)
this model