Papers
arxiv:2604.19742

PlayCoder: Making LLM-Generated GUI Code Playable

Published on Apr 21
ยท Submitted by
Wei Tao
on Apr 22
ยท tencent Tencent
Authors:
,
,
,
,

Abstract

Large language models struggle to generate logically correct GUI applications, prompting the development of PlayEval benchmark and PlayCoder framework that uses multi-agent approaches to improve functional correctness through iterative repair.

AI-generated summary

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

Community

Paper author Paper submitter

๐Ÿค– Current code LLMs can generate GUI code that compiles, but rarely playable and interactively functional.
This work builds a complete pipeline from evaluation to refinement for LLM-generated GUI programs.

๐ŸŽฎ PlayEval: a new multi-language benchmark for playable GUI applications
๐Ÿ“ Play@k: a dedicated metric focusing on real interaction logic quality
๐Ÿ› ๏ธ PlayCoder: a multi-agent framework that iteratively repairs and improves code

The work provides valuable insights for researchers interested in code generation and LLM agents.

Feel free to contact us at pzy2000@sjtu.edu.cn if u have further questions!

the closed loop of PlayTester feeding into PlayRefiner to repair behavior failures is the most interesting part. this shifts evaluation from unit correctness to end-to-end behavioral validity, which is exactly what GUI games need to be playable. my one question is how robust PlayTester is to variation in interaction order or stochastic paths; did you measure sensitivity of repairs to different playthrough schedules? the arxivlens breakdown helped me parse the method details, check it here: https://arxivlens.com/PaperView/Details/playcoder-making-llm-generated-gui-code-playable-6479-ac923d62

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.19742
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.19742 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.19742 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.19742 in a Space README.md to link it from this page.

Collections including this paper 3