home›Finetuning›

Generative UI: Revolutionizing AI Agent Interactions Beyond Plain Text

This paper introduces Macaron-A2UI, a novel model enabling AI agents to dynamically synthesize interactive UI controls alongside natural language, addressing the limitations of text-only interfaces.

May 27, 2026

#Academic #Agents #Fine Tuning #LLM #Reinforcement Learning

Discover Macaron-A2UI, a groundbreaking model that allows AI agents to generate interactive UI elements using a declarative protocol. Learn about its comprehensive corpus construction, A2UI-Bench for structured evaluation, and a two-stage training recipe combining SFT and GRPO to enhance user experience and agent capability.

The Bottleneck of Plain-Text Agents

As AI personal agents grow more capable, the limitations of static, text-only chat interfaces become increasingly apparent. When users need to provide structured information, compare options, confirm decisions, or juggle multiple goals in a single turn, long text replies slow reading and increase cognitive load. Generative UI — the ability for an agent to dynamically synthesize interactive controls, options, and state in real time — emerges as the necessary next interface layer.

The paper introduces Macaron-A2UI, a model that moves beyond text-only interaction by enabling agents to generate natural language alongside lightweight, executable UI actions. Rather than producing arbitrary code, the model emits structured messages in A2UI, a declarative UI protocol that a trusted client renderer translates into interactive widgets. This separation makes generation safer, more portable across rendering environments, and easier to validate automatically. The core research question is whether models can internalize this capability without relying on long schema prompts at inference time.

Figure 1: Many dialogue turns that are cumbersome in plain text become more efficient when the assistant can render lightweight structured interfaces.

Building a Generative UI Corpus

Training a model to produce protocol-compliant, contextually appropriate UI requires large-scale supervision data. The authors construct a corpus from four heterogeneous dialogue sources: task-oriented assistance (MultiWOZ and Schema-Guided Dialogue), emotional support (ESConv), and motivational interviewing (AnnoMI). These are normalized into a unified format of (context, response) pairs, where each response may contain an optional A2UI payload.

A hybrid rule-and-LLM pipeline annotates the data. For task-oriented datasets, where source annotations already constrain interaction semantics, a state-machine-style converter deterministically generates UI surfaces and widgets. For open-domain data, a two-stage LLM process is used: an Editor pass decides which turns should contain UI, and an Author pass generates the local component content. All outputs pass through deterministic post-processing and a four-level validation pipeline checking format, structure, data-binding, and semantic consistency. The final corpus contains 14,245 assistant-turn samples, with a 71.7% UI ratio and 99.2% renderability after repair.

Figure 2: Overview of the A2UI corpus construction pipeline.

A Benchmark for Structured Evaluation

To complement the training corpus, the authors introduce A2UI-Bench, a dedicated benchmark of 300 tasks designed for controlled evaluation rather than training-scale diversity. Tasks are organized into three structural families:

Atomic tasks: Single-turn evaluations measuring core turn-level ability to decide whether UI is needed and to generate an appropriate interface.
Depth tasks: Multi-turn episodes testing cross-turn consistency, state maintenance, and surface lifecycle management.
Width tasks: Single-turn but compositionally broader, requiring the model to organize a unified response addressing multiple sub-goals.

Evaluation operates at three levels. L1 measures protocol correctness through automated checks on JSON parsing, schema compliance, reference integrity, and value formatting. L2 assesses task construction quality via LLM judges on trigger appropriateness, component-intent alignment, text-UI grounding, data model utilization, and action completeness. L3 evaluates user experience quality, including value-addition over plain text, conversational naturalness, and cognitive load. A complementary visual evaluation layer scores rendered screenshots for integrity, task alignment, and action clarity.

A Two-Stage Training Recipe

The training pipeline combines supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO), both using parameter-efficient LoRA adaptation. SFT teaches the model the basic response format — jointly producing fluent text and protocol-compliant UI actions — using a standard autoregressive negative log-likelihood objective:

$\mathcal{L}_{\mathrm{SFT}}=-\sum_{t=1}^{T}\log p_{\theta}(y_{t}\mid x,y_{<t})$

GRPO then refines behavior under an interaction-oriented reward. For each prompt, the model samples a group of candidate responses, scores them with a reward function combining structural quality, task-construction quality, and user-level utility, and computes a group-relative advantage:

$A_{i,j}=R_{i,j}-\frac{1}{G}\sum_{k=1}^{G}R_{i,k}$

The reward design applies hard structural gates: malformed JSON, missing required output, or render-critical errors receive zero reward. Responses that pass these checks are scored on L1 correctness, L2 task quality, and L3 user utility. This two-stage approach is instantiated on Qwen3-30B, Qwen3-235B, and GLM-5.1 backbones.

Results: Internalizing UI Competence

The primary evaluation regime is the w/o schema setting, where models receive only lightweight protocol instructions and must rely on internalized A2UI competence. Results demonstrate the pipeline's effectiveness across scales.

Model	L1	L2	L3	V1	V2	V3	Avg.
GPT-5.4 w/ schema	4.02	3.59	3.27	3.46	3.73	3.17	3.54
Gemini-3.1-Pro w/ schema	4.25	3.20	2.96	3.53	3.55	3.04	3.42
Macaron-A2UI-Grande w/o schema	4.67	3.22	2.91	3.95	3.74	3.47	3.66
Macaron-A2UI-Venti w/o schema	4.47	3.36	3.28	3.95	3.76	3.52	3.72

For Qwen-30B, SFT improves the overall score from 19.8 to 37.2, and RL further pushes it to 58.8. Qwen-235B improves from 21.6 at base to 63.6 after SFT, then reaches 74.2 after RL. The best model, Macaron-A2UI-Venti trained from GLM-5.1, achieves an overall score of 75.6, surpassing the strongest full-schema frontier baseline (GPT-5.4 at 74.1). Out-of-the-box frontier models remain weak without schema hints, confirming that lightweight instructions are insufficient for untuned models to acquire stable A2UI competence.

Figure 4: Training-pipeline ablation under the w/o schema prompt regime.

RL Dynamics and Cross-Domain Robustness

The reward trajectories during GRPO training reveal a consistent pattern. Across both model scales, the L1 reward rises first and most rapidly, indicating that protocol correctness and structural executability are the easiest properties to improve under reinforcement learning. Improvements in higher-level interaction quality occur more gradually. The 235B model shows steady improvement in both L2 and L3 rewards throughout training, while the 30B model's L3 reward remains flatter, suggesting that user-facing quality is harder to optimize at smaller scales.

Per-dataset and per-task breakdowns show strong cross-domain robustness. Macaron-A2UI-235B achieves scores in a narrow range (3.82–3.84) across MultiWOZ, SGD, ESConv, and AnnoMI. It is the best model on atomic tasks (4.38) and width tasks (3.96), while remaining competitive on depth tasks (3.14). RL primarily strengthens the model's ability to translate dialogue intent into concise, well-structured, and interaction-ready UI decisions, with especially large gains on width tasks across all four datasets.

Figure 6: Reward trajectories during GRPO training.

Why This Matters

This work establishes Generative UI for personal agents as a tractable learning problem with measurable progress. Three contributions stand out. First, the scalable pipeline for transforming heterogeneous dialogue corpora into multi-turn Generative UI data, combining LLM-based annotation with rule-based repair and validation, provides a blueprint for future data construction efforts. Second, A2UI-Bench offers a standardized evaluation framework that separates protocol validity from interaction quality, enabling rigorous comparison across models. Third, the two-stage training recipe demonstrates that executable UI generation can be internalized without long schema prompts at inference time, making deployment more practical.

The results carry an important implication: Generative UI competence does not have to depend on heavy schema prompting. Through targeted training, models can learn when to produce UI, what UI to produce, and how to produce protocol-compliant UI under lightweight instructions. This shifts the paradigm from prompt engineering toward learned interaction design, opening the door to more fluid, efficient, and personalized agent interfaces.

Project page ArXiv paper