home›Agentic Systems›When Should AI Agents Ask for Clarification? Timing Matters

New study reveals optimal windows for clarifying instructions in long-horizon agents, with goal info losing value after 10% of execution.

When Should AI Agents Ask for Clarification? Timing Matters

A forced-injection framework across 6,000+ runs shows that the value of clarification depends sharply on information type and timing. Goal clarification loses nearly all value after 10% of execution, while input clarification retains value through 50%. Current frontier models fail to ask within optimal windows.

#Academic#Agents#Context#Framework#LLM
When Should AI Agents Ask for Clarification? Timing Matters

When Should an AI Agent Ask for Help?

Long-horizon AI agents are designed to execute complex workflows that span hundreds of sequential actions. A single wrong assumption early in execution can cascade into irreversible errors. When instructions are incomplete, the agent must decide not only whether to ask for clarification but when. Until now, no prior work has measured how the value of clarification changes over the course of execution. A new paper, "Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?" (arXiv:2605.07937), fills that gap with a rigorous empirical framework.

A line chart with four colored curves representing different clarification types (goal, input, constraint, context) plotted against percentage of task execution on the x-axis (0% to 100%) and task success rate (pass@3) on the y-axis (0.0 to 1.0). The goal curve drops steeply from 0.78 to near baseline after 10% execution. The input curve declines gradually, retaining value until about 50% execution. The constraint and context curves fall in between. A horizontal dashed line marks the baseline performance when no clarification is asked. The chart title: "Value of Clarification Over Execution Trajectory."
A line chart with four colored curves representing different clarification types (goal, input, constraint, context) plotted against percentage of task execution on the x-axis (0% to 100%) and task success rate (pass@3) on the y-axis (0.0 to 1.0). The goal curve drops steeply from 0.78 to near baseline after 10% execution. The input curve declines gradually, retaining value until about 50% execution. The constraint and context curves fall in between. A horizontal dashed line marks the baseline performance when no clarification is asked. The chart title: "Value of Clarification Over Execution Trajectory."

A Forced-Injection Framework

The researchers introduced a novel forced-injection framework that provides ground-truth clarifications at controlled points in the agent's trajectory. They tested four information dimensions: goal, input, constraint, and context. Experiments were conducted across three agent benchmarks, using four frontier models (three models per benchmark, with one model used on a single benchmark only). In total, 84 task variants and over 6,000 runs were performed. This large-scale setup allowed the team to isolate the effect of timing from other variables.

Key Findings: Timing Is Everything

The value of clarification depends sharply on what information is missing. Goal clarification loses nearly all value after 10% of execution: pass@3 drops from 0.78 to baseline. In contrast, input clarification retains value through approximately 50% of execution. The most striking result: deferring any clarification type past mid-trajectory degrades performance below never asking at all.

"Deferring any clarification type past mid-trajectory degrades performance below never asking at all."

This suggests that late clarification is not just unhelpful—it actively harms the agent's performance, possibly by disrupting already established plans.

Cross-Model Consistency and Real-World Behavior

Kendall tau correlations among models sharing identical task coverage ranged from 0.78 to 0.87, confirming that timing profiles are substantially task-intrinsic. Across the full four-model panel, correlations dropped to 0.34–0.67, indicating some model-specific variation. A complementary study of 300 unscripted sessions revealed a sobering reality: no current frontier model asks within the empirically optimal window. Strategies ranged from over-asking (52% of sessions) to never asking at all. This mismatch between optimal timing and actual behavior underscores the need for better policies.

Conclusions and Contributions

The paper provides empirical demand curves that give quantitative foundation to existing theoretical frameworks. It establishes concrete design targets for timing-aware clarification policies. The authors have committed to releasing code and data publicly, enabling the community to build on their work. For developers of long-horizon agents, the message is clear: ask early, ask late, but above all, ask right.