The High Price of Intelligent Workflows
Modern AI assistants often follow multiâstep procedures â booking a flight, troubleshooting software, processing an insurance claim.
The dominant approach, called surface orchestration, wraps a large language model (LLM) in an external controller that injects prompts and routes decisions at every turn.
This works well but is expensive: each step calls a frontier model, and the orchestrator adds latency and complexity.
A team at the University of Melbourne asks a provocative question: what if we could compile the entire workflow directly into the weights of a small model, eliminating the orchestrator at runtime?
Two Architectures: Surface vs. Subterranean
Surface orchestration is like a GPS that constantly tells you where to turn.
An external program sits between the user and the LLM, feeding node prompts and deciding the next step based on the modelâs output.
Subterranean compilation flips the script.
The orchestrator is used only during training to generate example dialogues.
At deployment, the user talks directly to a fineâtuned small model â the subterranean agent â which follows the procedure from its own weights, guided by a minimal system prompt.
The paperâs core insight: procedural knowledge can be baked into parameters, not reâinjected at every call.

The Compilation Pipeline: From Flowchart to Weights
The compilation pipeline has four stages.
First, experts define the workflow as a directed graph (flowchart) with nodes for agent and user turns, and edges that encode transitions and conditions.
Second, a frontier model (Claude Sonnet 4.5) generates synthetic conversations by traversing all valid acyclic paths through the graph.
Third, a small openâsource LLM is fineâtuned on these dialogues using fullâparameter updates â lowârank methods like LoRA were shown to fail on procedural tasks.
Finally, the model is deployed without any orchestrator; it receives only a short instruction like âYou are a helpful travel booking assistant.â
The training data contains only natural dialogue, never the underlying flowchart annotations.
Procedure as Directed Graphs
Workflows are formalized as graphs with nodes (agent/user turns), edges (transitions with optional conditions), a start node, and terminal nodes for success, abandonment, or escalation.
Three domains test the approachâs range:
- Travel booking (14 nodes, 86 unique paths, 4â17 turns)
- Zoom support (14 nodes, 60 paths, encodes productâspecific knowledge about UI and error codes)
- Insurance claims (55 nodes, 2,381 paths, 9â39 turns, with nested loops and crossâphase dependencies)
The insurance graphâs complexity demonstrates that compilation can handle realâworld enterprise workflows, not just simple linear scripts.
Rigorous Evaluation with Simulated Users
All experiments use 200 scenarios per condition, generated by a dynamic user simulator (Claude Sonnet 4.5) that roleâplays customers with varied personalities, budgets, and goals â without seeing the flowchart.
Each conversation is scored by an LLMâasâjudge on five criteria (1â5 scale): Task Success, Information Accuracy, Consistency, Graceful Handling, and Naturalness.
Primary scoring uses Claude Sonnet 4.5; a robustness check rescored all conversations with GPTâ4.1 using the identical rubric.
Statistical comparisons rely on Wilcoxon signedârank or MannâWhitney U tests with HolmâBonferroni correction, plus Cohenâs d and bootstrap confidence intervals.
Travel Booking: A 3B Model Challenges the Frontier
The 3B subterranean agent (Qwen 2.5 3B Instruct, fineâtuned on 2,125 synthetic dialogues) was pitted against three baselines.
| Comparison | Task Success | Info Accuracy | Consistency | Graceful Handling | Naturalness |
|---|---|---|---|---|---|
| vs. 3B orchestrator | +0.18*** | +0.05 (n.s.) | +0.22*** | +0.20*** | +0.17*** |
| vs. LangGraph (Claude 3.5) | comparable | 4.75 vs 4.21*** | comparable | 4.07 vs 4.62*** | 4.12 vs 4.84*** |
| vs. inâcontext Claude 3.5 | ~102% of accuracy | â | â | ~82% of graceful | ~82% of natural |
The small model beats its own size when orchestrated, and it outperforms the 70Ă larger frontier model on information accuracy.
It lags in graceful handling and naturalness, but the gap is modest â and the cost is two orders of magnitude lower.
Zoom Support and the Road Ahead
Scaling to an 8B model (Qwen3â8B) on the Zoom support domain confirms the trend.
With 8 independent training runs and more data, the subterranean agent again matches or exceeds the LangGraph orchestrator on task success and accuracy, while running at a fraction of the cost.
The insurance claims domain (55 nodes) pushes the method further, showing that even deeply nested procedures can be internalized.
These results suggest a future where complex agentic workflows are deployed onâdevice or at massive scale, without paying the orchestration tax at every turn.





