Tailored news hub
home›Agentic Systems›

How to Compile Multi-Step AI Workflows Directly into Small Models

Subterranean compilation eliminates the orchestrator at runtime, slashing costs and latency while matching frontier accuracy.

How to Compile Multi-Step AI Workflows Directly into Small Models
#Agents#Automation#Fine Tuning#LLM#Training

Discover how synthetic data and full-parameter fine-tuning can internalize complex procedures in a small LLM, removing the need for external orchestration and delivering dramatic cost savings.

The High Price of Intelligent Workflows

Modern AI assistants often follow multi‑step procedures — booking a flight, troubleshooting software, processing an insurance claim.
The dominant approach, called surface orchestration, wraps a large language model (LLM) in an external controller that injects prompts and routes decisions at every turn.
This works well but is expensive: each step calls a frontier model, and the orchestrator adds latency and complexity.
A team at the University of Melbourne asks a provocative question: what if we could compile the entire workflow directly into the weights of a small model, eliminating the orchestrator at runtime?

Two Architectures: Surface vs. Subterranean

Surface orchestration is like a GPS that constantly tells you where to turn.
An external program sits between the user and the LLM, feeding node prompts and deciding the next step based on the model’s output.

Subterranean compilation flips the script.
The orchestrator is used only during training to generate example dialogues.
At deployment, the user talks directly to a fine‑tuned small model — the subterranean agent — which follows the procedure from its own weights, guided by a minimal system prompt.
The paper’s core insight: procedural knowledge can be baked into parameters, not re‑injected at every call.

A luminous GPS signal hovers above cracked earth, its blue light fracturing into thin rays that penetrate the soil. Below the surface, an intricate network of glowing root-like filaments pulses with embedded intelligence, the fragmented signal absorbed and woven into organic neural pathways that branch endlessly through dark, fertile ground. At the deepest level, a single small crystalline seed radiates quiet competence, no longer needing external guidance. The contrast between the cold, distant satellite above and the warm, self-contained organism below captures the shift from surface orchestration to subterranean compilation.

The Compilation Pipeline: From Flowchart to Weights

The compilation pipeline has four stages.
First, experts define the workflow as a directed graph (flowchart) with nodes for agent and user turns, and edges that encode transitions and conditions.
Second, a frontier model (Claude Sonnet 4.5) generates synthetic conversations by traversing all valid acyclic paths through the graph.
Third, a small open‑source LLM is fine‑tuned on these dialogues using full‑parameter updates — low‑rank methods like LoRA were shown to fail on procedural tasks.
Finally, the model is deployed without any orchestrator; it receives only a short instruction like “You are a helpful travel booking assistant.”
The training data contains only natural dialogue, never the underlying flowchart annotations.

Procedure as Directed Graphs

Workflows are formalized as graphs with nodes (agent/user turns), edges (transitions with optional conditions), a start node, and terminal nodes for success, abandonment, or escalation.
Three domains test the approach’s range:

  • Travel booking (14 nodes, 86 unique paths, 4–17 turns)
  • Zoom support (14 nodes, 60 paths, encodes product‑specific knowledge about UI and error codes)
  • Insurance claims (55 nodes, 2,381 paths, 9–39 turns, with nested loops and cross‑phase dependencies)

The insurance graph’s complexity demonstrates that compilation can handle real‑world enterprise workflows, not just simple linear scripts.

Rigorous Evaluation with Simulated Users

All experiments use 200 scenarios per condition, generated by a dynamic user simulator (Claude Sonnet 4.5) that role‑plays customers with varied personalities, budgets, and goals — without seeing the flowchart.
Each conversation is scored by an LLM‑as‑judge on five criteria (1–5 scale): Task Success, Information Accuracy, Consistency, Graceful Handling, and Naturalness.
Primary scoring uses Claude Sonnet 4.5; a robustness check rescored all conversations with GPT‑4.1 using the identical rubric.
Statistical comparisons rely on Wilcoxon signed‑rank or Mann–Whitney U tests with Holm–Bonferroni correction, plus Cohen’s d and bootstrap confidence intervals.

Travel Booking: A 3B Model Challenges the Frontier

The 3B subterranean agent (Qwen 2.5 3B Instruct, fine‑tuned on 2,125 synthetic dialogues) was pitted against three baselines.

ComparisonTask SuccessInfo AccuracyConsistencyGraceful HandlingNaturalness
vs. 3B orchestrator+0.18***+0.05 (n.s.)+0.22***+0.20***+0.17***
vs. LangGraph (Claude 3.5)comparable4.75 vs 4.21***comparable4.07 vs 4.62***4.12 vs 4.84***
vs. in‑context Claude 3.5~102% of accuracy——~82% of graceful~82% of natural

The small model beats its own size when orchestrated, and it outperforms the 70× larger frontier model on information accuracy.
It lags in graceful handling and naturalness, but the gap is modest — and the cost is two orders of magnitude lower.

Zoom Support and the Road Ahead

Scaling to an 8B model (Qwen3‑8B) on the Zoom support domain confirms the trend.
With 8 independent training runs and more data, the subterranean agent again matches or exceeds the LangGraph orchestrator on task success and accuracy, while running at a fraction of the cost.
The insurance claims domain (55 nodes) pushes the method further, showing that even deeply nested procedures can be internalized.
These results suggest a future where complex agentic workflows are deployed on‑device or at massive scale, without paying the orchestration tax at every turn.

Related Articles