home›Training›

SkillOpt: Optimizing LLM Behavior with Trainable Skill Documents

Introducing SkillOpt, a novel framework that treats natural-language skill documents as trainable states for domain adaptation in large language models, enabling automated procedural improvement without modifying model weights.

May 27, 2026

#Academic #Agents #Fine Tuning #LLM #Training

SkillOpt optimizes large language model behavior by iteratively refining natural-language "skill documents" through a propose-and-test loop. It uses an optimizer model to suggest edits, applies them under a bounded textual learning rate, and validates improvements, ensuring robust and portable domain adaptation for even closed-source frontier models.

A New Training Object: The Skill Document

As large language models power increasingly complex agents, adapting them to a new domain requires more than just a new prompt—it often demands improved procedures for gathering evidence, calling tools, and formatting outputs. Skill documents—compact natural-language artifacts that package these procedures—have emerged as a popular adaptation layer, but their creation is usually manual or one-shot. SkillOpt reimagines the skill document itself as a trainable state. By treating skill editing as a controlled optimization process, complete with rollouts, validation, and learning-rate–like bounds, the system can distill execution experience into reusable text without ever modifying the model’s weights. This makes domain adaptation possible even for closed, frozen frontier models.

Overview of SkillOpt

A Text-Space Optimizer with Deep-Learning Controls

SkillOpt runs a loop where a frozen target model executes tasks using the current skill, and a separate optimizer model analyzes the resulting trajectories. The process mirrors a training pipeline:

Rollout batches provide evidence (like training data).
Minibatch reflection over successes and failures proposes structured add/delete/replace edits.
A textual learning rate (an edit budget $L_t$ ) controls how many edits are applied per step, preserving continuity.
A validation gate evaluates candidate skills on a held-out selection split, accepting only those that improve performance. Rejected edits are preserved as negative feedback.
An epoch-wise slow/meta update captures longer-horizon regularities, acting like momentum.

Crucially, the optimizer model never touches the target model. The deployed artifact is a portable best_skill.md file, typically 300–2,000 tokens, that can be reused unchanged across models and harnesses.

Bounded Updates and the Validation Gate

The optimizer proposes edits that are first merged hierarchically (failure corrections are prioritized) and then ranked by expected utility. Only the top $L_t$ edits are applied, with the budget decaying over time (e.g., cosine schedule). This bounded textual update prevents the skill from being erased or over-edited by a single bad reflection.

Every candidate skill is then evaluated on a separate selection split. It becomes the new skill only if its score strictly exceeds the current one; ties are rejected. This conservative gate is the central safety mechanism: plausible-sounding diagnoses that actually hurt the target model are caught before deployment. Rejected edits are not discarded—they enter a buffer that later optimizer calls see, providing negative feedback without any inference-time cost. The result is a propose-and-test cycle that steadily improves the skill while avoiding drift.

Epoch-Wise Slow/Meta Update and Harness-Agnostic Design

At the end of each epoch, SkillOpt runs the same training items under the previous and current skills, classifying them into improvements, regressions, persistent failures, and stable successes. The optimizer then writes a protected, longitudinal guidance block into the skill—its slow update—which step-level edits cannot overwrite. A separate optimizer-side meta skill summarizes which edit patterns helped, which failed, and which failures persisted, guiding future reflection calls. This separation keeps the deployed skill compact while allowing the trainer to learn from longer timelines.

The entire loop is harness-agnostic. A thin adapter injects the skill into direct-chat, code-execution, or embodied environments and returns scored trajectories. The same optimizer codebase therefore trains skills for search QA, spreadsheet manipulation, document reasoning, mathematical MCQs, and household decision-making, as well as inside Codex and Claude Code sandboxes.

Experimental Dominance Across the Board

SkillOpt was evaluated on six benchmarks, seven target models (from GPT‑5.5 to Qwen‑3.5‑4B), and three execution modes. Out of 52 measured (model, benchmark, harness) cells, it is best or tied-best on all 52. On GPT‑5.5 in direct chat, it lifts the six-benchmark average from 58.8% (no skill) to 82.3% (+23.5 points), and beats an oracle that picks the best of seven competing baselines (human-written, one-shot LLM, Trace2Skill, TextGrad, GEPA, EvoSkill) by +5.4 points. Gains are largest on procedural tasks: SpreadsheetBench jumps from 41.8 to 80.7, OfficeQA from 33.1 to 72.1. The same optimizer inside Codex and Claude Code harnesses yields average improvements of +24.8 and +19.1 points, outperforming the strongest harness-side rival, EvoSkill, by +14.0 and +3.2 points respectively.

Small target models also benefit disproportionately—GPT‑5.4‑nano nearly doubles on DocVQA and triples on ALFWorld—showing that a compact skill can supply procedural knowledge that small models lack.

Ablations: Evidence, Budgets, and the Role of Memory

Controlled ablations confirm that the optimizer’s design choices matter.

Training evidence: Procedural benchmarks improve steadily as more training data is exposed (SpreadsheetBench +30.5 points from 1% to 100% data), while factual QA saturates early.
Bounded learning rate: Removing the edit budget (allowing unbounded rewrites) degrades performance. With a budget of $L_t=4$ , scores hold near the top across settings.
Rejected-edit buffer: Removing it lowers SpreadsheetBench by 4.6 points, confirming it stabilizes learning.
Epoch-wise slow/meta update: The most dramatic ablation: removing both meta skill and slow update drops SpreadsheetBench from 77.5 to 55.0 (−22.5 points). This mechanism is critical for retaining long-horizon lessons.

A stronger frontier optimizer always yields larger gains than a target-matched one, but even a target-matched optimizer recovers 56–74% of the strong-optimizer gain, showing the loop itself contributes value beyond raw optimizer power.

Transfer, Compactness, and What Skills Learn

Skills trained on one model or harness transfer positively everywhere tested:

Cross-model: A SpreadsheetBench skill from GPT‑5.4 improves smaller GPT variants by +3.0 to +10.7 points.
Cross-harness: A Codex-trained spreadsheet skill transferred to Claude Code gains +59.7 points over the no-skill Claude Code baseline.
Cross-benchmark: An OlympiadBench skill yields positive gains on Omni‑MATH across three model scales.

The learned artifacts are remarkably compact: only 300–2,000 tokens after 1–4 accepted edits. Cost per test-point gain varies (0.6–46.4 M training tokens), but the expense is paid once during offline training; deployment adds zero extra cost.

Learned rules per benchmark

The rules themselves are procedural, not instance-specific. For example, the spreadsheet skill learns to “inspect workbook structure and formulas, then write evaluated static values … instead of relying on Excel recalculation,” while the ALFWorld skill adds a visited/frontier ledger and loop breaker. These are exactly the disciplined patterns a human expert would codify after observing failures—arrived at automatically by the optimizer and validated on held-out data.

Conclusion and Outlook

SkillOpt demonstrates that a natural-language skill document can serve as a trainable, self-improving adaptation layer for frozen LLM agents. By importing deep-learning controls—batches, learning rates, validation gates, and negative feedback—into text-space editing, the system produces compact, interpretable artifacts that transfer across models, harnesses, and benchmarks, yielding a new state-of-the-art for no-weight-update domain adaptation. Future directions include building skill libraries, reward-free validation for open-ended tasks, and self-distillation of optimized skills back into model weights. Treating the skill itself as the trainable object opens the door for the full optimization toolkit to be applied to agentic procedures.

Project page GitHub ArXiv paper