A New Training Object: The Skill Document
As large language models power increasingly complex agents, adapting them to a new domain requires more than just a new promptâit often demands improved procedures for gathering evidence, calling tools, and formatting outputs. Skill documentsâcompact natural-language artifacts that package these proceduresâhave emerged as a popular adaptation layer, but their creation is usually manual or one-shot. SkillOpt reimagines the skill document itself as a trainable state. By treating skill editing as a controlled optimization process, complete with rollouts, validation, and learning-rateâlike bounds, the system can distill execution experience into reusable text without ever modifying the modelâs weights. This makes domain adaptation possible even for closed, frozen frontier models.

A Text-Space Optimizer with Deep-Learning Controls
SkillOpt runs a loop where a frozen target model executes tasks using the current skill, and a separate optimizer model analyzes the resulting trajectories. The process mirrors a training pipeline:
- Rollout batches provide evidence (like training data).
- Minibatch reflection over successes and failures proposes structured add/delete/replace edits.
- A textual learning rate (an edit budget ) controls how many edits are applied per step, preserving continuity.
- A validation gate evaluates candidate skills on a held-out selection split, accepting only those that improve performance. Rejected edits are preserved as negative feedback.
- An epoch-wise slow/meta update captures longer-horizon regularities, acting like momentum.
Crucially, the optimizer model never touches the target model.
The deployed artifact is a portable best_skill.md file, typically 300â2,000 tokens, that can be reused unchanged across models and harnesses.
Bounded Updates and the Validation Gate
The optimizer proposes edits that are first merged hierarchically (failure corrections are prioritized) and then ranked by expected utility. Only the top edits are applied, with the budget decaying over time (e.g., cosine schedule). This bounded textual update prevents the skill from being erased or over-edited by a single bad reflection.
Every candidate skill is then evaluated on a separate selection split. It becomes the new skill only if its score strictly exceeds the current one; ties are rejected. This conservative gate is the central safety mechanism: plausible-sounding diagnoses that actually hurt the target model are caught before deployment. Rejected edits are not discardedâthey enter a buffer that later optimizer calls see, providing negative feedback without any inference-time cost. The result is a propose-and-test cycle that steadily improves the skill while avoiding drift.
Epoch-Wise Slow/Meta Update and Harness-Agnostic Design
At the end of each epoch, SkillOpt runs the same training items under the previous and current skills, classifying them into improvements, regressions, persistent failures, and stable successes. The optimizer then writes a protected, longitudinal guidance block into the skillâits slow updateâwhich step-level edits cannot overwrite. A separate optimizer-side meta skill summarizes which edit patterns helped, which failed, and which failures persisted, guiding future reflection calls. This separation keeps the deployed skill compact while allowing the trainer to learn from longer timelines.
The entire loop is harness-agnostic. A thin adapter injects the skill into direct-chat, code-execution, or embodied environments and returns scored trajectories. The same optimizer codebase therefore trains skills for search QA, spreadsheet manipulation, document reasoning, mathematical MCQs, and household decision-making, as well as inside Codex and Claude Code sandboxes.
Experimental Dominance Across the Board
SkillOpt was evaluated on six benchmarks, seven target models (from GPTâ5.5 to Qwenâ3.5â4B), and three execution modes. Out of 52 measured (model, benchmark, harness) cells, it is best or tied-best on all 52. On GPTâ5.5 in direct chat, it lifts the six-benchmark average from 58.8% (no skill) to 82.3% (+23.5 points), and beats an oracle that picks the best of seven competing baselines (human-written, one-shot LLM, Trace2Skill, TextGrad, GEPA, EvoSkill) by +5.4 points. Gains are largest on procedural tasks: SpreadsheetBench jumps from 41.8 to 80.7, OfficeQA from 33.1 to 72.1. The same optimizer inside Codex and Claude Code harnesses yields average improvements of +24.8 and +19.1 points, outperforming the strongest harness-side rival, EvoSkill, by +14.0 and +3.2 points respectively.
Small target models also benefit disproportionatelyâGPTâ5.4ânano nearly doubles on DocVQA and triples on ALFWorldâshowing that a compact skill can supply procedural knowledge that small models lack.
Ablations: Evidence, Budgets, and the Role of Memory
Controlled ablations confirm that the optimizerâs design choices matter.
- Training evidence: Procedural benchmarks improve steadily as more training data is exposed (SpreadsheetBench +30.5 points from 1% to 100% data), while factual QA saturates early.
- Bounded learning rate: Removing the edit budget (allowing unbounded rewrites) degrades performance. With a budget of , scores hold near the top across settings.
- Rejected-edit buffer: Removing it lowers SpreadsheetBench by 4.6 points, confirming it stabilizes learning.
- Epoch-wise slow/meta update: The most dramatic ablation: removing both meta skill and slow update drops SpreadsheetBench from 77.5 to 55.0 (â22.5 points). This mechanism is critical for retaining long-horizon lessons.
A stronger frontier optimizer always yields larger gains than a target-matched one, but even a target-matched optimizer recovers 56â74% of the strong-optimizer gain, showing the loop itself contributes value beyond raw optimizer power.
Transfer, Compactness, and What Skills Learn
Skills trained on one model or harness transfer positively everywhere tested:
- Cross-model: A SpreadsheetBench skill from GPTâ5.4 improves smaller GPT variants by +3.0 to +10.7 points.
- Cross-harness: A Codex-trained spreadsheet skill transferred to Claude Code gains +59.7 points over the no-skill Claude Code baseline.
- Cross-benchmark: An OlympiadBench skill yields positive gains on OmniâMATH across three model scales.
The learned artifacts are remarkably compact: only 300â2,000 tokens after 1â4 accepted edits. Cost per test-point gain varies (0.6â46.4 M training tokens), but the expense is paid once during offline training; deployment adds zero extra cost.

The rules themselves are procedural, not instance-specific. For example, the spreadsheet skill learns to âinspect workbook structure and formulas, then write evaluated static values ⌠instead of relying on Excel recalculation,â while the ALFWorld skill adds a visited/frontier ledger and loop breaker. These are exactly the disciplined patterns a human expert would codify after observing failuresâarrived at automatically by the optimizer and validated on held-out data.
Conclusion and Outlook
SkillOpt demonstrates that a natural-language skill document can serve as a trainable, self-improving adaptation layer for frozen LLM agents. By importing deep-learning controlsâbatches, learning rates, validation gates, and negative feedbackâinto text-space editing, the system produces compact, interpretable artifacts that transfer across models, harnesses, and benchmarks, yielding a new state-of-the-art for no-weight-update domain adaptation. Future directions include building skill libraries, reward-free validation for open-ended tasks, and self-distillation of optimized skills back into model weights. Treating the skill itself as the trainable object opens the door for the full optimization toolkit to be applied to agentic procedures.



