home›Training›

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Language Tasks

A data-free framework for training language models without external supervision, improving performance on open-ended and short-form QA benchmarks.

June 9, 2026

#Academic #LLM #Reinforcement Learning #Training

Introducing SCOPE, a data-free self-play framework for open-ended tasks that co-evolves a Challenger for task generation and a Solver for answering. It uses a self-judge to create rubrics and grade responses, improving 7-8B instruction-tuned models by up to +10.4 points on open-ended and +13.8 points on held-out QA benchmarks.

A New Frontier for Self-Improving AI

Large language models have mastered games like Go and chess through self-play, learning superhuman strategies without human data. But applying this same principle to open-ended language tasks—like writing a research report or planning a project—has remained elusive. The reason is simple: in a game, you can check who won by counting points. For an essay, there’s no single correct answer to verify.

A new framework called SCOPE changes this. It is the first method to extend data-free self-play to open-ended tasks, where success is measured by quality, not a binary right or wrong. Instead of relying on human-curated prompts or expensive frontier models to act as judges, SCOPE creates a self-contained ecosystem where a model learns by playing against itself. This breaks through a major bottleneck, showing that AI can improve on complex, creative tasks without any external supervision.

How SCOPE’s Self-Play Ecosystem Works

SCOPE operates by splitting a single base model into three roles: a Challenger, a Solver, and a fixed Judge. The Challenger and Solver are the two evolving policies that drive learning, while the Judge remains frozen to provide a stable evaluation standard.

Overview of SCOPE

The process unfolds in a loop. First, the Challenger reads a source document from a corpus like Wikipedia and generates a complex, document-grounded task. The Judge then creates a task-specific rubric from the same source document. Crucially, the Solver never sees this document; it must answer the task by performing multi-turn retrieval to find the necessary information. The Judge grades the Solver’s response against the rubric, and this score becomes the reward signal. The Challenger is rewarded for creating tasks that are moderately difficult for the current Solver, while the Solver is rewarded for satisfying the rubric’s criteria. This creates a sustainable cycle of improvement, as the Challenger must constantly devise harder tasks to stay ahead of the improving Solver.

The Necessity of Co-Evolution

A key finding is that the Challenger and Solver must co-evolve. The paper shows that if the Challenger is frozen after the first iteration, the Solver’s performance quickly plateaus. Without a co-evolving adversary, the tasks become too easy, and the learning signal disappears.

The framework uses a clever reward function to maintain this balance. The Challenger’s reward is maximized when the Solver’s average rubric score is near 0.5, the point of maximum feedback variance. This is formalized with a difficulty reward, $f_{\mathrm{diff}}$ , that peaks at this sweet spot:

$f_{\mathrm{diff}}(\bar{g};\,\tau)=\max\!\Bigl(0,\;1-\frac{|\bar{g}-\tau|}{\min(\tau,\,1{-}\tau)}\Bigr)$

This equation mathematically ensures the Challenger is incentivized to propose tasks right at the Solver’s capability frontier. The paper also introduces a cosine length penalty to prevent the Solver from "reward hacking" by simply writing longer answers to please the rubric judge. Ablations show that removing either the co-evolution or these guardrails causes training to collapse, proving that both the adversarial dynamic and careful reward design are essential for sustained self-improvement.

Matching Curated Data Without Any

The results are striking. SCOPE was tested on three 7–8 billion parameter models: Qwen2.5, Qwen3, and OLMo-3. Across eight diverse open-ended benchmarks—from deep research and scholarly QA to creative writing—SCOPE delivered substantial gains. For instance, on the Qwen2.5-7B model, the average score jumped from 24.4 to 34.8, a gain of over 10 points.

Model	Base Score	SCOPE Score	GRPO (data) Score
Qwen2.5-7B	24.4	34.8	33.4
Qwen3-8B	37.7	43.1	41.5
OLMo-3-7B	30.7	38.5	39.0

Remarkably, SCOPE achieved this without a single curated prompt or external judge. It matched or exceeded the performance of a baseline model trained on ~9,000 human-curated prompts with frontier-model rubrics. The gains were most pronounced on research-intensive tasks, where the model’s ability to retrieve and synthesize information is critical. This proves that self-generated data can be as effective as human-curated data for complex, open-ended learning.

Generalizing Beyond the Training Ground

Perhaps the most surprising result is how well SCOPE’s training transferred to entirely different tasks. Although trained exclusively on open-ended, document-grounded tasks, the models showed significant improvement on held-out short-form question-answering benchmarks. On Qwen2.5-7B, the average score on seven QA benchmarks rose by 13.8 points, surpassing the model trained on curated data.

This suggests that the skills learned through SCOPE—namely, strategic retrieval and information synthesis—are fundamental and broadly applicable. A controlled experiment disentangled these two capabilities. By swapping components between an early and a late-stage Solver, the study showed that SCOPE improves both retrieval and synthesis, with the dominant source of gain depending on the task. For multi-hop questions requiring chained queries, retrieval improved more. For single-hop questions, synthesis was the bigger factor. This explains why SCOPE’s benefits transfer so well: it builds a general-purpose research and reasoning engine, not just a narrow task-solver.

The Bottleneck is Rubric Quality

The self-judging mechanism is the linchpin of the entire framework. The paper’s analysis reveals that the quality of the rubric, not the grading itself, is the bottleneck. When the rubric generator was scaled down to a 4B parameter model, performance dropped sharply because the rubrics became generic, missing the specific, document-grounded details needed for a meaningful evaluation. In contrast, scaling the grader model had almost no effect.

Rubric quality matters more than grading

This finding has profound implications. It shows that for self-improving AI, the ability to ask the right questions and define success criteria is more important than the ability to judge the final answer. SCOPE’s success lies in its ability to automatically generate these specific, task-relevant rubrics from source documents, creating a closed loop where a model can teach itself what a good answer looks like, and then learn to produce it. This work marks a significant step toward AI systems that can autonomously expand their capabilities beyond the limits of human supervision.

Project page ArXiv paper