The Sampling Bottleneck in LLM Self-Improvement
Large language models (LLMs) and agentic systems can solve complex reasoning tasks, but their self-improvement hinges on generating high-quality samples. At training time, better samples enable more effective post-training; at inference, they drive test-time scaling. The dominant sampling methods—best-of-N and tree search—share two fundamental limitations.
First, they rely on sparse verification signals, typically binary or coarse-grained feedback, which provides little guidance during search. Second, they construct candidates by autoregressively extending trajectories, confining exploration to regions with substantial model probability mass. On hard problems, correct solutions often lie in low-probability regions that these methods rarely reach. This paper introduces a framework that addresses both issues simultaneously.
Bidirectional Evolutionary Search: A New Framework
Bidirectional Evolutionary Search (BES) couples a forward evolutionary search with a backward goal-decomposition process. The forward search augments standard autoregressive expansion with evolution operators that recombine partial trajectories, generating candidates beyond the model’s typical distribution. The backward search recursively decomposes the original task into checkable sub-goals, producing dense intermediate feedback that guides the forward search.

This bidirectional design enables BES to discover solutions that neither pure expansion nor sparse-reward search can reach, making it effective for both post-training sample generation and inference-time problem solving.
Forward Search: Evolution Operators Beyond Autoregressive Expansion
The forward search maintains a population of partial trajectories (nodes). At each step, it applies one of five operators: expansion (sampling new steps from the policy) or one of four evolution operators inspired by biological recombination.

- Combination merges suffixes of two trajectories beyond a shared prefix.
- Deletion removes an interior step to produce a shorter candidate.
- Translocation transplants a single step from one trajectory into another.
- Crossover splices the prefix of one trajectory onto the tail of another.
Parents are selected via a Boltzmann distribution over backward scores (and pair scores for two-parent operators), with temperature annealing from exploration to exploitation. These operators allow the search to restructure and recombine existing trajectories, generating candidates that no single policy rollout could produce.
Backward Search: Dense Feedback via Goal Decomposition
The backward search builds a rooted goal tree by recursively prompting the policy to decompose the top-level task into finer sub-goals. Each sub-goal comes with a local verifier that tests how well a candidate node addresses it.
The score of a node is computed recursively:
where balances parent and child contributions. For leaf sub-goals, . If a goal is fully satisfied, the score short-circuits to 1. For two-parent operators, a pair score uses the maximum of the two parents’ verifier outputs, favoring complementary candidates that cover different parts of the goal tree.
This dense, interpretable signal guides parent selection even when no candidate has fully solved the problem, dramatically improving search efficiency.
Theoretical Guarantees: Escaping the Entropy Shell
The paper provides two theoretical motivations. First, under mild assumptions (bounded per-step surprise, decaying step dependence, and linear block total correlation), Theorem 4.4 proves that expansion-only search is confined to a narrow entropy shell , whose size is at most . In contrast, evolution operators that recombine blocks from independent trajectories produce candidates with expected log-probability strictly beyond this shell, with a positive fraction escaping it.
Second, Theorem 4.5 shows that backward sub-goal decomposition yields an exponential reduction in sample complexity. Terminal-only search requires candidates to find a complete solution, while backward-guided search needs only to collect evidence for all sub-goals. In the symmetric case , the ratio is , exponential in the number of sub-goals.
Post-Training and Inference: Experimental Gains
BES was evaluated on both post-training and inference tasks.
Post-training. On logical reasoning (Knights-and-Knaves), GRPO and MaxRL showed little improvement, while BES steadily increased validation accuracy (Figure 3). On multi-hop reasoning (MuSiQue), BES substantially outperformed GRPO and Tree-GRPO across two model scales (Table 1).

| Method | Accuracy (%) | # Valid Search | # Valid Actions | Finish Ratio |
|---|---|---|---|---|
| Llama-3.2-3B-Instruct | ||||
| Base model | 4.0 | – | – | – |
| + GRPO | 2.1 (-1.9) | 0.84 | 0.20 | 0.64 |
| + Tree-GRPO | 3.9 (-0.1) | 1.50 | 2.14 | 0.64 |
| + BES | 7.0 (+3.0) | 2.31 | 3.29 | 0.97 |
| Llama-3.1-8B-Instruct | ||||
| Base model | 6.6 | – | – | – |
| + GRPO | 5.6 (-1.0) | 1.46 | 1.83 | 0.37 |
| + Tree-GRPO | 7.4 (+0.8) | 0.65 | 1.36 | 0.71 |
| + BES | 10.4 (+3.8) | 2.11 | 3.05 | 0.94 |
Inference. On three open problem solving benchmarks (Circle Packing, Heilbronn Convex), BES outperformed all open-source frameworks in both average and best-case performance, with lower variance (Table 2).
| Strategy | Circle Packing (Sq.) | Circle Packing (Rect.) | Heilbronn (Convex) |
|---|---|---|---|
| Avg. | Best | Avg. | |
| OpenEvolve | 2.531 | 2.541 | 2.267 |
| GEPA | 2.613 | 2.628 | 2.326 |
| ShinkaEvolve | 2.464 | 2.541 | 2.335 |
| BES | 2.623 | 2.632 | 2.349 |
Ablation, Cost, and Conclusion
An ablation study on logical reasoning confirmed that both the evolution operators and the backward search contribute to BES’s gains (Figure 4). Removing either component reduced performance, though both ablations still outperformed GRPO and MaxRL.

Cost analysis showed that BES adds less than 30% wall-clock overhead over Tree-GRPO while delivering substantially better accuracy and search behavior. On open problems, BES incurred modest additional API cost (19 per run) compared to ShinkaEvolve (13) but achieved consistently higher objective values.
In summary, BES addresses the twin challenges of sparse verification and confined exploration through a bidirectional evolutionary search. By coupling forward recombination with backward goal decomposition, it discovers high-quality solutions that elude existing methods, enabling consistent improvements in both post-training and inference across diverse reasoning domains.



