home›Finetuning›

How Bidirectional Evolutionary Search Improves LLM Self-Improvement

Discover BES, a novel framework coupling forward evolutionary search with backward goal decomposition to overcome sampling bottlenecks in LLM reasoning.

May 28, 2026

#Agents #Fine Tuning #LLM #Reinforcement Learning #Training

This article explains Bidirectional Evolutionary Search (BES), a new framework that enhances LLM self-improvement by combining evolutionary operators for broader exploration with dense, intermediate feedback from goal decomposition. Learn how BES tackles the limitations of traditional sampling methods like best-of-N and tree search.

The Sampling Bottleneck in LLM Self-Improvement

Large language models (LLMs) and agentic systems can solve complex reasoning tasks, but their self-improvement hinges on generating high-quality samples. At training time, better samples enable more effective post-training; at inference, they drive test-time scaling. The dominant sampling methods—best-of-N and tree search—share two fundamental limitations.

First, they rely on sparse verification signals, typically binary or coarse-grained feedback, which provides little guidance during search. Second, they construct candidates by autoregressively extending trajectories, confining exploration to regions with substantial model probability mass. On hard problems, correct solutions often lie in low-probability regions that these methods rarely reach. This paper introduces a framework that addresses both issues simultaneously.

Bidirectional Evolutionary Search: A New Framework

Bidirectional Evolutionary Search (BES) couples a forward evolutionary search with a backward goal-decomposition process. The forward search augments standard autoregressive expansion with evolution operators that recombine partial trajectories, generating candidates beyond the model’s typical distribution. The backward search recursively decomposes the original task into checkable sub-goals, producing dense intermediate feedback that guides the forward search.

Figure 1: Comparison of tree search and Bidirectional Evolutionary Search (BES).
Left: Tree search constructs candidates by sequentially expanding steps, confined to a narrow entropy shell.
Right: BES escapes this shell through evolution operators that recombine parts of different trajectories, with backward search providing dense sub-goal feedback.

This bidirectional design enables BES to discover solutions that neither pure expansion nor sparse-reward search can reach, making it effective for both post-training sample generation and inference-time problem solving.

Forward Search: Evolution Operators Beyond Autoregressive Expansion

The forward search maintains a population of partial trajectories (nodes). At each step, it applies one of five operators: expansion (sampling new steps from the policy) or one of four evolution operators inspired by biological recombination.

Figure 2: Forward search operators. (a) Expansion: the policy generates new steps. (b) Combination: two trajectories sharing a common prefix have their distinct suffixes concatenated. (c) Deletion: an interior step is removed. (d) Translocation: one step is replaced by a step from another trajectory. (e) Crossover: a prefix of one trajectory is spliced onto the tail of another.

Combination merges suffixes of two trajectories beyond a shared prefix.
Deletion removes an interior step to produce a shorter candidate.
Translocation transplants a single step from one trajectory into another.
Crossover splices the prefix of one trajectory onto the tail of another.

Parents are selected via a Boltzmann distribution over backward scores (and pair scores for two-parent operators), with temperature annealing from exploration to exploitation. These operators allow the search to restructure and recombine existing trajectories, generating candidates that no single policy rollout could produce.

Backward Search: Dense Feedback via Goal Decomposition

The backward search builds a rooted goal tree by recursively prompting the policy to decompose the top-level task into finer sub-goals. Each sub-goal $g$ comes with a local verifier $V_g(x,n) \in [0,1]$ that tests how well a candidate node $n$ addresses it.

The score of a node is computed recursively:

$s(n,g) = \alpha \cdot V_g(x,n) + (1-\alpha) \cdot \frac{1}{|\text{ch}(g)|} \sum_{g' \in \text{ch}(g)} s(n,g')$

where $\alpha$ balances parent and child contributions. For leaf sub-goals, $s(n,g) = V_g(x,n)$ . If a goal is fully satisfied, the score short-circuits to 1. For two-parent operators, a pair score $s(n_a, n_b)$ uses the maximum of the two parents’ verifier outputs, favoring complementary candidates that cover different parts of the goal tree.

This dense, interpretable signal guides parent selection even when no candidate has fully solved the problem, dramatically improving search efficiency.

Theoretical Guarantees: Escaping the Entropy Shell

The paper provides two theoretical motivations. First, under mild assumptions (bounded per-step surprise, decaying step dependence, and linear block total correlation), Theorem 4.4 proves that expansion-only search is confined to a narrow entropy shell $A_\epsilon(T) = \{y : |-\log P(y) - H_T| \le \epsilon T\}$ , whose size is at most $\exp(H_T + \epsilon T)$ . In contrast, evolution operators that recombine blocks from independent trajectories produce candidates with expected log-probability strictly beyond this shell, with a positive fraction escaping it.

Second, Theorem 4.5 shows that backward sub-goal decomposition yields an exponential reduction in sample complexity. Terminal-only search requires $\Omega(1/\prod_i p_i)$ candidates to find a complete solution, while backward-guided search needs only $O(p_{\min}^{-1} \log(m/\delta))$ to collect evidence for all $m$ sub-goals. In the symmetric case $p_i = p$ , the ratio is $\Omega(p^{-(m-1)} / \log(m/\delta))$ , exponential in the number of sub-goals.

Post-Training and Inference: Experimental Gains

BES was evaluated on both post-training and inference tasks.

Post-training. On logical reasoning (Knights-and-Knaves), GRPO and MaxRL showed little improvement, while BES steadily increased validation accuracy (Figure 3). On multi-hop reasoning (MuSiQue), BES substantially outperformed GRPO and Tree-GRPO across two model scales (Table 1).

Figure 3: EMA-smoothed validation accuracy on logical reasoning.
BES consistently improves while baselines stagnate.

Method	Accuracy (%)	# Valid Search	# Valid Actions	Finish Ratio
Llama-3.2-3B-Instruct
Base model	4.0	–	–	–
+ GRPO	2.1 (-1.9)	0.84	0.20	0.64
+ Tree-GRPO	3.9 (-0.1)	1.50	2.14	0.64
+ BES	7.0 (+3.0)	2.31	3.29	0.97
Llama-3.1-8B-Instruct
Base model	6.6	–	–	–
+ GRPO	5.6 (-1.0)	1.46	1.83	0.37
+ Tree-GRPO	7.4 (+0.8)	0.65	1.36	0.71
+ BES	10.4 (+3.8)	2.11	3.05	0.94

Inference. On three open problem solving benchmarks (Circle Packing, Heilbronn Convex), BES outperformed all open-source frameworks in both average and best-case performance, with lower variance (Table 2).

Strategy	Circle Packing (Sq.)	Circle Packing (Rect.)	Heilbronn (Convex)
	Avg.	Best	Avg.
OpenEvolve	2.531	2.541	2.267
GEPA	2.613	2.628	2.326
ShinkaEvolve	2.464	2.541	2.335
BES	2.623	2.632	2.349

Ablation, Cost, and Conclusion

An ablation study on logical reasoning confirmed that both the evolution operators and the backward search contribute to BES’s gains (Figure 4). Removing either component reduced performance, though both ablations still outperformed GRPO and MaxRL.

Figure 4: Ablation study on logical reasoning.
Removing evolution operators or answer reweighting degrades performance.

Cost analysis showed that BES adds less than 30% wall-clock overhead over Tree-GRPO while delivering substantially better accuracy and search behavior. On open problems, BES incurred modest additional API cost ( $14–$ 19 per run) compared to ShinkaEvolve ( $12–$ 13) but achieved consistently higher objective values.

In summary, BES addresses the twin challenges of sparse verification and confined exploration through a bidirectional evolutionary search. By coupling forward recombination with backward goal decomposition, it discovers high-quality solutions that elude existing methods, enabling consistent improvements in both post-training and inference across diverse reasoning domains.

Project page GitHub ArXiv paper