home›Training›

The Recursion Ceiling is a Myth: NovaSky Unleashes Recursive Language Models

Models are no longer bounded by single-call context windows; SkyRL's infrastructure enables execution-driven meta-reasoning via stateful child agents.

May 26, 2026

#Agents #Python #Reinforcement Learning #Sandboxing #Training

Discover how NovaSky's SkyRL framework shatters the limitations of large language models. By spawning recursive child agents within persistent Python sandboxes, models can now reason in multi-turn, multi-agent trees, redefining what "thinking" means for AI.

The recursion ceiling is a myth — and this code breaks it

You think large language models are bounded by their single-call context windows. You're wrong. The NovaSky team just shipped an environment where models spawn recursive child agents that reason inside stateful Python sandboxes, call sub-models mid-rollout, and terminate only when a structured FINAL answer is produced. This isn't a prompt engineering trick — it's infrastructure for execution-driven meta-reasoning. SkyRL's Recursive Language Models (RLM) implementation turns a flat RL gym into a multi-turn, multi-agent reasoning tree. With 1.9k stars on GitHub, NovaSky-AI/SkyRL is quietly redefining what "training a model to think" actually means.

Inside the stateful Python sandbox that teaches models to think in steps

The core is PersistentREPL — a Python interpreter that survives across turns, blocks dangerous builtins like eval, and injects scaffold identifiers (FINAL_VAR, SHOW_VARS, context) directly into the namespace. Models don't just chat; they execute code, inspect variables, and query LM helpers (llm_query, rlm_query) that themselves can spawn child RLM agents with independent REPLs. Timeouts use SIGALRM with a thread-based fallback for Ray workers. After every execute(), reserved identifiers are restored so shadowing attempts fail silently. The environment forces the model to structure its reasoning as a sequence of thought-action-observation cycles — each turn parsed from the final ````repl` block.

A luminous chamber of translucent blue glass suspended in infinite darkness, within it a spiraling helix of molten amber and silver light ascending like frozen fire, its coils intersecting with smaller radiant spheres floating at branching nodes — each sphere encased in its own delicate crystalline shell, connected by threads of electric violet that pulse rhythmically outward. The larger helix pulses with persistent warmth while child spheres glow with independent intensities, their light bleeding soft shadows across the chamber walls. The atmosphere is thick with floating geometric fragments — shattered hexagons and fractured cubes — suspended mid-rotation like suspended computation. Above, a crown of silver scaffolding hovers, emitting faint geometric signatures into the luminous core below. The mood is contemplative and electric, like witnessing thought itself crystallizing into form, with cinematic depth of field blurring the infinite void surrounding the central apparatus.

Training rollout trees: where each leaf is a child agent with its own REPL

RLMGymGenerator manages not one rollout but a tree. A parent rollout can invoke subcall_fn, which launches a full agent_loop for a child, returning its final_answer after ast.literal_eval processing. Child rollouts carry a _rlm_parent_rid sentinel for linkage, and an optional train_child_trajectories flag inlines their step-outputs back into the parent trajectory once root finalization occurs. If a frozen OpenRouter model is configured, child calls route through OpenRouterInferenceEngine; otherwise they default to the policy engine. Every assistant token in both parent and child sequences receives a loss_mask — the generator demands step_wise_trajectories=True. Metrics flow to Weights & Biases, already logged for RLM-2b-4b-E2E-Runs.

FSDP1 was already dead — this PR just buried it

For months, SkyRL's FSDP1 backend existed as dead weight. SFT pipelines rejected it. FSDP2 was the operational default, yet dual-backend maintenance bloated the codebase with redundant dispatchers, parallel helper functions, FSDP1-specific LoRA prefixes, and duplicated CI test matrices. Pull request #1659 ended the charade. It removed every trace — get_fsdp_state_ctx, offload_fsdp_model_to_cpu, get_sharding_strategy — and standardized the strategy identifier to "fsdp". The old "fsdp2" alias now merely emits a DeprecationWarning and normalizes. Maintaining two paths was never about flexibility; it was technical debt masquerading as compatibility.

What's left after the purge: a lean, single-path distributed training stack

The cleanup touched fsdp_utils.py, fsdp_strategy.py, fsdp_worker.py, and configuration defaults. _handle.reshard(True) workarounds vanished. FSDPBackendOverrides.strategy defaulted to "fsdp". Test suites shed 14 parametrized FSDP1 rows, and a new alias test verifies the warning. A GPU CI matrix collapsed to a single FSDP path. Twenty pytest executions passed. A grep sweep confirmed zero remaining references to legacy identifiers across skyrl/, tests/, examples/, and docs/. Configs that still use strategy="fsdp2" will run — but they’re now running FSDP2 under a clean, honest name.

The real moat isn't model size — it's environment depth

While the industry obsesses over parameter counts, NovaSky-AI is weaponizing the training environment itself. The RLM system doesn't just evaluate a model — it forces the model to become a programmer, an orchestrator, and a recursive reasoner. Each child REPL is a laboratory where sub-problems are decomposed and solved independently. The FSDP consolidation proves the same engineering ethos: remove cruft, enforce a single coherent path, and optimize for what actually works. If you're still betting on static benchmarks to measure intelligence, you're missing the shift. The future belongs to frameworks that let models run code, spawn sub-minds, and reason through execution — and SkyRL just lit the fuse.