The recursion ceiling is a myth — and this code breaks it
You think large language models are bounded by their single-call context windows. You're wrong. The NovaSky team just shipped an environment where models spawn recursive child agents that reason inside stateful Python sandboxes, call sub-models mid-rollout, and terminate only when a structured FINAL answer is produced. This isn't a prompt engineering trick — it's infrastructure for execution-driven meta-reasoning. SkyRL's Recursive Language Models (RLM) implementation turns a flat RL gym into a multi-turn, multi-agent reasoning tree. With 1.9k stars on GitHub, NovaSky-AI/SkyRL is quietly redefining what "training a model to think" actually means.
Inside the stateful Python sandbox that teaches models to think in steps
The core is PersistentREPL — a Python interpreter that survives across turns, blocks dangerous builtins like eval, and injects scaffold identifiers (FINAL_VAR, SHOW_VARS, context) directly into the namespace. Models don't just chat; they execute code, inspect variables, and query LM helpers (llm_query, rlm_query) that themselves can spawn child RLM agents with independent REPLs. Timeouts use SIGALRM with a thread-based fallback for Ray workers. After every execute(), reserved identifiers are restored so shadowing attempts fail silently. The environment forces the model to structure its reasoning as a sequence of thought-action-observation cycles — each turn parsed from the final ````repl` block.

Training rollout trees: where each leaf is a child agent with its own REPL
RLMGymGenerator manages not one rollout but a tree. A parent rollout can invoke subcall_fn, which launches a full agent_loop for a child, returning its final_answer after ast.literal_eval processing. Child rollouts carry a _rlm_parent_rid sentinel for linkage, and an optional train_child_trajectories flag inlines their step-outputs back into the parent trajectory once root finalization occurs. If a frozen OpenRouter model is configured, child calls route through OpenRouterInferenceEngine; otherwise they default to the policy engine. Every assistant token in both parent and child sequences receives a loss_mask — the generator demands step_wise_trajectories=True. Metrics flow to Weights & Biases, already logged for RLM-2b-4b-E2E-Runs.
FSDP1 was already dead — this PR just buried it
For months, SkyRL's FSDP1 backend existed as dead weight. SFT pipelines rejected it. FSDP2 was the operational default, yet dual-backend maintenance bloated the codebase with redundant dispatchers, parallel helper functions, FSDP1-specific LoRA prefixes, and duplicated CI test matrices. Pull request #1659 ended the charade. It removed every trace — get_fsdp_state_ctx, offload_fsdp_model_to_cpu, get_sharding_strategy — and standardized the strategy identifier to "fsdp". The old "fsdp2" alias now merely emits a DeprecationWarning and normalizes. Maintaining two paths was never about flexibility; it was technical debt masquerading as compatibility.
What's left after the purge: a lean, single-path distributed training stack
The cleanup touched fsdp_utils.py, fsdp_strategy.py, fsdp_worker.py, and configuration defaults. _handle.reshard(True) workarounds vanished. FSDPBackendOverrides.strategy defaulted to "fsdp". Test suites shed 14 parametrized FSDP1 rows, and a new alias test verifies the warning. A GPU CI matrix collapsed to a single FSDP path. Twenty pytest executions passed. A grep sweep confirmed zero remaining references to legacy identifiers across skyrl/, tests/, examples/, and docs/. Configs that still use strategy="fsdp2" will run — but they’re now running FSDP2 under a clean, honest name.
The real moat isn't model size — it's environment depth
While the industry obsesses over parameter counts, NovaSky-AI is weaponizing the training environment itself. The RLM system doesn't just evaluate a model — it forces the model to become a programmer, an orchestrator, and a recursive reasoner. Each child REPL is a laboratory where sub-problems are decomposed and solved independently. The FSDP consolidation proves the same engineering ethos: remove cruft, enforce a single coherent path, and optimize for what actually works. If you're still betting on static benchmarks to measure intelligence, you're missing the shift. The future belongs to frameworks that let models run code, spawn sub-minds, and reason through execution — and SkyRL just lit the fuse.



