home›LLMs›

New LLM "Sleep" Phase Boosts Long-Context Performance

A novel arXiv study introduces an offline "sleep" mechanism for Transformer-based language models, improving long-horizon task efficiency without increasing online inference costs.

May 28, 2026

#Academic #Context #LLM #Memory

Researchers propose a "sleep" phase for large language models that converts recent context into persistent fast weights, clearing the key-value cache. This innovative approach addresses the attention bottleneck, enabling models to handle long-context tasks efficiently and perform better on complex benchmarks like math reasoning.

A Sleep-Inspired Mechanism for Language Models

Transformer-based large language models struggle with long-context tasks because attention scales poorly with sequence length. A new arXiv study proposes an offline “sleep” phase that converts recent context into persistent fast weights and clears the key-value cache. This design shifts extra computation to the sleep phase while preserving the latency of wake-time prediction. The method improves performance on long-horizon benchmarks without increasing online inference cost.

How the Sleep Phase Works

The model periodically enters sleep and processes accumulated context through N offline recurrent passes. During each pass, fast weights in its state-space model (SSM) blocks are updated via a learned local rule. After sleep, the key-value cache is cleared. The fast weights then serve as a persistent memory of recent context. Wake-time predictions use only these weights, avoiding expensive attention over the full history.

“shifts extra computation to the sleep phase while preserving the latency of wake-time prediction.”

A luminous, ethereal landscape of translucent geometric forms floating in deep indigo darkness. Soft, pulsing waves of golden light ripple through crystalline lattice structures, gradually compressing into dense, glowing nodes. A faint horizon line where shimmering threads of memory dissolve into starlit mist. The scene evokes quiet renewal, weightless transformation, and the stillness of a mind at rest.

The Attention Bottleneck

Attention mechanisms scale quadratically with context length, making long-horizon tasks computationally heavy. Typical mitigation strategies store large key-value caches, increasing memory demands. The sleep-inspired method reimagines memory management: by periodically sleeping, the model compresses context into SSM fast weights and resets the cache. This transforms a growing-cache problem into a fixed number of offline passes, offering a practical path toward efficient long-context inference with hybrid transformer–state-space architectures.

Tests: Synthetic Tasks and Math Reasoning

The authors evaluate on controlled synthetic tasks—cellular automata and multi-hop graph retrieval—and on the more realistic math reasoning benchmark. Baselines include a regular transformer and SSM-attention hybrids that lack the sleep mechanism. Both baselines fail on math reasoning. When equipped with sleep, the models show performance gains across all tasks, demonstrating that offline recurrence can rescue models from failure on complex, long-range dependencies.

Findings: More Sleep, Deeper Reasoning

Key findings from the paper:

The sleep mechanism improves performance on the tested tasks.
Increasing the number of offline passes (N) yields further gains.
The largest improvements occur on examples requiring deeper reasoning steps.
Baseline models—a regular transformer and SSM-attention hybrids—fail on math reasoning; the sleep-equipped model succeeds.

This suggests that offline consolidation is not just helpful but necessary for certain challenging tasks.

Paper and Authors

Title: “Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference.”
Authors: Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti.
arXiv ID: 2605.26099 (v2, revised 27 May 2026; original submission 25 May 2026).
License: CC BY 4.0.
Primary subject: Computation and Language (cs.CL). Secondary: Artificial Intelligence (cs.AI).