A Sleep-Inspired Mechanism for Language Models
Transformer-based large language models struggle with long-context tasks because attention scales poorly with sequence length. A new arXiv study proposes an offline “sleep” phase that converts recent context into persistent fast weights and clears the key-value cache. This design shifts extra computation to the sleep phase while preserving the latency of wake-time prediction. The method improves performance on long-horizon benchmarks without increasing online inference cost.
How the Sleep Phase Works
The model periodically enters sleep and processes accumulated context through N offline recurrent passes. During each pass, fast weights in its state-space model (SSM) blocks are updated via a learned local rule. After sleep, the key-value cache is cleared. The fast weights then serve as a persistent memory of recent context. Wake-time predictions use only these weights, avoiding expensive attention over the full history.
“shifts extra computation to the sleep phase while preserving the latency of wake-time prediction.”

The Attention Bottleneck
Attention mechanisms scale quadratically with context length, making long-horizon tasks computationally heavy. Typical mitigation strategies store large key-value caches, increasing memory demands. The sleep-inspired method reimagines memory management: by periodically sleeping, the model compresses context into SSM fast weights and resets the cache. This transforms a growing-cache problem into a fixed number of offline passes, offering a practical path toward efficient long-context inference with hybrid transformer–state-space architectures.
Tests: Synthetic Tasks and Math Reasoning
The authors evaluate on controlled synthetic tasks—cellular automata and multi-hop graph retrieval—and on the more realistic math reasoning benchmark. Baselines include a regular transformer and SSM-attention hybrids that lack the sleep mechanism. Both baselines fail on math reasoning. When equipped with sleep, the models show performance gains across all tasks, demonstrating that offline recurrence can rescue models from failure on complex, long-range dependencies.
Findings: More Sleep, Deeper Reasoning
Key findings from the paper:
- The sleep mechanism improves performance on the tested tasks.
- Increasing the number of offline passes (N) yields further gains.
- The largest improvements occur on examples requiring deeper reasoning steps.
- Baseline models—a regular transformer and SSM-attention hybrids—fail on math reasoning; the sleep-equipped model succeeds.
This suggests that offline consolidation is not just helpful but necessary for certain challenging tasks.
Paper and Authors
Title: “Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference.”
Authors: Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti.
arXiv ID: 2605.26099 (v2, revised 27 May 2026; original submission 25 May 2026).
License: CC BY 4.0.
Primary subject: Computation and Language (cs.CL).
Secondary: Artificial Intelligence (cs.AI).




