Training

Achieving state-of-the-art performance with AudioVAE, full-history conditioning, and reward-free self-corrective post-training for robust, expressive, and efficient speech synthesis.

dots.tts: 2B-Parameter Continuous Autoregressive TTS Foundation Model

Introducing dots.tts, a 2B-parameter continuous autoregressive text-to-speech foundation model. It leverages AudioVAE, full-history conditioning, and self-corrective post-training for unparalleled performance on multilingual benchmarks, offering strong generation stability, voice cloning, and emotional expressiveness with efficient MeanFlow distillation.

Introducing three core primitives for aggregating diverse models to achieve lower validation loss and improved data efficiency

Hyper-Epoch Pretraining (q0) for Data-Constrained Language Models

1Q Labs researchers introduce Hyper-Epoch Pretraining (q0), a conceptual shift from single-model training to exploring and aggregating a population of models. q0 uses cyclic schedules, chain distillation, and a learned prior to achieve significant data efficiency gains and lower validation loss in multi-epoch pretraining.

A data-free framework for training language models without external supervision, improving performance on open-ended and short-form QA benchmarks.

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Language Tasks

Introducing SCOPE, a data-free self-play framework for open-ended tasks that co-evolves a Challenger for task generation and a Solver for answering. It uses a self-judge to create rubrics and grade responses, improving 7-8B instruction-tuned models by up to +10.4 points on open-ended and +13.8 points on held-out QA benchmarks.

A system-algorithm co-designed framework achieves 24 FPS 1280x704 resolution editing on consumer GPUs with enhanced temporal consistency.

SANA-Streaming: Real-time Video Editing with Hybrid Diffusion Transformer

SANA-Streaming introduces a hybrid diffusion transformer and Cycle-Reverse Regularization for real-time streaming video editing. Optimized for NVIDIA Blackwell (RTX 5090), it achieves 1280x704 resolution at 24 FPS with superior temporal coherence and throughput on consumer GPUs.

Exploring the architecture and application of state-externalizing harnesses in AI agent development.

Harness-1: Reinforcement Learning for Search Agents

Harness-1 introduces a novel approach to reinforcement learning for search agents through state-externalizing harnesses. This project, detailed in arXiv:2606.02373, provides a framework for advanced AI agent development.

NVIDIA's latest foundation model for robotics and embodied AI, integrating diverse sensory data for advanced physical intelligence.

Cosmos 3: Omnimodal World Models for Physical AI

NVIDIA introduces Cosmos 3, a cutting-edge omnimodal world model designed for physical AI applications. This project leverages diverse data inputs to enable robots and embodied AI systems to better understand and interact with the physical world, pushing the boundaries of autonomous intelligence.

Discover how a novel framework, inspired by diffusion models, enables training of massive Transformers with significantly reduced memory footprint.

How DiffusionBlocks Overcomes the Deep Learning Memory Wall

Explore the "memory wall" in deep learning and how DiffusionBlocks, by reinterpreting residual networks as diffusion processes, offers a principled, block-wise training method. Learn how it dramatically cuts memory usage for large Transformer models, making them accessible on standard hardware.

Understanding the geometric modeling advantage of direct clean-latent regression over velocity prediction in compressed VAE spaces.

Why Clean-Latent Prediction Outperforms Velocity in Diffusion Models

Explore how the choice of prediction target profoundly impacts diffusion model performance, even in latent spaces. This article details a controlled study comparing clean-latent (JLT) and velocity prediction (DiT), revealing why direct clean-latent regression consistently yields superior results due to fundamental differences in the underlying regression problem.

Discover how LeJEPA achieves linear identifiability and why a Gaussian latent distribution is crucial for perfect recovery of underlying AI world models.

Why Gaussianity is Key to Identifiable World Models in AI

Explore the "if and only if" theorem behind LeJEPA's success in representation learning. Understand the role of Gaussian distributions, alignment, and regularization in achieving linear identifiability in AI's quest for robust world models.

Introducing SkillOpt, a novel framework that treats natural-language skill documents as trainable states for domain adaptation in large language models, enabling automated procedural improvement without modifying model weights.

SkillOpt: Optimizing LLM Behavior with Trainable Skill Documents

SkillOpt optimizes large language model behavior by iteratively refining natural-language "skill documents" through a propose-and-test loop. It uses an optimizer model to suggest edits, applies them under a bounded textual learning rate, and validates improvements, ensuring robust and portable domain adaptation for even closed-source frontier models.

New hybrid models leverage offline consolidation, inspired by biological sleep, to overcome attention cache limitations in long-horizon tasks.

LLMs Learn to "Sleep" for Deeper Reasoning

This article explores how "LLM sleep," an offline consolidation phase, allows hybrid attention-SSM models to improve deep reasoning by iteratively refining fast-weight memories. Inspired by hippocampal replay, this method addresses the computational bottleneck of context eviction, enhancing performance on complex sequential tasks without increasing prediction-time cost.

Analysis of incomplete submissions reveals the critical need for full paper text, including abstract, methods, results, and figures, to generate evidence-based summaries.

Missing Paper Content Hinders Accurate Synthesis

This article highlights the challenges of producing accurate and comprehensive paper summaries when only a title is provided. It emphasizes that a full understanding of research requires complete content, encompassing abstract, methodology, results, and illustrative figures, to ensure an evidence-based synthesis.