Training

Page 1 of 3

Achieving state-of-the-art performance with AudioVAE, full-history conditioning, and reward-free self-corrective post-training for robust, expressive, and efficient speech synthesis.

dots.tts: 2B-Parameter Continuous Autoregressive TTS Foundation Model

Introducing dots.tts, a 2B-parameter continuous autoregressive text-to-speech foundation model. It leverages AudioVAE, full-history conditioning, and self-corrective post-training for unparalleled performance on multilingual benchmarks, offering strong generation stability, voice cloning, and emotional expressiveness with efficient MeanFlow distillation.

Introducing three core primitives for aggregating diverse models to achieve lower validation loss and improved data efficiency

Hyper-Epoch Pretraining (q0) for Data-Constrained Language Models

1Q Labs researchers introduce Hyper-Epoch Pretraining (q0), a conceptual shift from single-model training to exploring and aggregating a population of models. q0 uses cyclic schedules, chain distillation, and a learned prior to achieve significant data efficiency gains and lower validation loss in multi-epoch pretraining.

A data-free framework for training language models without external supervision, improving performance on open-ended and short-form QA benchmarks.

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Language Tasks

Introducing SCOPE, a data-free self-play framework for open-ended tasks that co-evolves a Challenger for task generation and a Solver for answering. It uses a self-judge to create rubrics and grade responses, improving 7-8B instruction-tuned models by up to +10.4 points on open-ended and +13.8 points on held-out QA benchmarks.

Investigating the potential of Parameter-Efficient Fine-Tuning to enable individual models with massive scale.

Scaling PEFT for Trillion-Parameter Personal Models

This article explores the scaling capabilities of Parameter-Efficient Fine-Tuning (PEFT) towards creating millions of personal models, each potentially reaching trillion-parameter scales. It delves into the architectural and practical considerations for achieving such unprecedented model personalization and efficiency.

A system-algorithm co-designed framework achieves 24 FPS 1280x704 resolution editing on consumer GPUs with enhanced temporal consistency.

SANA-Streaming: Real-time Video Editing with Hybrid Diffusion Transformer

SANA-Streaming introduces a hybrid diffusion transformer and Cycle-Reverse Regularization for real-time streaming video editing. Optimized for NVIDIA Blackwell (RTX 5090), it achieves 1280x704 resolution at 24 FPS with superior temporal coherence and throughput on consumer GPUs.

Exploring the architecture and application of state-externalizing harnesses in AI agent development.

Harness-1: Reinforcement Learning for Search Agents

Harness-1 introduces a novel approach to reinforcement learning for search agents through state-externalizing harnesses. This project, detailed in arXiv:2606.02373, provides a framework for advanced AI agent development.

NVIDIA's latest foundation model for robotics and embodied AI, integrating diverse sensory data for advanced physical intelligence.

Cosmos 3: Omnimodal World Models for Physical AI

NVIDIA introduces Cosmos 3, a cutting-edge omnimodal world model designed for physical AI applications. This project leverages diverse data inputs to enable robots and embodied AI systems to better understand and interact with the physical world, pushing the boundaries of autonomous intelligence.

Explore Ideogram 4's state-of-the-art capabilities, including multilingual text rendering, structured JSON prompting, and leading performance in design benchmarks.

What is Ideogram 4: The Open-Weight Text-to-Image Foundation Model?

Ideogram 4 is Ideogram's first open-weight text-to-image foundation model, trained from scratch. It features a new structured JSON prompting interface, best-in-class multilingual text rendering, deep language understanding, explicit layout/color controls, and native 2k resolution. It leads open-weight models in Design Arena and ContraLabs typography evaluations.

Discover NVIDIA's 550B parameter LatentMoE model, optimized for agentic reasoning, long-context analysis, and multilingual capabilities with Multi-Token Prediction.

NVIDIA Nemotron-3-Ultra 550B: A Frontier LLM for Complex AI Workflows

Nemotron-3-Ultra-550B-A55B-BF16 is a frontier-scale LLM by NVIDIA, featuring a LatentMoE architecture, Mamba-2 + MoE + Attention hybrid, and Multi-Token Prediction. Designed for complex multi-step agents, long-context analysis, and high-accuracy reasoning across multiple languages, it offers configurable reasoning and is released under the OpenMDW License.

Discover how a novel framework, inspired by diffusion models, enables training of massive Transformers with significantly reduced memory footprint.

How DiffusionBlocks Overcomes the Deep Learning Memory Wall

Explore the "memory wall" in deep learning and how DiffusionBlocks, by reinterpreting residual networks as diffusion processes, offers a principled, block-wise training method. Learn how it dramatically cuts memory usage for large Transformer models, making them accessible on standard hardware.

Discover BES, a novel framework coupling forward evolutionary search with backward goal decomposition to overcome sampling bottlenecks in LLM reasoning.

How Bidirectional Evolutionary Search Improves LLM Self-Improvement

This article explains Bidirectional Evolutionary Search (BES), a new framework that enhances LLM self-improvement by combining evolutionary operators for broader exploration with dense, intermediate feedback from goal decomposition. Learn how BES tackles the limitations of traditional sampling methods like best-of-N and tree search.

Understanding the geometric modeling advantage of direct clean-latent regression over velocity prediction in compressed VAE spaces.

Why Clean-Latent Prediction Outperforms Velocity in Diffusion Models

Explore how the choice of prediction target profoundly impacts diffusion model performance, even in latent spaces. This article details a controlled study comparing clean-latent (JLT) and velocity prediction (DiT), revealing why direct clean-latent regression consistently yields superior results due to fundamental differences in the underlying regression problem.