Video

A novel framework overcoming high-resolution bottlenecks with mask-free shifted-window attention and lightweight autoencoders for live-stream applications.

SwiftVR: Real-Time Generative Video Restoration on Consumer GPUs

SwiftVR is a streaming one-step generative video restoration framework for live-stream applications. It addresses consumer GPU bottlenecks with mask-free shifted-window self-attention and a lightweight autoencoder, achieving real-time 1080p streaming on consumer-grade GPUs and 4K on H100.

Explore NAVA's Align-then-Fuse MMDiT architecture for native audio-visual alignment, enabling precise multi-timbre control and language-described camera movements.

How NAVA Generates Synchronized 720p Audio-Video from a Single Prompt

NAVA is a 6.3B-parameter joint audio-video generator that synthesizes synchronized 720p video and audio from a single prompt. It utilizes an Align-then-Fuse MMDiT architecture to establish audio-video correspondence, offering features like multi-speaker speech with timbre control, fast generation, and language-described camera control.

Native editing, not generation, is the silent revolution that just left the prompt-to-pixel circus behind.

You’ve Been Lied To About Video AI’s Real Breakthrough

The AI world is obsessed with generating video from scratch, but the true frontier is native editing through conversation. Gemini Omni’s ability to surgically alter existing footage without re-rendering shatters the old pipeline approach, even as token costs threaten to gatekeep the revolution.

A 2.6B-parameter diffusion transformer synthesizing 720p video with 6-DoF camera control, hybrid linear attention, and two-stage refinement

SANA-WM: Open-Source Bidirectional World Model for Minute-Long Video

SANA-WM is an efficient open-source world model trained for one-minute video generation. It uses a bidirectional image-to-video diffusion transformer with hybrid linear attention, dual-branch camera control, and a two-stage pipeline. Runs on under 8GB VRAM and generates 60-second 720p clips in 34 seconds on a single RTX 5090.

End-to-end training and inference system using NVFP4 quantization, Balanced SP, and multi-shot attention sink for real-time, long, interactive video generation.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

LongLive-2.0 presents the first end-to-end NVFP4 system for long video generation. It introduces Balanced Sequence Parallelism (SP) and NVFP4 quantization to accelerate training and inference. On Blackwell GPUs, W4A4 inference and quantized KV cache reduce memory and boost throughput. A clean training pipeline directly fine-tunes diffusion models into autoregressive models with standalone LoRA for real-time generation. Multi-shot attention sink enables stable streaming. Experiments show up to 2.15× training speedup and 1.84× inference speedup, achieving 45.7 FPS at 5B parameters.