Tailored news hub
home›Training›

SANA-Streaming: Real-time Video Editing with Hybrid Diffusion Transformer

A system-algorithm co-designed framework achieves 24 FPS 1280x704 resolution editing on consumer GPUs with enhanced temporal consistency.

SANA-Streaming: Real-time Video Editing with Hybrid Diffusion Transformer
#Academic#Content Generation#Context#Memory#Training

SANA-Streaming introduces a hybrid diffusion transformer and Cycle-Reverse Regularization for real-time streaming video editing. Optimized for NVIDIA Blackwell (RTX 5090), it achieves 1280x704 resolution at 24 FPS with superior temporal coherence and throughput on consumer GPUs.

Real-Time Streaming Video Editing on Consumer GPUs

Interactive applications like live broadcasting and gaming demand real time video editing with both temporal consistency and high throughput on limited hardware. Previous video editing models often rely on full softmax attention, which becomes memory-prohibitive for long streams, while purely linear attention variants introduce visible chunk-boundary artifacts.

SANA-Streaming tackles these challenges through a system‑algorithm co‑design that combines a hybrid diffusion transformer, cycle‑reverse regularization, and hardware‑aware optimizations for the NVIDIA Blackwell architecture. Running on a single RTX 5090 GPU, it edits 1280×704‑resolution videos at 24 end‑to‑end frames per second, with the DiT core reaching 58 FPS. This work shows that careful co‑design can unlock practical real-time video editing on consumer hardware.

Image 1: Overview of SANA-Streaming

Hybrid Diffusion Transformer for Streaming Consistency

The heart of SANA-Streaming is a hybrid diffusion transformer that interleaves two attention mechanisms. Most blocks use Gated DeltaNet (GDN) linear attention, which compresses the streaming history into a fixed‑size recurrent state updated frame‑by‑frame. This provides a compact global memory whose size is independent of the video length, preventing the memory explosion of full softmax attention.

A minority of blocks use softmax attention with a sliding local window and a persistent sink chunk. These blocks restore fine‑grained local correspondence, which is essential for preserving source details across chunk boundaries. During inference, each GDN block caches only terminal recurrent states, while softmax blocks attend to a small, constrained context.

Together, the hybrid design eliminates the flickering artifacts of pure linear attention while keeping memory constant. Compared with an all‑softmax variant, it uses only 5.56 GB VRAM and runs 3.7× faster, making high‑resolution streaming editing feasible on a consumer GPU.

Image 2: Hybrid streaming diffusion transformer

Cycle-Reverse Regularization: Learning Consistency from Unpaired Data

Minute‑length video editing demands long‑range temporal stability, but paired edited long videos are extremely rare. SANA-Streaming introduces Cycle-Reverse Regularization, a training strategy that requires only long source videos.

The forward pass performs streaming editing according to a given instruction. The resulting chunk is then used as a visual condition for a reverse edit, guided by an inverse prompt (e.g., “restore the original scene”). The reverse branch is trained with a flow‑matching objective to reconstruct the corresponding source frame. This cycle‑consistency objective forces the model to preserve source structure, motion, and non‑edited regions across hundreds of frames, even without paired supervision.

The cycle‑reverse loss complements the existing LongLive‑style streaming long training with distribution matching distillation (DMD), which already teaches causal rollout. Experiments show that the regularization eliminates drifting and flickering, maintaining appearance consistency over minute‑long sequences.

Image 4: Streaming Long Training and Cycle-Reverse Regularization

Efficient System Co‑design: Fused Kernels and Mixed-Precision Quantization

To meet throughput targets on consumer GPUs, SANA-Streaming applies two hardware‑aware optimizations. First, a fused GDN kernel implemented in Triton partitions the spatial dimension and keeps the compact recurrent state in SRAM, achieving 1.5–2.2× speedup over a naive PyTorch implementation across various GPU architectures.

Second, a mixed‑precision quantization (MPQ) policy search is performed for the NVIDIA Blackwell architecture. Rather than assigning a uniform precision, the search evaluates per‑layer and per‑block sensitivity. Robust layers such as attention query/key projections and temporal FFN components can be safely demoted to NVFP4, while sensitive layers (patch embedding, output projection) remain in BF16 or FP8.

The resulting mixed‑precision policy yields a 1.59× DiT latency reduction over the BF16 baseline with negligible quality loss. Combined with the GDN kernel, these system‑level optimizations enable 24 end‑to‑end FPS on a single RTX 5090.

Image 6: Per-layer quantization policy search

Data Pipeline for High-Quality Streaming Training

Training a real‑time video editor requires large‑scale, high‑quality data. SANA-Streaming builds a pipeline that constructs both short‑video editing pairs and long‑video editing instructions.

For short clips, a taxonomy‑guided process generates diverse edit instructions, and an image editor modifies the first frame as a visual anchor. A controllable video generator then produces the edited video, conditioned on the source, the edited first frame, and an extracted pose sequence to preserve motion. A vision‑language model (VLM) verifies each sample for instruction alignment, consistency, and visual quality.

For long videos, a VLM generates forward and backward editing prompts from anchor frames of source videos. These serve the streaming long training and cycle‑reverse regularization without needing paired edited videos. This pipeline ensures motion‑preserving, instruction‑following edits that form the foundation for the model’s fidelity and streaming stability.

Image 8: Data Pipeline

Experimental Results: Real-Time Speed and Editing Quality

SANA-Streaming was evaluated on the OpenVE‑Bench pixel‑aligned editing categories. The undisdistilled bidirectional model achieves a state‑of‑the‑art average score of 2.62 with only 2B parameters, outperforming larger methods like VACE and OpenVE‑Edit.

The step‑distilled streaming version maintains competitive quality (2.42) while running at 24 end‑to‑end FPS on a single RTX 5090—over 100× faster than previous SOTA. Ablations confirm that cycle‑reverse regularization improves temporal stability, and the fused GDN kernel plus mixed‑precision quantization together deliver a 1.59× DiT speedup. The causal VAE decoder, distilled from a bidirectional teacher, recovers sharp details and matches the teacher’s fidelity.

These results validate the co‑design approach, demonstrating that high‑resolution real time video editing ai is now achievable on consumer hardware.

Image 9: Qualitative comparison

Conclusion and Broader Impact

SANA-Streaming demonstrates that minute‑length, high‑resolution video editing can run in real time on a consumer GPU by uniting architectural innovation, training strategies, and hardware‑aware system design. The hybrid transformer, cycle‑reverse regularization, and efficient kernels collectively overcome latency, memory, and data‑scarcity bottlenecks.

Limitations include sensitivity to ambiguous instructions and the persistent shortage of diverse long editing data. The system incorporates safeguards such as input screening, generation‑time controls, and output monitoring to mitigate potential misuse, including deepfakes. This work sets a practical baseline for interactive video editing and highlights how co‑design can accelerate generative AI toward real‑world deployment.

Related Articles