Real-Time Streaming Video Editing on Consumer GPUs
Interactive applications like live broadcasting and gaming demand real time video editing with both temporal consistency and high throughput on limited hardware. Previous video editing models often rely on full softmax attention, which becomes memory-prohibitive for long streams, while purely linear attention variants introduce visible chunk-boundary artifacts.
SANA-Streaming tackles these challenges through a systemâalgorithm coâdesign that combines a hybrid diffusion transformer, cycleâreverse regularization, and hardwareâaware optimizations for the NVIDIA Blackwell architecture. Running on a single RTX 5090 GPU, it edits 1280Ă704âresolution videos at 24 endâtoâend frames per second, with the DiT core reaching 58 FPS. This work shows that careful coâdesign can unlock practical real-time video editing on consumer hardware.

Hybrid Diffusion Transformer for Streaming Consistency
The heart of SANA-Streaming is a hybrid diffusion transformer that interleaves two attention mechanisms. Most blocks use Gated DeltaNet (GDN) linear attention, which compresses the streaming history into a fixedâsize recurrent state updated frameâbyâframe. This provides a compact global memory whose size is independent of the video length, preventing the memory explosion of full softmax attention.
A minority of blocks use softmax attention with a sliding local window and a persistent sink chunk. These blocks restore fineâgrained local correspondence, which is essential for preserving source details across chunk boundaries. During inference, each GDN block caches only terminal recurrent states, while softmax blocks attend to a small, constrained context.
Together, the hybrid design eliminates the flickering artifacts of pure linear attention while keeping memory constant. Compared with an allâsoftmax variant, it uses only 5.56 GB VRAM and runs 3.7Ă faster, making highâresolution streaming editing feasible on a consumer GPU.

Cycle-Reverse Regularization: Learning Consistency from Unpaired Data
Minuteâlength video editing demands longârange temporal stability, but paired edited long videos are extremely rare. SANA-Streaming introduces Cycle-Reverse Regularization, a training strategy that requires only long source videos.
The forward pass performs streaming editing according to a given instruction. The resulting chunk is then used as a visual condition for a reverse edit, guided by an inverse prompt (e.g., ârestore the original sceneâ). The reverse branch is trained with a flowâmatching objective to reconstruct the corresponding source frame. This cycleâconsistency objective forces the model to preserve source structure, motion, and nonâedited regions across hundreds of frames, even without paired supervision.
The cycleâreverse loss complements the existing LongLiveâstyle streaming long training with distribution matching distillation (DMD), which already teaches causal rollout. Experiments show that the regularization eliminates drifting and flickering, maintaining appearance consistency over minuteâlong sequences.

Efficient System Coâdesign: Fused Kernels and Mixed-Precision Quantization
To meet throughput targets on consumer GPUs, SANA-Streaming applies two hardwareâaware optimizations. First, a fused GDN kernel implemented in Triton partitions the spatial dimension and keeps the compact recurrent state in SRAM, achieving 1.5â2.2Ă speedup over a naive PyTorch implementation across various GPU architectures.
Second, a mixedâprecision quantization (MPQ) policy search is performed for the NVIDIA Blackwell architecture. Rather than assigning a uniform precision, the search evaluates perâlayer and perâblock sensitivity. Robust layers such as attention query/key projections and temporal FFN components can be safely demoted to NVFP4, while sensitive layers (patch embedding, output projection) remain in BF16 or FP8.
The resulting mixedâprecision policy yields a 1.59Ă DiT latency reduction over the BF16 baseline with negligible quality loss. Combined with the GDN kernel, these systemâlevel optimizations enable 24 endâtoâend FPS on a single RTX 5090.

Data Pipeline for High-Quality Streaming Training
Training a realâtime video editor requires largeâscale, highâquality data. SANA-Streaming builds a pipeline that constructs both shortâvideo editing pairs and longâvideo editing instructions.
For short clips, a taxonomyâguided process generates diverse edit instructions, and an image editor modifies the first frame as a visual anchor. A controllable video generator then produces the edited video, conditioned on the source, the edited first frame, and an extracted pose sequence to preserve motion. A visionâlanguage model (VLM) verifies each sample for instruction alignment, consistency, and visual quality.
For long videos, a VLM generates forward and backward editing prompts from anchor frames of source videos. These serve the streaming long training and cycleâreverse regularization without needing paired edited videos. This pipeline ensures motionâpreserving, instructionâfollowing edits that form the foundation for the modelâs fidelity and streaming stability.

Experimental Results: Real-Time Speed and Editing Quality
SANA-Streaming was evaluated on the OpenVEâBench pixelâaligned editing categories. The undisdistilled bidirectional model achieves a stateâofâtheâart average score of 2.62 with only 2B parameters, outperforming larger methods like VACE and OpenVEâEdit.
The stepâdistilled streaming version maintains competitive quality (2.42) while running at 24 endâtoâend FPS on a single RTX 5090âover 100Ă faster than previous SOTA. Ablations confirm that cycleâreverse regularization improves temporal stability, and the fused GDN kernel plus mixedâprecision quantization together deliver a 1.59Ă DiT speedup. The causal VAE decoder, distilled from a bidirectional teacher, recovers sharp details and matches the teacherâs fidelity.
These results validate the coâdesign approach, demonstrating that highâresolution real time video editing ai is now achievable on consumer hardware.

Conclusion and Broader Impact
SANA-Streaming demonstrates that minuteâlength, highâresolution video editing can run in real time on a consumer GPU by uniting architectural innovation, training strategies, and hardwareâaware system design. The hybrid transformer, cycleâreverse regularization, and efficient kernels collectively overcome latency, memory, and dataâscarcity bottlenecks.
Limitations include sensitivity to ambiguous instructions and the persistent shortage of diverse long editing data. The system incorporates safeguards such as input screening, generationâtime controls, and output monitoring to mitigate potential misuse, including deepfakes. This work sets a practical baseline for interactive video editing and highlights how coâdesign can accelerate generative AI toward realâworld deployment.



