Overview
Long video generation faces severe memory and compute bottlenecks during both training and inference. Existing works focus on algorithmic improvements but largely neglect infrastructure optimizations. LongLive-2.0 introduces an end-to-end NVFP4 (4-bit floating-point) parallel infrastructure that co-designs training and inference for long video generation. The system achieves up to 2.15× training speedup and 1.84× inference speedup, enabling real-time generation at 45.7 FPS for a 5B-parameter model.
Key contributions include:
- Balanced SP: a sequence-parallel (SP) layout that pairs clean-history and noisy-target chunks on each GPU, balancing loss computation and enabling SP-aware VAE encoding.
- NVFP4 training and inference: full W4A4 quantization of weights, activations, and KV cache, with hardware acceleration on Blackwell GPUs.
- Clean training pipeline: directly fine-tunes a diffusion model into a long, multi-shot autoregressive (AR) model without complex ODE initialization or multi-stage distillation.
- Multi-shot attention sink: preserves global and shot-level identity during streaming generation with sliding-window attention.
Training Infrastructure
Balanced Sequence Parallelism
LongLive-2.0 trains a chunk-level AR diffusion model using teacher forcing. The efficient formulation concatenates clean-history and noisy-target latent streams into one sequence, but naive SP creates workload imbalance and replicated VAE encoding. Balanced SP assigns each GPU the clean and noisy latents from the same temporal chunk, so every rank holds both context and target tokens. This paired layout balances loss-bearing tokens and allows natural teacher-forcing masks after Ulysses All-to-All communication. VAE encoding is also sharded: each rank encodes only its local chunk plus a left halo covering the temporal receptive field, reducing per-rank cost from to .
NVFP4 Training
NVFP4 represents each element as an E2M1 4-bit value with hierarchical scaling (block-wise FP8 scale and tensor-wise FP32 scale). The paper applies NVFP4 to all linear layers during AR training and DMD distillation, while keeping numerically sensitive operations (reductions, normalization, optimizer states) in higher precision. For gradient-sensitive paths, a Random Hadamard Transform (RHT) is applied before quantization. Combined with Balanced SP, NVFP4 training yields a 1.3×–2.1× speedup over BF16+SP baselines, with the largest gains at longer video lengths (64s).
Inference Infrastructure
W4A4 NVFP4 Inference
On Blackwell GPUs, the generator runs in W4A4 NVFP4 mode, replacing BF16 GEMMs with FP4 GEMMs for up to 4× theoretical throughput improvement. The backbone is trained with NVFP4-aware training (not post-training quantization), preserving quality. KV cache is also quantized to NVFP4 using micro-block scaling and adaptive scale selection (Four Over Six), achieving a 3.6× compression ratio with negligible overhead (<2%).
Asynchronous Streaming Decoding
VAE decoding is often a bottleneck. LongLive-2.0 dedicates one GPU to streaming VAE decoding and overlaps it with DiT denoising. Since denoising dominates (), end-to-end latency reduces from to approximately , and GPU memory for VAE drops to .
Sequence Parallelism on Non-Blackwell GPUs
For H100/A100 GPUs without native NVFP4 support, SP inference with quantized KV cache reduces communication volume by ~3.6×, enabling real-time generation. Table 6 shows SP=2 with 4-bit KV cache cuts latency from 31.0s to 18.3s for 16s videos on H100.
Algorithm-Level Designs
Clean Training Pipeline
Unlike prior methods (Self-Forcing, Causal-Forcing) that require ODE initialization and multi-stage DMD, LongLive-2.0 directly fine-tunes a bidirectional diffusion model (Wan2.2-TI2V-5B) into a long, multi-shot AR model using long-video data. Few-step distillation is performed in one stage with only LoRA adapters trained, keeping the quantized backbone frozen. This yields a streamlined pipeline that supports long, interactive, multi-shot, and real-time generation.
Multi-Shot Attention Sink
To prevent appearance drift during streaming inference with sliding-window attention, the paper introduces two cooperating anchor sets:
- Global Sink (): first frames of the video, permanently fixed.
- Shot-Level Sink (): first frames of the current shot, re-bound at scene cuts.
This integrates seamlessly with chunk-wise prompting: a prompt switch triggers local re-binding of without affecting global identity.
Experimental Results
Training Efficiency
Table 1 shows end-to-end AR training iteration times. NVFP4 + Balanced SP achieves the fastest configuration, with speedups of 1.3×, 1.4×, and 2.1× over BF16+SP for 16s, 32s, and 64s videos respectively.
| Input Length | BF16 w/o SP | BF16 w/ SP | BF16 Balanced SP | NVFP4 Balanced SP |
|---|---|---|---|---|
| 16s | 75.3 | 52.2 | 45.8 | 40.1 (1.3×) |
| 32s | 202.7 | 162.7 | 136.8 | 119.3 (1.4×) |
| 64s | OOM | 1372.9 | 1196.5 | 639.5 (2.1×) |
Inference Efficiency
Table 3 shows progressive optimizations on GB200. The 2-step NVFP4 model achieves 45.7 FPS with 19.4 GB peak memory for 64s videos.
| Inference Settings | FPS↑ | 16s E2E (s) | 16s Mem (GB) | 32s E2E (s) | 32s Mem (GB) | 64s E2E (s) | 64s Mem (GB) |
|---|---|---|---|---|---|---|---|
| BF16 | 24.8 | 26.6 | 36.4 | 53.2 | 36.4 | 112.9 | 36.4 |
| NVFP4 | 32.0 | 22.9 | 29.7 | 46.6 | 29.7 | 96.0 | 29.7 |
| + NVFP4 KV Cache | 29.7 | 23.8 | 19.4 | 48.9 | 19.4 | 99.5 | 19.4 |
| + Async Decoding | 29.7 | 15.9 | 19.4 | 29.1 | 19.4 | 57.6 | 19.4 |
| 3 Steps | 35.2 | 12.7 | 19.4 | 23.2 | 19.4 | 46.0 | 19.4 |
| 2 Steps | 45.7 | 11.2 | 19.4 | 19.2 | 19.4 | 36.3 | 19.4 |
Benchmark Performance
On VBench (short video), LongLive-2.0-5B achieves 85.06 Total score at 1280×720 resolution, outperforming all baselines. On VBench-Long (60s video), it achieves the best average rank (3.67) with highest subject consistency (97.48) and background consistency (97.00).
Conclusion
LongLive-2.0 demonstrates that algorithm–infrastructure co-design can dramatically improve the efficiency of long video generation. By introducing Balanced SP, NVFP4 quantization across training and inference, and a clean training pipeline, the system achieves state-of-the-art throughput and memory efficiency while maintaining high generation quality. The work is the first end-to-end NVFP4 system tailored for long video generation, and its principles can inform future low-precision infrastructure for generative models.
Limitations: NVFP4 acceleration is hardware-dependent (Blackwell GPUs). On non-Blackwell platforms, SP inference with quantized KV cache provides an alternative. Broader Impacts: The system reduces computational costs and resource barriers, sharing ethical considerations with existing video generation models.



