Tailored news hub
homeVideo

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

End-to-end training and inference system using NVFP4 quantization, Balanced SP, and multi-shot attention sink for real-time, long, interactive video generation.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
#Academic#Content Generation#Fine Tuning#Training

LongLive-2.0 presents the first end-to-end NVFP4 system for long video generation. It introduces Balanced Sequence Parallelism (SP) and NVFP4 quantization to accelerate training and inference. On Blackwell GPUs, W4A4 inference and quantized KV cache reduce memory and boost throughput. A clean training pipeline directly fine-tunes diffusion models into autoregressive models with standalone LoRA for real-time generation. Multi-shot attention sink enables stable streaming. Experiments show up to 2.15× training speedup and 1.84× inference speedup, achieving 45.7 FPS at 5B parameters.

Overview

Long video generation faces severe memory and compute bottlenecks during both training and inference. Existing works focus on algorithmic improvements but largely neglect infrastructure optimizations. LongLive-2.0 introduces an end-to-end NVFP4 (4-bit floating-point) parallel infrastructure that co-designs training and inference for long video generation. The system achieves up to 2.15× training speedup and 1.84× inference speedup, enabling real-time generation at 45.7 FPS for a 5B-parameter model.

Key contributions include:

  • Balanced SP: a sequence-parallel (SP) layout that pairs clean-history and noisy-target chunks on each GPU, balancing loss computation and enabling SP-aware VAE encoding.
  • NVFP4 training and inference: full W4A4 quantization of weights, activations, and KV cache, with hardware acceleration on Blackwell GPUs.
  • Clean training pipeline: directly fine-tunes a diffusion model into a long, multi-shot autoregressive (AR) model without complex ODE initialization or multi-stage distillation.
  • Multi-shot attention sink: preserves global and shot-level identity during streaming generation with sliding-window attention.

Training Infrastructure

Balanced Sequence Parallelism

LongLive-2.0 trains a chunk-level AR diffusion model using teacher forcing. The efficient formulation concatenates clean-history and noisy-target latent streams into one sequence, but naive SP creates workload imbalance and replicated VAE encoding. Balanced SP assigns each GPU the clean and noisy latents from the same temporal chunk, so every rank holds both context and target tokens. This paired layout balances loss-bearing tokens and allows natural teacher-forcing masks after Ulysses All-to-All communication. VAE encoding is also sharded: each rank encodes only its local chunk plus a left halo covering the temporal receptive field, reducing per-rank cost from O(F)O(F) to O(F/P+h)O(F/P + h).

NVFP4 Training

NVFP4 represents each element as an E2M1 4-bit value with hierarchical scaling (block-wise FP8 scale and tensor-wise FP32 scale). The paper applies NVFP4 to all linear layers during AR training and DMD distillation, while keeping numerically sensitive operations (reductions, normalization, optimizer states) in higher precision. For gradient-sensitive paths, a Random Hadamard Transform (RHT) is applied before quantization. Combined with Balanced SP, NVFP4 training yields a 1.3×–2.1× speedup over BF16+SP baselines, with the largest gains at longer video lengths (64s).

Inference Infrastructure

W4A4 NVFP4 Inference

On Blackwell GPUs, the generator runs in W4A4 NVFP4 mode, replacing BF16 GEMMs with FP4 GEMMs for up to 4× theoretical throughput improvement. The backbone is trained with NVFP4-aware training (not post-training quantization), preserving quality. KV cache is also quantized to NVFP4 using micro-block scaling and adaptive scale selection (Four Over Six), achieving a 3.6× compression ratio with negligible overhead (<2%).

Asynchronous Streaming Decoding

VAE decoding is often a bottleneck. LongLive-2.0 dedicates one GPU to streaming VAE decoding and overlaps it with DiT denoising. Since denoising dominates (tDiTtVAEt_{\text{DiT}} \geq t_{\text{VAE}}), end-to-end latency reduces from C(tDiT+tVAE)C(t_{\text{DiT}}+t_{\text{VAE}}) to approximately CtDiT+tVAEC \cdot t_{\text{DiT}} + t_{\text{VAE}}, and GPU memory for VAE drops to O(Tc)\mathcal{O}(T_c).

Sequence Parallelism on Non-Blackwell GPUs

For H100/A100 GPUs without native NVFP4 support, SP inference with quantized KV cache reduces communication volume by ~3.6×, enabling real-time generation. Table 6 shows SP=2 with 4-bit KV cache cuts latency from 31.0s to 18.3s for 16s videos on H100.

Algorithm-Level Designs

Clean Training Pipeline

Unlike prior methods (Self-Forcing, Causal-Forcing) that require ODE initialization and multi-stage DMD, LongLive-2.0 directly fine-tunes a bidirectional diffusion model (Wan2.2-TI2V-5B) into a long, multi-shot AR model using long-video data. Few-step distillation is performed in one stage with only LoRA adapters trained, keeping the quantized backbone frozen. This yields a streamlined pipeline that supports long, interactive, multi-shot, and real-time generation.

Multi-Shot Attention Sink

To prevent appearance drift during streaming inference with sliding-window attention, the paper introduces two cooperating anchor sets:

  • Global Sink (Ag\mathcal{A}_g): first SgS_g frames of the video, permanently fixed.
  • Shot-Level Sink (As\mathcal{A}_s): first SsS_s frames of the current shot, re-bound at scene cuts.

This integrates seamlessly with chunk-wise prompting: a prompt switch triggers local re-binding of As\mathcal{A}_s without affecting global identity.

Experimental Results

Training Efficiency

Table 1 shows end-to-end AR training iteration times. NVFP4 + Balanced SP achieves the fastest configuration, with speedups of 1.3×, 1.4×, and 2.1× over BF16+SP for 16s, 32s, and 64s videos respectively.

Input LengthBF16 w/o SPBF16 w/ SPBF16 Balanced SPNVFP4 Balanced SP
16s75.352.245.840.1 (1.3×)
32s202.7162.7136.8119.3 (1.4×)
64sOOM1372.91196.5639.5 (2.1×)

Inference Efficiency

Table 3 shows progressive optimizations on GB200. The 2-step NVFP4 model achieves 45.7 FPS with 19.4 GB peak memory for 64s videos.

Inference SettingsFPS↑16s E2E (s)16s Mem (GB)32s E2E (s)32s Mem (GB)64s E2E (s)64s Mem (GB)
BF1624.826.636.453.236.4112.936.4
NVFP432.022.929.746.629.796.029.7
+ NVFP4 KV Cache29.723.819.448.919.499.519.4
+ Async Decoding29.715.919.429.119.457.619.4
3 Steps35.212.719.423.219.446.019.4
2 Steps45.711.219.419.219.436.319.4

Benchmark Performance

On VBench (short video), LongLive-2.0-5B achieves 85.06 Total score at 1280×720 resolution, outperforming all baselines. On VBench-Long (60s video), it achieves the best average rank (3.67) with highest subject consistency (97.48) and background consistency (97.00).

Conclusion

LongLive-2.0 demonstrates that algorithm–infrastructure co-design can dramatically improve the efficiency of long video generation. By introducing Balanced SP, NVFP4 quantization across training and inference, and a clean training pipeline, the system achieves state-of-the-art throughput and memory efficiency while maintaining high generation quality. The work is the first end-to-end NVFP4 system tailored for long video generation, and its principles can inform future low-precision infrastructure for generative models.

Limitations: NVFP4 acceleration is hardware-dependent (Blackwell GPUs). On non-Blackwell platforms, SP inference with quantized KV cache provides an alternative. Broader Impacts: The system reduces computational costs and resource barriers, sharing ethical considerations with existing video generation models.

Related Articles