Tailored news hub
homeVideo

How NAVA Generates Synchronized 720p Audio-Video from a Single Prompt

Explore NAVA's Align-then-Fuse MMDiT architecture for native audio-visual alignment, enabling precise multi-timbre control and language-described camera movements.

How NAVA Generates Synchronized 720p Audio-Video from a Single Prompt
#Content Generation#Open Source#TTS

NAVA is a 6.3B-parameter joint audio-video generator that synthesizes synchronized 720p video and audio from a single prompt. It utilizes an Align-then-Fuse MMDiT architecture to establish audio-video correspondence, offering features like multi-speaker speech with timbre control, fast generation, and language-described camera control.

TL;DR

NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt. It handles multi-speaker speech with reference-timbre control and image-conditioned continuations, all within a single model.

Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT architecture. A dedicated alignment space first establishes audio–video correspondence, then context is fused via cross-attention.

Highlights:

  • 720p 1-min Fast Generation — 720p synchronized audio-video in ~1 minute using 8‑GPU Ulysses sequence parallel.
  • Dual-Channel Audio — stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment.
  • Precise Multi-Timbre Control — reference WAVs bound to ~~... speech spans for per-speaker voice identity.
  • Language-Described Camera Control — shot composition, motion, and pacing directly from the prompt.
  • Multi-Resolution — landscape/portrait/square aspect ratios from the same checkpoint.

Architecture

NAVA instantiates Native Audio-Visual Alignment as an Align-then-Fuse MMDiT stack built on the Wan2.2 backbone.

Hierarchical Alignment Layers — 10 double-stream blocks. Video and audio keep separate QKV projections and FFNs but share joint self-attention over concatenated [video_tokens; audio_tokens], plus dedicated cross-attention to text. This builds an alignment space where AV correspondence is learned without semantic context interference.

Unified Fusion Layers — 20 single-stream blocks. Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens.

Key hyperparameters: dim=3072, ffn_dim=14336, 24 attention heads, 30 layers. 3D RoPE handles video (temporal + height + width), while 1D RoPE handles audio, applied jointly inside the joint-attention path.

Timbre-in-Context and Cross-Modal CFG

For multi-speaker scenes, Timbre-in-Context Conditioning injects reference-WAV speaker embeddings (ReDimNet, 192‑d) through the context pathway. These embeddings are bound to ~~... speech spans, enabling per-speaker timbre control.

At inference, 3D cross-modal CFG applies independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (video_align_guidance_scale, audio_align_guidance_scale). This keeps AV synchronization tight without sacrificing generation quality.

What sets NAVA apart from existing open-source AV models:

Design axisTypical baselinesNAVA
Stream layoutDual-tower (post-hoc align) or fully unified tri-modalAlign-then-Fuse — alignment space first, context fused after
Speech controlCaption-only, no per-speaker timbreTimbre-in-Context via reference WAVs
Param budget10 B – 32 B6.3 B

Evaluation — VerseBench and Speech Quality

NAVA achieves the best AV synchronization, video quality, and audio WER on VerseBench, with the smallest parameter budget among joint AV models.

ModelParamsSync-C ↑Sync-D ↓Video Quality ↑WER ↓
Ovi 1.110 B7.48397.97910.6360.102
MOVA32 B7.28887.8080.6030.126
Davinci15 B7.14877.81580.6000.151
LTX 2.319 B7.24767.69020.5760.106
NAVA6.3 B7.79147.56550.6590.099

On Seed-TTS-eval, NAVA delivers speech quality close to dedicated audio-only systems, with 5.81 WER and 62.4 speaker similarity — far ahead of other joint AV models like DreamID-Omni.

Quick Facts and Components

ArchitectureAlign-then-Fuse MMDiT (Wan2.2 backbone)
Parameters6.3 B
Resolution1280×704 (recommended) · 960×960 supported
Frames / FPS37 frames @ 24 fps ≈ 6 s · 55–61 frames ≈ 9–10 s
Audio25 latent tokens/sec, ≤ 10 s
SamplingFlow matching · UniPC scheduler · 50 default steps
Precisionbf16
ParallelismSingle-GPU or Ulysses sequence parallel (up to 8 GPUs)

Shipped components: WanAVModel backbone (6.3 B), Wan2.2 Video VAE (causal 3D ConvNet, 16×16×4 compression), LTX Audio VAE + Vocoder (128 latent channels, built-in waveform decoder), umt5-xxl Text Encoder, and ReDimNet speaker embedder.

How to Use — Quick Start

After setup, run one of the provided scripts:

bash scripts/inference.sh          # General T2AV
bash scripts/inference_timbre.sh   # I2AV + timbre control

Outputs land under eval_results/.

First-time setup:

git clone https://github.com/ernie-research/NAVA && cd NAVA
pip install torch torchvision torchaudio
pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece
pip install flash-attn --no-build-isolation
huggingface-cli download  --local-dir .

Custom Batches and Prompt Rewriting

Write a JSONL file with one prompt per line:

{"prompt": "一位男子在海边奔跑,镜头跟随。背景是海浪声和风声。"}
{"prompt": "两人对话~~Hello~~Hi there", "spk_wavs": ["spk1.wav", "spk2.wav"]}
{"prompt": "镜头跟随主体...", "image_path": "/abs/path/first_frame.png"}

Launch with torchrun (8 GPUs with Ulysses SP):

SETUPTOOLS_USE_DISTUTILS=stdlib torchrun \
--nnodes=1 --nproc_per_node=8 \
inference_nava.py \
--config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
--ckpt NAVA.ckpt --out_dir ./outputs \
--data_file my_prompts.jsonl \
--width 1280 --height 704 --frames 37 --fps 24 \
--steps 50 --save_sample --gen_turn 1 --use_sp

NAVA is trained on Chinese dense captions. Short or English prompts benefit from rewriting before inference. Three pathways are provided: a vLLM batch server (< 2 s/prompt), a local transformers script, and a Gradio "Rewrite" button — all using Qwen3-4B-Thinking-2507, with ~~... spans preserved verbatim.

Related Articles