home›Video›

How NAVA Generates Synchronized 720p Audio-Video from a Single Prompt

Explore NAVA's Align-then-Fuse MMDiT architecture for native audio-visual alignment, enabling precise multi-timbre control and language-described camera movements.

June 9, 2026

#Content Generation #Open Source #TTS

NAVA is a 6.3B-parameter joint audio-video generator that synthesizes synchronized 720p video and audio from a single prompt. It utilizes an Align-then-Fuse MMDiT architecture to establish audio-video correspondence, offering features like multi-speaker speech with timbre control, fast generation, and language-described camera control.

TL;DR

NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt. It handles multi-speaker speech with reference-timbre control and image-conditioned continuations, all within a single model.

Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT architecture. A dedicated alignment space first establishes audio–video correspondence, then context is fused via cross-attention.

Highlights:

720p 1-min Fast Generation — 720p synchronized audio-video in ~1 minute using 8‑GPU Ulysses sequence parallel.
Dual-Channel Audio — stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment.
Precise Multi-Timbre Control — reference WAVs bound to ~~... speech spans for per-speaker voice identity.
Language-Described Camera Control — shot composition, motion, and pacing directly from the prompt.
Multi-Resolution — landscape/portrait/square aspect ratios from the same checkpoint.

Architecture

NAVA instantiates Native Audio-Visual Alignment as an Align-then-Fuse MMDiT stack built on the Wan2.2 backbone.

Hierarchical Alignment Layers — 10 double-stream blocks. Video and audio keep separate QKV projections and FFNs but share joint self-attention over concatenated [video_tokens; audio_tokens], plus dedicated cross-attention to text. This builds an alignment space where AV correspondence is learned without semantic context interference.

Unified Fusion Layers — 20 single-stream blocks. Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens.

Key hyperparameters: dim=3072, ffn_dim=14336, 24 attention heads, 30 layers. 3D RoPE handles video (temporal + height + width), while 1D RoPE handles audio, applied jointly inside the joint-attention path.

Timbre-in-Context and Cross-Modal CFG

For multi-speaker scenes, Timbre-in-Context Conditioning injects reference-WAV speaker embeddings (ReDimNet, 192‑d) through the context pathway. These embeddings are bound to ~~... speech spans, enabling per-speaker timbre control.

At inference, 3D cross-modal CFG applies independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (video_align_guidance_scale, audio_align_guidance_scale). This keeps AV synchronization tight without sacrificing generation quality.

What sets NAVA apart from existing open-source AV models:

Design axis	Typical baselines	NAVA
Stream layout	Dual-tower (post-hoc align) or fully unified tri-modal	Align-then-Fuse — alignment space first, context fused after
Speech control	Caption-only, no per-speaker timbre	Timbre-in-Context via reference WAVs
Param budget	10 B – 32 B	6.3 B

Evaluation — VerseBench and Speech Quality

NAVA achieves the best AV synchronization, video quality, and audio WER on VerseBench, with the smallest parameter budget among joint AV models.

Model	Params	Sync-C ↑	Sync-D ↓	Video Quality ↑	WER ↓
Ovi 1.1	10 B	7.4839	7.9791	0.636	0.102
MOVA	32 B	7.2888	7.808	0.603	0.126
Davinci	15 B	7.1487	7.8158	0.600	0.151
LTX 2.3	19 B	7.2476	7.6902	0.576	0.106
NAVA	6.3 B	7.7914	7.5655	0.659	0.099

On Seed-TTS-eval, NAVA delivers speech quality close to dedicated audio-only systems, with 5.81 WER and 62.4 speaker similarity — far ahead of other joint AV models like DreamID-Omni.

Quick Facts and Components


Architecture	Align-then-Fuse MMDiT (Wan2.2 backbone)
Parameters	6.3 B
Resolution	1280×704 (recommended) · 960×960 supported
Frames / FPS	37 frames @ 24 fps ≈ 6 s · 55–61 frames ≈ 9–10 s
Audio	25 latent tokens/sec, ≤ 10 s
Sampling	Flow matching · UniPC scheduler · 50 default steps
Precision	bf16
Parallelism	Single-GPU or Ulysses sequence parallel (up to 8 GPUs)

Shipped components: WanAVModel backbone (6.3 B), Wan2.2 Video VAE (causal 3D ConvNet, 16×16×4 compression), LTX Audio VAE + Vocoder (128 latent channels, built-in waveform decoder), umt5-xxl Text Encoder, and ReDimNet speaker embedder.

How to Use — Quick Start

After setup, run one of the provided scripts:

bash scripts/inference.sh          # General T2AV
bash scripts/inference_timbre.sh   # I2AV + timbre control

Outputs land under eval_results/.

First-time setup:

git clone https://github.com/ernie-research/NAVA && cd NAVA
pip install torch torchvision torchaudio
pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece
pip install flash-attn --no-build-isolation
huggingface-cli download  --local-dir .

Custom Batches and Prompt Rewriting

Write a JSONL file with one prompt per line:

{"prompt": "一位男子在海边奔跑，镜头跟随。背景是海浪声和风声。"}
{"prompt": "两人对话~~Hello~~Hi there", "spk_wavs": ["spk1.wav", "spk2.wav"]}
{"prompt": "镜头跟随主体...", "image_path": "/abs/path/first_frame.png"}

Launch with torchrun (8 GPUs with Ulysses SP):

SETUPTOOLS_USE_DISTUTILS=stdlib torchrun \
--nnodes=1 --nproc_per_node=8 \
inference_nava.py \
--config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \
--ckpt NAVA.ckpt --out_dir ./outputs \
--data_file my_prompts.jsonl \
--width 1280 --height 704 --frames 37 --fps 24 \
--steps 50 --save_sample --gen_turn 1 --use_sp

NAVA is trained on Chinese dense captions. Short or English prompts benefit from rewriting before inference. Three pathways are provided: a vLLM batch server (< 2 s/prompt), a local transformers script, and a Gradio "Rewrite" button — all using Qwen3-4B-Thinking-2507, with ~~... spans preserved verbatim.

Project page GitHub ArXiv paper