TL;DR
NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt. It handles multi-speaker speech with reference-timbre control and image-conditioned continuations, all within a single model.
Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT architecture. A dedicated alignment space first establishes audio–video correspondence, then context is fused via cross-attention.
Highlights:
- 720p 1-min Fast Generation — 720p synchronized audio-video in ~1 minute using 8‑GPU Ulysses sequence parallel.
- Dual-Channel Audio — stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment.
- Precise Multi-Timbre Control — reference WAVs bound to
~~...speech spans for per-speaker voice identity. - Language-Described Camera Control — shot composition, motion, and pacing directly from the prompt.
- Multi-Resolution — landscape/portrait/square aspect ratios from the same checkpoint.
Architecture
NAVA instantiates Native Audio-Visual Alignment as an Align-then-Fuse MMDiT stack built on the Wan2.2 backbone.
Hierarchical Alignment Layers — 10 double-stream blocks. Video and audio keep separate QKV projections and FFNs but share joint self-attention over concatenated [video_tokens; audio_tokens], plus dedicated cross-attention to text.
This builds an alignment space where AV correspondence is learned without semantic context interference.
Unified Fusion Layers — 20 single-stream blocks. Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens.
Key hyperparameters: dim=3072, ffn_dim=14336, 24 attention heads, 30 layers. 3D RoPE handles video (temporal + height + width), while 1D RoPE handles audio, applied jointly inside the joint-attention path.
Timbre-in-Context and Cross-Modal CFG
For multi-speaker scenes, Timbre-in-Context Conditioning injects reference-WAV speaker embeddings (ReDimNet, 192‑d) through the context pathway.
These embeddings are bound to ~~... speech spans, enabling per-speaker timbre control.
At inference, 3D cross-modal CFG applies independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (video_align_guidance_scale, audio_align_guidance_scale).
This keeps AV synchronization tight without sacrificing generation quality.
What sets NAVA apart from existing open-source AV models:
| Design axis | Typical baselines | NAVA |
|---|---|---|
| Stream layout | Dual-tower (post-hoc align) or fully unified tri-modal | Align-then-Fuse — alignment space first, context fused after |
| Speech control | Caption-only, no per-speaker timbre | Timbre-in-Context via reference WAVs |
| Param budget | 10 B – 32 B | 6.3 B |
Evaluation — VerseBench and Speech Quality
NAVA achieves the best AV synchronization, video quality, and audio WER on VerseBench, with the smallest parameter budget among joint AV models.
| Model | Params | Sync-C ↑ | Sync-D ↓ | Video Quality ↑ | WER ↓ |
|---|---|---|---|---|---|
| Ovi 1.1 | 10 B | 7.4839 | 7.9791 | 0.636 | 0.102 |
| MOVA | 32 B | 7.2888 | 7.808 | 0.603 | 0.126 |
| Davinci | 15 B | 7.1487 | 7.8158 | 0.600 | 0.151 |
| LTX 2.3 | 19 B | 7.2476 | 7.6902 | 0.576 | 0.106 |
| NAVA | 6.3 B | 7.7914 | 7.5655 | 0.659 | 0.099 |
On Seed-TTS-eval, NAVA delivers speech quality close to dedicated audio-only systems, with 5.81 WER and 62.4 speaker similarity — far ahead of other joint AV models like DreamID-Omni.
Quick Facts and Components
| Architecture | Align-then-Fuse MMDiT (Wan2.2 backbone) |
| Parameters | 6.3 B |
| Resolution | 1280×704 (recommended) · 960×960 supported |
| Frames / FPS | 37 frames @ 24 fps ≈ 6 s · 55–61 frames ≈ 9–10 s |
| Audio | 25 latent tokens/sec, ≤ 10 s |
| Sampling | Flow matching · UniPC scheduler · 50 default steps |
| Precision | bf16 |
| Parallelism | Single-GPU or Ulysses sequence parallel (up to 8 GPUs) |
Shipped components: WanAVModel backbone (6.3 B), Wan2.2 Video VAE (causal 3D ConvNet, 16×16×4 compression), LTX Audio VAE + Vocoder (128 latent channels, built-in waveform decoder), umt5-xxl Text Encoder, and ReDimNet speaker embedder.
How to Use — Quick Start
After setup, run one of the provided scripts:
bash scripts/inference.sh # General T2AV bash scripts/inference_timbre.sh # I2AV + timbre control
Outputs land under eval_results/.
First-time setup:
git clone https://github.com/ernie-research/NAVA && cd NAVA pip install torch torchvision torchaudio pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece pip install flash-attn --no-build-isolation huggingface-cli download --local-dir .
Custom Batches and Prompt Rewriting
Write a JSONL file with one prompt per line:
{"prompt": "一位男子在海边奔跑,镜头跟随。背景是海浪声和风声。"} {"prompt": "两人对话~~Hello~~Hi there", "spk_wavs": ["spk1.wav", "spk2.wav"]} {"prompt": "镜头跟随主体...", "image_path": "/abs/path/first_frame.png"}
Launch with torchrun (8 GPUs with Ulysses SP):
SETUPTOOLS_USE_DISTUTILS=stdlib torchrun \ --nnodes=1 --nproc_per_node=8 \ inference_nava.py \ --config configs/baseline_t2av_demo_mmdit_no_split_ltx_control_unipc.yaml \ --ckpt NAVA.ckpt --out_dir ./outputs \ --data_file my_prompts.jsonl \ --width 1280 --height 704 --frames 37 --fps 24 \ --steps 50 --save_sample --gen_turn 1 --use_sp
NAVA is trained on Chinese dense captions.
Short or English prompts benefit from rewriting before inference.
Three pathways are provided: a vLLM batch server (< 2 s/prompt), a local transformers script, and a Gradio "Rewrite" button — all using Qwen3-4B-Thinking-2507, with ~~... spans preserved verbatim.



