home›Video›

SANA-WM: Open-Source Bidirectional World Model for Minute-Long Video

A 2.6B-parameter diffusion transformer synthesizing 720p video with 6-DoF camera control, hybrid linear attention, and two-stage refinement

May 24, 2026

#Academic #Content Generation #Open Source #Training

SANA-WM is an efficient open-source world model trained for one-minute video generation. It uses a bidirectional image-to-video diffusion transformer with hybrid linear attention, dual-branch camera control, and a two-stage pipeline. Runs on under 8GB VRAM and generates 60-second 720p clips in 34 seconds on a single RTX 5090.

Introduction

SANA-WM is an efficient open-source world model trained natively for one-minute generation. The bidirectional checkpoint released here is a 2.6B-parameter image-to-video diffusion transformer that synthesises 720p, minute-scale videos with precise 6-DoF camera control, paired with the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.

Core Architecture Designs

Four core designs drive the architecture:

Hybrid Linear Attention — frame-wise Gated DeltaNet combined with softmax attention every Nth block for memory-efficient long-context modelling.
Dual-Branch Camera Control — independent main and camera branches enable precise per-frame trajectory adherence.
Two-Stage Generation Pipeline — a long-video refiner stitched on top of Stage-1 latents improves quality and temporal consistency.
Robust Annotation Pipeline — metric-scale 6-DoF camera poses extracted from public video corpora yield spatiotemporally consistent action supervision.

@article{zhu2026sanawm,
title = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
author = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze},
journal = {arXiv preprint arXiv:2605.15178},
year = {2026},
}

Repository Layout

Component	Path in repo	Size
Sana DiT (Stage 1)	`dit/sana_wm_1600m_720p.safetensors`	10 GB
LTX-2 VAE (diffusers)	`vae/`	2 GB
LTX-2 refiner (Stage 2)	`refiner/refiner.safetensors`	41 GB
Gemma text encoder for the refiner	`refiner/text_encoder/`	46 GB
Inference config	`config.yaml`	—

The Sana text encoder (gemma-2-2b-it) is not bundled here — it is fetched on demand from the public Hugging Face mirror.

python inference_video_scripts/inference_sana_wm.py \
--image asset/sana_wm/demo_0.png \
--prompt asset/sana_wm/demo_0.txt \
--action "w-80,jw-40,w-40,lw-60,w-100" \
--translation_speed 0.055 \
--rotation_speed_deg 1.2 \
--num_frames 321 \
--output_dir results/demo

Usage Details

Weights are fetched from this repository on first use. Pass --no_refiner to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE instead. To run fully offline, override any of --config / --model_path / --refiner_checkpoint / --refiner_gemma_root with local paths.

The output frame size is fixed at 704 x 1280; input images are aspect-preserving resized + center-cropped to that resolution.

Inputs

Argument	Format
`--image`	RGB image (any PIL-readable format) — used as the first frame.
`--prompt`	UTF-8 text file containing the conditioning prompt.
`--camera`	NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices.
`--action`	WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`.
`--intrinsics`	Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`.

Social Feedback

Social media reports highlight the following user-observed characteristics and performance:

Runs on less than 8 GB VRAM.
A distilled version denoises a 60-second 720p clip in 34 seconds on one RTX 5090.
Reported to be 36× faster than older open models.
Trained on approximately 213K public videos for 15 days on 64 H100 GPUs.
Compatible with ComfyUI and Diffusers plugins.
Licensed under Apache 2.0.
One user reported receiving only a regular video output (potential quality or correctness issue).