Tailored news hub
home›Video›

SANA-WM: Open-Source Bidirectional World Model for Minute-Long Video

A 2.6B-parameter diffusion transformer synthesizing 720p video with 6-DoF camera control, hybrid linear attention, and two-stage refinement

SANA-WM: Open-Source Bidirectional World Model for Minute-Long Video
#Academic#Content Generation#Open Source#Training

SANA-WM is an efficient open-source world model trained for one-minute video generation. It uses a bidirectional image-to-video diffusion transformer with hybrid linear attention, dual-branch camera control, and a two-stage pipeline. Runs on under 8GB VRAM and generates 60-second 720p clips in 34 seconds on a single RTX 5090.

Introduction

SANA-WM is an efficient open-source world model trained natively for one-minute generation. The bidirectional checkpoint released here is a 2.6B-parameter image-to-video diffusion transformer that synthesises 720p, minute-scale videos with precise 6-DoF camera control, paired with the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.

Core Architecture Designs

Four core designs drive the architecture:

  1. Hybrid Linear Attention — frame-wise Gated DeltaNet combined with softmax attention every Nth block for memory-efficient long-context modelling.
  2. Dual-Branch Camera Control — independent main and camera branches enable precise per-frame trajectory adherence.
  3. Two-Stage Generation Pipeline — a long-video refiner stitched on top of Stage-1 latents improves quality and temporal consistency.
  4. Robust Annotation Pipeline — metric-scale 6-DoF camera poses extracted from public video corpora yield spatiotemporally consistent action supervision.
@article{zhu2026sanawm,
title = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
author = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze},
journal = {arXiv preprint arXiv:2605.15178},
year = {2026},
}

Repository Layout

ComponentPath in repoSize
Sana DiT (Stage 1)dit/sana_wm_1600m_720p.safetensors10 GB
LTX-2 VAE (diffusers)vae/2 GB
LTX-2 refiner (Stage 2)refiner/refiner.safetensors41 GB
Gemma text encoder for the refinerrefiner/text_encoder/46 GB
Inference configconfig.yaml—

The Sana text encoder (gemma-2-2b-it) is not bundled here — it is fetched on demand from the public Hugging Face mirror.

python inference_video_scripts/inference_sana_wm.py \
--image asset/sana_wm/demo_0.png \
--prompt asset/sana_wm/demo_0.txt \
--action "w-80,jw-40,w-40,lw-60,w-100" \
--translation_speed 0.055 \
--rotation_speed_deg 1.2 \
--num_frames 321 \
--output_dir results/demo

Usage Details

Weights are fetched from this repository on first use. Pass --no_refiner to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE instead. To run fully offline, override any of --config / --model_path / --refiner_checkpoint / --refiner_gemma_root with local paths.

The output frame size is fixed at 704 x 1280; input images are aspect-preserving resized + center-cropped to that resolution.

Inputs

ArgumentFormat
--imageRGB image (any PIL-readable format) — used as the first frame.
--promptUTF-8 text file containing the conditioning prompt.
--cameraNumPy .npy of shape (F, 4, 4) — per-frame camera-to-world matrices.
--actionWASD/IJKL DSL, e.g. "w-80,jw-40,w-40,lw-60,w-100". We roll it out to a (F+1, 4, 4) trajectory. Mutually exclusive with --camera.
--intrinsicsOptional. .npy of shape (3, 3), (F, 3, 3), or (4,). If omitted, we estimate intrinsics from --image with Pi3X and abort if the resulting FOV is outside [25°, 120°].

Social Feedback

Social media reports highlight the following user-observed characteristics and performance:

  • Runs on less than 8 GB VRAM.
  • A distilled version denoises a 60-second 720p clip in 34 seconds on one RTX 5090.
  • Reported to be 36× faster than older open models.
  • Trained on approximately 213K public videos for 15 days on 64 H100 GPUs.
  • Compatible with ComfyUI and Diffusers plugins.
  • Licensed under Apache 2.0.
  • One user reported receiving only a regular video output (potential quality or correctness issue).
Related Articles