Unified Audio Generation and Editing with UNISON
UNISON is a unified latent flow-matching framework that handles multiple audio and speech tasks using a single set of weights. It supports text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound scene generation, and fine-grained audio editing — all within one model and one forward pass. Built on a deep LLM fusion strategy, UNISON leverages a frozen Qwen2.5-Omni-7B language model to inject rich semantic representations layer-wise into a diffusion backbone, eliminating the need for task-specific encoders or heads. The design marks a shift toward truly general-purpose audio generation systems.
How UNISON Works: Mask Channels and Deep LLM Fusion
UNISON’s architecture is built around a shared VAE encoder/decoder and an MM-DiT backbone. The VAE compresses raw audio into a latent space, where latent flow matching generates waveforms efficiently. Task identity is encoded via a mask channel that conditions the diffusion process without extra modules. Source or reference audio is injected through VAE-encoded channel concatenation.
The key innovation is deep LLM fusion: hidden states from uniformly sampled layers of the frozen Qwen2.5-Omni-7B model are projected via learned linear layers and injected into corresponding MM-DiT double-stream blocks. This layer-wise integration provides rich linguistic and acoustic context, enabling the model to unify diverse generation and editing tasks under a single forward pass. No separate text encoders or task-specific heads are needed.
Available Checkpoints
Two variants are provided, differing in VAE sampling rate and model capacity. Both share the same Qwen2.5-Omni-7B encoder and inference pipeline.
| Directory | VAE | DiT depth | Channels | Config |
|---|---|---|---|---|
unison_D20S0_O_40ch/ | MMAudio 44 kHz | 20 double + 0 single | 40 | D20S0_O_40ch.yaml |
unison_D24S0_O_20ch/ | MMAudio 16 kHz | 24 double + 0 single | 20 | D24S0_O_20ch.yaml |
The 44 kHz variant provides higher-quality audio for music and general sound; the 16 kHz variant uses more transformer blocks but a narrower channel, suitable for speech.
Multi-Task Prompting
UNISON uses unified prompt formats to specify tasks. The following table shows how each task is triggered.
| Task | Prompt format |
|---|---|
| Text-to-Audio (T2A) | [Audio] {caption} |
| Text-to-Speech (TTS) | [Speech] A {female/male} voice saying "{text}" |
| Mixed Speech + Sound | [Speech] A {gender} voice saying "{text}" [Audio] {background} |
| Zero-shot Speaker Cloning | [Speech with voice] {ref_text}, {target_text} |
| Audio Scene Editing (add/remove/replace/denoise) | [Edit] [Audio] {instruction} |
| Speech-in-Scene Editing (content/insert/delete) | [Edit] [Speech] {instruction} |
| Timed Temporal Composition | `[Audio] From {t1}s to {t2}s, {event1}. |
| From {t2}s to {t3}s, {event2}. ...` |
The mask channel and VAE-encoded reference concatenation allow the model to interpret these prompts without separate input branches.
Getting Started
To run UNISON locally:
- Clone the repository and install dependencies.
- Download MMAudio VAE weights (
v1-44.pthorv1-16.pth, andbest_netG.ptfor the 16 kHz VAE) from the MMAudio release. - Set the environment variable
QWEN_OMNI_MODEL_PATHto your local Qwen2.5-Omni-7B installation. - Use Hugging Face’s
snapshot_downloadto fetch the UNISON checkpoints into acheckpoints/directory.
The checkpoints are single model.safetensors files, automatically unwrapped from EMA if needed.
The pipeline also accepts directories or direct file paths.
git clone https://github.com/lizhaoqing/UNISON cd UNISON pip install -r requirements.txt # Optional: pip install flash-attn --no-build-isolation export QWEN_OMNI_MODEL_PATH=Qwen/Qwen2.5-Omni-7B # Place downloaded MMAudio VAE weights in unison/models/mmaudio/data/ext_weights/ # Then download UNISON checkpoints (e.g., via snapshot_download)
Generating and Editing with UNISON
Inference is launched via a single script.
The pipeline supports both the infer.sh bash helper and a direct Python call.
Key parameters include:
--num_inference_steps: ODE solver steps (default 100; use 50 for faster generation).--guidance_scale: classifier-free guidance strength (default 4.5).--seed: reproducibility seed (default 42).--gen_duration: output length in seconds for generation tasks (default 10.0).--ref_duration: reference clip length for zero-shot TTS (default 3.0).
The model can switch between all tasks using the --task_mode all flag.
Outputs are saved to a dedicated directory.
A single-prompt example below demonstrates text-to-audio generation.
# 44 kHz variant bash scripts/infer.sh \ --checkpoint_dir checkpoints/unison_D20S0_O_40ch \ --model_config unison/config/D20S0_O_40ch.yaml \ --vae_config unison/models/mmaudio/vae_config_44k.yaml \ --task_mode all # Or single-prompt generation python unison/pipelines/infer.py \ --model_ckpt checkpoints/unison_D20S0_O_40ch \ --model_config unison/config/D20S0_O_40ch.yaml \ --vae_config unison/models/mmaudio/vae_config_44k.yaml \ --omni_model_path $QWEN_OMNI_MODEL_PATH \ --task_mode generation \ --gen_prompt "[Audio] Rain falling on a tin roof with distant thunder" \ --gen_duration 10.0 \ --output_dir outputs/demo




