home›Audio›

How UNISON Unifies Audio and Speech Generation with Deep LLM Fusion

Explore UNISON, a single-model framework leveraging latent flow-matching and Qwen2.5-Omni-7B for diverse audio tasks, from text-to-audio to complex scene editing.

June 6, 2026

#Academic #Content Generation #LLM #Open Source #TTS

UNISON is a unified latent flow-matching framework for audio and speech generation and editing. Using a single set of weights, it integrates text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound scene generation, and audio/speech-in-scene editing—all in one model, one architecture, one forward pass, leveraging deep LLM fusion with Qwen2.5-Omni-7B.

Unified Audio Generation and Editing with UNISON

UNISON is a unified latent flow-matching framework that handles multiple audio and speech tasks using a single set of weights. It supports text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound scene generation, and fine-grained audio editing — all within one model and one forward pass. Built on a deep LLM fusion strategy, UNISON leverages a frozen Qwen2.5-Omni-7B language model to inject rich semantic representations layer-wise into a diffusion backbone, eliminating the need for task-specific encoders or heads. The design marks a shift toward truly general-purpose audio generation systems.

How UNISON Works: Mask Channels and Deep LLM Fusion

UNISON’s architecture is built around a shared VAE encoder/decoder and an MM-DiT backbone. The VAE compresses raw audio into a latent space, where latent flow matching generates waveforms efficiently. Task identity is encoded via a mask channel that conditions the diffusion process without extra modules. Source or reference audio is injected through VAE-encoded channel concatenation.

The key innovation is deep LLM fusion: hidden states from uniformly sampled layers of the frozen Qwen2.5-Omni-7B model are projected via learned linear layers and injected into corresponding MM-DiT double-stream blocks. This layer-wise integration provides rich linguistic and acoustic context, enabling the model to unify diverse generation and editing tasks under a single forward pass. No separate text encoders or task-specific heads are needed.

Available Checkpoints

Two variants are provided, differing in VAE sampling rate and model capacity. Both share the same Qwen2.5-Omni-7B encoder and inference pipeline.

Directory	VAE	DiT depth	Channels	Config
`unison_D20S0_O_40ch/`	MMAudio 44 kHz	20 double + 0 single	40	`D20S0_O_40ch.yaml`
`unison_D24S0_O_20ch/`	MMAudio 16 kHz	24 double + 0 single	20	`D24S0_O_20ch.yaml`

The 44 kHz variant provides higher-quality audio for music and general sound; the 16 kHz variant uses more transformer blocks but a narrower channel, suitable for speech.

Multi-Task Prompting

UNISON uses unified prompt formats to specify tasks. The following table shows how each task is triggered.

Task	Prompt format
Text-to-Audio (T2A)	`[Audio] {caption}`
Text-to-Speech (TTS)	`[Speech] A {female/male} voice saying "{text}"`
Mixed Speech + Sound	`[Speech] A {gender} voice saying "{text}" [Audio] {background}`
Zero-shot Speaker Cloning	`[Speech with voice] {ref_text}, {target_text}`
Audio Scene Editing (add/remove/replace/denoise)	`[Edit] [Audio] {instruction}`
Speech-in-Scene Editing (content/insert/delete)	`[Edit] [Speech] {instruction}`
Timed Temporal Composition	`[Audio] From {t1}s to {t2}s, {event1}.
From {t2}s to {t3}s, {event2}. ...`

The mask channel and VAE-encoded reference concatenation allow the model to interpret these prompts without separate input branches.

Getting Started

To run UNISON locally:

Clone the repository and install dependencies.
Download MMAudio VAE weights (v1-44.pth or v1-16.pth, and best_netG.pt for the 16 kHz VAE) from the MMAudio release.
Set the environment variable QWEN_OMNI_MODEL_PATH to your local Qwen2.5-Omni-7B installation.
Use Hugging Face’s snapshot_download to fetch the UNISON checkpoints into a checkpoints/ directory.

The checkpoints are single model.safetensors files, automatically unwrapped from EMA if needed. The pipeline also accepts directories or direct file paths.

git clone https://github.com/lizhaoqing/UNISON
cd UNISON
pip install -r requirements.txt
# Optional: pip install flash-attn --no-build-isolation
export QWEN_OMNI_MODEL_PATH=Qwen/Qwen2.5-Omni-7B
# Place downloaded MMAudio VAE weights in unison/models/mmaudio/data/ext_weights/
# Then download UNISON checkpoints (e.g., via snapshot_download)

Generating and Editing with UNISON

Inference is launched via a single script. The pipeline supports both the infer.sh bash helper and a direct Python call. Key parameters include:

--num_inference_steps: ODE solver steps (default 100; use 50 for faster generation).
--guidance_scale: classifier-free guidance strength (default 4.5).
--seed: reproducibility seed (default 42).
--gen_duration: output length in seconds for generation tasks (default 10.0).
--ref_duration: reference clip length for zero-shot TTS (default 3.0).

The model can switch between all tasks using the --task_mode all flag. Outputs are saved to a dedicated directory. A single-prompt example below demonstrates text-to-audio generation.

# 44 kHz variant
bash scripts/infer.sh \
--checkpoint_dir checkpoints/unison_D20S0_O_40ch \
--model_config unison/config/D20S0_O_40ch.yaml \
--vae_config unison/models/mmaudio/vae_config_44k.yaml \
--task_mode all

# Or single-prompt generation
python unison/pipelines/infer.py \
--model_ckpt checkpoints/unison_D20S0_O_40ch \
--model_config unison/config/D20S0_O_40ch.yaml \
--vae_config unison/models/mmaudio/vae_config_44k.yaml \
--omni_model_path $QWEN_OMNI_MODEL_PATH \
--task_mode generation \
--gen_prompt "[Audio] Rain falling on a tin roof with distant thunder" \
--gen_duration 10.0 \
--output_dir outputs/demo

Project page GitHub ArXiv paper

Unified Audio Generation and Editing with UNISON

How UNISON Works: Mask Channels and Deep LLM Fusion

Available Checkpoints

Multi-Task Prompting

Getting Started

Generating and Editing with UNISON

How MOSS-SoundEffect v2.0 Revolutionizes Text-to-Audio Synthesis

How MOSS-SoundEffect v2.0 Revolutionizes Text-to-Audio Synthesis

GTIG AI Threat Tracker: Adversaries Weaponize AI for Cyber Attacks

Fast Byte Latent Transformer: Efficient Byte-Level Generation via Diffusion and Speculation

Interaction Models: Real-Time Human-AI Collaboration at Scale