Tailored news hub
homeAudio

How UNISON Unifies Audio and Speech Generation with Deep LLM Fusion

Explore UNISON, a single-model framework leveraging latent flow-matching and Qwen2.5-Omni-7B for diverse audio tasks, from text-to-audio to complex scene editing.

How UNISON Unifies Audio and Speech Generation with Deep LLM Fusion
#Academic#Content Generation#LLM#Open Source#TTS

UNISON is a unified latent flow-matching framework for audio and speech generation and editing. Using a single set of weights, it integrates text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound scene generation, and audio/speech-in-scene editing—all in one model, one architecture, one forward pass, leveraging deep LLM fusion with Qwen2.5-Omni-7B.

Unified Audio Generation and Editing with UNISON

UNISON is a unified latent flow-matching framework that handles multiple audio and speech tasks using a single set of weights. It supports text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound scene generation, and fine-grained audio editing — all within one model and one forward pass. Built on a deep LLM fusion strategy, UNISON leverages a frozen Qwen2.5-Omni-7B language model to inject rich semantic representations layer-wise into a diffusion backbone, eliminating the need for task-specific encoders or heads. The design marks a shift toward truly general-purpose audio generation systems.

How UNISON Works: Mask Channels and Deep LLM Fusion

UNISON’s architecture is built around a shared VAE encoder/decoder and an MM-DiT backbone. The VAE compresses raw audio into a latent space, where latent flow matching generates waveforms efficiently. Task identity is encoded via a mask channel that conditions the diffusion process without extra modules. Source or reference audio is injected through VAE-encoded channel concatenation.

The key innovation is deep LLM fusion: hidden states from uniformly sampled layers of the frozen Qwen2.5-Omni-7B model are projected via learned linear layers and injected into corresponding MM-DiT double-stream blocks. This layer-wise integration provides rich linguistic and acoustic context, enabling the model to unify diverse generation and editing tasks under a single forward pass. No separate text encoders or task-specific heads are needed.

Available Checkpoints

Two variants are provided, differing in VAE sampling rate and model capacity. Both share the same Qwen2.5-Omni-7B encoder and inference pipeline.

DirectoryVAEDiT depthChannelsConfig
unison_D20S0_O_40ch/MMAudio 44 kHz20 double + 0 single40D20S0_O_40ch.yaml
unison_D24S0_O_20ch/MMAudio 16 kHz24 double + 0 single20D24S0_O_20ch.yaml

The 44 kHz variant provides higher-quality audio for music and general sound; the 16 kHz variant uses more transformer blocks but a narrower channel, suitable for speech.

Multi-Task Prompting

UNISON uses unified prompt formats to specify tasks. The following table shows how each task is triggered.

TaskPrompt format
Text-to-Audio (T2A)[Audio] {caption}
Text-to-Speech (TTS)[Speech] A {female/male} voice saying "{text}"
Mixed Speech + Sound[Speech] A {gender} voice saying "{text}" [Audio] {background}
Zero-shot Speaker Cloning[Speech with voice] {ref_text}, {target_text}
Audio Scene Editing (add/remove/replace/denoise)[Edit] [Audio] {instruction}
Speech-in-Scene Editing (content/insert/delete)[Edit] [Speech] {instruction}
Timed Temporal Composition`[Audio] From {t1}s to {t2}s, {event1}.
From {t2}s to {t3}s, {event2}. ...`

The mask channel and VAE-encoded reference concatenation allow the model to interpret these prompts without separate input branches.

Getting Started

To run UNISON locally:

  • Clone the repository and install dependencies.
  • Download MMAudio VAE weights (v1-44.pth or v1-16.pth, and best_netG.pt for the 16 kHz VAE) from the MMAudio release.
  • Set the environment variable QWEN_OMNI_MODEL_PATH to your local Qwen2.5-Omni-7B installation.
  • Use Hugging Face’s snapshot_download to fetch the UNISON checkpoints into a checkpoints/ directory.

The checkpoints are single model.safetensors files, automatically unwrapped from EMA if needed. The pipeline also accepts directories or direct file paths.

git clone https://github.com/lizhaoqing/UNISON
cd UNISON
pip install -r requirements.txt
# Optional: pip install flash-attn --no-build-isolation
export QWEN_OMNI_MODEL_PATH=Qwen/Qwen2.5-Omni-7B
# Place downloaded MMAudio VAE weights in unison/models/mmaudio/data/ext_weights/
# Then download UNISON checkpoints (e.g., via snapshot_download)

Generating and Editing with UNISON

Inference is launched via a single script. The pipeline supports both the infer.sh bash helper and a direct Python call. Key parameters include:

  • --num_inference_steps: ODE solver steps (default 100; use 50 for faster generation).
  • --guidance_scale: classifier-free guidance strength (default 4.5).
  • --seed: reproducibility seed (default 42).
  • --gen_duration: output length in seconds for generation tasks (default 10.0).
  • --ref_duration: reference clip length for zero-shot TTS (default 3.0).

The model can switch between all tasks using the --task_mode all flag. Outputs are saved to a dedicated directory. A single-prompt example below demonstrates text-to-audio generation.

# 44 kHz variant
bash scripts/infer.sh \
--checkpoint_dir checkpoints/unison_D20S0_O_40ch \
--model_config unison/config/D20S0_O_40ch.yaml \
--vae_config unison/models/mmaudio/vae_config_44k.yaml \
--task_mode all

# Or single-prompt generation
python unison/pipelines/infer.py \
--model_ckpt checkpoints/unison_D20S0_O_40ch \
--model_config unison/config/D20S0_O_40ch.yaml \
--vae_config unison/models/mmaudio/vae_config_44k.yaml \
--omni_model_path $QWEN_OMNI_MODEL_PATH \
--task_mode generation \
--gen_prompt "[Audio] Rain falling on a tin roof with distant thunder" \
--gen_duration 10.0 \
--output_dir outputs/demo
Related Articles