home›Audio›

How MOSS-SoundEffect v2.0 Revolutionizes Text-to-Audio Synthesis

Explore the Diffusion Transformer with Flow Matching that powers high-fidelity 48 kHz audio generation from natural language.

May 27, 2026

#Content Generation #LLM #Open Source #Python #Training

Discover MOSS-SoundEffect v2.0, a cutting-edge text-to-audio model using a 1.3B-parameter Diffusion Transformer and Flow Matching for superior sound generation. Learn about its capabilities, multilingual support, and optimal settings for creating diverse audio content.

A New Generation of Sound Effect Synthesis

MOSS-SoundEffect v2.0 is the text-to-audio model in the MOSS-TTS family. Unlike v1’s discrete-token autoregressive backbone, v2.0 uses a continuous-latent Diffusion Transformer (DiT) trained with a Flow Matching objective. It generates high-fidelity 48 kHz audio from natural language captions.

Broad coverage includes natural and urban environments, animals, human actions, and short percussive clips. Output duration is controllable up to 30 seconds, with a duration tag prepended during training. The model supports both English and Chinese prompts, making it versatile for multilingual applications. This release marks a substantial leap in sound generation quality and flexibility.

Diffusion Transformer with Flow Matching

The core of MOSS-SoundEffect v2.0 is a 1.3B-parameter Diffusion Transformer operating in a compressed latent space provided by a DAC VAE. A Qwen3 text encoder converts the natural-language prompt into a conditioning embedding.

Training follows the Flow Matching paradigm, where the model learns to reverse a continuous-time corruption process, mapping Gaussian noise to target latent representations. Compared to the v1 autoregressive model that generated discrete audio tokens, this continuous-latent approach enables smoother, more natural transitions and better long-term structure. To control output length, a numerical duration tag (up to 30 s) is prepended to the prompt during training, allowing flexible generation time without altering the model architecture.

Model Variant and Recommended Settings

MOSS-SoundEffect v2.0 is available as a single 1.3B-parameter Diffusion Transformer. The following tables list the model details and the suggested settings for optimal generation.

Model	Architecture	DiT Variant	Parameters
MOSS-SoundEffect-V2.0	DiT + Flow Matching	1.3B	1.3B

Parameter	Default	Description
`num_inference_steps`	100	Number of flow-match solver steps.
`cfg_scale`	4.0	Classifier-free guidance weight.
`sigma_shift`	5.0	Flow-match scheduler shift applied per call.
`seconds`	10.0	Output duration.
Up to 30.

Quick Setup

To start using MOSS-SoundEffect v2.0, create an isolated Python 3.12 environment and install the necessary dependencies. The commands below set up a conda environment, clone the repository, and install the full package with PyTorch CUDA 12.8 support. A minimal inference-only install is also provided.

conda create -n moss-soundeffect-v2 python=3.12 -y
conda activate moss-soundeffect-v2
git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS/moss_soundeffect_v2

# Full install with fine-tuning support
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-cu128,finetune]"

# Alternatively, inference-only install (still includes Gradio demo)
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-cu128]"

Basic Usage

The pipeline can be loaded and used with a few lines of Python. The first call may take several minutes due to torch.compile and Triton CUDA Graph warm-up. To avoid compilation issues, set TORCHDYNAMO_DISABLE=1 if needed.

import torch
from moss_soundeffect_v2 import MossSoundEffectPipeline

pipe = MossSoundEffectPipeline.from_pretrained(
    "OpenMOSS-Team/MOSS-SoundEffect-v2.0",
    torch_dtype=torch.bfloat16,
    device="cuda",
)

audio = pipe(
    prompt="A dog barking loudly in a park.",
    seconds=10,
    num_inference_steps=100,
    cfg_scale=4.0,
)  # (B, C, T) waveform tensor

pipe.save_audio(audio, "out.wav")

Important Notes

If you encounter TorchDynamo or Triton errors during the first inference call, disable dynamo by setting TORCHDYNAMO_DISABLE=1 before launching Python. A Gradio demo is included in the inference-only install. For fine-tuning recipes and more examples, refer to the GitHub README.

Project page GitHub ArXiv paper

A New Generation of Sound Effect Synthesis

Diffusion Transformer with Flow Matching

Model Variant and Recommended Settings

Quick Setup

Basic Usage

Important Notes

How UNISON Unifies Audio and Speech Generation with Deep LLM Fusion

How UNISON Unifies Audio and Speech Generation with Deep LLM Fusion

GTIG AI Threat Tracker: Adversaries Weaponize AI for Cyber Attacks

Fast Byte Latent Transformer: Efficient Byte-Level Generation via Diffusion and Speculation

Interaction Models: Real-Time Human-AI Collaboration at Scale