A New Generation of Sound Effect Synthesis
MOSS-SoundEffect v2.0 is the text-to-audio model in the MOSS-TTS family. Unlike v1’s discrete-token autoregressive backbone, v2.0 uses a continuous-latent Diffusion Transformer (DiT) trained with a Flow Matching objective. It generates high-fidelity 48 kHz audio from natural language captions.
Broad coverage includes natural and urban environments, animals, human actions, and short percussive clips. Output duration is controllable up to 30 seconds, with a duration tag prepended during training. The model supports both English and Chinese prompts, making it versatile for multilingual applications. This release marks a substantial leap in sound generation quality and flexibility.
Diffusion Transformer with Flow Matching
The core of MOSS-SoundEffect v2.0 is a 1.3B-parameter Diffusion Transformer operating in a compressed latent space provided by a DAC VAE. A Qwen3 text encoder converts the natural-language prompt into a conditioning embedding.
Training follows the Flow Matching paradigm, where the model learns to reverse a continuous-time corruption process, mapping Gaussian noise to target latent representations. Compared to the v1 autoregressive model that generated discrete audio tokens, this continuous-latent approach enables smoother, more natural transitions and better long-term structure. To control output length, a numerical duration tag (up to 30 s) is prepended to the prompt during training, allowing flexible generation time without altering the model architecture.
Model Variant and Recommended Settings
MOSS-SoundEffect v2.0 is available as a single 1.3B-parameter Diffusion Transformer. The following tables list the model details and the suggested settings for optimal generation.
| Model | Architecture | DiT Variant | Parameters |
|---|---|---|---|
| MOSS-SoundEffect-V2.0 | DiT + Flow Matching | 1.3B | 1.3B |
| Parameter | Default | Description |
|---|---|---|
num_inference_steps | 100 | Number of flow-match solver steps. |
cfg_scale | 4.0 | Classifier-free guidance weight. |
sigma_shift | 5.0 | Flow-match scheduler shift applied per call. |
seconds | 10.0 | Output duration. |
| Up to 30. |
Quick Setup
To start using MOSS-SoundEffect v2.0, create an isolated Python 3.12 environment and install the necessary dependencies. The commands below set up a conda environment, clone the repository, and install the full package with PyTorch CUDA 12.8 support. A minimal inference-only install is also provided.
conda create -n moss-soundeffect-v2 python=3.12 -y conda activate moss-soundeffect-v2 git clone https://github.com/OpenMOSS/MOSS-TTS.git cd MOSS-TTS/moss_soundeffect_v2 # Full install with fine-tuning support pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-cu128,finetune]" # Alternatively, inference-only install (still includes Gradio demo) pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-cu128]"
Basic Usage
The pipeline can be loaded and used with a few lines of Python.
The first call may take several minutes due to torch.compile and Triton CUDA Graph warm-up.
To avoid compilation issues, set TORCHDYNAMO_DISABLE=1 if needed.
import torch from moss_soundeffect_v2 import MossSoundEffectPipeline pipe = MossSoundEffectPipeline.from_pretrained( "OpenMOSS-Team/MOSS-SoundEffect-v2.0", torch_dtype=torch.bfloat16, device="cuda", ) audio = pipe( prompt="A dog barking loudly in a park.", seconds=10, num_inference_steps=100, cfg_scale=4.0, ) # (B, C, T) waveform tensor pipe.save_audio(audio, "out.wav")
Important Notes
If you encounter TorchDynamo or Triton errors during the first inference call, disable dynamo by setting TORCHDYNAMO_DISABLE=1 before launching Python.
A Gradio demo is included in the inference-only install.
For fine-tuning recipes and more examples, refer to the GitHub README.





