A Fully Continuous Autoregressive TTS Foundation Model
This paper introduces dots.tts, a 2-billion-parameter text-to-speech (TTS) system that operates entirely in a continuous latent space, removing the need for discrete acoustic tokens. The work targets a core problem in continuous autoregressive generation: long-range error accumulation. Without the quantization buffer that discrete codecs provide, small prediction errors compound across time, degrading quality. The authors address this with three complementary innovations.
First, they train a semantically structured AudioVAE using multiple objectives, including a WavLM alignment loss, making the latent space both high-fidelity and learnable for the downstream model. Second, they decompose generation into a semantic encoder, a language model backbone, and a full-context autoregressive flow-matching head, keeping semantic reasoning and acoustic rendering separate. Third, they apply reward-free self-corrective post-training to the flow-matching head, teaching it to recover from its own inference-time errors. The result is a model that achieves state-of-the-art stability and voice-cloning quality while preserving the expressiveness enabled by continuous latents.
Architecture: Decoupling Semantics and Acoustics
The backbone of dots.tts consists of three specialized modules. A semantic encoder compresses each generated 25 Hz VAE latent patch into a 6.25 Hz embedding, stripping away high-variance acoustic detail before feeding it back to the language model. This restriction is critical: the LLM sees only a compact semantic summary of the history, not the raw latent, which prevents acoustic errors from destabilizing the autoregressive rollout.

The LLM backbone, initialized from Qwen2.5-1.5B, consumes BPE text tokens interleaved with these audio-semantic embeddings. Its hidden states condition an autoregressive flow-matching head—a Diffusion Transformer (DiT) that generates the next four-frame VAE latent patch. The head uses a block-causal attention mask during training that exactly reproduces the per-step context seen at inference, enabling parallel training across all patches while maintaining strict causality. A speaker embedding extracted by a frozen CAM++ encoder is injected via adaLN-zero modulation, and classifier-free guidance is applied jointly over text content and timbre.
Building a Learnable Continuous Latent Space
The AudioVAE is trained in two stages on 48 kHz audio. Stage 1 targets reconstruction quality using a BigVGAN-v2-style adversarial and multi-scale mel-spectral loss, regularized by a KL and flow prior. The encoder is fully causal, using strided convolutional residual blocks to achieve 1920× temporal downsampling, producing a 128-dimensional latent stream at 25 Hz.
Stage 2 targets learnability. A heavily compressed latent can reconstruct well but retains so much acoustic variation that a downstream LLM struggles to use it as a generation target. The authors add a frame-level cosine alignment loss against a frozen WavLM teacher and a multitask downstream block trained jointly on ASR, emotion, and speaker classification. This makes the space semantically structured without sacrificing reconstruction. The resulting VAE achieves a WER of 4.14% and SIM of 0.969 on LibriSpeech test-other, placing it in the top band of continuous representations and well above discrete codecs, ensuring reconstruction is not a downstream bottleneck.
Self-Corrective Alignment and MeanFlow Distillation
Post-training proceeds in two stages, both updating only the DiT acoustic generator. The first stage adapts the SOAR (Self-corrective alignment) framework to the autoregressive flow-matching head. For each training sample, the model performs a detached one-step Euler rollout using its own CFG-guided prediction, creating an off-trajectory state that simulates inference-time errors. It then learns to steer these states back toward the clean latent endpoint. This reward-free process directly addresses the multi-step ODE mismatch between pretraining and inference, where small velocity errors accumulate across patches.
The second stage applies CFG-aware MeanFlow distillation. A frozen self-corrected teacher generates trajectories with classifier-free guidance, and a student DiT is trained to predict the mean velocity over variable-length intervals with a single conditional forward pass. Because CFG is fused into the distillation target, the student avoids the separate conditional and unconditional evaluations required by standard CFG. At inference, the student needs only 2–4 function evaluations per patch, enabling low-latency generation while preserving the corrected teacher behavior.
State-of-the-Art Zero-Shot Voice Cloning
On Seed-TTS-Eval, the primary zero-shot voice-cloning benchmark, dots.tts achieves the best average performance. The self-corrected model (SOAR) reaches a WER of 2.95% and SIM of 79.2, leading the next-best baseline by 1.4 SIM points. The MeanFlow-distilled variant at NFE=4 maintains WER within 0.01 of SOAR at a cost of roughly one SIM point.
| Model | test-en WER↓ / SIM↑ | test-zh WER↓ / SIM↑ | test-zh-hard WER↓ / SIM↑ | Average WER↓ / SIM↑ |
|---|---|---|---|---|
| dots.tts (SOAR) | 1.30 / 77.1 | 0.94 / 81.0 | 6.60 / 79.5 | 2.95 / 79.2 |
| dots.tts (MF, NFE=4) | 1.29 / 76.2 | 0.94 / 80.0 | 6.60 / 78.5 | 2.94 / 78.2 |
| CosyVoice 3 | 2.22 / 72.0 | 1.12 / 78.1 | 5.83 / 75.8 | 3.06 / 75.3 |
| Seed-TTS | 2.25 / 76.2 | 1.12 / 79.6 | 7.59 / 77.6 | 3.65 / 77.8 |
On the 24-language MiniMax multilingual benchmark, dots.tts (SOAR) leads average speaker similarity at 83.9, taking the per-language SIM lead on 19 of 24 languages. The WER picture is mixed, with a few low-resource outliers pulling up the average—a limitation attributed to insufficient BPE token coverage for script-divergent languages.
Expressiveness and Cross-Lingual Capabilities
On EmergentTTS-Eval, which uses a Gemini-2.5-Pro audio judge for head-to-head comparisons against gpt-4o-mini-tts, dots.tts (Pretrain) leads the open-source field with a 49.2% overall win rate. It achieves the top open-source score on Emotions (72.7%) and the highest Syntactic Complexity score across all systems—open and closed—at 65.7%. The SOAR stage improves text faithfulness on syntactically complex utterances by 7.3 points but trades off some emotional expressiveness.
On CV3-Eval's cross-lingual voice-cloning subset, dots.tts (SOAR) leads SIM in both directions: 75.0 for English→Chinese and 72.8 for Chinese→English, 6–8 points above CosyVoice 3. This demonstrates strong timbre disentanglement, a critical capability for preserving speaker identity across languages. The MeanFlow-distilled variant inherits these gains, with MF 4 taking the hard-English WER lead at 4.37%.
Real-Time Streaming and Deployment
The model is designed from the outset for causal, low-latency inference. A 1T1A interleaved sequence layout alternates single BPE text tokens with 6.25 Hz audio steps, allowing an upstream conversational LLM to drive synthesis at its own text-emission rate. Speech can begin within a single text token of generation, without buffering a full utterance.
Combined with CFG-aware MeanFlow distillation at NFE=4, the system achieves a first-packet latency of 85 ms at RTF 0.231 in plain mode and 54 ms at RTF 0.245 in interleaved streaming mode on a single NVIDIA H800 GPU. The LLM runs on vLLM with continuous batching and paged-KV attention, while the AR-FM head and semantic encoder are JIT-compiled. This efficiency profile makes dots.tts suitable for real-time conversational deployment. The full training and inference code, along with pretrained, post-trained, and distilled checkpoints, is released under the Apache 2.0 license.




