Tailored news hub
homeTraining

Why Clean-Latent Prediction Outperforms Velocity in Diffusion Models

Understanding the geometric modeling advantage of direct clean-latent regression over velocity prediction in compressed VAE spaces.

Why Clean-Latent Prediction Outperforms Velocity in Diffusion Models
#Academic#Content Generation#Training

Explore how the choice of prediction target profoundly impacts diffusion model performance, even in latent spaces. This article details a controlled study comparing clean-latent (JLT) and velocity prediction (DiT), revealing why direct clean-latent regression consistently yields superior results due to fundamental differences in the underlying regression problem.

Clean Prediction in Latent Space: A Controlled Target Study

Image 1: [Uncaptioned image]

Diffusion models can be trained to predict the clean data, the added noise, or a velocity field. While these targets are algebraically convertible, recent work (JiT) showed that directly regressing the clean image in pixel space exploits low-dimensional structure better than predicting ambient noise or velocity. This paper asks: does the choice of prediction target still matter when the model operates in a compressed VAE latent space, where much of the pixel variability has already been removed? The authors introduce JLT, a 130M latent diffusion Transformer that uses clean-latent prediction over frozen FLUX.2 VAE codes. By comparing JLT with a matched velocity-prediction DiT under identical representation, architecture, and training settings, the study isolates the effect of the direct regression target. The central finding is that clean-latent prediction consistently outperforms velocity prediction, demonstrating that target parameterization is a geometric modeling choice, not merely an algebraic rewrite.

Prediction Targets and Algebraic Equivalence

The forward corruption process mixes a clean latent xx and Gaussian noise ϵ\epsilon along a linear path:

zt=tx+(1t)ϵ,t[0,1].z_t = t x + (1-t) \epsilon, \quad t \in [0,1].

Three common direct targets are the clean latent yx=xy_x = x, the noise yϵ=ϵy_\epsilon = \epsilon, and the velocity yv=xϵy_v = x - \epsilon. For a fixed tt, any one target determines the others through affine readout. For example, from a predicted clean x^θ\hat{x}_\theta, one can recover ϵ^θ=(zttx^θ)/(1t)\hat{\epsilon}_\theta = (z_t - t\hat{x}_\theta)/(1-t) and v^θ=(x^θzt)/(1t)\hat{v}_\theta = (\hat{x}_\theta - z_t)/(1-t). This algebraic equivalence often leads practitioners to treat target choice as a notation change. However, the network is trained before this readout is applied, and the readout scales prediction errors differently across noise levels. The paper’s controlled comparison changes only the direct target—clean latent for JLT, velocity for the matched DiT—while keeping the representation, backbone, and training fixed, revealing that the induced regression problem differs substantially.

Target-Geometry Analysis: Why the Target Matters

A local linear-Gaussian analysis explains the empirical gap. Assume xN(0,Σ)x \sim \mathcal{N}(0,\Sigma) and ϵN(0,I)\epsilon \sim \mathcal{N}(0,I). The marginal target covariances are:

Cov(yx)=Σ,Cov(yv)=Σ+I.\operatorname{Cov}(y_x) = \Sigma, \quad \operatorname{Cov}(y_v) = \Sigma + I.

Velocity prediction adds an isotropic unit floor to every direction. If Σ\Sigma is anisotropic, low-variance latent directions become unit-variance in yvy_v, while clean prediction keeps their target variance small. The conditional ambiguity also differs. For a single coordinate with eigenvalue λi\lambda_i,

Var(vizi)=1(1t)2Var(xizi).\operatorname{Var}(v_i \mid z_i) = \frac{1}{(1-t)^2} \operatorname{Var}(x_i \mid z_i).

The Bayes estimators reveal a further mechanism: as λi0\lambda_i \to 0, the clean-target coefficient tends to 0, attenuating low-variance directions, while the velocity-target coefficient tends to 1/(1t)-1/(1-t), amplifying them. Thus, even though the targets are linearly convertible after prediction, they present different supervised regression problems to the network.

Architecture and Training Setup

JLT is a Base-scale latent Transformer with 12 blocks, hidden dimension 768, 12 attention heads, and a 128-dimensional bottleneck patch embedding, totaling 130M parameters. It follows the JiT-B/16 configuration closely but replaces raw image patches with fixed FLUX.2 VAE latent tokens. Two patch-size variants are evaluated: JLT-B/1 and JLT-B/2, corresponding to VAE-grid patches of size 1 and 2. The matched velocity baseline, DiT-B/1 and DiT-B/2, uses the same architecture and representation but predicts v=xϵv = x - \epsilon. All models are trained for 250K steps (200 epochs) with AdamW, a base learning rate of 5×1055\times10^{-5} (scaled to 2×1042\times10^{-4}), and an effective batch size of 1024. To isolate the target effect, the implementation omits repeated in-context class-token concatenation and the auxiliary classification loss used in JiT. Class conditioning is otherwise standard.

Matched Target Ablation: Clean vs.

Velocity

The core experiment fixes the representation, Transformer scale, and training settings, varying only the direct prediction target. Results on ImageNet 256×256256\times256 are shown in Table 1 and Figure 2.

ModelTargetPatchFID-50K
DiT-B/1velocity/16.56
JLT-B/1clean/12.56
DiT-B/2velocity/228.71
JLT-B/2clean/214.81
JLT-B/1 (guided)clean/12.50

Table 1: Matched latent target ablation on ImageNet 256×256256\times256. Clean-latent prediction dominates velocity prediction at both patch sizes.

Image 2: Refer to caption

Figure 2: Training curves for the matched target ablation. Clean-latent variants maintain lower FID and higher Inception Score throughout training.

Clean-latent prediction improves FID from 6.56 to 2.56 at patch /1, and from 28.71 to 14.81 at patch /2. The training curves show that the clean-latent model enters the low-FID regime earlier and sustains a clear margin. With classifier-free guidance, JLT-B/1 reaches an FID of 2.50. The advantage is not tied to a specific patch size, confirming that the target geometry itself drives the improvement.

Comparison with Representative Baselines

Table 2 places the guided JLT-B/1 result alongside established ImageNet 256×256256\times256 models. JLT is a 130M latent model trained for only 250K steps, yet it achieves competitive FID.

ModelFID-50KISTrain
ADM (guided)3.94215.31000K
LDM-43.60247.7178K
DiT-XL/22.27278.27M
SiT-XL/22.06270.37M
JiT-B/162.09282.3800K
JiT-B/322.28278.4800K
JLT-B/1 (ours)2.50250K

Table 2: Guided ImageNet 256×256256\times256 comparison. JLT-B/1 is a smaller-scale model trained for fewer steps, yet its FID is close to much larger systems.

While XL-scale or representation-aligned models achieve lower FID, they alter multiple factors simultaneously. The comparison contextualizes the magnitude of the clean-prediction benefit without claiming a new state-of-the-art. The key takeaway is that a simple target change inside a fixed latent space yields substantial gains, even at modest scale.

Conclusion and Implications

This study demonstrates that clean-latent prediction consistently outperforms velocity prediction when the representation, architecture, and training are held constant. The local Gaussian analysis provides a mechanism: velocity prediction adds an isotropic covariance floor and amplifies low-variance latent directions, while clean prediction attenuates them. These findings reframe target parameterization in latent diffusion as a geometric modeling choice, not an algebraic detail. The result is not explained by latent compression alone—the gap appears inside the same latent space. Limitations include the focus on ImageNet 256×256256\times256 and a 130M configuration; future work should validate the mechanism across tokenizers and datasets. The paper encourages practitioners to treat the direct prediction target as a first-class design dimension, even in compressed latent spaces.

Related Articles