Training

Page 2 of 3

Discover how LeJEPA achieves linear identifiability and why a Gaussian latent distribution is crucial for perfect recovery of underlying AI world models.

Why Gaussianity is Key to Identifiable World Models in AI

Explore the "if and only if" theorem behind LeJEPA's success in representation learning. Understand the role of Gaussian distributions, alignment, and regularization in achieving linear identifiability in AI's quest for robust world models.

Shanghai-based AI firm, backed by Tencent and Alibaba, details M2's MoE architecture and "interleaved thinking," while previewing M3's significant performance gains for ultra-long contexts.

MiniMax Unveils M2 Series, Teases M3 with 9.7x Speedup via Sparse Attention

MiniMax releases a technical report on its M2 model series, featuring a sparse Mixture-of-Experts backbone and innovative "interleaved thinking." The report also previews the upcoming M3 model, which achieves a 9.7x prefilling speedup with MiniMax Sparse Attention (MSA) for 1-million-token sequences, pushing AI efficiency boundaries.

Explore the Diffusion Transformer with Flow Matching that powers high-fidelity 48 kHz audio generation from natural language.

How MOSS-SoundEffect v2.0 Revolutionizes Text-to-Audio Synthesis

Discover MOSS-SoundEffect v2.0, a cutting-edge text-to-audio model using a 1.3B-parameter Diffusion Transformer and Flow Matching for superior sound generation. Learn about its capabilities, multilingual support, and optimal settings for creating diverse audio content.

Introducing SkillOpt, a novel framework that treats natural-language skill documents as trainable states for domain adaptation in large language models, enabling automated procedural improvement without modifying model weights.

SkillOpt: Optimizing LLM Behavior with Trainable Skill Documents

SkillOpt optimizes large language model behavior by iteratively refining natural-language "skill documents" through a propose-and-test loop. It uses an optimizer model to suggest edits, applies them under a bounded textual learning rate, and validates improvements, ensuring robust and portable domain adaptation for even closed-source frontier models.

New hybrid models leverage offline consolidation, inspired by biological sleep, to overcome attention cache limitations in long-horizon tasks.

LLMs Learn to "Sleep" for Deeper Reasoning

This article explores how "LLM sleep," an offline consolidation phase, allows hybrid attention-SSM models to improve deep reasoning by iteratively refining fast-weight memories. Inspired by hippocampal replay, this method addresses the computational bottleneck of context eviction, enhancing performance on complex sequential tasks without increasing prediction-time cost.

Models are no longer bounded by single-call context windows; SkyRL's infrastructure enables execution-driven meta-reasoning via stateful child agents.

The Recursion Ceiling is a Myth: NovaSky Unleashes Recursive Language Models

Discover how NovaSky's SkyRL framework shatters the limitations of large language models. By spawning recursive child agents within persistent Python sandboxes, models can now reason in multi-turn, multi-agent trees, redefining what "thinking" means for AI.

Elon Musk confirms 1.5T parameter model, tripling its predecessor, now enters fine-tuning for a public launch in weeks with enhanced coding capabilities.

xAI Completes Grok V9-Medium Training, June Release Expected

xAI has finished training its Grok V9-Medium foundational model, a 1.5 trillion parameter AI with significant improvements over its predecessor, v8-small. The model, which heavily emphasizes coding tasks through Cursor data, is now undergoing fine-tuning and reinforcement learning, with a public release anticipated in early to mid-June 2026.

Subterranean compilation eliminates the orchestrator at runtime, slashing costs and latency while matching frontier accuracy.

How to Compile Multi-Step AI Workflows Directly into Small Models

Discover how synthetic data and full-parameter fine-tuning can internalize complex procedures in a small LLM, removing the need for external orchestration and delivering dramatic cost savings.

Full fine-tune family based on Alibaba's Z-Image S3-DiT, with variants for quality, speed, and low VRAM.

Z-Anime: Full Anime Fine-Tune on Z-Image Base

Z-Anime is a full fine-tune of the Z-Image Base architecture, not a LoRA merge. It provides anime-style generation with natural language prompting, high diversity, and multiple variants including Base, Distill-8-Step, Distill-4-Step, GGUF, and AIO. Supports 8GB VRAM and includes VAE and text encoder.

Enhanced lighting, sharper focus, natural skin texture, and improved anatomy for cinematic image generation.

Juggernaut Z V1: Cinematic Fine-Tune of Z-Image Base

Juggernaut Z V1 is a cinematic fine-tune of Z-Image Base, trained by KandooAI and released by RunDiffusion. It features dramatic lighting, sharper focus, natural skin, improved anatomy, and better ethnic diversity out of the box. Available in FP16, FP8, and GGUF formats for Diffusers and other workflows.

A 2.6B-parameter diffusion transformer synthesizing 720p video with 6-DoF camera control, hybrid linear attention, and two-stage refinement

SANA-WM: Open-Source Bidirectional World Model for Minute-Long Video

SANA-WM is an efficient open-source world model trained for one-minute video generation. It uses a bidirectional image-to-video diffusion transformer with hybrid linear attention, dual-branch camera control, and a two-stage pipeline. Runs on under 8GB VRAM and generates 60-second 720p clips in 34 seconds on a single RTX 5090.

End-to-end training and inference system using NVFP4 quantization, Balanced SP, and multi-shot attention sink for real-time, long, interactive video generation.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

LongLive-2.0 presents the first end-to-end NVFP4 system for long video generation. It introduces Balanced Sequence Parallelism (SP) and NVFP4 quantization to accelerate training and inference. On Blackwell GPUs, W4A4 inference and quantized KV cache reduce memory and boost throughput. A clean training pipeline directly fine-tunes diffusion models into autoregressive models with standalone LoRA for real-time generation. Multi-shot attention sink enables stable streaming. Experiments show up to 2.15× training speedup and 1.84× inference speedup, achieving 45.7 FPS at 5B parameters.