MiniMax Publishes M2 Series Report and Teases M3 with Sparse Attention
On May 27, 2026, MiniMax released a technical report detailing its M2 model series â M2, M2.5, and M2.7. The Shanghai-based AI firm, backed by Tencent, Alibaba, and miHoYo, also previewed its upcoming M3 model. AI Engineering Lead Skyler Miao stated that M3 is entering its final preparation stage. The new model introduces MiniMax Sparse Attention (MSA), a custom sparse mechanism designed to slash the computational load for ultra-long contexts. Preliminary hardware profiling at 1âmillionâtoken sequences shows a 9.7Ă speedup in prefilling latency and a 15.6Ă boost in decoding generation speed compared to the fullâattention M2. The M2 series itself brings interleaved thinking, a scalable reinforcement learning system called Forge, and autonomous engineering milestones inside the company. The report arrives as the AI industry shifts toward efficiencyâfocused architectures.
M2âs Sparse MixtureâofâExperts Backbone
The M2 series is built on a sparse MixtureâofâExperts (MoE) decoderâonly Transformer. The foundational backbone contains 229.9 billion total parameters but activates only 9.8 billion per token, distributed across 256 fineâgrained experts. Expert routing uses sigmoid gating combined with learnable, expertâspecific bias terms. This design reduces reliance on restrictive auxiliary losses, letting the model scale efficiently while maintaining a manageable perâtoken compute budget.

Why Full Attention Survived the Subâquadratic Rejection
MiniMax explored subâquadratic attention alternatives â Lightning Attention and hybrid Sliding Window Attention (SWA) â but chose to keep full multiâhead attention with Grouped Query Attention (GQA) across all 62 layers. On the RULER 128K complex word extraction task, SWA variants dropped from a baseline score of 90.0 to 72.0 when context exceeded 32,000 tokens. Subâquadratic methods also hit memoryâbound constraints during training, lacked native prefix caching support, and could not integrate cleanly with MultiâToken Prediction (MTP) modules for speculative decoding. Retaining quadratic attention preserved multiâhop reasoning capability.
Interleaved Thinking and the Forge Reinforcement Learning System
M2 introduced an âinterleaved thinkingâ protocol: the model alternates between naturalâlanguage planning traces and explicit tool invocations, appending chainâofâthought blocks directly into the conversation history. This prevents state drift and enables recovery from runtime errors. To train longâhorizon agent workflows, MiniMax built Forge â a scalable reinforcement learning system that splits execution into agent, middleware (Gateway Server and Data Pool), and training/inference engines. Two innovations manage trajectoryâlength variance:
- Windowed FIFO Scheduling maintains distributional stability by operating a sliding window over the generation queue.
- Prefix Tree Merging reuses shared conversation prefixes during batch training, yielding up to a 40Ă speedup with zero approximation error.
Forge directly produced the M2.7 checkpoint.
M2.5 and M2.7: Autonomous Engineering at MiniMax
M2.5 completed 30% of internal tasks and 80% of newly committed code at MiniMax headquarters. M2.7 advanced further, acting as an independent machine learning engineer inside an automated harness. It profiles its own training runs, diagnoses anomalies, reads logs, and modifies its codebase and configurations. MiniMax reports that M2.7 handled between 30% and 50% of its own development workflow. On OpenAIâs MLE Bench Lite, which tests autonomous ML research, M2.7 achieved a 66.6% medal rate across independent 24âhour trials â tying the closedâweight Gemini 3.1 Pro from Google.
M3 Teaser: MiniMax Sparse Attention (MSA) and Efficiency Gains
MSA is described as a GQAâdriven dynamic block selection mechanism. An Index Branch rapidly scans the full context to identify key tokens, then routes them to a S



