home›LLMs›

MiniMax Unveils M2 Series, Teases M3 with 9.7x Speedup via Sparse Attention

Shanghai-based AI firm, backed by Tencent and Alibaba, details M2's MoE architecture and "interleaved thinking," while previewing M3's significant performance gains for ultra-long contexts.

May 28, 2026

#Agents #Context #LLM #Reinforcement Learning #Training

MiniMax releases a technical report on its M2 model series, featuring a sparse Mixture-of-Experts backbone and innovative "interleaved thinking." The report also previews the upcoming M3 model, which achieves a 9.7x prefilling speedup with MiniMax Sparse Attention (MSA) for 1-million-token sequences, pushing AI efficiency boundaries.

MiniMax Publishes M2 Series Report and Teases M3 with Sparse Attention

On May 27, 2026, MiniMax released a technical report detailing its M2 model series — M2, M2.5, and M2.7. The Shanghai-based AI firm, backed by Tencent, Alibaba, and miHoYo, also previewed its upcoming M3 model. AI Engineering Lead Skyler Miao stated that M3 is entering its final preparation stage. The new model introduces MiniMax Sparse Attention (MSA), a custom sparse mechanism designed to slash the computational load for ultra-long contexts. Preliminary hardware profiling at 1‑million‑token sequences shows a 9.7× speedup in prefilling latency and a 15.6× boost in decoding generation speed compared to the full‑attention M2. The M2 series itself brings interleaved thinking, a scalable reinforcement learning system called Forge, and autonomous engineering milestones inside the company. The report arrives as the AI industry shifts toward efficiency‑focused architectures.

M2’s Sparse Mixture‑of‑Experts Backbone

The M2 series is built on a sparse Mixture‑of‑Experts (MoE) decoder‑only Transformer. The foundational backbone contains 229.9 billion total parameters but activates only 9.8 billion per token, distributed across 256 fine‑grained experts. Expert routing uses sigmoid gating combined with learnable, expert‑specific bias terms. This design reduces reliance on restrictive auxiliary losses, letting the model scale efficiently while maintaining a manageable per‑token compute budget.

A vast, dark neural landscape of interlocking geometric shards, each glowing with a faint, intricate network of blue and gold threads. In the center, a single, brilliant crystalline node pulses with focused light, while countless other shards around it remain dim and dormant. The scene evokes a sense of immense scale and selective activation, with deep shadows and luminous highlights suggesting efficient, sparse computation.

Why Full Attention Survived the Sub‑quadratic Rejection

MiniMax explored sub‑quadratic attention alternatives — Lightning Attention and hybrid Sliding Window Attention (SWA) — but chose to keep full multi‑head attention with Grouped Query Attention (GQA) across all 62 layers. On the RULER 128K complex word extraction task, SWA variants dropped from a baseline score of 90.0 to 72.0 when context exceeded 32,000 tokens. Sub‑quadratic methods also hit memory‑bound constraints during training, lacked native prefix caching support, and could not integrate cleanly with Multi‑Token Prediction (MTP) modules for speculative decoding. Retaining quadratic attention preserved multi‑hop reasoning capability.

Interleaved Thinking and the Forge Reinforcement Learning System

M2 introduced an “interleaved thinking” protocol: the model alternates between natural‑language planning traces and explicit tool invocations, appending chain‑of‑thought blocks directly into the conversation history. This prevents state drift and enables recovery from runtime errors. To train long‑horizon agent workflows, MiniMax built Forge — a scalable reinforcement learning system that splits execution into agent, middleware (Gateway Server and Data Pool), and training/inference engines. Two innovations manage trajectory‑length variance:

Windowed FIFO Scheduling maintains distributional stability by operating a sliding window over the generation queue.
Prefix Tree Merging reuses shared conversation prefixes during batch training, yielding up to a 40× speedup with zero approximation error.

Forge directly produced the M2.7 checkpoint.

M2.5 and M2.7: Autonomous Engineering at MiniMax

M2.5 completed 30% of internal tasks and 80% of newly committed code at MiniMax headquarters. M2.7 advanced further, acting as an independent machine learning engineer inside an automated harness. It profiles its own training runs, diagnoses anomalies, reads logs, and modifies its codebase and configurations. MiniMax reports that M2.7 handled between 30% and 50% of its own development workflow. On OpenAI’s MLE Bench Lite, which tests autonomous ML research, M2.7 achieved a 66.6% medal rate across independent 24‑hour trials — tying the closed‑weight Gemini 3.1 Pro from Google.

M3 Teaser: MiniMax Sparse Attention (MSA) and Efficiency Gains

MSA is described as a GQA‑driven dynamic block selection mechanism. An Index Branch rapidly scans the full context to identify key tokens, then routes them to a S