Gemma 4 MTP Fails to Deliver Speed Gains on Top GPUs

Community benchmarks show MTP slower or equal on RTX 5090, 7900 XTX, dual 3080; only mixed VRAM setup sees boost.

May 21, 2026

#Agents #Automation #LLM #Open Source #Training

Reddit users tested the work-in-progress Gemma 4 MTP model. Most high-end GPU configurations saw equal or worse performance compared to non-MTP inference. Only a mixed VRAM/CPU setup showed significant speedup. Stability issues reported. Community anticipates further optimizations.

The much-anticipated release of Gemma 4 MTP (Multi-Token Prediction) has arrived — but the community’s first testing chronicle paints a picture of promise tempered by inconsistency.

While speculative enthusiasts hoped for a leap in inference speed, early benchmarks from a dozen Reddit testers reveal a sobering reality: MTP is not yet the silver bullet for high-end hardware, and its real value may lie in niche, memory-constrained environments.

As we sift through the data, a clear thesis emerges: MTP’s performance is highly dependent on hardware configuration, model size, and use case, and it currently fails to justify its overhead on most consumer GPUs. However, the technology hints at a future where predictive preloading could transform how we run large Mixture-of-Experts (MoE) models.

A futuristic laboratory bench with multiple GPUs (AMD Radeon 7900 XTX, NVIDIA RTX 3080, RTX 5090, RTX 4090) connected by glowing cables, a central chip labeled 'MTP' emitting both green and red sparks, performance meters showing fluctuating speeds (120 tok/s dropping to 10 tok/s), with a holographic display of token acceptance rates around 55%. Cyberpunk aesthetic with blue and orange lighting.

Hardware Showdown: Where MTP Worked — and Where It Didn’t

Four testers provided detailed performance numbers spanning AMD and NVIDIA GPUs. The results are anything but uniform.

User	GPU(s)	Model & Quant	Without MTP (tok/s)	With MTP (tok/s)	Notes
nickm_27	AMD Radeon 7900 XTX	Gemma 26B-A4B	120	100–130 (varies by task)	“still not good enough to justify” MTP
EveningIncrease7579	Dual NVIDIA RTX 3080 20GB	Q8 31B	20	10	“seems instable”
SBoots	RTX 5090 (32GB) + RTX 4090 (24GB)	Gemma (52-token prompt)	32.17	28.81	Draft acceptance rate: 0.55447
DragonfruitIll660	Mixed VRAM/CPU (half VRAM half CPU)	Gemma 31B	1.8	3.5–4.5	Clear speedup
DragonfruitIll660	Full VRAM (Q2KL on RTX 3080 mobile)	Gemma 31B	20	~25	Smaller boost

“With MTP: 10 t/s … seems instable.” — EveningIncrease7579

The only unambiguous win came from DragonfruitIll660’s mixed VRAM/CPU configuration, where MTP nearly doubled throughput from 1.8 to 3.5–4.5 tok/s. This is the most compelling evidence that MTP’s strength lies in offload-heavy scenarios.

Patterns Behind the Inconsistency

Several recurring themes explain why MTP failed to impress on powerful setups.

Draft Acceptance Rate

SBoots reported that only 55% of generated tokens were accepted as drafts. A high rejection rate means the model spends extra compute verifying and regenerating tokens, negating any speed advantage. For MTP to be viable, acceptance rates likely need to exceed 70–80%.

Instability and Optimization Flux

EveningIncrease7579 described the dual RTX 3080 setup as unstable. Meanwhile, nickm_27 tested again after “latest mtp optimizations” were merged — while MTP improved, it still could not beat the non-MTP baseline on the 7900 XTX.

Use-Case Dependency

DragonfruitIll660 noted that casual chat — the task tested by most users — is “likely one of the weaker spots for MTP.” This suggests MTP may excel only in specific workloads (e.g., batch generation, structured outputs) where predictive patterns are more predictable.

“I think casual chat is likely one of the weaker spots for MTP.” — DragonfruitIll660

The Counterargument: What MTP Could Enable

Despite the lackluster benchmarks, several community voices argued that MTP’s potential should not be dismissed.

rog-uk highlighted a fascinating possibility: “predictive expert preloading” for MoE models. If MTP can anticipate which experts will be needed next, it could preload them into VRAM, allowing consumer GPUs (8–16 GB) to run models that ordinarily require 24 GB or more. The key assumption is high expert reuse and small expert sizes.

scheurneus echoed this, questioning whether MTP weights — which are relatively small — could fit entirely in VRAM on 8 GB cards, reducing CPU-GPU transfers and thus improving latency.

PromptInjection_ predicted that MTP will eventually be “blazing fast and great for agentic usage,” where iterative generation and planning benefit from lookahead.

These viewpoints remind us that the current tests are against a raw throughput metric. The real value of MTP may not be raw speed on high-end hardware, but enabling larger models on modest setups. DragonfruitIll660’s mixed-VRAM results are a proof of concept for exactly that scenario.

A Call for Continued Experimentation

The Gemma 4 MTP release is a work in progress — the original poster warned as much: “a work in progress so you have to compile it yourself, and you shouldn’t expect it to work.”

Yet the community has already uncovered a critical insight: MTP is not universally faster, but it is not universally slower either. It offers a lifeline for those running models partially offloaded to CPU, and it whispers of a future where predictive token generation unlocks new architectures.

We urge developers to focus optimization efforts on two fronts:

Boosting draft acceptance rates through better model alignment or adaptive thresholds.
Targeting memory-constrained hardware where even modest speedups (1.8 → 4.5 tok/s) transform usability.

MTP may not be ready for the flagship GPUs of today. But as MoE models grow and consumer VRAM stagnates, techniques like predictive expert preloading could shift the landscape. The testing chronicle is still being written — and the next chapter depends on the community’s willingness to iterate.

“Downloading now, i have been waiting for this.” — wgaca2

So have we. Let’s keep testing, keep compiling, and keep refining.