Tailored news hub
homeBenchmark

What ByteShape's Qwen 3.6 35B Quants Reveal About Model Optimization

Insights from NTP and MTP variants, benchmarking across GPUs and CPUs, and community reports on speed, quality, and memory trade-offs.

What ByteShape's Qwen 3.6 35B Quants Reveal About Model Optimization
#Context#Dev Tools#LLM#Memory#Open Source

ByteShape released GGUF quantizations of Qwen 3.6 35B-A3B with NTP and MTP variants. Discover why lower bpw isn't always optimal, how MTP boosts GPU generation speed 20-40%, and why MMLU was excluded. Includes community benchmarks and hardware-specific recommendations.

A New Benchmark for Local LLM Quantization

ByteShape has released GGUF quantizations of the Qwen 3.6 35B-A3B model, a substantial update in the world of local large language model deployment. The release comes in two distinct families: standard NTP (Next Token Prediction) and MTP (Multi-Token Prediction). Both are available on Hugging Face, with a detailed blog post covering methodology and results.

Blog: byteshape.com/blogs/Qwen3.6-35B-A3B/ NTP models: huggingface.co/byteshape/Qwen3.6-35B-A3B-GGUF MTP models: huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF

This release includes direct comparisons with quants from Bartowski, Unsloth, Mudler, and AesSedai, giving the community a comprehensive reference for choosing the right quantization for their hardware.

A side-by-side bar chart comparing token generation speed (tok/s) across different quantizations (NTP vs MTP) for multiple GPU models (RTX 4090, RTX 5090, RTX 4080, RTX 5060 Ti, Pro 6000), with annotations showing the 20–40% speedup for MTP variants on GPUs.

Understanding NTP and MTP

NTP (Next Token Prediction) is the traditional approach: the model predicts one token at a time. Standard quantizations of Qwen 3.6 35B-A3B follow this pattern. The largest NTP variant (GPU-5, 4.15 bpw) often rivals smaller quantizations in both quality and speed, leading to the recommendation: “pick the largest quant that fits” for NTP.

MTP (Multi-Token Prediction) is a newer technique that predicts multiple tokens in a single forward pass. On GPUs, MTP can boost generation speed by 20–40%, depending on the workload. However, MTP increases runtime memory usage, so on a 16 GB GPU the largest MTP model may be impractical. ByteShape recommends the GPU-2 MTP variant for those devices.

On CPUs, MTP is not attractive. Prompt processing on CPUs is already slow, and MTP further degrades it. The CPU recommendation remains NTP.

Why MMLU Was Excluded

ByteShape deliberately excluded MMLU from benchmarking. They discovered an answer-format compliance issue specific to Qwen 3.6 (not present in Qwen 3.5). Even the full-precision model sometimes failed to respond in the strict format expected by the benchmark, despite 5-shot prompts. Since this is a baseline-model behavior rather than a quantization artifact, MMLU would introduce noise into quantization comparisons.

llama-server \
  --fit --fit-margin 1664 \
  --ctx-size 131072 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cache-type-k-draft q8_0 \
  --cache-type-v-draft q8_0 \
  --multi-token-prediction \
  --draft-p-min 0.75 \
  --draft-max 3 \
  --no-mmap \
  --mlock \
  --threads 8 \
  --temp 0.0

Community Benchmarks and Real-World Reports

Early adopters have shared impressive results across various hardware configurations.

User janvitos (Reddit) ran ByteShape’s Qwen3.6-35B-A3B-IQ4_XS (4.19 bpw) MTP model on an RTX 4070 Super 12 GB using the llama-server command shown above. Across 9 requests, they achieved an average of 110.24 tok/s, with an aggregate accept rate of 0.8749. They noted that the --fit-margin flag may need tweaking depending on VRAM.

User Mooncast Productions (Twitter) tested on an RTX 2080, comparing MTP with speculative decoding using Beellama K/V. They found MXFP4 was considerably faster, achieving a +41% performance boost even with 3 GB larger models.

User Ankit Prateek (Twitter) benchmarked Qwen 3.6 27B on a single RTX 5090 and reported that at extreme context scales, MTP roughly doubles generation throughput with no quality loss. Specific figures:

  • 128K context (Q4_K_XL): 40.8 → 83.1 tok/s (2.0×)
  • 250K context (Q5_K_XL): 25.3 → 54.2 tok/s (2.1×) He emphasized that at temp=0, MTP holds a strict veto layer, so logical, code, and math accuracy remain 100% intact.

User Fahd Mirza (Twitter) noted that Qwen3.6 27B achieved 56 tok/s with MTP and ngram-mod stacked (up from 22 tok/s) using four extra flags and zero extra model files. MTP handles creative generation (prediction heads baked into weights), while ngram-mod handles repetition — both methods stack in one server command.

LM Studio Developers announced that Multi Token Prediction is now in beta in LM Studio, requiring version 0.4.14+3 and llama.cpp engine 2.15.0.

Key Takeaways for Practitioners

ByteShape’s release underscores a crucial lesson: bpw (bits per weight) should not be minimized blindly. If a larger quantization fits within your memory and context budget, it may deliver superior quality and speed — especially for NTP models.

For GPU users with sufficient VRAM, MTP offers a substantial speedup — typically 20–40%, and up to 2× at extreme context lengths — without sacrificing output quality. CPU users should stick with NTP, as MTP worsens already slow prompt processing.

Finally, the community reports confirm that MTP is now a mature, practical feature in mainline llama.cpp and LM Studio. With careful tuning (e.g., --fit-margin, KV cache types), even mid-range GPUs like the RTX 4070 Super can achieve over 100 tok/s on a 35B model. The era of high-speed local LLM inference is here.

Related Articles