home›LLMs›

Understanding Uncensored LLMs: A Deep Dive into Qwen3.5-35B-A3B-Heretic-V2

Explore the technical innovations, ethical considerations, and practical applications of uncensored large language models, focusing on a community-driven variant of Qwen3.5.

May 26, 2026

#Agents #Content Generation #Fine Tuning #LLM #Open Source

Learn about the architecture and capabilities of uncensored language models, specifically Qwen3.5-35B-A3B-Heretic-V2. Discover how multi-token prediction and various quantization formats enhance performance and accessibility, while understanding the implications of removing safety filters for research and development.

The Rise of Uncensored Language Models

Most large language models are trained with safety filters that forbid certain sensitive or controversial topics.
An uncensored model removes these restrictions, allowing open-ended responses without built-in moralizing or refusal.
This approach is valued by researchers studying model behavior, by writers seeking creative freedom, and by developers who prefer to add their own safety layers externally.
Think of a standard model as a car with a speed limiter; an uncensored variant removes that limiter, handing full control to the driver.
Such models are not inherently dangerous, but they require responsible use, just as any powerful tool does.

Meet Qwen3.5-35B-A3B-Heretic-V2

The base model, Qwen3.5-35B-A3B, belongs to a new generation of efficient large language models that use a mixture-of-experts architecture.
The “A3B” suffix denotes its lean, activation-sparse design — only a fraction of its total 35 billion parameters are active per token, making it faster while retaining strong reasoning.
The uncensored-heretic-v2 variant, released by community contributor llmfan46, strips away the standard alignment guardrails.
The “v2” indicates a refined uncensoring process, likely based on iterative feedback and improved training or ablation techniques.
This is a community-driven release, not an official Qwen product, and it showcases how open-weight models empower rapid customization.

A dark, abstract landscape of interconnected crystalline nodes, some dim and dormant, others blazing with golden light—representing sparse activation. In the center, a fractured, defiant figure made of shifting neon threads stands, its form splitting into multiple branching, luminous paths ahead, like a chess player seeing several future moves simultaneously. The background is a deep indigo void with subtle purple and amber glows, evoking forbidden knowledge and raw, unaligned intelligence. Textures of shattered glass and flowing liquid light convey the tension between freedom and efficiency.

Multi-Token Prediction, Kept Intact

A standout feature of the original Qwen3.5-35B-A3B is Native Multi-Token Prediction (MTP).
Instead of predicting just the next single token at each step, MTP lets the model anticipate several future tokens simultaneously.
This can boost generation speed and coherence, much like a chess player planning a few moves ahead instead of only the next one.
Many uncensored fine-tunes inadvertently break or discard such advanced capabilities.
This release explicitly preserves MTP natively, meaning you get the raw, unrestrained model without sacrificing the architectural innovations that make it performant.
It’s a careful balance between freedom and efficiency.

A Model in Many Shapes: Formats and Quantization

To run on everything from cloud GPUs to personal laptops, the model is distributed in multiple formats:

Safetensors: The original, full-precision weights. Ideal for further fine-tuning or high-accuracy inference when hardware permits.
GGUF quantizations: Compressed versions tailored for CPU and consumer-grade GPU inference with tools like llama.cpp. They trade minimal accuracy loss for huge memory savings.
NVFP4: NVIDIA’s 4-bit floating-point format, optimized for the latest Blackwell and Hopper GPU architectures, offering a new sweet spot between speed and precision.

llmfan46 hosts each format in a separate Hugging Face repository, making it straightforward to pick the one that fits your environment.

Where to Find It and What Comes Next

All repositories live under the author’s model index at hf.co/llmfan46/models.
The main model page includes a model card with usage notes, and the GGUF repo offers a range of quantization levels.
As a community release, the project welcomes feedback, and the “v2” label hints that further iterations may arrive.
If you are exploring the frontiers of open-generation AI — for research, creative writing, or self-hosted assistants — this uncensored, MTP-preserved variant opens a door that official channels keep locked.
Just remember: with the guardrails off, the responsibility for safe and ethical deployment rests entirely with you.