Tailored news hub
home›LLMs›

What is MiniCPM5-1B and How Does Its Dual-Mode Architecture Work?

Explore MiniCPM5-1B, a 1B-parameter LLM designed for on-device deployment, featuring state-of-the-art performance and a unique 'Think'/'No Think' dual-mode chat template.

What is MiniCPM5-1B and How Does Its Dual-Mode Architecture Work?
#Agents#Dev Tools#Fine Tuning#LLM#Open Source

Discover MiniCPM5-1B, an efficient 1B-parameter causal language model optimized for local and resource-constrained environments. Learn about its Llama-based architecture, impressive 131K context window, and innovative 'Think' and 'No Think' modes that enable it to function as both a fast assistant and a deliberate reasoner from a single checkpoint.

Overview and Architecture

MiniCPM5-1B is a dense 1B-parameter causal language model designed for on‑device, local deployment and resource‑constrained settings. It achieves state-of-the-art open‑source performance in the 1B class. The architecture is a standard LlamaForCausalLM stack, requiring no custom kernels or code forks.

Key specifications:

  • Total parameters: 1,080,632,832 (679,552,512 non‑embedding)
  • 24 layers with Grouped Query Attention (16 query heads, 2 key‑value heads)
  • Native context window: 131,072 tokens

A single checkpoint powers both a fast assistant and a deliberate reasoner through a built‑in chat template that toggles Think and No Think modes via the enable_thinking flag. This makes the model directly usable for local assistants, coding agents, tool‑call workflows, and reasoning tasks.

Model Variants

The release provides five formats to suit different runtimes:

  • BF16 final checkpoint – post‑trained with RL and online preference data (recommended)
  • SFT‑only checkpoint – after supervised fine‑tuning, before RL
  • Base checkpoint – pre‑training only
  • GGUF – quantised format for llama.cpp, Ollama, and LM Studio
  • MLX / 4‑bit – optimised for Apple Silicon via MLX

All variants share the same underlying model, so you can choose the one that best fits your hardware and workflow.

Dual Think / No Think Chat Modes

The chat template switches between two operating modes simply by setting the enable_thinking parameter. No separate checkpoint is needed.

ModeRecommended samplingenable_thinking
Thinktemperature=0.9, top_p=0.95True
No Thinktemperature=0.7, top_p=0.95False
  • Think mode engages the model’s capacity for step‑by‑step reasoning, suitable for complex problems.
  • No Think mode produces quicker, more direct responses for everyday assistant tasks.

This design lets the same compact model serve both as a fast chat assistant and a deliberate reasoning engine.

pip install -U "transformers>=5.6" accelerate torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openbmb/MiniCPM5-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Who are you? Please briefly introduce yourself."}]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    enable_thinking=False,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Tool Calling

MiniCPM5-1B natively emits XML‑style tool calls. To convert them into standard OpenAI‑compatible tool_calls, the recommended backend is SGLang with its built‑in minicpm5 parser. This approach requires no extra model patching and delivers seamless integration.

Launch the SGLang server with the tool‑call parser enabled, then send requests through the standard /v1/chat/completions endpoint.

python -m sglang.launch_server --model-path openbmb/MiniCPM5-1B --port 30000 \
--tool-call-parser minicpm5

Deployment Flexibility and Agent Skills

Because the model uses the standard LlamaForCausalLM architecture, it loads directly into mainstream inference engines without custom kernels or code modifications. The project provides step‑by‑step deployment cookbooks for:

  • Transformers (BF16/FP16 local inference, GPU and CPU)
  • vLLM (OpenAI‑compatible server)
  • SGLang (recommended for tool calling)
  • llama.cpp (GGUF, CPU/GPU hybrid)

Additionally, Agent Skills are available as GitHub resources, offering tailored instructions for users building coding agents with tools like Cursor or Claude Code. Together, these resources let you quickly move from model download to a production‑ready assistant, all within a compact 1B footprint.

Related Articles