Tailored news hub
home›LLMs›

How LFM2.5-8B-A1B Powers On-Device AI with Unmatched Throughput

Explore the LFM2.5 hybrid model architecture for efficient, agentic, and multilingual personal assistants on diverse hardware.

How LFM2.5-8B-A1B Powers On-Device AI with Unmatched Throughput
#Academic#Agents#Fine Tuning#LLM#Open Source

LFM2.5-8B-A1B is a new family of hybrid models designed for on-device deployment, building on the LFM2 architecture with extended pre-training and reinforcement learning. It offers competitive performance with larger models on instruction following and agentic tasks, boasting unmatched throughput on CPU and GPU inference with day-one support for llama.cpp, MLX, vLLM, and SGLang.

Introduction

LFM2.5-8B-A1B is a new reasoning-tuned model from the LFM2.5 family, built for on-device AI and agentic AI workflows. It extends the LFM2 architecture with scaled-up pre-training and large-scale reinforcement learning, delivering a compressed performance that rivals much larger dense and mixture-of-experts models. Designed as an on-device personal assistant, it chains tool calls and follows complex instructions across all devices. The model achieves unmatched throughput in its size class on both CPU and GPU, with day-one support for vLLM, llama.cpp, MLX, and SGLang. This release significantly improves instruction following, hallucination resistance, and agentic task success over its predecessor.

Model Details

LFM2.5-8B-A1B is a general-purpose text-only hybrid model with 8.3B total parameters and only 1.5B active parameters. Its architecture combines 18 double-gated LIV convolutional layers with 6 grouped-query attention layers across 24 layers. The model was trained on 38 trillion tokens, supports a context length of 128,000, and uses a vocabulary of 128,000 tokens covering nine languages: English, Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish. Recommended generation parameters are temperature 0.2, top_k 80, and repetition_penalty 1.05. This compact design makes it ideal for on-device deployment while maintaining strong reasoning capabilities.

Chat Template and Tool Use

The model uses a ChatML-like format with special tokens. Assistant turns include an explicit chain of thought before the final answer, making it a reasoning model. The template is:

<|startoftext|><|im_start|>system
You are a helpful assistant trained by Liquid AI.<|im_end|>
<|im_start|>user
What is C. elegans?<|im_end|>
<|im_start|>assistant

Tool use follows four steps: define tools as a JSON object in the system prompt; the model outputs a Pythonic function call between <|tool_call_start|> and <|tool_call_end|>; execute the call and return the result with the tool role; the model then interprets the output and provides a final answer. This structured approach enables reliable agentic AI behavior for real-world applications.

<|startoftext|><|im_start|>system
You are a helpful assistant trained by Liquid AI.<|im_end|>
<|im_start|>user
What is C. elegans?<|im_end|>
<|im_start|>assistant

Inference

LFM2.5-8B-A1B is supported across multiple frameworks for flexible deployment. Use Transformers for simple inference with direct model access, vLLM for high-throughput GPU serving, llama.cpp for cross-platform CPU offloading, MLX for Apple Silicon, and LM Studio for local desktop use. Model checkpoints are available in native format, GGUF for llama.cpp, ONNX for cross-platform runtime, and MLX for Mac devices. The model is optimized for agentic workflows, tool use, structured outputs, and multilingual assistants, but is not intended for heavy programming or knowledge-intensive QA without retrieval.

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_id = "LiquidAI/LFM2.5-8B-A1B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
    # attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "What is C. elegans?"
input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
).to(model.device)

output = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.2,
    top_k=80,
    repetition_penalty=1.05,
    max_new_tokens=8192,
    streamer=streamer,
)

Fine-Tuning

Fine-tuning is recommended to adapt LFM2.5 for specific use cases. Supported methods include continued pre-training (CPT) with Unsloth for text completion and translation, supervised fine-tuning (SFT) with LoRA using Unsloth or TRL, direct preference optimization (DPO) with TRL, and group relative policy optimization (GRPO) with Unsloth or TRL. Each method is accompanied by documentation and Colab notebooks, making it easy to customize the model for on-device AI applications or specialized agentic tasks.

Performance Improvements

LFM2.5-8B-A1B shows dramatic gains over LFM2-8B-A1B across all benchmarks, driven by reasoning, extended pre-training, and large-scale RL.

BenchmarkLFM2-8B-A1BLFM2.5-8B-A1BΔ
AA-Omniscience Index-78.42-24.70+53.62
AA-Omniscience Accuracy7.338.67+1.34
AA-Omniscience Non-Hallucination Rate7.4663.47+56.01
IFEval79.4491.84+12.40
IFBench26.0056.47+30.47
Multi-IF58.5479.93+21.39
MATH50074.8088.76+13.96
AIME2520.0042.53+22.53
BFCLv345.0764.36+19.29
BFCLv425.5248.50+22.98
Tau² Telecom13.6088.07+74.47
Tau² Retail7.0239.82+32.80

The AA-Omniscience Index, which rewards correct answers and penalizes hallucinations, improved by over 53 points. Instruction following (IFEval) and agentic benchmarks (BFCL, Tau²) saw substantial jumps, making this model a strong candidate for on-device AI assistants that require reliable, low-hallucination performance.

Related Articles