LLMs

Explore the LFM2.5 hybrid model architecture for efficient, agentic, and multilingual personal assistants on diverse hardware.

How LFM2.5-8B-A1B Powers On-Device AI with Unmatched Throughput

LFM2.5-8B-A1B is a new family of hybrid models designed for on-device deployment, building on the LFM2 architecture with extended pre-training and reinforcement learning. It offers competitive performance with larger models on instruction following and agentic tasks, boasting unmatched throughput on CPU and GPU inference with day-one support for llama.cpp, MLX, vLLM, and SGLang.

Discover NVIDIA's 550B parameter LatentMoE model, optimized for agentic reasoning, long-context analysis, and multilingual capabilities with Multi-Token Prediction.

NVIDIA Nemotron-3-Ultra 550B: A Frontier LLM for Complex AI Workflows

Nemotron-3-Ultra-550B-A55B-BF16 is a frontier-scale LLM by NVIDIA, featuring a LatentMoE architecture, Mamba-2 + MoE + Attention hybrid, and Multi-Token Prediction. Designed for complex multi-step agents, long-context analysis, and high-accuracy reasoning across multiple languages, it offers configurable reasoning and is released under the OpenMDW License.

A novel arXiv study introduces an offline "sleep" mechanism for Transformer-based language models, improving long-horizon task efficiency without increasing online inference costs.

New LLM "Sleep" Phase Boosts Long-Context Performance

Researchers propose a "sleep" phase for large language models that converts recent context into persistent fast weights, clearing the key-value cache. This innovative approach addresses the attention bottleneck, enabling models to handle long-context tasks efficiently and perform better on complex benchmarks like math reasoning.

Shanghai-based AI firm, backed by Tencent and Alibaba, details M2's MoE architecture and "interleaved thinking," while previewing M3's significant performance gains for ultra-long contexts.

MiniMax Unveils M2 Series, Teases M3 with 9.7x Speedup via Sparse Attention

MiniMax releases a technical report on its M2 model series, featuring a sparse Mixture-of-Experts backbone and innovative "interleaved thinking." The report also previews the upcoming M3 model, which achieves a 9.7x prefilling speedup with MiniMax Sparse Attention (MSA) for 1-million-token sequences, pushing AI efficiency boundaries.

Explore MiniCPM5-1B, a 1B-parameter LLM designed for on-device deployment, featuring state-of-the-art performance and a unique 'Think'/'No Think' dual-mode chat template.

What is MiniCPM5-1B and How Does Its Dual-Mode Architecture Work?

Discover MiniCPM5-1B, an efficient 1B-parameter causal language model optimized for local and resource-constrained environments. Learn about its Llama-based architecture, impressive 131K context window, and innovative 'Think' and 'No Think' modes that enable it to function as both a fast assistant and a deliberate reasoner from a single checkpoint.

Explore the technical innovations, ethical considerations, and practical applications of uncensored large language models, focusing on a community-driven variant of Qwen3.5.

Understanding Uncensored LLMs: A Deep Dive into Qwen3.5-35B-A3B-Heretic-V2

Learn about the architecture and capabilities of uncensored language models, specifically Qwen3.5-35B-A3B-Heretic-V2. Discover how multi-token prediction and various quantization formats enhance performance and accessibility, while understanding the implications of removing safety filters for research and development.

Elon Musk confirms 1.5T parameter model, tripling its predecessor, now enters fine-tuning for a public launch in weeks with enhanced coding capabilities.

xAI Completes Grok V9-Medium Training, June Release Expected

xAI has finished training its Grok V9-Medium foundational model, a 1.5 trillion parameter AI with significant improvements over its predecessor, v8-small. The model, which heavily emphasizes coding tasks through Cursor data, is now undergoing fine-tuning and reinforcement learning, with a public release anticipated in early to mid-June 2026.

Exploring the motivations, training data, capabilities, and community reactions to a language model that only knows the world before 1931

Inside Talkie: The 13B LM Trained Only on Pre-1931 Text

Talkie is a 13B-parameter language model trained exclusively on 260 billion tokens of text published before 1931. Built by Nick Levine, Alec Radford, and David Duvenaud to study AI generalization, it sparks discussion on historical perspective and anachronistic outputs. This deep dive covers data sources, processing, limitations, and public release plans.