home›LLMs›

NVIDIA Nemotron-3-Ultra 550B: A Frontier LLM for Complex AI Workflows

Discover NVIDIA's 550B parameter LatentMoE model, optimized for agentic reasoning, long-context analysis, and multilingual capabilities with Multi-Token Prediction.

June 5, 2026

#Agents #Content Generation #LLM #Open Source #Training

Nemotron-3-Ultra-550B-A55B-BF16 is a frontier-scale LLM by NVIDIA, featuring a LatentMoE architecture, Mamba-2 + MoE + Attention hybrid, and Multi-Token Prediction. Designed for complex multi-step agents, long-context analysis, and high-accuracy reasoning across multiple languages, it offers configurable reasoning and is released under the OpenMDW License.

Introducing NVIDIA Nemotron-3-Ultra 550B-A55B

Released June 4, 2026, the NVIDIA Nemotron-3-Ultra 550B-A55B-BF16 is a frontier-scale LLM engineered for the most demanding reasoning, agentic, and long-context workloads. It packs 550 billion total parameters, but only 55 billion are active thanks to its LatentMoE architecture.

The model supports up to 1 million tokens of context and works across English, French, Spanish, German, Italian, Japanese, Korean, Hindi, Portuguese, and Chinese. A built‑in thinking mode can be toggled via the chat template. Designed for AI agents, RAG, and high‑stakes analytical tasks, it is available under the OpenMDW 1.1 license for both commercial and research use.

Hybrid LatentMoE and Multi‑Token Prediction

The Ultra model combines Mamba‑2, mixture‑of‑experts, and selective attention layers into a latent MoE hybrid. Tokens are projected into a smaller latent space for expert routing, improving accuracy per byte.

Multi‑Token Prediction (MTP) layers share weights across prediction heads, which boosts training signal quality and enables faster inference through native speculative decoding. During pre‑training, NVIDIA used an NVFP4 recipe — most linear layers store weights, activations, and gradients in 4‑bit floating point, while stability‑critical projections (latent, MTP, attention, embeddings) remain in BF16 or MXFP8. This balanced design delivers frontier efficiency without sacrificing accuracy.

Four‑Stage Training Pipeline

Training unfolded in four steps:

Pre‑training on ~20T tokens of crawled and synthetic data with the NVFP4 recipe.
Supervised fine‑tuning on math, code, tool‑calling, and long‑range retrieval data.
Reinforcement learning using asynchronous GRPO across math, code, science, and multi‑turn tool use; MTP accelerated rollout generation.
Multi‑Domain On‑Policy Distillation (MOPD) — teacher models guide learning on the model’s own rollouts, aligning behavior with what it will actually produce at inference time.

Pre‑training data cutoff is September 2025; post‑training data is fresh through May 2026. All datasets and environment code (Megatron‑LM, NeMo RL, NeMo Gym, Data Designer) are open‑source.

Benchmark Highlights

Nemotron‑3‑Ultra competes at the top of the LLM leaderboard. It excels in agentic coding, high‑level math, and extreme long‑context retrieval.

Benchmark	Nemotron‑3‑Ultra	Qwen‑3.5 397B	DS‑v4‑Pro
SWE‑Bench Verified	71.9	69.9	74.0
LiveCodeBench (v6)	89.0	79.3	92.5
GPQA (no tools)	87.0	87.1	87.8
MMLU‑Pro	86.8	88.3	87.5
RULER (1M tokens)	94.7	90.1	94.2
MMLU‑ProX (10‑lang avg)	83.0	86.4	85.6

Full results and evaluation harness details are available in the technical report.

Deployment Quick Start

The BF16 checkpoint is a large model. For single‑node inference, 8× B200 GPUs (≈1.5 TB HBM) are recommended. Multi‑node setups can use H100/H200/GB200/GB300 clusters orchestrated with Ray v2. All configurations enable chunked prefill and MTP‑based speculative decoding (5 draft tokens). Below are the basic launch scripts.

# Set the IP for the head node in RAY_HEAD_IP
export RAY_HEAD_IP=
export RAY_PORT=6379
export RAY_ADDRESS=${RAY_HEAD_IP}:${RAY_PORT}

# Start Ray head node (vLLM/SGLang will run on this node)
ray start --head --node-ip-address=${RAY_HEAD_IP} --port=${RAY_PORT}

# Start Ray worker node(s)
ray start --address=${RAY_HEAD_IP}:${RAY_PORT} --block

# Verify Ray cluster is ready
ray status --address=${RAY_HEAD_IP}:${RAY_PORT}

export MODEL_CKPT=PATH/TO/MODEL/CHECKPOINT

docker run -d --name nemotron-ultra-vllm \
--gpus all \
--ipc=host \
--network=host \
--shm-size=16g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v $MODEL_CKPT:/model:ro \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
-e SAFETENSORS_FAST_GPU=1 \
-e NVIDIA_TF32_OVERRIDE=1 \
-e VLLM_LOGGING_LEVEL=INFO \
vllm/vllm-openai:v0.22.0 \
/model \
--host 0.0.0.0 \
--port 8000 \
--served-model-name