Introducing NVIDIA Nemotron-3-Ultra 550B-A55B
Released June 4, 2026, the NVIDIA Nemotron-3-Ultra 550B-A55B-BF16 is a frontier-scale LLM engineered for the most demanding reasoning, agentic, and long-context workloads. It packs 550 billion total parameters, but only 55 billion are active thanks to its LatentMoE architecture.
The model supports up to 1 million tokens of context and works across English, French, Spanish, German, Italian, Japanese, Korean, Hindi, Portuguese, and Chinese. A builtâin thinking mode can be toggled via the chat template. Designed for AI agents, RAG, and highâstakes analytical tasks, it is available under the OpenMDW 1.1 license for both commercial and research use.
Hybrid LatentMoE and MultiâToken Prediction
The Ultra model combines Mambaâ2, mixtureâofâexperts, and selective attention layers into a latent MoE hybrid. Tokens are projected into a smaller latent space for expert routing, improving accuracy per byte.
MultiâToken Prediction (MTP) layers share weights across prediction heads, which boosts training signal quality and enables faster inference through native speculative decoding. During preâtraining, NVIDIA used an NVFP4 recipe â most linear layers store weights, activations, and gradients in 4âbit floating point, while stabilityâcritical projections (latent, MTP, attention, embeddings) remain in BF16 or MXFP8. This balanced design delivers frontier efficiency without sacrificing accuracy.
FourâStage Training Pipeline
Training unfolded in four steps:
- Preâtraining on ~20T tokens of crawled and synthetic data with the NVFP4 recipe.
- Supervised fineâtuning on math, code, toolâcalling, and longârange retrieval data.
- Reinforcement learning using asynchronous GRPO across math, code, science, and multiâturn tool use; MTP accelerated rollout generation.
- MultiâDomain OnâPolicy Distillation (MOPD) â teacher models guide learning on the modelâs own rollouts, aligning behavior with what it will actually produce at inference time.
Preâtraining data cutoff is September 2025; postâtraining data is fresh through May 2026. All datasets and environment code (MegatronâLM, NeMo RL, NeMo Gym, Data Designer) are openâsource.
Benchmark Highlights
Nemotronâ3âUltra competes at the top of the LLM leaderboard. It excels in agentic coding, highâlevel math, and extreme longâcontext retrieval.
| Benchmark | Nemotronâ3âUltra | Qwenâ3.5 397B | DSâv4âPro |
|---|---|---|---|
| SWEâBench Verified | 71.9 | 69.9 | 74.0 |
| LiveCodeBench (v6) | 89.0 | 79.3 | 92.5 |
| GPQA (no tools) | 87.0 | 87.1 | 87.8 |
| MMLUâPro | 86.8 | 88.3 | 87.5 |
| RULER (1M tokens) | 94.7 | 90.1 | 94.2 |
| MMLUâProX (10âlang avg) | 83.0 | 86.4 | 85.6 |
Full results and evaluation harness details are available in the technical report.
Deployment Quick Start
The BF16 checkpoint is a large model. For singleânode inference, 8Ă B200 GPUs (â1.5âŻTB HBM) are recommended. Multiânode setups can use H100/H200/GB200/GB300 clusters orchestrated with Ray v2. All configurations enable chunked prefill and MTPâbased speculative decoding (5 draft tokens). Below are the basic launch scripts.
# Set the IP for the head node in RAY_HEAD_IP export RAY_HEAD_IP= export RAY_PORT=6379 export RAY_ADDRESS=${RAY_HEAD_IP}:${RAY_PORT} # Start Ray head node (vLLM/SGLang will run on this node) ray start --head --node-ip-address=${RAY_HEAD_IP} --port=${RAY_PORT} # Start Ray worker node(s) ray start --address=${RAY_HEAD_IP}:${RAY_PORT} --block # Verify Ray cluster is ready ray status --address=${RAY_HEAD_IP}:${RAY_PORT}
export MODEL_CKPT=PATH/TO/MODEL/CHECKPOINT
docker run -d --name nemotron-ultra-vllm \ --gpus all \ --ipc=host \ --network=host \ --shm-size=16g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v $MODEL_CKPT:/model:ro \ -e VLLM_WORKER_MULTIPROC_METHOD=spawn \ -e SAFETENSORS_FAST_GPU=1 \ -e NVIDIA_TF32_OVERRIDE=1 \ -e VLLM_LOGGING_LEVEL=INFO \ vllm/vllm-openai:v0.22.0 \ /model \ --host 0.0.0.0 \ --port 8000 \ --served-model-name



