Reinforcement Learning

Page 1 of 1

A data-free framework for training language models without external supervision, improving performance on open-ended and short-form QA benchmarks.

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Language Tasks

Introducing SCOPE, a data-free self-play framework for open-ended tasks that co-evolves a Challenger for task generation and a Solver for answering. It uses a self-judge to create rubrics and grade responses, improving 7-8B instruction-tuned models by up to +10.4 points on open-ended and +13.8 points on held-out QA benchmarks.

Exploring the architecture and application of state-externalizing harnesses in AI agent development.

Harness-1: Reinforcement Learning for Search Agents

Harness-1 introduces a novel approach to reinforcement learning for search agents through state-externalizing harnesses. This project, detailed in arXiv:2606.02373, provides a framework for advanced AI agent development.

NVIDIA's latest foundation model for robotics and embodied AI, integrating diverse sensory data for advanced physical intelligence.

Cosmos 3: Omnimodal World Models for Physical AI

NVIDIA introduces Cosmos 3, a cutting-edge omnimodal world model designed for physical AI applications. This project leverages diverse data inputs to enable robots and embodied AI systems to better understand and interact with the physical world, pushing the boundaries of autonomous intelligence.

Millions invested in LLM alignment are undone by a simple script and electricity costs less than a fast-food meal, exposing a critical flaw in AI safety economics.

The $20 AI De-alignment: How Safety Guardrails Evaporate for Pocket Change

A group called Heretic demonstrated how to strip alignment and censorship from 168 open-weight LLMs for just $20, using "weight surgery." This automated process, which bypasses human judgment, reveals a six-order-of-magnitude cost asymmetry that undermines corporate-scale AI safety investments and highlights performance gains in de-aligned models.

Discover BES, a novel framework coupling forward evolutionary search with backward goal decomposition to overcome sampling bottlenecks in LLM reasoning.

How Bidirectional Evolutionary Search Improves LLM Self-Improvement

This article explains Bidirectional Evolutionary Search (BES), a new framework that enhances LLM self-improvement by combining evolutionary operators for broader exploration with dense, intermediate feedback from goal decomposition. Learn how BES tackles the limitations of traditional sampling methods like best-of-N and tree search.

Shanghai-based AI firm, backed by Tencent and Alibaba, details M2's MoE architecture and "interleaved thinking," while previewing M3's significant performance gains for ultra-long contexts.

MiniMax Unveils M2 Series, Teases M3 with 9.7x Speedup via Sparse Attention

MiniMax releases a technical report on its M2 model series, featuring a sparse Mixture-of-Experts backbone and innovative "interleaved thinking." The report also previews the upcoming M3 model, which achieves a 9.7x prefilling speedup with MiniMax Sparse Attention (MSA) for 1-million-token sequences, pushing AI efficiency boundaries.

This paper introduces Macaron-A2UI, a novel model enabling AI agents to dynamically synthesize interactive UI controls alongside natural language, addressing the limitations of text-only interfaces.

Generative UI: Revolutionizing AI Agent Interactions Beyond Plain Text

Discover Macaron-A2UI, a groundbreaking model that allows AI agents to generate interactive UI elements using a declarative protocol. Learn about its comprehensive corpus construction, A2UI-Bench for structured evaluation, and a two-stage training recipe combining SFT and GRPO to enhance user experience and agent capability.

Models are no longer bounded by single-call context windows; SkyRL's infrastructure enables execution-driven meta-reasoning via stateful child agents.

The Recursion Ceiling is a Myth: NovaSky Unleashes Recursive Language Models

Discover how NovaSky's SkyRL framework shatters the limitations of large language models. By spawning recursive child agents within persistent Python sandboxes, models can now reason in multi-turn, multi-agent trees, redefining what "thinking" means for AI.

Elon Musk confirms 1.5T parameter model, tripling its predecessor, now enters fine-tuning for a public launch in weeks with enhanced coding capabilities.

xAI Completes Grok V9-Medium Training, June Release Expected

xAI has finished training its Grok V9-Medium foundational model, a 1.5 trillion parameter AI with significant improvements over its predecessor, v8-small. The model, which heavily emphasizes coding tasks through Cursor data, is now undergoing fine-tuning and reinforcement learning, with a public release anticipated in early to mid-June 2026.