Reinforcement Learning
Page 1 of 1

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Language Tasks
Introducing SCOPE, a data-free self-play framework for open-ended tasks that co-evolves a Challenger for task generation and a Solver for answering. It uses a self-judge to create rubrics and grade responses, improving 7-8B instruction-tuned models by up to +10.4 points on open-ended and +13.8 points on held-out QA benchmarks.

Harness-1: Reinforcement Learning for Search Agents
Harness-1 introduces a novel approach to reinforcement learning for search agents through state-externalizing harnesses. This project, detailed in arXiv:2606.02373, provides a framework for advanced AI agent development.

Cosmos 3: Omnimodal World Models for Physical AI
NVIDIA introduces Cosmos 3, a cutting-edge omnimodal world model designed for physical AI applications. This project leverages diverse data inputs to enable robots and embodied AI systems to better understand and interact with the physical world, pushing the boundaries of autonomous intelligence.

The $20 AI De-alignment: How Safety Guardrails Evaporate for Pocket Change
A group called Heretic demonstrated how to strip alignment and censorship from 168 open-weight LLMs for just $20, using "weight surgery." This automated process, which bypasses human judgment, reveals a six-order-of-magnitude cost asymmetry that undermines corporate-scale AI safety investments and highlights performance gains in de-aligned models.

How Bidirectional Evolutionary Search Improves LLM Self-Improvement
This article explains Bidirectional Evolutionary Search (BES), a new framework that enhances LLM self-improvement by combining evolutionary operators for broader exploration with dense, intermediate feedback from goal decomposition. Learn how BES tackles the limitations of traditional sampling methods like best-of-N and tree search.

MiniMax Unveils M2 Series, Teases M3 with 9.7x Speedup via Sparse Attention
MiniMax releases a technical report on its M2 model series, featuring a sparse Mixture-of-Experts backbone and innovative "interleaved thinking." The report also previews the upcoming M3 model, which achieves a 9.7x prefilling speedup with MiniMax Sparse Attention (MSA) for 1-million-token sequences, pushing AI efficiency boundaries.

Generative UI: Revolutionizing AI Agent Interactions Beyond Plain Text
Discover Macaron-A2UI, a groundbreaking model that allows AI agents to generate interactive UI elements using a declarative protocol. Learn about its comprehensive corpus construction, A2UI-Bench for structured evaluation, and a two-stage training recipe combining SFT and GRPO to enhance user experience and agent capability.

The Recursion Ceiling is a Myth: NovaSky Unleashes Recursive Language Models
Discover how NovaSky's SkyRL framework shatters the limitations of large language models. By spawning recursive child agents within persistent Python sandboxes, models can now reason in multi-turn, multi-agent trees, redefining what "thinking" means for AI.

xAI Completes Grok V9-Medium Training, June Release Expected
xAI has finished training its Grok V9-Medium foundational model, a 1.5 trillion parameter AI with significant improvements over its predecessor, v8-small. The model, which heavily emphasizes coding tasks through Cursor data, is now undergoing fine-tuning and reinforcement learning, with a public release anticipated in early to mid-June 2026.