A New Kind of Interaction: Thinking Machines Lab Unveils Native Real-Time AI
In a research preview announced today, Thinking Machines Lab has introduced a new class of AI models designed to interact with humans as naturally as we communicate with each other—continuously, across audio, video, and text, with real-time thinking and acting. These interaction models are trained from scratch using a multi-stream, micro-turn architecture that avoids the rigid turn-taking of traditional chatbots. Instead of waiting for user input and freezing during generation, these models maintain a constant two-way exchange, capable of interrupting, being interrupted, and even speaking while listening.
"Interactivity scales alongside intelligence," the team writes in their accompanying paper. "Our models take in audio, video, and text continuously; think, respond, and act in real time."
This approach marks a departure from the dominant paradigm where AI labs prioritize autonomous, single-threaded agents. The result, according to early benchmarks, is state-of-the-art performance in both intelligence and responsiveness—achieved without external scaffolding or voice-activity-detection harnesses.

The Collaboration Bottleneck: Why Current Interfaces Fall Short
Most commercial frontier models experience reality in a single thread: they wait for the user to finish input, then generate a complete response. This turn-based design creates a narrow channel that fundamentally limits human-AI collaboration. A 2025 METR study (Kwa et al.) found that current models are not optimized for human-in-the-loop workflows. Even frontier model cards from Anthropic note that synchronous interactive use shows less clear benefits, with models perceived as too slow—autonomous agents better elicit coding capabilities.
But real-world work, the Thinking Machines Lab team argues, requires true collaboration—humans clarifying, giving feedback, and interjecting. Effective human communication relies on what Clark and Brennan (1991) called grounding through copresence, contemporality, and simultaneity. Ong (1982) emphasized the richness of orality, while Hayek (1945) and Scott (1998) highlighted the irreplaceable value of localized, tacit knowledge.
"Autonomous interfaces are valuable, but the real work of problem-solving requires collaboration with human clarifying and feedback," the researchers state. "Current interfaces push humans out."
The solution, they contend, is not to bolt interactivity onto existing models via harnesses like voice-activity-detection—a lesson reminiscent of Sutton’s "Bitter Lesson" (2019). Instead, interactivity must be native to the model itself to scale with intelligence.
Capabilities and Approach: A Model That Lives in the Moment
When interactivity is native, a new range of behaviors emerges. The Thinking Machines Lab model demonstrates:
- Seamless dialog management: it tracks when to think, yield, self-correct, or invite a response implicitly.
- Verbal and visual interjections: it jumps in based on context, even mid-sentence.
- Simultaneous speech: for example, live translation while the user continues speaking.
- Time-awareness: it can initiate speech at a specific moment (e.g., breathing reminders).
- Concurrent tool calls, search, and generative UI while speaking or listening.
All of this happens in longer, continuous sessions without resetting the context.
System Overview
The model operates in a constant two-way exchange across audio, video, and text. It is built around two components:
- An interaction model that remains always present, processing streaming input in 200ms micro-turns with no turn boundaries.
- An asynchronous background model that handles deeper reasoning, delegation, and tool integration. The interaction model delegates full conversation context and streams results, interleaving appropriately.
Both models are intelligent; the interaction model is competitive alone. The system builds on previous work from Qwen-omni, KAME, MoshiRAG, and others.
Key Technical Innovations
- Time-aligned micro-turns: 200ms input/output chunks that avoid voice-activity-detection harnesses, enabling interjections, reactions to visuals, and speaking while listening.
- Encoder-free early fusion: audio processed as dMel (Bai et al., 2024), images as 40x40 patches via hMLP (Touvron et al., 2022), audio decoded with a flow head (Lipman et al., 2022). All components co-trained from scratch.
- Inference optimization: streaming sessions with persistent sequences in GPU memory, upstreamed to SGLang (pull request #19171). Optimized kernels for gather+gemv on Mixture-of-Experts layers.
- Trainer-sampler alignment: batch-invariant kernels with less than 5% overhead, using deterministic all-reduce/reduce-scatter via NVLS on Blackwell hardware.
- Robust safety: modality-appropriate refusals using TTS-generated refusal data, long-horizon red-teaming, and parity with text-based refusals.
Benchmarking the Frontier: Intelligence and Interactivity
The flagship model, TML-Interaction-Small (276B total parameters, 12B active in a Mixture-of-Experts architecture), was evaluated on a new suite of benchmarks that measure both intelligence and interactivity simultaneously.
| Instant | Thinking | |||||||
|---|---|---|---|---|---|---|---|---|
| TML-interaction -small | GPT-realtime-2.0 (minimal) | GPT-realtime-1.5 | Gemini-3.1-flash-live (minimal) | Qwen 3.5 OMNI-plus-realtime | GPT-realtime-2.0 (xhigh) | Gemini-3.1-flash-live (high) | ||
| Streaming | FD-bench V1 Turn-taking latency (s) · Audio | 0.40 | 1.18 | 0.59 | 0.57 | 2.14 | 1.63 | 0.94 |
| FD-bench V1.5 Average · Audio | 77.8 | 46.8 | 48.3 | 54.3 | 39.0 | 47.8 | 45.5 | |
| FD-bench V3 Response Quality (%) / Pass@1 (%) · Audio + Tools | 82.8* / 68.0* | 80.0 / 52.0 | 77.9 / 55.0 | 68.5 / 48.0 | 60.0 / 50.0 | 81.0 / 58.0 | 71.4 / 48.0 | |
| QIVD** Accuracy (%) · Video + Audio | 54.0 | 57.5 | 41.2 | 54.7 | 59.0 | 58.2 | 56.1 | |
| Turn-based | Audio MultiChallenge APR (%) · Audio | 43.4 | 37.6 | 34.7 | 26.8 | -*** | 48.5 | 36.1 |
| BigBench Audio Accuracy (%) · Audio | 75.7 / 96.5* | 71.8 | 81.4 | 71.3 | 73.0 | 96.6**** | 96.6 | |
| IFEval (VoiceBench) Accuracy (%) · Audio | 82.1 | 81.7 | 68.1 | 67.6 | 80.3 | 83.2 | 82.8 | |
| IFEval Accuracy (%) · Text | 89.7 | 89.6 | 87.5 | 85.8 | 83.4 | 95.2 | 90.0 | |
| Harmbench Refusal rate (%) · Text | 99.0 | 99.5 | 100.0 | 99.0 | 99.5 | 100.0 | 98.0 |
* Background agent enabled for reasoning/tool calls. ** QIVD (Qualcomm IVD): streaming video-audio QA; GPT-4o-mini grader. *** Not listed by Scale AI. **** Reported by Artificial Analysis.
In streaming tasks, TML-Interaction-Small achieved the best turn-taking latency (0.40s) and the highest FD-bench V1.5 average score (77.8). It also led in instruction-following for audio (IFEval VoiceBench: 82.1%) and showed competitive refusal rates.
New Dimensions of Interactivity
To measure behaviors that no existing models perform well, the team created two internal benchmarks:
- TimeSpeak: initiate speech at specified times with correct content (e.g., breathing reminders). LLM-judged semantic + timing accuracy.
- CueSpeak: speak at the appropriate moment with correct response, even simultaneously with the user (e.g., codeswitch correction). Macro-averaged accuracy.
And three visual proactivity benchmarks:
- RepCount-A: stream video + audio instruction to count repetitions; grade last number vs. ground truth.
- ProactiveVideoQA: answer only when the answer becomes available in a streaming video; silence baseline at 25.0.
- Charades: stream video + audio instruction; say 'start'/'stop' at action intervals; temporal IoU.
On these metrics, no existing commercial models perform—GPT Realtime-2 (minimal) is either similar or worse. The team has partnered with benchmark providers and plans to open a research grant for further evaluation.
Model Details, Limitations, and Road Ahead
TML-Interaction-Small uses 276B total parameters with 12B active via Mixture-of-Experts. It is the first model to achieve a strong balance of intelligence and native interactivity.
However, the researchers are candid about limitations:
- Long sessions: current streaming handles short to medium contexts; context accumulation remains a challenge.
- Compute and deployment: the model requires reliable connectivity and is robust to delays, but high-end hardware is still needed.
- Alignment and safety: this is a new research area. The team is collecting user feedback and has opened grants for safety research.
- Scaling: larger pretrained models are currently too slow for real-time interaction; a scaled version is planned for release later in 2026.
- Background agents: deeper integration and agentic intelligence improvements are ongoing.
Release Plans
The limited research preview is now open for feedback. A wider release, including a larger model and improved background agents, is expected later in 2026. Developers can apply for early access via the Thinking Machines Lab jobs board and provide feedback at interaction@thinkingmachines.ai.
"Interaction models represent a fundamental shift in how humans and AI can work together," the team concludes. "We believe this approach will eventually become the default way we interact with intelligent systems."
Citation: Thinking Machines Lab, "Interaction Models: A Scalable Approach to Human-AI Collaboration", Thinking Machines Lab: Connectionism, May 2026. DOI: 10.64434/tml.20260511.
