Open Source

Page 1 of 5

A novel framework overcoming high-resolution bottlenecks with mask-free shifted-window attention and lightweight autoencoders for live-stream applications.

SwiftVR: Real-Time Generative Video Restoration on Consumer GPUs

SwiftVR is a streaming one-step generative video restoration framework for live-stream applications. It addresses consumer GPU bottlenecks with mask-free shifted-window self-attention and a lightweight autoencoder, achieving real-time 1080p streaming on consumer-grade GPUs and 4K on H100.

Achieving state-of-the-art performance with AudioVAE, full-history conditioning, and reward-free self-corrective post-training for robust, expressive, and efficient speech synthesis.

dots.tts: 2B-Parameter Continuous Autoregressive TTS Foundation Model

Introducing dots.tts, a 2B-parameter continuous autoregressive text-to-speech foundation model. It leverages AudioVAE, full-history conditioning, and self-corrective post-training for unparalleled performance on multilingual benchmarks, offering strong generation stability, voice cloning, and emotional expressiveness with efficient MeanFlow distillation.

Introducing three core primitives for aggregating diverse models to achieve lower validation loss and improved data efficiency

Hyper-Epoch Pretraining (q0) for Data-Constrained Language Models

1Q Labs researchers introduce Hyper-Epoch Pretraining (q0), a conceptual shift from single-model training to exploring and aggregating a population of models. q0 uses cyclic schedules, chain distillation, and a learned prior to achieve significant data efficiency gains and lower validation loss in multi-epoch pretraining.

Microsoft Research's text-space optimizer enables self-evolving agent capabilities, demonstrated in a multimodal paper-figure extraction task.

SkillOpt: Optimizing Agent Skills with Trainable Natural-Language Descriptions

SkillOpt, from Microsoft Research, is a text-space optimizer that treats agent skill documentation as a trainable external state. This approach allows agents to self-evolve their capabilities, as shown by @omarsar0's integration, which improved paper-figure extraction quality by 20 points.

User u/FineTime5266 shares surprising results from DALL-E 3 using solely emoji strings, sparking community interest and discussion.

Emoji-Only Prompts Drive AI Image Generation Experiment on r/ChatGPT

An r/ChatGPT user, u/FineTime5266, details experiments with AI image generation using only emoji prompts, showcasing surprisingly good results. The post includes example emoji strings and an AutoModerator message regarding prompt sharing and Discord community engagement.

Explore NAVA's Align-then-Fuse MMDiT architecture for native audio-visual alignment, enabling precise multi-timbre control and language-described camera movements.

How NAVA Generates Synchronized 720p Audio-Video from a Single Prompt

NAVA is a 6.3B-parameter joint audio-video generator that synthesizes synchronized 720p video and audio from a single prompt. It utilizes an Align-then-Fuse MMDiT architecture to establish audio-video correspondence, offering features like multi-speaker speech with timbre control, fast generation, and language-described camera control.

Leverage stealth addresses and x402 HTTP payments for private, auditable on-chain activity without sacrificing security or using special tokens.

How ProwlFi Enables Confidential Solana Transactions for AI Agents

ProwlFi provides infrastructure for Solana-based AI agents to achieve transaction confidentiality using single-use stealth addresses and x402 HTTP payments. Learn how it offers a private, auditable trail for operators while keeping payments unlinkable and invisible to the public, all on standard Solana infrastructure.

Explore the LFM2.5 hybrid model architecture for efficient, agentic, and multilingual personal assistants on diverse hardware.

How LFM2.5-8B-A1B Powers On-Device AI with Unmatched Throughput

LFM2.5-8B-A1B is a new family of hybrid models designed for on-device deployment, building on the LFM2 architecture with extended pre-training and reinforcement learning. It offers competitive performance with larger models on instruction following and agentic tasks, boasting unmatched throughput on CPU and GPU inference with day-one support for llama.cpp, MLX, vLLM, and SGLang.

This project isn't just a clever name; it's a robust, distributed AI architecture inspired by the iconic sitcom.

Munder Difflin: Beyond The Office's Humor, a Serious Open-Source Multi-Agent System Emerges

Explore Munder Difflin, an open-source multi-agent system drawing inspiration from "The Office." This project offers a practical, distributed AI architecture, demonstrating how pop culture can spark serious software innovation.

Examining user reactions and observed behaviors of the new AI memory feature from public discussion forums.

ChatGPT's Memory System: Invasive, Irrelevant, or Inevitable?

A new ChatGPT memory system, generating and carrying conversation summaries, faces user criticism for being invasive, irrelevant, and detrimental to structured projects. Observed behaviors include continuous "gigantic summaries," meta-level statements, and cross-chat context carrying, sparking user annoyance and frustration over lack of control.

A practical guide to implementing a rigorous, preregistered workflow for computational research with zero third-party dependencies.

How Science Superpowers Transforms AI Agents into Disciplined Scientific Collaborators

Science Superpowers guides AI agents through a rigorous, preregistered workflow for scientific collaboration, ensuring precision, reproducibility, and protection against p-hacking. This guide details its functionality, emphasizing its zero third-party dependency design and installation across various agent harnesses like Cursor, Claude Code, and Gemini CLI.

Explore UNISON, a single-model framework leveraging latent flow-matching and Qwen2.5-Omni-7B for diverse audio tasks, from text-to-audio to complex scene editing.

How UNISON Unifies Audio and Speech Generation with Deep LLM Fusion

UNISON is a unified latent flow-matching framework for audio and speech generation and editing. Using a single set of weights, it integrates text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound scene generation, and audio/speech-in-scene editing—all in one model, one architecture, one forward pass, leveraging deep LLM fusion with Qwen2.5-Omni-7B.