Training

Page 3 of 3

Exploring the motivations, training data, capabilities, and community reactions to a language model that only knows the world before 1931

Inside Talkie: The 13B LM Trained Only on Pre-1931 Text

Talkie is a 13B-parameter language model trained exclusively on 260 billion tokens of text published before 1931. Built by Nick Levine, Alec Radford, and David Duvenaud to study AI generalization, it sparks discussion on historical perspective and anachronistic outputs. This deep dive covers data sources, processing, limitations, and public release plans.

Community benchmarks show MTP slower or equal on RTX 5090, 7900 XTX, dual 3080; only mixed VRAM setup sees boost.

Gemma 4 MTP Fails to Deliver Speed Gains on Top GPUs

Reddit users tested the work-in-progress Gemma 4 MTP model. Most high-end GPU configurations saw equal or worse performance compared to non-MTP inference. Only a mixed VRAM/CPU setup showed significant speedup. Stability issues reported. Community anticipates further optimizations.

Combining hierarchical latent tokenization with block-wise discrete diffusion and self-speculation for faster byte-level language models

Fast Byte Latent Transformer: Efficient Byte-Level Generation via Diffusion and Speculation

This paper introduces BLT Diffusion (BLT-D), BLT Self-speculation (BLT-S), and BLT Diffusion+Verification (BLT-DV) to accelerate byte-level language models. By replacing autoregressive decoding with block-wise diffusion and verification, the methods achieve over 50% memory-bandwidth reduction and up to 92% with larger blocks, while maintaining competitive performance on translation and code generation tasks.