Tailored news hub
home›Images›

How Bonsai 4B's Ternary Weights Revolutionize Compact Text-to-Image AI

Discover the innovative ternary weight architecture of Bonsai Image Ternary 4B, achieving 6.4x model size reduction with high visual fidelity and rapid inference on consumer hardware.

How Bonsai 4B's Ternary Weights Revolutionize Compact Text-to-Image AI
#Content Generation#Dev Tools#Enterprise#Open Source#Privacy

Explore Bonsai Image Ternary 4B, a 1.21 GB Diffusion Transformer using ternary weights for efficient text-to-image generation. Learn how this model delivers fast, high-quality results without negative prompts, running natively on Linux and Windows with CUDA.

A 1.21 GB Diffusion Transformer with Ternary Weights

Bonsai Image Ternary 4B is a compact text-to-image diffusion model built on the FLUX.2 Klein 4B architecture. Its core innovation is the use of ternary weights — each weight takes a value in {‑1, 0, +1} — which shrinks the transformer from 7.75 GB to just 1.21 GB (6.4× smaller). The quality‑oriented ternary design adds a zero state that demonstrably improves visual fidelity and prompt coherence.

The model runs natively on Linux and Windows through CUDA and the Gemlite low‑bit kernel, delivering a 1024×1024 image in 4.5 s on an RTX 3080 and 2.8 s on an A100. A 4‑step FlowMatch‑Euler sampler with guidance = 1.0 eliminates the need for negative prompts or CFG. A companion Apple Silicon variant is also available.

Ternary Weight Representation and Storage

Each ternary weight is defined as:

w_i = scale_g * t_i, t_i ∈ {−1, 0, +1}

One shared FP16 scale is stored per group of 128 weights. Ternary values carry log₂(3) ≈ 1.585 bits, and with the scale overhead the effective bit‑width is about 1.71 bits/weight, an idealized 9.4× reduction. All 100 matmul‑heavy linears — Q/K/V projections, MLP weights, double‑stream add‑K/Q/V linears — are ternary; precision‑sensitive supporting tensors remain FP16.

FormatTransformer sizeReductionRatio
FP16 FLUX.2 Klein 4B7.75 GB—1.0×
Ternary Bonsai 4B1.21 GB84.4%6.4×

The CUDA deployment uses an INT2 pack (2 bits per ternary), resulting in a 1.54 GB on‑disk representation. The zero state is the quality lever that keeps visual performance close to the full FP16 model.

Deployment Architecture and Runtime Efficiency

The model backbone is FLUX.2 Klein 4B, a 25‑block MMDiT diffusion transformer (5 double‑stream + 20 single‑stream). The sampler is FlowMatchEuler‑discrete with 4 steps, guidance = 1.0, and shift = 3.0.

At inference, the Qwen3‑4B text encoder is compressed to 4‑bit HQQ and offloaded immediately after prompt encoding. The denoising loop therefore only carries the compact ternary transformer and an FP16 VAE with tiled 128 px decode. Total CUDA payload is 4.55 GB:

ComponentSize
Gemlite INT2 transformer1.54 GB
HQQ 4‑bit text encoder2.84 GB
FP16 VAE0.17 GB

Peak HBM at 1024² on an RTX 3080 is ~6.8 GiB end‑to‑end. The stack works natively on Linux x86_64 and Windows via the same CUDA/Gemlite kernels.

Throughput and Benchmark Performance

Throughput (4 denoising steps, 1024² unless noted)

Platform512² (s)1024² (s)
A100 (Colab)1.12.8
RTX PRO 6000 Blackwell (Colab)1.02.1
RTX 3080 10 GB1.44.5
RTX 3060 6 GB (laptop)3.317.5

Benchmarks — all higher is better. Comparison models evaluated under matched settings; smaller backbones tested at 512×512 where noted.

ModelTransf. (GB)GenEvalHPSv3DPG-Bench
Bonsai Ternary 4B1.210.72312.220.851
Bonsai Binary 4B0.930.67111.150.822
FLUX.2 Klein 4B (FP16)7.750.81912.840.853
FLUX.1-schnell23.80.71612.670.848
SDXL5.140.30010.050.740
PixArt-Σ XL 21.200.54111.930.769
Stable Diffusion 1.51.720.3964.200.601
BK-SDM-Small0.980.2973.050.559

Ternary Bonsai Image 4B sits very close to the FP16 FLUX.2 Klein 4B while reducing the transformer footprint by 6.4×, effectively moving the quality‑footprint frontier.

Use Cases and Limitations

Use Cases

  • Local creative tooling on CUDA‑equipped consumer GPUs.
  • Private generation with data residency for compliance‑sensitive workflows.
  • Rapid iteration thanks to low latency and no remote queue.
  • Commodity‑GPU serving with reduced memory pressure.
  • Native deployment on Windows and Linux.

Limitations

  • Not bit‑identical to the FP16 FLUX.2 Klein 4B; quality depends on prompt and detail complexity.
  • Ternary execution relies on Gemlite low‑bit GEMM kernels, as standard hardware paths are not yet fully ternary‑native.
  • After compressing the transformer, the VAE can become a visible memory bottleneck, mitigated by text‑encoder offloading and tiled decode.
git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
cd Bonsai-Image-Demo
./setup.sh
./scripts/download_model.sh   # ternary is the default
./scripts/serve.sh
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned   # one-time
.\setup.ps1
.\scripts\download_model.ps1
.\scripts\serve.ps1
from backend_gpu.server import build_pipeline

pipe = build_pipeline(model_id="prism-ml/bonsai-image-ternary-4B-gemlite-2bit")
image = pipe(
    prompt="A bonsai tree in a quiet ceramic studio, soft morning light",
    num_inference_steps=4,
    guidance_scale=1.0,
    height=1024,
    width=1024,
).images[0]
image.save("bonsai.png")
Related Articles