home›Images›

How Bonsai 4B's Ternary Weights Revolutionize Compact Text-to-Image AI

Discover the innovative ternary weight architecture of Bonsai Image Ternary 4B, achieving 6.4x model size reduction with high visual fidelity and rapid inference on consumer hardware.

May 27, 2026

#Content Generation #Dev Tools #Enterprise #Open Source #Privacy

Explore Bonsai Image Ternary 4B, a 1.21 GB Diffusion Transformer using ternary weights for efficient text-to-image generation. Learn how this model delivers fast, high-quality results without negative prompts, running natively on Linux and Windows with CUDA.

A 1.21 GB Diffusion Transformer with Ternary Weights

Bonsai Image Ternary 4B is a compact text-to-image diffusion model built on the FLUX.2 Klein 4B architecture. Its core innovation is the use of ternary weights — each weight takes a value in {‑1, 0, +1} — which shrinks the transformer from 7.75 GB to just 1.21 GB (6.4× smaller). The quality‑oriented ternary design adds a zero state that demonstrably improves visual fidelity and prompt coherence.

The model runs natively on Linux and Windows through CUDA and the Gemlite low‑bit kernel, delivering a 1024×1024 image in 4.5 s on an RTX 3080 and 2.8 s on an A100. A 4‑step FlowMatch‑Euler sampler with guidance = 1.0 eliminates the need for negative prompts or CFG. A companion Apple Silicon variant is also available.

Ternary Weight Representation and Storage

Each ternary weight is defined as:

w_i = scale_g * t_i, t_i ∈ {−1, 0, +1}

One shared FP16 scale is stored per group of 128 weights. Ternary values carry log₂(3) ≈ 1.585 bits, and with the scale overhead the effective bit‑width is about 1.71 bits/weight, an idealized 9.4× reduction. All 100 matmul‑heavy linears — Q/K/V projections, MLP weights, double‑stream add‑K/Q/V linears — are ternary; precision‑sensitive supporting tensors remain FP16.

Format	Transformer size	Reduction	Ratio
FP16 FLUX.2 Klein 4B	7.75 GB	—	1.0×
Ternary Bonsai 4B	1.21 GB	84.4%	6.4×

The CUDA deployment uses an INT2 pack (2 bits per ternary), resulting in a 1.54 GB on‑disk representation. The zero state is the quality lever that keeps visual performance close to the full FP16 model.

Deployment Architecture and Runtime Efficiency

The model backbone is FLUX.2 Klein 4B, a 25‑block MMDiT diffusion transformer (5 double‑stream + 20 single‑stream). The sampler is FlowMatchEuler‑discrete with 4 steps, guidance = 1.0, and shift = 3.0.

At inference, the Qwen3‑4B text encoder is compressed to 4‑bit HQQ and offloaded immediately after prompt encoding. The denoising loop therefore only carries the compact ternary transformer and an FP16 VAE with tiled 128 px decode. Total CUDA payload is 4.55 GB:

Component	Size
Gemlite INT2 transformer	1.54 GB
HQQ 4‑bit text encoder	2.84 GB
FP16 VAE	0.17 GB

Peak HBM at 1024² on an RTX 3080 is ~6.8 GiB end‑to‑end. The stack works natively on Linux x86_64 and Windows via the same CUDA/Gemlite kernels.

Throughput and Benchmark Performance

Throughput (4 denoising steps, 1024² unless noted)

Platform	512² (s)	1024² (s)
A100 (Colab)	1.1	2.8
RTX PRO 6000 Blackwell (Colab)	1.0	2.1
RTX 3080 10 GB	1.4	4.5
RTX 3060 6 GB (laptop)	3.3	17.5

Benchmarks — all higher is better. Comparison models evaluated under matched settings; smaller backbones tested at 512×512 where noted.

Model	Transf. (GB)	GenEval	HPSv3	DPG-Bench
Bonsai Ternary 4B	1.21	0.723	12.22	0.851
Bonsai Binary 4B	0.93	0.671	11.15	0.822
FLUX.2 Klein 4B (FP16)	7.75	0.819	12.84	0.853
FLUX.1-schnell	23.8	0.716	12.67	0.848
SDXL	5.14	0.300	10.05	0.740
PixArt-Σ XL 2	1.20	0.541	11.93	0.769
Stable Diffusion 1.5	1.72	0.396	4.20	0.601
BK-SDM-Small	0.98	0.297	3.05	0.559

Ternary Bonsai Image 4B sits very close to the FP16 FLUX.2 Klein 4B while reducing the transformer footprint by 6.4×, effectively moving the quality‑footprint frontier.

Use Cases and Limitations

Use Cases

Local creative tooling on CUDA‑equipped consumer GPUs.
Private generation with data residency for compliance‑sensitive workflows.
Rapid iteration thanks to low latency and no remote queue.
Commodity‑GPU serving with reduced memory pressure.
Native deployment on Windows and Linux.

Limitations

Not bit‑identical to the FP16 FLUX.2 Klein 4B; quality depends on prompt and detail complexity.
Ternary execution relies on Gemlite low‑bit GEMM kernels, as standard hardware paths are not yet fully ternary‑native.
After compressing the transformer, the VAE can become a visible memory bottleneck, mitigated by text‑encoder offloading and tiled decode.

git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git
cd Bonsai-Image-Demo
./setup.sh
./scripts/download_model.sh   # ternary is the default
./scripts/serve.sh

Set-ExecutionPolicy -Scope CurrentUser RemoteSigned   # one-time
.\setup.ps1
.\scripts\download_model.ps1
.\scripts\serve.ps1

from backend_gpu.server import build_pipeline

pipe = build_pipeline(model_id="prism-ml/bonsai-image-ternary-4B-gemlite-2bit")
image = pipe(
    prompt="A bonsai tree in a quiet ceramic studio, soft morning light",
    num_inference_steps=4,
    guidance_scale=1.0,
    height=1024,
    width=1024,
).images[0]
image.save("bonsai.png")

Project page GitHub ArXiv paper

A 1.21 GB Diffusion Transformer with Ternary Weights