A 1.21 GB Diffusion Transformer with Ternary Weights
Bonsai Image Ternary 4B is a compact text-to-image diffusion model built on the FLUX.2 Klein 4B architecture. Its core innovation is the use of ternary weights — each weight takes a value in {‑1, 0, +1} — which shrinks the transformer from 7.75 GB to just 1.21 GB (6.4× smaller). The quality‑oriented ternary design adds a zero state that demonstrably improves visual fidelity and prompt coherence.
The model runs natively on Linux and Windows through CUDA and the Gemlite low‑bit kernel, delivering a 1024×1024 image in 4.5 s on an RTX 3080 and 2.8 s on an A100. A 4‑step FlowMatch‑Euler sampler with guidance = 1.0 eliminates the need for negative prompts or CFG. A companion Apple Silicon variant is also available.
Ternary Weight Representation and Storage
Each ternary weight is defined as:
w_i = scale_g * t_i, t_i ∈ {−1, 0, +1}
One shared FP16 scale is stored per group of 128 weights. Ternary values carry log₂(3) ≈ 1.585 bits, and with the scale overhead the effective bit‑width is about 1.71 bits/weight, an idealized 9.4× reduction. All 100 matmul‑heavy linears — Q/K/V projections, MLP weights, double‑stream add‑K/Q/V linears — are ternary; precision‑sensitive supporting tensors remain FP16.
| Format | Transformer size | Reduction | Ratio |
|---|---|---|---|
| FP16 FLUX.2 Klein 4B | 7.75 GB | — | 1.0× |
| Ternary Bonsai 4B | 1.21 GB | 84.4% | 6.4× |
The CUDA deployment uses an INT2 pack (2 bits per ternary), resulting in a 1.54 GB on‑disk representation. The zero state is the quality lever that keeps visual performance close to the full FP16 model.
Deployment Architecture and Runtime Efficiency
The model backbone is FLUX.2 Klein 4B, a 25‑block MMDiT diffusion transformer (5 double‑stream + 20 single‑stream). The sampler is FlowMatchEuler‑discrete with 4 steps, guidance = 1.0, and shift = 3.0.
At inference, the Qwen3‑4B text encoder is compressed to 4‑bit HQQ and offloaded immediately after prompt encoding. The denoising loop therefore only carries the compact ternary transformer and an FP16 VAE with tiled 128 px decode. Total CUDA payload is 4.55 GB:
| Component | Size |
|---|---|
| Gemlite INT2 transformer | 1.54 GB |
| HQQ 4‑bit text encoder | 2.84 GB |
| FP16 VAE | 0.17 GB |
Peak HBM at 1024² on an RTX 3080 is ~6.8 GiB end‑to‑end. The stack works natively on Linux x86_64 and Windows via the same CUDA/Gemlite kernels.
Throughput and Benchmark Performance
Throughput (4 denoising steps, 1024² unless noted)
| Platform | 512² (s) | 1024² (s) |
|---|---|---|
| A100 (Colab) | 1.1 | 2.8 |
| RTX PRO 6000 Blackwell (Colab) | 1.0 | 2.1 |
| RTX 3080 10 GB | 1.4 | 4.5 |
| RTX 3060 6 GB (laptop) | 3.3 | 17.5 |
Benchmarks — all higher is better. Comparison models evaluated under matched settings; smaller backbones tested at 512×512 where noted.
| Model | Transf. (GB) | GenEval | HPSv3 | DPG-Bench |
|---|---|---|---|---|
| Bonsai Ternary 4B | 1.21 | 0.723 | 12.22 | 0.851 |
| Bonsai Binary 4B | 0.93 | 0.671 | 11.15 | 0.822 |
| FLUX.2 Klein 4B (FP16) | 7.75 | 0.819 | 12.84 | 0.853 |
| FLUX.1-schnell | 23.8 | 0.716 | 12.67 | 0.848 |
| SDXL | 5.14 | 0.300 | 10.05 | 0.740 |
| PixArt-Σ XL 2 | 1.20 | 0.541 | 11.93 | 0.769 |
| Stable Diffusion 1.5 | 1.72 | 0.396 | 4.20 | 0.601 |
| BK-SDM-Small | 0.98 | 0.297 | 3.05 | 0.559 |
Ternary Bonsai Image 4B sits very close to the FP16 FLUX.2 Klein 4B while reducing the transformer footprint by 6.4×, effectively moving the quality‑footprint frontier.
Use Cases and Limitations
Use Cases
- Local creative tooling on CUDA‑equipped consumer GPUs.
- Private generation with data residency for compliance‑sensitive workflows.
- Rapid iteration thanks to low latency and no remote queue.
- Commodity‑GPU serving with reduced memory pressure.
- Native deployment on Windows and Linux.
Limitations
- Not bit‑identical to the FP16 FLUX.2 Klein 4B; quality depends on prompt and detail complexity.
- Ternary execution relies on Gemlite low‑bit GEMM kernels, as standard hardware paths are not yet fully ternary‑native.
- After compressing the transformer, the VAE can become a visible memory bottleneck, mitigated by text‑encoder offloading and tiled decode.
git clone https://github.com/PrismML-Eng/Bonsai-Image-Demo.git cd Bonsai-Image-Demo ./setup.sh ./scripts/download_model.sh # ternary is the default ./scripts/serve.sh
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned # one-time .\setup.ps1 .\scripts\download_model.ps1 .\scripts\serve.ps1
from backend_gpu.server import build_pipeline pipe = build_pipeline(model_id="prism-ml/bonsai-image-ternary-4B-gemlite-2bit") image = pipe( prompt="A bonsai tree in a quiet ceramic studio, soft morning light", num_inference_steps=4, guidance_scale=1.0, height=1024, width=1024, ).images[0] image.save("bonsai.png")



