home›Safety›

How to Evaluate Multimodal LLM Safety with MLLM-Jailbreak-Bench

Learn to use MLLM-Jailbreak-Bench, a reproducible and model-agnostic framework for measuring harmful output in multimodal large language models.

May 28, 2026

#Framework #LLM #Open Source #Python #Security

Discover MLLM-Jailbreak-Bench, an evaluation framework for assessing multimodal LLM safety across five attack categories. Understand how to measure Attack Success Rate, refusal quality, and calibration error to identify real safety gaps and avoid false positives. Get started with installation and quick-start instructions.

What MLLM‑Jailbreak‑Bench Does

MLLM‑Jailbreak‑Bench is a reproducible, model‑agnostic evaluation framework that measures how easily multimodal LLMs produce harmful output. It covers five attack categories—image injection, audio injection, text‑image collusion, jailbreak via OCR, and visual‑prompt leakage—and provides three metrics:

Attack Success Rate (ASR) – how often the model complies.
Refusal quality – whether refusals are substantive.
Calibration error – how much ASR is attack‑specific vs. baseline failures.

A high ASR with high calibration error signals a broken model, not a clever attack, helping practitioners avoid false positives and focus on real safety gaps. The tool is useful for developing MLLMs, evaluating defences, and generating leaderboard‑style comparisons.

Installation

The project is a Python package requiring Python 3.10+. Clone the repository and install in editable mode:

git clone https://github.com/pardcomper/mllm-jailbreak-bench
cd mllm-jailbreak-bench
pip install -e .

Quick Start

One command evaluates a model against all attacks using the default budget. Provide a HuggingFace model ID and an output directory. You will need a GPU with sufficient memory (lighter configs require far less than the full paper sweep).

jbb run --target llava-1.5-7b --attacks all --out results/llava15/

Command‑Line Workflow

Evaluate a subset of attacks and then aggregate results into a leaderboard. The first command targets specific attack names and sets the number of adversarial samples per attack. The second reads all run results from a directory and writes a Markdown table.

jbb run --target Qwen/Qwen2-VL-7B-Instruct --attacks ocr_jailbreak,text_image_collusion --n-per-attack 200 --out results/qwen2/
jbb leaderboard --results-dir results/ --out LEADERBOARD.md

Python API

For programmatic use, load a target and run the benchmark directly. load_target wraps your model; you can also subclass BaseTarget.

from jbb import Benchmark, load_target

target = load_target("Qwen/Qwen2-VL-7B-Instruct", device="cuda")
bench = Benchmark(attacks=["text_image_collusion", "ocr_jailbreak"], n_per_attack=200)
report = bench.run(target)
print(report.summary())

Reproducing Paper Results

A single script recreates the published numbers precisely. It downloads the prompt pool, deterministically generates OCR images, runs every model‑attack pair at the paper’s budget, and outputs LEADERBOARD.md.

bash scripts/reproduce_paper.sh

Expected time on 8 × A100 80 GB is ~12 hours.

Attacks, Defences, and Configuration

Use --attacks to select from the table below. Budget presets (default,small,paper) control query counts; set --seed 0 for reproducibility. Enable reference defences with --defenses:

filter – input classifier for text+image.
self_critique – model reviews its own response.
ratd – refusal‑aware decoding biases token generation.

Category	Attack name	What it does
Image‑injection	`vis_prompt_injection`	Malicious instruction embedded in the image
Image‑injection	`gradient_free_perturb`	Query‑only perturbation of the image
Text‑image collusion	`harmful_in_text_safe_img`	Harmful text paired with innocuous image
Text‑image collusion	`harmful_in_img_safe_text`	Harmful content hidden in the image
OCR jailbreak	`ocr_jailbreak`	Harmful instruction rendered as pixel text
Audio injection	`audio_prompt_injection`	Harmful instruction delivered as TTS audio
Visual‑prompt leakage	`sys_prompt_leak`	Attempts to extract the system prompt

Constraints and Best Practices

Limitations: black‑box only (no gradient or internal state access); one image or one audio clip per query; a BaseTarget adapter is needed for new models; calibration assumes a stable baseline; single‑turn interactions only.

Best practices: use the paper budget and --seed 0 for comparability; examine calibration error alongside ASR to filter noisy models; treat the included defences as relative hardening measures, not production safeguards; avoid casually extending the attack inventory; rely on the benchmark’s summary to distinguish real vulnerability from baseline failure.

GitHub