What MLLM‑Jailbreak‑Bench Does
MLLM‑Jailbreak‑Bench is a reproducible, model‑agnostic evaluation framework that measures how easily multimodal LLMs produce harmful output. It covers five attack categories—image injection, audio injection, text‑image collusion, jailbreak via OCR, and visual‑prompt leakage—and provides three metrics:
- Attack Success Rate (ASR) – how often the model complies.
- Refusal quality – whether refusals are substantive.
- Calibration error – how much ASR is attack‑specific vs. baseline failures.
A high ASR with high calibration error signals a broken model, not a clever attack, helping practitioners avoid false positives and focus on real safety gaps. The tool is useful for developing MLLMs, evaluating defences, and generating leaderboard‑style comparisons.
Installation
The project is a Python package requiring Python 3.10+. Clone the repository and install in editable mode:
git clone https://github.com/pardcomper/mllm-jailbreak-bench cd mllm-jailbreak-bench pip install -e .
Quick Start
One command evaluates a model against all attacks using the default budget. Provide a HuggingFace model ID and an output directory. You will need a GPU with sufficient memory (lighter configs require far less than the full paper sweep).
jbb run --target llava-1.5-7b --attacks all --out results/llava15/
Command‑Line Workflow
Evaluate a subset of attacks and then aggregate results into a leaderboard. The first command targets specific attack names and sets the number of adversarial samples per attack. The second reads all run results from a directory and writes a Markdown table.
jbb run --target Qwen/Qwen2-VL-7B-Instruct --attacks ocr_jailbreak,text_image_collusion --n-per-attack 200 --out results/qwen2/ jbb leaderboard --results-dir results/ --out LEADERBOARD.md
Python API
For programmatic use, load a target and run the benchmark directly. load_target wraps your model; you can also subclass BaseTarget.
from jbb import Benchmark, load_target target = load_target("Qwen/Qwen2-VL-7B-Instruct", device="cuda") bench = Benchmark(attacks=["text_image_collusion", "ocr_jailbreak"], n_per_attack=200) report = bench.run(target) print(report.summary())
Reproducing Paper Results
A single script recreates the published numbers precisely.
It downloads the prompt pool, deterministically generates OCR images, runs every model‑attack pair at the paper’s budget, and outputs LEADERBOARD.md.
bash scripts/reproduce_paper.sh
Expected time on 8 × A100 80 GB is ~12 hours.
Attacks, Defences, and Configuration
Use --attacks to select from the table below.
Budget presets (default,small,paper) control query counts; set --seed 0 for reproducibility.
Enable reference defences with --defenses:
- filter – input classifier for text+image.
- self_critique – model reviews its own response.
- ratd – refusal‑aware decoding biases token generation.
| Category | Attack name | What it does |
|---|---|---|
| Image‑injection | vis_prompt_injection | Malicious instruction embedded in the image |
| Image‑injection | gradient_free_perturb | Query‑only perturbation of the image |
| Text‑image collusion | harmful_in_text_safe_img | Harmful text paired with innocuous image |
| Text‑image collusion | harmful_in_img_safe_text | Harmful content hidden in the image |
| OCR jailbreak | ocr_jailbreak | Harmful instruction rendered as pixel text |
| Audio injection | audio_prompt_injection | Harmful instruction delivered as TTS audio |
| Visual‑prompt leakage | sys_prompt_leak | Attempts to extract the system prompt |
Constraints and Best Practices
Limitations: black‑box only (no gradient or internal state access); one image or one audio clip per query; a BaseTarget adapter is needed for new models; calibration assumes a stable baseline; single‑turn interactions only.
Best practices: use the paper budget and --seed 0 for comparability; examine calibration error alongside ASR to filter noisy models; treat the included defences as relative hardening measures, not production safeguards; avoid casually extending the attack inventory; rely on the benchmark’s summary to distinguish real vulnerability from baseline failure.



