can-i-finetune-this — Practical Guide
What it does
This tool answers the question: “Can I fine‑tune this LLM on my consumer GPU?” — before you download the weights or waste time on an OOM error. It provides:
- Memory estimation – models the VRAM consumption for a given model, method (LoRA/QLoRA), sequence length, batch size, and LoRA rank, including weights, gradients, optimizer states, activations, and a fragmentation safety margin.
- Feasibility check – tells you whether a config fits on your GPU, with a confidence level.
- Recommendation – searches for a feasible configuration (e.g., recommends a lower rank or shorter sequence length).
- Benchmarking – runs a real mini‑training step on a tiny model (
sshleifer/tiny-gpt2, ~5 MB) to measure actual peak VRAM on your machine. - Calibration – uses benchmark results to correct the static estimates, making them more accurate for your specific GPU/driver/software stack.
- Recipe generation – produces a ready‑to‑run Hugging Face + PEFT + TRL training script tailored to your chosen config.
- Reporting & comparison – generates Markdown reports of benchmark results and compares multiple runs.
The core insight is that simple static estimates (like accelerate estimate-memory) only cover model loading, not training overhead. This tool adds a detailed memory model and a real measurement feedback loop.
How to get started
Installation
canifinetune has two installation layers:
| Layer | Command | What you get |
|---|---|---|
| Core (estimate, recommend, recipe, report, compare) | pip install canifinetune | CLI commands, no PyTorch required |
| Training (bench, actual fine‑tuning) | pip install canifinetune[train] | Adds torch, transformers, peft, bitsandbytes, trl, datasets |
| Reporting extras | pip install canifinetune[report] | Pandas/tabulate for prettier tables |
| Development | pip install canifinetune[dev] | pytest, ruff, mypy |
Important: PyTorch should generally be installed with the CUDA wheel that matches your driver. For example, with uv:
uv pip install torch --index-url https://download.pytorch.org/pypi/cu121
If you are using uv for the whole environment:
uv venv uv pip install -e ".[dev,report]" # Add training deps when you want to run benchmarks: uv pip install -e ".[dev,train,report]"
Refer to docs/troubleshooting.md for Windows / WSL / bitsandbytes specifics.
Minimal setup
- Install the core package (or core + training if you plan to run benchmarks).
- (Optional but recommended) Run
canifinetune doctorto verify your environment and GPU visibility. - You can now use any of the CLI commands.
Practical usage
All commands are invoked through the canifinetune CLI. The README shows the following workflows:
1. Check your machine
canifinetune doctor
Prints hardware info (GPU, driver, CUDA version) and checks that required dependencies are available.
2. Estimate VRAM for a specific config
canifinetune estimate \ --model Qwen/Qwen2.5-1.5B-Instruct \ --method qlora \ --gpu-vram-gb 16 \ --seq-len 2048 \ --micro-batch-size 1 \ --lora-rank 16
The output is a table with a feasibility verdict (YES/NO) and a memory breakdown per component (static model, quantization overhead, trainable params, gradients, optimizer states, activations, CUDA/fragmentation, safety margin, total).
3. Let the tool recommend a feasible config
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16
It searches for a configuration (rank, sequence length, method) that fits your GPU.
4. Run a real benchmark (requires [train] extras)
canifinetune bench --model sshleifer/tiny-gpt2 --method lora --steps 3
Downloads a tiny model and runs a few training steps, measuring actual peak VRAM. The results are saved locally so they can later be used for calibration.
5. Generate a ready‑to‑run training recipe
canifinetune recipe \ --model Qwen/Qwen2.5-1.5B-Instruct \ --method qlora \ --seq-len 2048 \ --output recipes/qwen2.5-1.5b-qlora-4080
Produces a folder with a training script (run_qlora.py or similar) and a configuration file, ready to execute with minimal modification.
6. Calibrate estimates using benchmark results
canifinetune calibrate --benchmarks benchmarks/results
Reads the benchmark JSON files and updates the static estimator’s internal parameters to better match your hardware.
7. Generate a Markdown report of benchmarks
canifinetune report --benchmarks benchmarks/results --out report.md canifinetune compare --benchmarks benchmarks/results --out compare.md
Configuration and options
All key parameters are passed as CLI flags. There is no standalone configuration file (though the recipe command generates scripts that contain the config). The main parameters are:
| Parameter | Description | Required for |
|---|---|---|
--model | Hugging Face model ID (e.g., Qwen/Qwen2.5-1.5B-Instruct) | estimate, recommend, bench, recipe |
--method | Fine‑tuning method: lora or qlora | estimate, bench, recipe |
--gpu-vram-gb | Total VRAM of your GPU in GB (e.g., 12, 16, 24) | estimate, recommend |
--seq-len | Sequence length in tokens | estimate, recommend, recipe |
--micro-batch-size | Batch size per GPU (not gradient accumulation) | estimate, recipe |
--lora-rank | LoRA rank (e.g., 8, 16, 32) | estimate, recommend, recipe |
--steps | Number of training steps for benchmarking | bench |
--output | Output directory for recipe | recipe |
--benchmarks | Path to directory containing benchmark result JSONs | calibrate, report, compare |
--out | Output file for reports | report, compare |
The static estimate also uses internal assumptions about target_modules (standard for the model architecture), gradient checkpointing (off by default), and optimizer (AdamW). These are listed in the assumptions block of the output.
Known constraints and limitations
The README explicitly states the current scope and known gaps:
- Single consumer GPU only – no multi‑GPU, no DeepSpeed, no FSDP support (though future roadmap may add it).
- Single node – no distributed training.
- LoRA / QLoRA only – full fine‑tuning is not considered.
- Causal LM only – classification or encoder‑decoder architectures are not modeled (roadmap may extend).
- Hugging Face stack – only models/datasets/trainers from the HF ecosystem are supported.
- Static estimates have limited accuracy – activation memory in particular is hard to predict. Every estimate is tagged with a
confidencelevel and anassumptionsblock. The tool encourages runningbenchandcalibrateto ground the numbers. - The
benchcommand currently uses a tiny model (sshleifer/tiny-gpt2) as a proxy – this may not scale perfectly to large models, but it does capture your environment’s overheads (bitsandbytes unpacking, fragmentation, etc.).
The project’s roadmap adds further context: throughput modeling (tokens/sec), auto‑tuning of gradient accumulation steps, and a web UI are not yet implemented.
Best practices
While the README does not have a dedicated “Best practices” section, the following advice can be derived:
- Always run
canifinetune doctorfirst to confirm your GPU and required libraries are visible. - Use
canifinetune recommendas a starting point – it will find a feasible config without manual trial and error. - Run
canifinetune benchon your actual machine, thencanifinetune calibrateto improve the accuracy of future estimates. The difference between static and measured VRAM can be significant (the README shows 3.16 GB static vs 7.10 GB real for a Qwen 1.5B QLoRA at seq_len=2048 on an RTX 4080). - Install PyTorch separately with the correct CUDA wheel – don’t rely on the
[train]dependency to pick the right version; use the index URL shown in the install section. - When using
uv, install torch after the venv is created, before adding training extras, to avoid version conflicts. - Consult
docs/troubleshooting.mdif you encounter issues on Windows or WSL, especially with bitsandbytes.
Notable procedures
Installing PyTorch with the correct CUDA backend
The README gives this explicit pattern (using uv as example):
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
This must be done before (or instead of) the automatic dependency resolution from [train]. The same principle applies to pip.
Upgrading from a previous version
No specific migration steps are documented. The core commands are backwards‑compatible as long as you re‑install canifinetube from PyPI. If you have locally stored benchmark results, they remain in benchmarks/results/ and can be reused.
Using the recipe output
The canifinetune recipe command generates a folder containing a training script and a config file. This script is intended to be run as‑is (with your own dataset and tokenizer) but may require minor edits to point to your data. The README does not detail the script’s internals, but it is based on the Hugging Face TRL trainer.
Collecting baselines for a new GPU
The repository’s scripts/ directory contains helper scripts for batch‑running benchmarks across many configs. The RTX 4080 baselines in docs/rtx4080_baselines.md were produced with these scripts. You can adapt them to generate your own reference table.



