Fine‑tune Qwen 2.5‑VL (or any Vision‑Language model with the same API) on image grounding tasks using GRPO (Generic Reward Prediction Optimization) in just a few lines of code.
- Plug‑and‑play trainer – drop in your own JSON dataset of prompts + bounding‑boxes and start training.
- Image‑aware data collator – automatically loads, preprocesses and batches images.
- Reward‑based optimisation – leverages the
trl
library’s GRPO algorithm for RL‑style fine‑tuning. - Minimal codebase – only three Python files, easy to read and customise.
- Accepts an
image_processor
and animages_root
folder. - Overrides
data_collator
to- Load images with Pillow.
- Batch‑encode them via the Hugging Face
AutoProcessor
. - Return a dict containing
pixel_values
– tensor (C × H × W)prompt
– instruction stringsolution
– ground‑truth bbox or coordinatesscales
– original image size
Tiny subclass that forwards all arguments to the real Qwen 2.5‑VL model while gracefully ignoring the extra logits_to_keep
parameter expected by GRPO.
Currently only accuracy_reward_coord
, which returns 1 if the (x, y) coordinate predicted by the model falls inside the ground‑truth bounding‑box and 0 otherwise.
Feel free to add IoU‑ or distance‑based rewards here.
Provides a concrete example wiring everything together.
Customise the constants at the top, or replace them with argparse flags for production use.
Hyper‑parameter | Where to set | Notes |
---|---|---|
per_device_train_batch_size |
GRPOConfig |
Limited by GPU memory – images are heavy! |
num_generations |
GRPOConfig |
How many action samples to draw per prompt. |
reward_funcs |
trainer init | List of callables returning a reward ∈ {0, 1}. |
bf16 / fp16 |
GRPOConfig |
Use bf16 on A100/H100 for speed and memory efficiency. |
Released under the MIT License – free to use, modify and distribute.
- TRL library for GRPO.
- Qwen‑VL team for the open‑source model.