🖼️ GRPO Images Trainer

Fine‑tune Qwen 2.5‑VL (or any Vision‑Language model with the same API) on image grounding tasks using GRPO (Generic Reward Prediction Optimization) in just a few lines of code.

✨ Why use this repo?

Plug‑and‑play trainer – drop in your own JSON dataset of prompts + bounding‑boxes and start training.
Image‑aware data collator – automatically loads, preprocesses and batches images.
Reward‑based optimisation – leverages the trl library’s GRPO algorithm for RL‑style fine‑tuning.
Minimal codebase – only three Python files, easy to read and customise.

🔍 Under the hood

`PersonalizedGRPOTrainer` (extends `trl.GRPOTrainer`)

Accepts an image_processor and an images_root folder.
Overrides data_collator to
1. Load images with Pillow.
2. Batch‑encode them via the Hugging Face AutoProcessor.
3. Return a dict containing
  - pixel_values – tensor (C × H × W)
  - prompt – instruction string
  - solution – ground‑truth bbox or coordinates
  - scales – original image size

`Qwen2_5_VLForConditionalGenerationWithLogits` (wrapper)

Tiny subclass that forwards all arguments to the real Qwen 2.5‑VL model while gracefully ignoring the extra logits_to_keep parameter expected by GRPO.

Reward functions (`rewards.py`)

Currently only accuracy_reward_coord, which returns 1 if the (x, y) coordinate predicted by the model falls inside the ground‑truth bounding‑box and 0 otherwise.
Feel free to add IoU‑ or distance‑based rewards here.

Training script (`main.py`)

Provides a concrete example wiring everything together.
Customise the constants at the top, or replace them with argparse flags for production use.

⚙️ Configuration tips

Hyper‑parameter	Where to set	Notes
`per_device_train_batch_size`	`GRPOConfig`	Limited by GPU memory – images are heavy!
`num_generations`	`GRPOConfig`	How many action samples to draw per prompt.
`reward_funcs`	trainer init	List of callables returning a reward ∈ {0, 1}.
`bf16` / `fp16`	`GRPOConfig`	Use `bf16` on A100/H100 for speed and memory efficiency.

📄 License

Released under the MIT License – free to use, modify and distribute.

🙏 Acknowledgements

TRL library for GRPO.
Qwen‑VL team for the open‑source model.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
GRPOImagesTrainer.py		GRPOImagesTrainer.py
README.md		README.md
main.py		main.py
rewards.py		rewards.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🖼️ GRPO Images Trainer

✨ Why use this repo?

🔍 Under the hood

`PersonalizedGRPOTrainer` (extends `trl.GRPOTrainer`)

`Qwen2_5_VLForConditionalGenerationWithLogits` (wrapper)

Reward functions (`rewards.py`)

Training script (`main.py`)

⚙️ Configuration tips

📄 License

🙏 Acknowledgements

About

Uh oh!

Releases

Languages

smsk-01/GRPO-Trainer-Images

Folders and files

Latest commit

History

Repository files navigation

🖼️ GRPO Images Trainer

✨ Why use this repo?

🔍 Under the hood

PersonalizedGRPOTrainer (extends trl.GRPOTrainer)

Qwen2_5_VLForConditionalGenerationWithLogits (wrapper)

Reward functions (rewards.py)

Training script (main.py)

⚙️ Configuration tips

📄 License

🙏 Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages

`PersonalizedGRPOTrainer` (extends `trl.GRPOTrainer`)

`Qwen2_5_VLForConditionalGenerationWithLogits` (wrapper)

Reward functions (`rewards.py`)

Training script (`main.py`)