Open
Description
Hi,
Thanks for the great work (mPLUG-OWL3)! I was wondering if the following template is the right chatting format for multi-image inference cuz the readme didn't explicitly mention it. When using the following code, it seems that the model successfully took a lot of images as input but the performance is under my expectation. Please let me know if my template is incorrect (specifically the real_prompt and message formation).
Looking forward to hearing from you. Thanks!
huggingface_model_id = 'mPLUG/mPLUG-Owl3-7B-240728'
model = AutoModelForCausalLM.from_pretrained(
huggingface_model_id,
torch_dtype=torch.half,
attn_implementation="flash_attention_2",
trust_remote_code=True
).eval().to("cuda")
tokenizer = AutoTokenizer.from_pretrained(huggingface_model_id)
processor = model.init_processor(tokenizer)
# Given a bunch of image paths image_paths = ['file1.png', 'file2.png', ...]
images = []
for idx, image_path in enumerate(image_paths):
images.append(Image.open(image_path).convert("RGB"))
real_prompt = '<|image|>' * len(image_paths) + prompt
messages = [{"role": "user", "content": real_prompt}, {"role": "assistant", "content": ""}]
inputs = processor(messages, images=images, video=None).to("cuda")
generated_text = model.generate(**inputs,
tokenizer=tokenizer, max_new_tokens=256, decode_text=True)[0]
Metadata
Metadata
Assignees
Labels
No labels