mllm-gesture-eval

This repository contains the code and gesture type annotations used in our paper: "Do Multimodal Large Language Models Truly See What We Point At? Investigating Indexical, Iconic, and Symbolic Gesture Comprehension" (Nishida et al., ACL 2025).

The repository provides:

Preprocessing scripts for the dataset used in our experiments
Step-by-step code and prompt templates for running gesture explanation tasks with multimodal large language models (MLLMs), including:
- GPT-4o (OpenAI)
- Gemini 1.5 Pro (Google DeepMind)
- Qwen2.5-VL-7B-Instruct (Alibaba)
- LLaVA-NeXT-Video (LLaVA Project)
Evaluation scripts (llm-as-a-judge)

Repository Structure

.
├── README.md                     # This file
├── codes/                        # All scripts and templates for processing and evaluation
│   ├── dataset-preparation/     # Preprocessing scripts for Miraikan SC Corpus
│   ├── step1_generate.py        # Run LLMs to generate gesture descriptions
│   ├── step2_evaluate_by_llm.py # Use LLMs to evaluate generated descriptions
│   ├── step3_calc_evaluation_scores.py # Aggregate and score the evaluations
│   ├── utils.py                 # Utility functions
│   ├── prompt_templates/        # Prompt templates
│   └── run.sh                   # Example script for running the full pipeline
├── data/                         # Directory for input data (COMING SOON)
├── results/                      # Directory for generated descriptions and evaluation results

Requirements

We recommend using the latest version of Python (>=3.10) and installing dependencies via:

pip install -r requirements.txt

Dataset Format

The dataset used in our experiments is formatted as a JSON list, where each entry corresponds to a gesture instance. Each entry contains:

the gesture annotations,
the utterances overlapping with the gesture time span, and
the video clip overlapping with the gesture time span.

Here is an example entry:

{
  "example_key": "data04_264.6_266.935",
  "corresponding_utterances": [
    {
      "id": "a7418",
      "time_span": [264.812, 266.511],
      "speaker": "scA",
      "utterance": "こんなでかい天文台,知ってます?"
    },
    ...
  ],
  "gesture": {
    "id": "a8042",
    "time_span": [264.6, 266.935],
    "gesturer": "scA",
    "position": "hand",
    "perspective": "intentional",
    "description": "客に対し，模型への注意を向けさせ，「天文台」が指示する対象を明示する",
    "gesture_type": "Indexical"
  }
}

Field Descriptions:

example_key (str): Unique identifier for the gesture instance
corresponding_utterances (list[dict]): List of utterance objects, each containing:
- id (str): Unique ID for the utterance (e.g., ELAN annotation ID)
- time_span (tuple[float]): Start and end time in seconds
- speaker (str): Speaker label (e.g., v01, scA)
- utterance (str): Transcribed text of the utterance
gesture (dict): Object containing gesture annotations:
- id (str): Unique ID for the gesture annotation (e.g., ELAN annotation ID)
- time_span (tuple[float]): Start and end time of the gesture in seconds
- gesturer (str): Identifier of the person performing the gesture (e.g., scA)
- position (str): Body part used in the gesture (e.g., hand)
- perspective (str): Indicates the perspective used to write the description: (e.g., intentional)
- description (str): Gesture description (i.e., relevant annotation)
- gesture_type (str): Gesture types (e.g., Indexical, Iconic, Symbolic)

Frame Image Layout:

Each gesture instance is also associated with a directory of extracted frame images.
The expected directory structure is:

<dataset_dir>/
├── dataset.json
└── frames/
    └── <example_key>/
        ├── <example_key>.frame_000.jpg
        ├── <example_key>.frame_001.jpg
        ├── <example_key>.frame_002.jpg
        └── ...

For example, if example_key = "data05_449.382_453.557", the expected frame image files are located at:

<dataset_dir>/
├── dataset.json
└── frames/
    └── data05_449.382_453.557/
        ├── data05_449.382_453.557.frame_000.jpg
        ├── data05_449.382_453.557.frame_001.jpg
        ├── data05_449.382_453.557.frame_002.jpg
        └── ...

These frames are used as visual input to multimodal LLMs.

⚠️ Dataset Availability:

Miraikan SC Corpus, the dataset used in our paper, is currently undergoing ethical review and preparation for public release.
We plan to make the dataset available upon completion of this process.

In the meantime, you may prepare your own dataset in the same format and run the full pipeline as described.
To use your own dataset, format the JSON and frame directory as described above and place them under:

data/your_dataset_name/
├── dataset.json
└── frames/

You can then follow the steps in the Usage section (from Step 2) to run the full evaluation pipeline.

Usage

Prepare the dataset (run from inside the codes/dataset-preparation/ directory):
```
cd codes/dataset-preparation
./run.sh # COMING SOON
```
Generate gesture descriptions (run from inside the codes/ directory):
```
cd codes
python step1_generate.py --llm_type ${LLM_TYPE} --dataset ${DATASET} --results_dir ${RESULTS_DIR} --prefix ${MY_PREFIX}
```
The script requires you to specify several environment variables:
- LLM_TYPE: Backend MLLM to use (e.g., openai, gemini, qwen, llava)
- DATASET: Path to the input preprocessed dataset (JSON file)
- RESULTS_DIR: Directory where results will be stored
- MY_PREFIX: Identifier for the experiment version
Example:
```
LLM_TYPE=openai
DATASET=<path to this repository>/data/mscc/v1/dataset.json
RESULTS_DIR=<path to this repository>/results
MY_PREFIX=example
```

Evaluate the explanations using LLMs (also from inside the codes/ directory):

python step2_evaluate_by_llm.py --input_file ${RESULTS_DIR}/${MY_PREFIX}/results.jsonl --output_file ${RESULTS_DIR}/${MY_PREFIX}/evaluation_by_llm.jsonl

Calculate evaluation scores (also from inside the codes/ directory):

python step3_calc_evaluation_scores.py --input ${RESULTS_DIR}/${MY_PREFIX}/evaluation_by_llm.jsonl

Or, run the whole pipeline (from inside codes/):

cd codes
./run.sh

License

This project is licensed under the MIT License.

Citation

If you find this work useful, please cite our paper:

@inproceedings{nishida-etal-2025-multimodal,
    title = "Do Multimodal Large Language Models Truly See What We Point At? Investigating Indexical, Iconic, and Symbolic Gesture Comprehension",
    author = "Nishida, Noriki  and
      Inoue, Koji  and
      Nakayama, Hideki  and
      Bono, Mayumi  and
      Takanashi, Katsuya",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-short.40/",
    pages = "514--524",
    ISBN = "979-8-89176-252-7"
}

Contact

For any questions or issues, please contact: Noriki Nishida – noriki.nishida[at]riken.jp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mllm-gesture-eval

Repository Structure

Requirements

Dataset Format

Field Descriptions:

Frame Image Layout:

⚠️ Dataset Availability:

Usage

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
codes		codes
data		data
results		results
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
requirements.txt		requirements.txt

License

norikinishida/mllm-gesture-eval

Folders and files

Latest commit

History

Repository files navigation

mllm-gesture-eval

Repository Structure

Requirements

Dataset Format

Field Descriptions:

Frame Image Layout:

⚠️ Dataset Availability:

Usage

License

Citation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages