This repository contains the code and gesture type annotations used in our paper: "Do Multimodal Large Language Models Truly See What We Point At? Investigating Indexical, Iconic, and Symbolic Gesture Comprehension" (Nishida et al., ACL 2025).
The repository provides:
- Preprocessing scripts for the dataset used in our experiments
- Step-by-step code and prompt templates for running gesture explanation tasks with multimodal large language models (MLLMs), including:
- GPT-4o (OpenAI)
- Gemini 1.5 Pro (Google DeepMind)
- Qwen2.5-VL-7B-Instruct (Alibaba)
- LLaVA-NeXT-Video (LLaVA Project)
- Evaluation scripts (llm-as-a-judge)
.
├── README.md # This file
├── codes/ # All scripts and templates for processing and evaluation
│ ├── dataset-preparation/ # Preprocessing scripts for Miraikan SC Corpus
│ ├── step1_generate.py # Run LLMs to generate gesture descriptions
│ ├── step2_evaluate_by_llm.py # Use LLMs to evaluate generated descriptions
│ ├── step3_calc_evaluation_scores.py # Aggregate and score the evaluations
│ ├── utils.py # Utility functions
│ ├── prompt_templates/ # Prompt templates
│ └── run.sh # Example script for running the full pipeline
├── data/ # Directory for input data (COMING SOON)
├── results/ # Directory for generated descriptions and evaluation results
We recommend using the latest version of Python (>=3.10) and installing dependencies via:
pip install -r requirements.txt
The dataset used in our experiments is formatted as a JSON list, where each entry corresponds to a gesture instance. Each entry contains:
- the gesture annotations,
- the utterances overlapping with the gesture time span, and
- the video clip overlapping with the gesture time span.
Here is an example entry:
{
"example_key": "data04_264.6_266.935",
"corresponding_utterances": [
{
"id": "a7418",
"time_span": [264.812, 266.511],
"speaker": "scA",
"utterance": "こんなでかい天文台,知ってます?"
},
...
],
"gesture": {
"id": "a8042",
"time_span": [264.6, 266.935],
"gesturer": "scA",
"position": "hand",
"perspective": "intentional",
"description": "客に対し,模型への注意を向けさせ,「天文台」が指示する対象を明示する",
"gesture_type": "Indexical"
}
}
example_key
(str): Unique identifier for the gesture instancecorresponding_utterances
(list[dict]): List of utterance objects, each containing:id
(str): Unique ID for the utterance (e.g., ELAN annotation ID)time_span
(tuple[float]): Start and end time in secondsspeaker
(str): Speaker label (e.g.,v01
,scA
)utterance
(str): Transcribed text of the utterance
gesture
(dict): Object containing gesture annotations:id
(str): Unique ID for the gesture annotation (e.g., ELAN annotation ID)time_span
(tuple[float]): Start and end time of the gesture in secondsgesturer
(str): Identifier of the person performing the gesture (e.g.,scA
)position
(str): Body part used in the gesture (e.g.,hand
)perspective
(str): Indicates the perspective used to write thedescription
: (e.g.,intentional
)description
(str): Gesture description (i.e., relevant annotation)gesture_type
(str): Gesture types (e.g.,Indexical
,Iconic
,Symbolic
)
Each gesture instance is also associated with a directory of extracted frame images.
The expected directory structure is:
<dataset_dir>/
├── dataset.json
└── frames/
└── <example_key>/
├── <example_key>.frame_000.jpg
├── <example_key>.frame_001.jpg
├── <example_key>.frame_002.jpg
└── ...
For example, if example_key = "data05_449.382_453.557"
, the expected frame image files are located at:
<dataset_dir>/
├── dataset.json
└── frames/
└── data05_449.382_453.557/
├── data05_449.382_453.557.frame_000.jpg
├── data05_449.382_453.557.frame_001.jpg
├── data05_449.382_453.557.frame_002.jpg
└── ...
These frames are used as visual input to multimodal LLMs.
Miraikan SC Corpus, the dataset used in our paper, is currently undergoing ethical review and preparation for public release.
We plan to make the dataset available upon completion of this process.
In the meantime, you may prepare your own dataset in the same format and run the full pipeline as described.
To use your own dataset, format the JSON and frame directory as described above and place them under:
data/your_dataset_name/
├── dataset.json
└── frames/
You can then follow the steps in the Usage section (from Step 2) to run the full evaluation pipeline.
-
Prepare the dataset (run from inside the
codes/dataset-preparation/
directory):cd codes/dataset-preparation ./run.sh # COMING SOON
-
Generate gesture descriptions (run from inside the
codes/
directory):cd codes python step1_generate.py --llm_type ${LLM_TYPE} --dataset ${DATASET} --results_dir ${RESULTS_DIR} --prefix ${MY_PREFIX}
The script requires you to specify several environment variables:
LLM_TYPE
: Backend MLLM to use (e.g.,openai
,gemini
,qwen
,llava
)DATASET
: Path to the input preprocessed dataset (JSON file)RESULTS_DIR
: Directory where results will be storedMY_PREFIX
: Identifier for the experiment version
Example:
LLM_TYPE=openai DATASET=<path to this repository>/data/mscc/v1/dataset.json RESULTS_DIR=<path to this repository>/results MY_PREFIX=example
-
Evaluate the explanations using LLMs (also from inside the
codes/
directory):python step2_evaluate_by_llm.py --input_file ${RESULTS_DIR}/${MY_PREFIX}/results.jsonl --output_file ${RESULTS_DIR}/${MY_PREFIX}/evaluation_by_llm.jsonl
-
Calculate evaluation scores (also from inside the
codes/
directory):python step3_calc_evaluation_scores.py --input ${RESULTS_DIR}/${MY_PREFIX}/evaluation_by_llm.jsonl
Or, run the whole pipeline (from inside codes/
):
cd codes
./run.sh
This project is licensed under the MIT License.
If you find this work useful, please cite our paper:
@inproceedings{nishida-etal-2025-multimodal,
title = "Do Multimodal Large Language Models Truly See What We Point At? Investigating Indexical, Iconic, and Symbolic Gesture Comprehension",
author = "Nishida, Noriki and
Inoue, Koji and
Nakayama, Hideki and
Bono, Mayumi and
Takanashi, Katsuya",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-short.40/",
pages = "514--524",
ISBN = "979-8-89176-252-7"
}
For any questions or issues, please contact: Noriki Nishida – noriki.nishida[at]riken.jp