Overview

Motivation

While several existing egocentric benchmarks include evaluations related to state prediction, they typically focus on instant (or short-term) and localized state changes (e.g., the state of a single object after a visual cue) and do not provide a systematic protocol to evaluate scene-level predictions under long action horizons. Therefore, we design a scalable annotation pipeline and build a new benchmark that enables fine-grained and quantitative evaluation of long-horizon egocentric scene prediction.

Key Features

  • Long Action Horizons: 113 atomic actions on average
  • Fine-grained Evaluation: Objects, attributes, and relations
  • Numerous Objects: 20000+ object instances
  • Diverse Scenarios: 12+ real-world scenarios
📊
0
instances
⚙️
0
atomic actions on average
📦
0
objects on average
🎬
0
scenarios

Scene Annotation Pipeline

Annotation Pipeline Diagram

Scene annotation pipeline. Rather than having the MLLM generate annotations directly from the images, we adopt a multi-step pipeline to ensure object coverage and the accuracy of attributes and relations, greatly reducing the load of manual annotation required.

Evaluation Results

Main Evaluation Results

Evaluation results on EXPLORE-Bench. Short, Medium and Long mean the subsets with short, medium and long atomic-action sequences, respectively. Full denotes the full dataset. Dark green and light green indicate the best and the second best result among all models. Orange denotes using Qwen2-VL-7B-Instruct as the base model. ✝ denotes results on EXPLORE-Bench (tiny) set. # and * represent the non-thinking and thinking mode, respectively.

Experiments

To study how well stepwise reasoning works in the non-thinking model, we first segment the atomic action sequence in two ways: (1) by a fixed number of segments, and (2) by a fixed window size. We then evaluate the performance of Qwen3-VL-8B-Instruct across subsets under two inference strategies: (1) single-turn inference with stepwise reasoning, and (2) multi-turn inference with stepwise reasoning. The results are shown in the figures below. Short, Medium, and Long denote the subsets with short, medium, and long atomic-action sequences. Full denotes the full dataset. Note that the two dashed lines corresponding to "Default Setting: Medium" and "Default Setting: Full" overlap.

Score vs Segments

Unified score vs. number of segments

Score vs Window Size

Unified score vs. window size

Inference Length Segments

Description length vs. number of segments

Inference Length Windows

Description length vs. window size

Inference Time Segments

Inference time vs. number of segments

Inference Time Windows

Inference time vs. window size

Case Studies

Citation

BibTeX

@article{yu2026explore,
  title={EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning},
  author={Yu, Chengjun and Zhu, XuHan and Du, Chaoqun and Yu, Pengfei and Zhai, Wei and Cao, Yang and Zha, Zheng-Jun},
  journal={arXiv preprint arXiv:2603.09731},
  year={2026}
}