EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Abstract

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.

Overview

Motivation

While several existing egocentric benchmarks include evaluations related to state prediction, they typically focus on instant (or short-term) and localized state changes (e.g., the state of a single object after a visual cue) and do not provide a systematic protocol to evaluate scene-level predictions under long action horizons. Therefore, we design a scalable annotation pipeline and build a new benchmark that enables fine-grained and quantitative evaluation of long-horizon egocentric scene prediction.

Key Features

Long Action Horizons: 113 atomic actions on average
Fine-grained Evaluation: Objects, attributes, and relations
Numerous Objects: 20000+ object instances
Diverse Scenarios: 12+ real-world scenarios

📊

instances

⚙️

atomic actions on average

📦

objects on average

🎬

scenarios

Scene Annotation Pipeline

Scene annotation pipeline. Rather than having the MLLM generate annotations directly from the images, we adopt a multi-step pipeline to ensure object coverage and the accuracy of attributes and relations, greatly reducing the load of manual annotation required.

Evaluation Results

Evaluation results on EXPLORE-Bench. Short, Medium and Long mean the subsets with short, medium and long atomic-action sequences, respectively. Full denotes the full dataset. Dark green and light green indicate the best and the second best result among all models. Orange denotes using Qwen2-VL-7B-Instruct as the base model. ✝ denotes results on EXPLORE-Bench (tiny) set. # and * represent the non-thinking and thinking mode, respectively.

Experiments

To study how well stepwise reasoning works in the non-thinking model, we first segment the atomic action sequence in two ways: (1) by a fixed number of segments, and (2) by a fixed window size. We then evaluate the performance of Qwen3-VL-8B-Instruct across subsets under two inference strategies: (1) single-turn inference with stepwise reasoning, and (2) multi-turn inference with stepwise reasoning. The results are shown in the figures below. Short, Medium, and Long denote the subsets with short, medium, and long atomic-action sequences. Full denotes the full dataset. Note that the two dashed lines corresponding to "Default Setting: Medium" and "Default Setting: Full" overlap.

Unified score vs. number of segments

Unified score vs. window size

Description length vs. number of segments

Description length vs. window size

Inference time vs. number of segments

Inference time vs. window size

Case Studies

Case 1 (1). "C" refers to the camera wearer. Although both models recognize the changes, neither accurately describes the key state. Instead, they generate descriptions that violate physical commonsense.

Case 1 (2). The scene in this case is the same as in the previous figure, but we show predictions from different models, demonstrating their consistent failures in this case. Notably, Qwen3-VL-8B-Instruct and Keye-VL-1.5-8B cannot even realize that the bottle pyramid changes.

Case 2 (1). Qwen3-VL-8B-Instruct describes more objects than Keye-VL-1.5-8B in the scene, but it fails to mention that the refrigerator door is still open, whereas Keye-VL-1.5-8B notices this. In addition, both models claim that the faucet is turned off, which is inconsistent with the actual scene.

Case 2 (2). Both Gemini-3-Pro and GPT-5.2-Chat describe the key states in this scene, demonstrating their relative advantage in detecting safety hazards.

Citation

BibTeX

@article{yu2026explore,
  title={EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning},
  author={Yu, Chengjun and Zhu, XuHan and Du, Chaoqun and Yu, Pengfei and Zhai, Wei and Cao, Yang and Zha, Zheng-Jun},
  journal={arXiv preprint arXiv:2603.09731},
  year={2026}
}

🎯 EXPLORE-Bench

Egocentric Scene Prediction with Long-Horizon Reasoning