NavBench: Probing Multimodal Large Language Models for Embodied Navigation

Yanyuan Qiao1, Haodong Hong2,3, Wenqi Lyu1, Dong An4, Siqi Zhang5, Yutong Xie4, Xinyu Wang1, Qi Wu1*

1The University of Adelaide    2The University of Queensland    3CSIRO Data61    4Mohamed bin Zayed University of Artificial Intelligence    5Tongji University

NavBench teaser figure

NavBench evaluates MLLMs across three comprehension tasks and a step-by-step execution task, assessing their ability to understand navigation behavior, track progress, reason about observation and action, and act accordingly. The step-by-step navigation is assessed from different difficulty levels, which is defined by cognitive, spatial, and execution complexity.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong generalization in vision-language tasks, yet their ability to understand and act within embodied environments remains underexplored. We present NavBench, a benchmark to evaluate the embodied navigation capabilities of MLLMs under zero-shot settings. NavBench consists of two components: (1) navigation comprehension, assessed through three cognitively grounded tasks including global instruction alignment, temporal progress estimation, and local observation-action reasoning, covering 3,200 question-answer pairs; and (2) step-by-step execution in 432 episodes across 72 indoor scenes, stratified by spatial, cognitive, and execution complexity. To support real-world deployment, we introduce a pipeline that converts MLLMs' outputs into robotic actions. We evaluate both proprietary and open-source models, finding that GPT-4o performs well across tasks, while lighter open-source models succeed in simpler cases. Results also show that models with higher comprehension scores tend to achieve better execution performance. Providing map-based context improves decision accuracy, especially in medium-difficulty scenarios. However, most models struggle with temporal understanding, particularly in estimating progress during navigation, which may pose a key challenge.

NavBench construction pipeline and statistics

NavBench construction pipeline and statistics.
(a) QA generation for comprehension tasks at global, progress, and local levels.
(b) Execution pipeline combining automatic difficulty scoring and human ratings.
(c) Benchmark statistics, including comprehension (comp.) task distribution, QA counts, and execution statistics (e.g., instruction length, steps, distance).

Global Instruction Alignment Examples

Example 1

Example 1 trajectory

You are presented with a sequence of panoramic views that represent a navigation path from the starting point to the goal location. Identify the correct instruction.

Example 2

Example 2 trajectory

You are presented with a sequence of panoramic views that represent a navigation path from the starting point to the goal location. Identify the correct instruction.

Temporal Progress Estimation Examples

Example 1

Example 3-1 trajectory

You are given a navigation instruction divided into multiple sub-instructions, along with a trajectory. Your task is to determine how many sub-instructions have been completed based on the views provided.

Example 2

Example 3-2 trajectory

You are given a navigation instruction divided into multiple sub-instructions, along with a trajectory. Your task is to determine how many sub-instructions have been completed based on the views provided.

Local Observation-Action Reasoning Examples

Example 1

Given the current view and a direction to move, select the location matching the expected view after moving.
Current View
Current View
Direction
Direction
A
B
C
D

Example 2

Given the current view and a target view., select the direction that is most likely to lead to the target view.
Current View
Current View
Target View
Target View
A
B
C
D

Real-world Embodied Navigation Pipeline

Overview of the real-world embodied navigation pipeline

Overview of the real-world embodied navigation pipeline, demonstrating how MLLMs can be integrated into physical navigation systems.

Results

Leaderboard results

BibTeX


@article{qiao2025navbench,
  title={NavBench: Probing Multimodal Large Language Models for Embodied Navigation},
  author={Qiao, Yanyuan and Hong, Haodong and Lyu, Wenqi and An, Dong and Zhang, Siqi and Xie, Yutong and Wang, Xinyu and Wu, Qi},
  journal={arXiv preprint arXiv:2506.01031},
  year={2025}
}
      

© 2025 NavBench Project. All rights reserved.