Scalable visual target navigation with foundation models

Autonomous robots are becoming increasingly capable of operating in indoor environments, but reliably finding a specific target in an unfamiliar space remains difficult. A robot that is asked to find an object, such as a laptop, a document, or a piece of equipment, must interpret what it sees, decide where to explore, and adapt when the environment is only partially known. In his thesis, Bangguo Yu studies visual target navigation, a problem at the intersection of robot perception, mapping, reasoning, and decision-making.
Yu develops a modular navigation framework that progresses from single-robot search to multi-robot cooperation. He first shows how reinforcement learning can improve exploration by combining semantic maps with frontier-based search. Yu then demonstrates that large language models can provide useful commonsense knowledge for object search without requiring costly task-specific training. Next, Yu extends navigation from simple object categories to richer natural-language descriptions, enabling robots to search for targets described by attributes or spatial relations. Yu also introduces a cooperative multi-robot setting in which several robots share information and divide exploration more effectively.
Finally, Yu addresses privacy-aware navigation, allowing robots to choose routes that reduce unnecessary exposure in sensitive or crowded environments. Together, the results show that combining mapping, language-based reasoning, vision-language models, and robot cooperation can make autonomous navigation more efficient, more flexible, and better aligned with real-world requirements.