"Where am I?": Scene Retrieval
with Language

1ETH Zürich, 2Stanford University, 3Microsoft Mixed Reality Lab
Teaser Image

Given an open-set natural language query (left, red) and a reference map of environments represented by a set of 3D scene graphs (right, yellow), we establish text-to-scene-graph correspondences. The text and scene graph correspondences are matched according to their embeddings in a joint embedding space (blue). These embeddings are jointly learned by a joint embedding model (green). Additionally, the text-query content in brackets represent potential downstream applications for our system, they are not part of the scene description.

Abstract

Natural language interfaces for embodied AI are becoming more ubiquitous in our daily lives. This opens up further opportunities for language-based interaction, such as a user verbally instructing an agent to execute some task in a specific location. For example, "put the bowls back in the cupboard next to the fridge" or "meet me at the intersection under the red sign." As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as "language-based scene-retrieval" and it is closely related to "coarse-localization," but we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. We present Text2SceneGraphMatcher, a "scene-retrieval" pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are a match.

Given open-set natural language text descriptions of scenes, our method showws the potential for using language and scene graphs for performing scene retrieval. We evaluate our method on scenes from the 3DSSG dataset, and also a set of human annotated scenes. We show generalizability to open-set human annotated language queries. Our joint embedding model also works in a “retrieval-based” fashion, meaning we can embed our scene graphs in a first step and then retrieve the most similar embedding text and scene graph pair.