stub Advancement in “Spatial-AI” Enables Robots to Perceive Physical Environments Like a Human - Unite.AI
Connect with us


Advancement in “Spatial-AI” Enables Robots to Perceive Physical Environments Like a Human

Updated on

Engineers at MIT are working toward giving robots the ability to follow high-level commands, such as going to another room to retrieve an item for an individual. In order for this to be possible, robots will need to have the ability to perceive their physical environments similar to the way we humans do. 

Luca Carlone is an assistant professor of aeronautics and astronautics at MIT. 

“In order to make any decision in the world, you need to have a mental model of the environment around you,” Carlone says. “This is something so effortless for humans. But for robots it’s a painfully hard problem, where it’s about transforming pixel values that they see through a camera, into an understanding of the world.”

To take on this challenge, the researchers modeled a representation of spatial perception for robots based on how humans perceive and navigate their physical environments.

3D Dynamic Scene Graphs

The new model is called 3D Dynamic Scene Graphs, and it enables a robot to generate a 3D map of its physical surroundings, including objects and their semantic labels. The robot can also map out people, rooms, walls, and other structures in the environment.

The model then allows the robot to extract information from the 3D map, information that can be used to locate objects, rooms, and the movement of people.

“This compressed representation of the environment is useful because it allows our robot to quickly make decisions and plan its path,” Carlone says. “This is not too far from what we do as humans. If you need to plan a path from your home to MIT, you don't plan every single position you need to take. You just think at the level of streets and landmarks, which helps you plan your route faster.”

According to Carlone, robots that rely on this model would be able to do much more than just domestic tasks. They could also be used for high-level skills and work alongside people in factories, or help locate survivors of a disaster site.

3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans

Current Methods vs New Model

The current methods for robotic vision and navigation mainly focus on 3D mapping that allows robots to reconstruct their environment in three dimensions in real-time, or semantic segmentation, which happens when robots classify features in the environment as semantic objects, like a car versus a bicycle. Semantic segmentation is often done on 2D images. 

The newly developed model of spatial perception is the first of its kind to generate a 3D map of the environment in real-time and label objects, people, and structures within the 3D map at the same time. 

In order to achieve this new model, the researchers relied on Kimera, an open-source library. Kimera was previously developed by the same team to construct a 3D geometric model of an environment, while at the same time encoding what the object likely is, such as a chair versus a desk.

“Like the mythical creature that is a mix of different animals, we wanted Kimera to be a mix of mapping and semantic understanding in 3D,” Carlone says.

Kimera used images from a robot’s camera and inertial measurements from onboard sensors to reconstruct the scene as a 3D mesh in real-time. In order to do this, Kimera utilized a neural network that has been trained on millions of real-world images. It could then predict the label of each pixel and use ray-casting to project them in 3D.

Through the use of this technique, the robot’s environment can be mapped out in a three-dimensional mesh where each face is color-coded, identifying it as a part of objects, structures, or people in the environment. 

3D Mesh to 3D Dynamic “Scene Graphs”

Because the 3D semantic mesh model requires a lot of computational power and is time-consuming, the researchers used Kimera to develop algorithms that resulted in 3D dynamic “scene graphs.”

The 3D semantic mesh gets broken down into distinct semantic layers, and the robot is then able to view a scene through a layer. The layers go from objects and people, to open spaces and structures, to rooms, corridors, halls, and whole buildings.

This layering method allows the robot to narrow its focus rather than having to analyze billions of points and faces. This layering method also allows the algorithms to track humans and their movement within the environment in real-time.

The new model was tested in a photo-realistic simulator that simulates a robot navigating an office environment with moving people. 

“We are essentially enabling robots to have mental models similar to the ones humans use,” Carlone says. “This can impact many applications, including self-driving cars, search and rescue, collaborative manufacturing, and domestic robotics.

Carlone was joined by lead author and MIT graduate student Antoni Rosinol.

“Our approach has just been made possible thanks to recent advances in deep learning and decades of research on simultaneous localization and mapping,” Rosinol says. “With this work, we are making the leap toward a new era of robotic perception called spatial-AI, which is just in its infancy but has great potential in robotics and large-scale virtual and augmented reality.”

The research was presented at the Robotics: Science and Systems virtual conference


Alex McFarland is a tech writer who covers the latest developments in artificial intelligence. He has worked with AI startups and publications across the globe.