Connect with us


Machine Learning Model Understands Object Relationships



Researchers at Massachusetts Institute of Technology (MIT) have developed a new machine learning (ML) model that understands the underlying relationships between objects in a scene. The model represents individual relationships one at a time before combining the representations to describe the overall scene. 

Through this new approach, the ML model can generate more accurate images from text descriptions, and it can do this even when the scene has multiple projects arranged in different relationships with one another. 

This new development is important given that many deep learning models are unable to understand the entangled relationships between individual objects.

The team’s model could be used in cases where industrial robots must perform multi step manipulation tasks, such as stacking items or assembling appliances. It also helps lead to machines eventually being able to learn from and interact with their environments, just like humans. 

Yilun Du is a PhD student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and co-lead author of the paper. Du co-led the research with Shuang Li, a CSAIL PhD student, and Nan Liu, a graduate student at the University of Illinois at Urbana-Champaign. It also included Joshua B. Tenenbaum, Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain and Cognitive Sciences, and senior author Antonio Torralba, the Delta Electronics Professor of Electrical Engineering and Computer Science. Both Tenenbaum and Torralba are members of CSAIL.

The New Framework

“When I look at a table, I can’t say that there is an object at XYZ location. Our minds don’t work like that. In our minds, when we understand a scene, we really understand it based on the relationships between the objects. We think that by building a system that can understand the relationships between objects, we could use that system to more effectively manipulate and change our environments,” says Du.

The new framework can generate an image of a scene based on a text description of objects and their relationships. 

The system can then break these sentences down into smaller pieces that describe each individual relationship. Each part is then modeled separately, and the pieces are combined through an optimization process that generates an image of the scene. 

With the sentences broken down into shorter pieces, the system can then recombine them in different ways, enabling it to adapt to scene descriptions it has never encountered.

“Other systems would take all the relations holistically and generate the image one-shot from the description. However, such approaches fail when we have out-of-distribution descriptions, such as descriptions with more relations, since these models can’t really adapt one shot to generate images containing more relationships. However, as we are composing these separate, smaller models together, we can model a larger number of relationships and adapt to novel combinations,” Du says.

The system can also carry out this process in reverse. If it is fed an image, it can find text descriptions that match the relationships between objects in the scene. 

Evaluating the Model

The researchers asked humans to evaluate whether the generated images matched the original scene description. When descriptions contained three relationships, which was the most complex type, 91 percent of participants said the new model performed better than other deep learning methods.

“One interesting thing we found is that for our model, we can increase our sentence from having one relation description to having two, or three, or even four descriptions, and our approach continues to be able to generate images that are correctly described by those descriptions, while other methods fail,” Du says.

The model also demonstrated an impressive ability to work with descriptions it hadn’t encountered previously.

“This is very promising because that is closer to how humans work. Humans may only see several examples, but we can extract useful information from just those few examples and combine them together to create infinite combinations. And our model has such a property that allows it to learn from fewer data but generalize to more complex scenes or image generations,” Li says.

The team will now look to test the model on real-world images that are more complex and explore how to eventually incorporate the model into robotics systems. 


Alex McFarland is a Brazil-based writer who covers the latest developments in artificial intelligence & blockchain. He has worked with top AI companies and publications across the globe.