Researchers at Carnegie Mellon University have developed a computer model that is capable of translating text that describes physical movements into simple computer-generated animations. These new developments could make it possible for movies and other animations to be created directly from a computer model reading the scripts.
Scientists have been making progress in getting computers to understand both natural language and generate physical poses from script. This new computer model can be the link between them.
Louis-Philippe Morency, an associate professor in the Language Technologies Institute (LTI), and Chaitanya Ahuja, an LTI Ph.D. student, have been using a neural architecture that is called Joint Language-to-Pose (JL2P). The JL2P model is capable of jointly embedding sentences and physical motions. This allows it to learn how language is connected to action, gestures, and movements.
“I think we’re in an early stage of this research, but from a modeling, artificial intelligence and theory perspective, it’s a very exciting moment,” Morency said. “Right now, we’re talking about animating virtual characters. Eventually, this link between language and gestures could be applied to robots; we might be able to simply tell a personal assistant robot what we want it to do.
“We also could eventually go the other way — using this link between language and animation so a computer could describe what is happening in a video,” he added.
The Joint Language-to-Pose model will be presented by Ahuja on September 19 at the International Conference on 3D Vision. That conference will be taking place in Quebec City, Canada.
The JL2P model was created by a curriculum-learning approach. The first important step was for the model to learn short, easy sequences. That would be something like “A person walks forward.” It then moved on to longer and harder sequences such as “A person steps forward, then turns around and steps forward again,” or “A person jumps over an obstacle while running.”
When the model is using the sequences, it looks at verbs and adverbs. These describe the action and speed/acceleration of the action. Then, it looks at nouns and adjectives which describe locations and directions. According to Ahuja, the end goal for the model is to animate complex sequences with multiple actions that are happening simultaneously or in sequence.
As of right now, the animations are limited to stick figures, but the scientists are going to keep developing the model. One of the complications that arises is that according to Morency, a lot of things are happening at the same time. Some of them are even happening in simple sequences.
“Synchrony between body parts is very important,” Morency said. “Every time you move your legs, you also move your arms, your torso and possibly your head. The body animations need to coordinate these different components, while at the same time achieving complex actions. Bringing language narrative within this complex animation environment is both challenging and exciting. This is a path toward better understanding of speech and gestures.”
If the Joint Language-to-Pose model is able to develop to the point in which it can create complex animations and actions based on language, the possibilities are huge. Not only can it be used in areas such as film and animation, but it will also help lead to developments in understanding speech and gestures.
Turning to artificial intelligence, this JL2P model could be used on robots. For example, robots might be able to be controlled and told what to do, and they would be able to understand the language and respond accordingly.
These new developments will impact many different fields, and the model will keep getting more capable of understanding complex language.