A collaboration between Microsoft Research Asia and Duke University has produced a machine learning system capable of generating video solely from a text prompt, without the use of Generative Adversarial Networks (GANs).
The project is titled GODIVA (Generating Open-DomaIn Videos from nAtural Descriptions), and builds on some of the approaches used by OpenAI's DALL-E image synthesis system, revealed earlier this year.
GODIVA uses the Vector Quantised-Variational AutoEncoder (VQ-VAE) model first introduced by researchers from Google's DeepMind project in 2018, and also an essential component in DALL-E's transformational capabilities.
VQ-VAE has been used in a number of projects to generate predicted video, where the user supplies an initial number of frames and requests the system to generate additional frames:
However, the authors of the new paper claim that GODIVA represents the first pure text-to-video (T2V) implementation that uses VQ-VAE rather than the more erratic results that previous projects have obtained with GANs.
Seed Points In Text-To-Video
Though the submission is short on details as to the criteria by which origination frames are created, GODIVA appears to summon seed imagery from nowhere before going on to extrapolate it into low-resolution video frames.
In fact, the origination comes from labels in the data used: GODIVA was pre-trained on the Howto100M dataset, comprised of 136 million captioned video clips sourced from YouTube over 15 years, and featuring 23,000 labeled activities. Nonetheless, each possible activity is present in very high numbers of clips, increasing with generalization (i.e. ‘Pets and animals' has 3.5 million clips, whereas ‘dogs' has 762,000 clips), and so there is still a great choice of possible starting points.
The model was evaluated on Microsoft's MSR Video to Text (MSR-VTT) dataset. As further tests of the architecture, GODIVA was trained from scratch on the Moving Mnist dataset and the Double Moving Mnist dataset, both derived from the original MNIST database, a collaboration between Microsoft, Google and the Courant Institute of Mathematical Sciences at NYU.
Frame Evaluation In Continuous Video Synthesis
In line with Peking University's IRC-GAN, GODIVA add four additional columnar checks to the original MNIST method, which evaluated prior and following frames by moving up>down and then left>right. IRC-GAN and GODIVA also consider frames by moving attention left>right, right>left, up>down and down>up.
Evaluating Video Quality And Fidelity To Prompt
To understand how well the image generation succeeded, the researchers utilized two metrics: one based on CLIP similarity, and a novel Relative Matching (RM) metric.
OpenAI's CLIP framework is capable of zero-shot matching of images to text, as well as facilitating image synthesis by reversing this model. The researchers divided the CLIP-derived score by the calculated similarity between the text prompt and the ground truth video in order to arrive at an RM score. In a separate scoring round, the output was evaluated by 200 people and the results compared to the programmatic scores.
TFGAN can produce 128 square pixels in comparison to the 64×64 output that constrains GODIVA and T2V in the above examples, but the researchers note not only that GODIVA produces bolder and more committed movement, but will generate scene changes without any specific prompting, and does not shy away from generating close-up shots.
In later runs, GODIVA also generates 128x128px output, with changes in POV:
In the project's own RM metric, GODIVA is able to achieve scores approaching 100% in terms of authenticity (quality of video) and fidelity (how closely the generated content matches the input prompt).
The researchers concede, however, that the development of video-based CLIP metrics would be a welcome addition to this area of image synthesis, since it would provide a level playing field for evaluating the quality of results without resorting to the over-fitting and lack of generalization that has increasingly become criticized in regard to ‘standard' computer vision challenges over the last ten years.
They also observe that generating longer videos will be a logistical consideration in further development of the system, since just 10 frames of 64x64px output requires 2560 visual tokens, a pipeline bloat that is likely to get expensive and unmanageable rather quickly.