stub Microsoft Proposes GODIVA, a Text-To-Video Machine Learning Framework - Unite.AI
Connect with us

Artificial Intelligence

Microsoft Proposes GODIVA, a Text-To-Video Machine Learning Framework

mm
Updated on

A collaboration between Microsoft Research Asia and Duke University has produced a machine learning system capable of generating video solely from a text prompt, without the use of Generative Adversarial Networks (GANs).

The project is titled GODIVA (Generating Open-DomaIn Videos from nAtural Descriptions), and builds on some of the approaches used by OpenAI's DALL-E image synthesis system, revealed earlier this year.

Early results from GODIVA, with frames from videos created from two prompts. The top two examples were generated from the prompt 'Play golf on grass', and the bottom third from the prompt 'A baseball game is played'. Source: https://arxiv.org/pdf/2104.14806.pdf

Early results from GODIVA, with frames from videos created from two prompts. The top two examples were generated from the prompt ‘Play golf on grass', and the bottom third from the prompt ‘A baseball game is played'. Source: https://arxiv.org/pdf/2104.14806.pdf

GODIVA uses the Vector Quantised-Variational AutoEncoder (VQ-VAE) model first introduced by researchers from Google's DeepMind project in 2018, and also an essential component in DALL-E's transformational capabilities.

Architecture of the VQ-VAE model, with embedding space to the right and encoder/decoder sharing dimensional space in order to lower losses during reconstruction.  Source: https://arxiv.org/pdf/1711.00937.pdf

Architecture of the VQ-VAE model, with embedding space to the right and encoder/decoder sharing dimensional space in order to lower losses during reconstruction.  Source: https://arxiv.org/pdf/1711.00937.pdf

VQ-VAE has been used in a number of projects to generate predicted video, where the user supplies an initial number of frames and requests the system to generate additional frames:

Earlier work: VQ-VAE infers frames from very limited supplied source material. Source: Supplementary materials at https://openreview.net/forum?id=bBDlTR5eDIX

Earlier work: VQ-VAE infers frames from very limited supplied source material. Source: Supplementary materials at https://openreview.net/forum?id=bBDlTR5eDIX

However, the authors of the new paper claim that GODIVA represents the first pure text-to-video (T2V) implementation that uses VQ-VAE rather than the more erratic results that previous projects have obtained with GANs.

Seed Points In Text-To-Video

Though the submission is short on details as to the criteria by which origination frames are created, GODIVA appears to summon seed imagery from nowhere before going on to extrapolate it into low-resolution video frames.

A columnar representation of the three-dimensional sparse attention system that powers GODIVA for text-to-image tasks. The auto-regression is predicted through four factors: input text, relative positioning with previous frame (similar to NVIDIA's SPADE and other methods that build on or evolve beyond Optical Flow approaches), same rows on the same frame, and same columns on the same column.

A columnar representation of the three-dimensional sparse attention system that powers GODIVA for text-to-image tasks. The auto-regression is predicted through four factors: input text, relative positioning with previous frame (similar to NVIDIA's SPADE and other methods that build on or evolve beyond Optical Flow approaches), same rows on the same frame, and same columns on the same column.

In fact, the origination comes from labels in the data used: GODIVA was pre-trained on the Howto100M dataset, comprised of 136 million captioned video clips sourced from YouTube over 15 years, and featuring 23,000 labeled activities. Nonetheless, each possible activity is present in very high numbers of clips, increasing with generalization (i.e. ‘Pets and animals' has 3.5 million clips, whereas ‘dogs' has 762,000 clips), and so there is still a great choice of possible starting points.

The model was evaluated on Microsoft's MSR Video to Text (MSR-VTT) dataset. As further tests of the architecture, GODIVA was trained from scratch on the Moving Mnist dataset and the Double Moving Mnist dataset, both derived from the original MNIST database, a collaboration between Microsoft, Google and the Courant Institute of Mathematical Sciences at NYU.

Frame Evaluation In Continuous Video Synthesis

In line with Peking University's IRC-GAN, GODIVA add four additional columnar checks to the original MNIST method, which evaluated prior and following frames by moving up>down and then left>right. IRC-GAN and GODIVA also consider frames by moving attention left>right, right>left, up>down and down>up.

Additional generated frames from GODIVA.

Additional generated frames from GODIVA.

Evaluating Video Quality And Fidelity To Prompt

To understand how well the image generation succeeded, the researchers utilized two metrics: one based on CLIP similarity, and a novel Relative Matching (RM) metric.

OpenAI's CLIP framework is capable of zero-shot matching of images to text, as well as facilitating image synthesis by reversing this model. The researchers divided the CLIP-derived score by the calculated similarity between the text prompt and the ground truth video in order to arrive at an RM score. In a separate scoring round, the output was evaluated by 200 people and the results compared to the programmatic scores.

Finally GODIVA was tested against two prior frameworks, TFGAN and the 2017's Duke/NEC collaboration, T2V.

T2V-vs-TFGAN-vs-GODIVA

TFGAN can produce 128 square pixels in comparison to the 64×64 output that constrains GODIVA and T2V in the above examples, but the researchers note not only that GODIVA produces bolder and more committed movement, but will generate scene changes without any specific prompting, and does not shy away from generating close-up shots.

In later runs, GODIVA also generates 128x128px output, with changes in POV:

godiva_baseball_128px

In the project's own RM metric, GODIVA is able to achieve scores approaching 100% in terms of authenticity (quality of video) and fidelity (how closely the generated content matches the input prompt).

The researchers concede, however, that the development of video-based CLIP metrics would be a welcome addition to this area of image synthesis, since it would provide a level playing field for evaluating the quality of results without resorting to the over-fitting and lack of generalization that has increasingly become criticized in regard to ‘standard' computer vision challenges over the last ten years.

They also observe that generating longer videos will be a logistical consideration in further development of the system, since just 10 frames of 64x64px output requires 2560 visual tokens, a pipeline bloat that is likely to get expensive and unmanageable rather quickly.