Removing Objects From Video More Efficiently With Machine Learning
New research from China reports state-of-the-art results – as well as an impressive improvement in efficiency – for a new video inpainting system that can adroitly remove objects from footage.
The technique, called End-to-End framework for Flow-Guided video Inpainting (E2FGVI), is also capable of removing watermarks and various other kinds of occlusion from video content.
To see more examples in better resolution, check out the video embedded at the end of the article.
Though the model featured in the published paper was trained on 432px x 240px videos (commonly low input sizes, constrained by available GPU space vs. optimal batch sizes and other factors), the authors have since released E2FGVI-HQ, which can handle videos at an arbitrary resolution.
The code for the current version is available at GitHub, while the HQ version, released last Sunday, can be downloaded from Google Drive and Baidu Disk.
E2FGVI can process 432×240 video at 0.12 seconds per frame on a Titan XP GPU (12GB VRAM), and the authors report that the system operates fifteen times faster than prior state-of-the-art methods based on optical flow.
Tested on standard datasets for this sub-sector of image synthesis research, the new method was able to outperform rivals in both qualitative and quantitative evaluation rounds.
The paper is titled Towards An End-to-End Framework for Flow-Guided Video Inpainting, and is a collaboration between four researchers from Nankai University, together with a researcher from Hisilicon Technologies.
What’s Missing in This Picture
Besides its obvious applications for visual effects, high quality video inpainting is set to become a core defining feature of new AI-based image synthesis and image-altering technologies.
This is particularly the case for body-altering fashion applications, and other frameworks that seek to ‘slim down’ or otherwise alter scenes in images and video. In such cases, it’s necessary to convincingly ‘fill in’ the extra background that is exposed by the synthesis.
Coherent Optical Flow
Optical flow (OF) has become a core technology in the development of video object removal. Like an atlas, OF provides a one-shot map of a temporal sequence. Often used to measure velocity in computer vision initiatives, OF can also enable temporally consistent in-painting, where the aggregate sum of the task can be considered in a single pass, instead of Disney-style ‘per-frame’ attention, which inevitably leads to temporal discontinuity.
Video inpainting methods to date have centered on a three-stage process: flow completion, where the video is essentially mapped out into a discrete and explorable entity; pixel propagation, where the holes in ‘corrupted’ videos are filled in by bidirectionally propagating pixels; and content hallucination (pixel ‘invention’ that’s familiar to most of us from deepfakes and text-to-image frameworks such as the DALL-E series) where the estimated ‘missing’ content is invented and inserted into the footage.
The central innovation of E2FGVI is to combine these three stages into an end-to-end system, obviating the need to carry out manual operations on the content or the process.
The paper observes that the need for manual intervention requires that older processes not take advantage of a GPU, making them quite time-consuming. From the paper*:
‘Taking DFVI as an example, completing one video with the size of 432 × 240 from DAVIS, which contains about 70 frames, needs about 4 minutes, which is unacceptable in most real-world applications. Besides, except for the above-mentioned drawbacks, only using a pretrained image inpainting network at the content hallucination stage ignores the content relationships across temporal neighbors, leading to inconsistent generated content in videos.’
By uniting the three stages of video inpainting, E2FGVI is able to substitute the second stage, pixel propagation, with feature propagation. In the more segmented processes of prior works, features are not so extensively available, because each stage is relatively hermetic, and the workflow only semi-automated.
Additionally, the researchers have devised a temporal focal transformer for the content hallucination stage, which considers not just the direct neighbors of pixels in the current frame (i.e. what is happening in that part of the frame in the previous or next image), but also the distant neighbors that are many frames away, and yet will influence the cohesive effect of any operations performed on the video as a whole.
The new feature-based central section of the workflow is able to take advantage of more feature-level processes and learnable sampling offsets, while the project’s novel focal transformer, according to the authors, extends the size of focal windows ‘from 2D to 3D’.
Tests and Data
To test E2FGVI, the researchers evaluated the system against two popular video object segmentation datasets: YouTube-VOS, and DAVIS. YouTube-VOS features 3741 training video clips, 474 validation clips, and 508 test clips, while DAVIS features 60 training video clips, and 90 test clips.
E2FGVI was trained on YouTube-VOS and evaluated on both datasets. During training, object masks (the green areas in the images above, and the embedded video below) were generated to simulate video completion.
For metrics, the researchers adopted Peak signal-to-noise ratio (PSNR), Structural similarity (SSIM), Video-based Fréchet Inception Distance (VFID), and Flow Warping Error – the latter to measure temporal stability in the affected video.
The prior architectures against which the system was tested were VINet, DFVI, LGTSM, CAP, FGVC, STTN, and FuseFormer.
In addition to achieving the best scores against all competing systems, the researchers conducted a qualitative user-study, in which videos transformed with five representative methods were shown individually to twenty volunteers, who were asked to rate them in terms of visual quality.
The authors note that in spite of the unanimous preference for their method, one of the results, FGVC, does not reflect the quantitative results, and they suggest that this indicates that E2FGVI might, speciously, be generating ‘more visually pleasant results’.
In terms of efficiency, the authors note that their system greatly reduces floating point operations per second (FLOPs) and inference time on a single Titan GPU on the DAVIS dataset, and observe that the results show E2FGVI running x15 faster than flow-based methods.
‘[E2FGVI] holds the lowest FLOPs in contrast to all other methods. This indicates that the proposed method is highly efficient for video inpainting.’
*My conversion of authors’ inline citations to hyperlinks.
First published 19th May 2022.