New research out of China offers an effective and novel method for restoring detail and resolution to user-uploaded video that is automatically compressed on platforms such as WeChat and YouTube in order to save bandwidth and storage space.
Contrary to prior methods that can upscale and upsample videos based on generic training data, the new approach instead derives a degradation feature map (DFM) for each frame of the compressed video – effectively an overview of the most damaged or deteriorated regions in the frame that have resulted from compression.
The restorative process, which leverages convolutional neural networks (CNNs), among other technologies, is guided and focused by the information in the DFM, allowing the new method to surpass the performance and accuracy of prior approaches.
The ground truth for the process was obtained by the researchers uploading high-quality video to four popular sharing platforms, downloading the compressed results, and developing a computer vision pipeline capable of abstractly learning compression artifacts and detail loss, so that it can be applied across a number of platforms to restore the videos to a near-original quality, based on completely apposite data.
Material used in the research has been compiled into a HQ/LQ dataset titled User Videos Shared on Social Media (UVSSM), and has been made available for download (password: rsqw) at Baidu, for the benefit of subsequent research projects seeking to develop new methods to restore platform-compressed video.
The code for the system, which is known as Video restOration through adapTive dEgradation Sensing (VOTES), has also been released at GitHub, though its implementation entails a number of pull-based dependencies.
The paper is titled Restoration of User Videos Shared on Social Media, and comes from three researchers at Shenzhen University, and one from the Department of Electronic and Information Engineering at the Hong Kong Polytechnic University.
From Artifacts to Facts
The ability to restore the quality of web-scraped videos without the generic, sometimes excessive ‘hallucination' of detail provided by programs such as Gigapixel (and most of the popular open source packages of similar scope) could have implications for the computer vision research sector.
Research into video-based CV technologies frequently relies on footage obtained from platforms such as YouTube and Twitter, where the compression methods and codecs used are closely guarded, cannot be easily gleaned based on artifact patterns or other visual indicators, and may change periodically.
Most of the projects that leverage web-found video are not researching compression, and have to make allowances for the available quality of compressed video that the platforms offer, since they have no access to the original high-quality versions that the users uploaded.
Therefore the ability to faithfully restore greater quality and resolution to such videos, without introducing downstream influence from unrelated computer vision datasets, could help obviate the frequent workarounds and accommodations that CV projects must currently make for the degraded video sources.
Though platforms such as YouTube will occasionally trumpet major changes in the way they compress users' videos (such as VP9), none of them explicitly reveal the entire process or exact codecs and settings used to slim down the high-quality files that users upload.
Achieving improved output quality from user uploads has therefore become something of a Druidic art in the last ten or so years, with various (mostly unconfirmed) ‘workarounds' going in and out of fashion.
Prior approaches to deep learning-based video restoration have involved generic feature extraction, either as an approach to single-frame restoration or in a multi-frame architecture that leverages optical flow (i.e. that takes account of adjacent and later frames when restoring a current frame).
All of these approaches have had to contend with the ‘black box' effect – the fact that they cannot examine compression effects in the core technologies, because it is not certain either what the core technologies are, or how they were configured for any particular user-uploaded video.
VOTES, instead, seeks to extract salient features directly from the original and compressed video, and determine patterns of transformation that will generalize to the standards of a number of platforms.
VOTES uses a specially developed degradation sensing module (DSM, see image above) to extract features in convolutional blocks. Multiple frames are then passed to a feature extraction and alignment module (FEAM), with these then being shunted to a degradation modulation module (DMM). Finally, the reconstruction module outputs the restored video.
Data and Experiments
In the new work, the researchers have concentrated their efforts on restoring video uploaded to and re-downloaded from the WeChat platform, but were concerned to ensure that the resulting algorithm could be adapted to other platforms.
It transpired that once they had obtained an effective restoration model for WeChat videos, adapting it to Bilibili, Twitter and YouTube only took 90 seconds for a single epoch for each custom model for each platform (on a machine running 4 NVIDIA Tesla P40 GPUs with a total 96GB of VRAM).
To populate the UVSSM dataset, the researchers gathered 264 videos ranging between 5-30 seconds, each with a 30fps frame rate, sourced either directly from mobile phone cameras or from the internet. The videos were all either 1920 x 1080 or 1280 x 270 resolution.
Content (see earlier image) included city views, landscapes, people, and animals, among a variety of other subjects, and are usable in the public dataset via Creative Commons Attribution license, allowing reuse.
The authors uploaded 214 videos to WeChat using five different brands of mobile phone, obtaining WeChat's default video resolution of 960×540 (unless the source video is already smaller than these dimensions), among the most ‘punitive' conversions across popular platforms.
For the later comparisons against the conversion routines of other platforms, the researchers uploaded 50 videos not included in the original 214 to Bilibili, YouTube, and Twitter. The videos' original resolution was 1280×270, with the downloaded versions standing at 640×360.
This brings the UVSSM dataset to a total of 364 couplets of original (HQ) and shared (LQ) videos, with 214 to WeChat, and 50 each to Bilibili, YouTube, and Twitter.
For the experiments, 10 random videos were selected as the test set, four as the validation set, and the remnant 200 as the core training set. Experiments were conducted five times with K-fold cross validation, with the results averaged across these instances.
In tests for video restoration, VOTES was compared to Spatio-Temporal Deformable Fusion (STDF). For resolution enhancement, it was tested against Enhanced Deformable convolutions (EDVR), RSDN, Video Super-resolution with Temporal Group Attention (VSR_TGA), and BasicVSR. Google's single-stage method COMISR was also included, though it does not fit the architecture type of the other prior works.
The methods were tested against both UVSS and the REDS dataset, with VOTES achieving the highest scores:
The authors contend that the qualitative results also indicate the superiority of VOTES against the prior systems:
First published 19th August 2022.