A collaboration between a Chinese AI research group and US-based researchers has developed what may be the first real innovation in deepfakes technology since the phenomenon emerged four years ago.
The new method can perform faceswaps that outperform all other existing frameworks on standard perceptual tests, without needing to exhaustively gather and curate large dedicated datasets and train them for up to a week for just a single identity. For the examples presented in the new paper, models were trained on the entirety of two popular celebrity datasets, on one NVIDIA Tesla P40 GPU for about three days.
The new approach removes the need to ‘paste’ the transplanted identity crudely into the target video, which frequently leads to tell-tale artifacts that appear where the fake face ends and the real, underlying face begins. Rather, ‘hallucination maps’ are used to perform a deeper mingling of visual facets, because the system separates identity from context far more effectively than current methods, and therefore can blend the target identity at a more profound level.
Effectively the new hallucination map provides a more complete context for the swap, as opposed to the hard masks that often require extensive curation (and in the case of DeepFaceLab, separate training) while providing limited flexibility in terms of real incorporation of the two identities.
The paper, titled One-stage Context and Identity Hallucination Network, is authored by researchers affiliated with JD AI Research, and the University of Massachusetts Amherst, and was supported by the National Key R&D Program of China under Grant No. 2020AAA0103800. It was introduced at the 29th ACM International Conference on Multimedia, on October 20th- 24th, at Chengdu, China.
No Need for ‘Face-On’ Parity
Both the most popular current deepfake software, DeepFaceLab, and competing fork FaceSwap, perform tortuous and frequently hand-curated workflows in order to identify which way a face is inclined, what obstacles are in the way that must be accounted for (again, manually), and must cope with many other irritating impediments (including lighting) that make their use far from the ‘point-and-click’ experience inaccurately portrayed in the media since the advent of deepfakes.
By contrast, CihaNet does not require that two images be facing the camera directly in order to extract and exploit useful identity information from a single image.
The CihaNet project, according to the authors, was inspired by the 2019 collaboration between Microsoft Research and Peking University, called FaceShifter, though it makes some notable and critical changes to the core architecture of the older method.
FaceShifter uses two Adaptive Instance Normalization (AdaIN) networks to handle identity information, which data is then transposed into the target image via a mask, in a way similar to current popular deepfake software (and with all its related limitations), using an additional HEAR-Net (which includes a separately trained sub-net trained on occlusion obstacles – an additional layer of complexity).
Instead, the new architecture directly uses this ‘contextual’ information for the transformative process itself, via a two-step single Cascading Adaptive Instance Normalization (C-AdaIN) operation, which provides consistency of context (i.e. face skin and occlusions) of ID-relevant areas.
The second sub-net crucial to the system is called Swapping Block (SwapBlk), which generates an integrated feature from the context of the reference image and the embedded ‘identity’ information from the source image, bypassing the multiple stages necessary to accomplish this by conventional current means.
To help distinguish between context and identity, a hallucination map is generated for each level, standing in for a soft-segmentation mask, and acting on a wider range of features for this critical part of the deepfake process.
In this way, the entire swapping process is accomplished in a single stage and without post-processing.
Data and Testing
To try out the system, the researchers trained four models on two highly popular and variegated open image datasets – CelebA-HQ and NVIDIA’s Flickr-Faces-HQ Dataset (FFHQ), each containing 30,000 and 70,000 images respectively.
No pruning or filtering was performed on these base datasets. In each case, the researchers trained the entirety of each dataset on the single Tesla GPU over three days, with a learning rate of 0.0002 on Adam optimization.
They then rendered out a series of random swaps among the thousands of personalities featured in the datasets, without regard for whether or not the faces were similar or even gender-matched, and compared CihaNet’s results to the output from four leading deepfake frameworks: FaceSwap (which stands in for the more popular DeepFaceLab, since it shares a root codebase in the original 2017 repository that brought deepfakes to the world); the aforementioned FaceShifter; FSGAN; and SimSwap.
The three metrics used in evaluating the results were Structural Similarity (SSIM), pose estimation error and ID retrieval accuracy, which is computed based on the percentage of successfully retrieved pairs.
The researchers contend that CihaNet represents a superior approach in terms of qualitative results, and a notable advance on the current state of the art in deepfake technologies, by removing the burden of extensive and labor-intensive masking architectures and methodologies, and achieving a more useful and actionable separation of identity from context.
Take a look below to see further video examples of the new technique. You can find the full-length video here.
From supplementary materials for the new paper, CihaNet performs faceswapping on various identities. Source: https://mitchellx.github.io/#video