Artificial Intelligence

High Precision Semantic Image Editing with EditGAN

Published September 4, 2023

Kunal Kejriwal

A person holding on the globe in his hands while standing in fields.

Generative Adversarial Networks or GANs have been enjoying new applications in the image editing industry. For the past few months, EditGAN is gaining popularity in the AI/ML industry because it’s a novel method for high-precision, and high-quality semantic image editing.

We will be talking about the EditGAN model in detail, and let you know why it might prove to be a milestone in the semantic image editing industry.

So let’s start. But before we get to know what EditGAN is, it’s important for us to understand what is the importance of EditGAN, and why it is a significant step forward.

Why EditGAN?

Although traditional GAN architectures have helped the AI-based image editing industry advance significantly, there are some major challenges with building a GAN architecture from scratch.

During the training phase, a GAN architecture requires a high amount of labeled data with semantic segmentation annotations.
They are capable of providing only high-level control.
And often, they just interpolate back and forth between images.

It can be observed that although traditional GAN architectures get the work done, they are not effective for wide scale deployment. Traditional GAN architecture’s sub-par efficiency is the reason why EditGAN was introduced by NVIDIA in 2022.

EditGAN is proposed to be an effective method for high precision, and high quality semantic image editing with the capability of allowing its users to edit images by altering their highly detailed segmentation masks of an image. One of the reasons why EditGAN is a scalable method for image editing tasks is because of its architecture.

The EditGAN model is built on a GAN framework that models images and their semantic segmentations jointly, and requires only a handful of labeled or annotated training data. The developers of EditGAN have attempted to embed an image into GAN’s latent space to effectively modify the image by performing conditional latent code optimization in accordance with the segmentation edit. Furthermore, to amortize optimization, the model attempts to find “editing vectors” in latent space that realizes the edits.

The architecture of the EditGAN framework allows the model to learn an arbitrary number of editing vectors that can then be implemented or applied directly on other images with high speed, and efficiency. Furthermore, experimental results indicate that EditGAN can edit images with a never seen before level of detail while preserving the image quality to a maximum.

To sum as to why we need EditGAN, it’s the first ever GAN-based image editing framework that offers

Very high-precision editing.
Can work with a handful of labeled data.
Can be deployed effectively in real-time scenarios.
Allows compositionality for multiple edits simultaneously.
Works on GAN-generated, real embedded, and even out of domain images.

High-Precision Semantic Image Editing with EditGAN

StyleGAN2, a state of the art GAN framework for image synthesis, is the primary image generation component of EditGAN. The StyleGAN2 framework maps latent codes that are drawn from a pool of multivariate normal distribution, and maps it into realistic images.

StyleGAN2 is a deep generative model that has been trained to synthesize images of the highest quality possible along with acquiring a semantic understanding of the images modeled.

Segmentation Training and Inference

The EditGAN model embeds an image into the GAN’s latent space using optimization, and an encoder to perform segmentation on a new image, and training the segmentation branch. The EditGAN framework continues to build on previous works, and trains an encoder to embed the images in the latent space. The primary objective here is to train the encoder consisting of standard pixel-wise L2 and LPIPS construction losses using samples from GAN, and real-life training data. Furthermore, the model also regularizes the encoder explicitly using the latent codes when working with the GAN samples.

Resultantly, the model embeds the annotated images from the dataset labeled with semantic segmentation into the latent space, and uses cross entropy loss to train the segmentation branch of the generator.

Using Segmentation Editing to Find Semantics in Latent Space

The primary purpose of EditGAN is to leverage the joint distribution of semantic segmentations and images for high precision image editing. Let’s say we have an image x that needs to be edited, so the model embeds the image into EditGAN’s latent space or uses the sample images from the model itself. The segmentation branch then generates y or the corresponding segmentation primarily because both RGB images & segmentations share the same latent codes w. Developers can then use any labeling or digital painting tools to modify the segmentation & edit them as per their requirements manually.

Different Ways of Editing during Inference

The latent space editing vectors obtained using optimization can be described as semantically meaningful, and are often disentangled with different attributes. Therefore, to edit a new image, the model can directly embed the image into the latent space, and directly perform the same editing operations that the model learnt previously, without performing the optimization all over again from scratch. It would be safe to say that the editing vectors the model learns amortize the optimization that was essential to edit the image initially.

It is worth noting that developers have still not perfected disentanglement, and edit vectors often do not return the best results when used to other images. However, the issue can be overcome by removing editing artifacts from other parts of the image by performing a few additional optimization steps during the test time.

On the basis of our current learnings, the EditGAN framework can be used to edit images in three different modes.

Real-Time Editing with Editing Vectors

For images that are localized, and disentangled, the model edits the images by applying editing vectors learned previously with different scales, and manipulates the images at interactive rates.

Using Self-Supervised Refinement for Vector-based Editing

For editing localized images that are not disentangled perfectly with other parts of the image, the model initializes editing the image using previously learned editing vectors, and removes editing artifacts by performing a few additional optimization steps during the test time.

Optimization-based Editing

To perform large-scale & image-specific edits, the model performs optimization from the start because editing vectors cannot be used to perform these kinds of transfers to other images.

Implementation

The EditGAN framework is evaluated on images spread across four different categories: Cars, Birds, Cats, and Faces. The segmentation branch of the model is trained by using image-mask pairs of 16, 30, 30, 16 as labeled training data for Cars, Birds, Cats, and Faces respectively. When the image is to be edited purely using optimization, or when the model is attempting to learn the editing vectors, the model performs 100 optimization steps using the Adam optimizer.

For the Cat, Car, and Faces dataset, the model uses real images from the DatasetGAN’s test set that were not used to train the GAN framework for performing editing functionality. Straightaway, these images are embedded into EditGAN’s latent space using optimization and encoding. For the Birds category, the editing is shown on GAN-generated images.

Results

Qualitative Results

In-Domain Results

The above image demonstrates the performance of the EditGAN framework when it is applying the previously learned editing vectors on novel images, and refining the images using 30 optimization steps. These editing operations performed by the EditGAN framework are disentangled for all classes, and they preserve the overall quality of the images. Comparing the results of EditGAN and other frameworks, it could be observed that the EditGAN framework outperforms other methods in performing high-precision, and complex edits while preserving the subject identity, and image quality at the same time.

What’s astonishing is that the EditGAN framework can perform extremely high precision edits like dilating the pupils, or editing the wheel spokes in the tyres of a car. Furthermore, EditGAN can also be used to edit the semantic parts of objects that have only a few pixels, or it can be used to perform large-scale modifications to an image as well. It’s worth noting that the several editing operations of the EditGAN framework are capable of generating manipulated images unlike the images that appear in the GAN training data.

Out of Domain Results

To evaluate EditGAN’s out of domain performance, the framework has been tested on the MetFaces dataset. The EditGAN model uses in-domain real faces to create editing vectors. The model then embeds MetFaces portraits that are out of domain using a 100-step optimization process, and applies the editing vectors via a 30-step self-supervised refinement process. The results can be seen in the following image.

Quantitative Results

To measure EditGAN’s image editing capabilities quantitatively, the model uses a smile edit benchmark that was first introduced by MaskGAN. Faces that contain neutral expression are replaced with smiling faces, and the performance is measured across three parameters.

Semantic Correctness

The model uses a pre-trained smile attribute classifier to measure whether the faces in the images show smiling expressions after editing.

Distribution-level Image Quality

Kernel Inception Distance or KID and Frechet Inception Distance or FID is calculated between the CelebA test dataset & 400 edited test images.

Identity Preservations

The model’s ability to preserve the identity of subjects when editing the image is measured using a pre-trained ArcFace feature extraction network.

The above table compares the performance of the EditGAN framework with other baseline models on the smile edit benchmark. The method followed by the EditGAN framework to deliver such high results is compared across three different baselines:

MaskGAN

MaskGAN takes non-smiling images along with their segmentation masks, and a target smiling segmentation mask as the input. It’s worth noting that when compared to EditGAN, the MaskGAN framework requires a large amount of annotated data.

Local Editing

EditGAN also compares its performance with local editing, a method that is used to cluster GAN features to implement local editing, and it is dependent on reference images.

InterFaceGAN

Just like EditGAN, InterFaceGAN also attempts to find editing vectors in the latent space of the model. However, unlike EditGAN, the InterFaceGAN model uses a large amount of annotated data, auxiliary attribute classifiers, and does not have the fine editing precision.

StyleGAN2Distillation

This method creates an alternative approach that does not necessarily require real image embeddings, and instead it uses an editing-vector model to create a training dataset.

Limitations

Because EditGAN is based on the GAN framework, it has the identical limitation as any other GAN model: it can work only with images that can be modeled by the GAN. EditGAN’s limitation to work with GAN modeled images is the major reason why it is difficult to implement EditGAN across different scenarios. However, it is worth noting that EditGAN’s high-precision edits can be transferred readily to other different images by making use of editing vectors.

Conclusion

One of the major reasons why GAN is not an industry standard in the image editing field is because of its limited practicality. GAN frameworks usually require a high amount of annotated training data, and they do not often return a high efficiency & accuracy.

EditGAN aims to tackle the issues presented by conventional GAN frameworks, and it attempts to come about as an effective method for high-quality, and high-precision semantic image editing. The results so far have indicated that EditGAN indeed offers what it claims, and it’s already performing better than some of the current industry standard practices & models.

Kunal Kejriwal

"An engineer by profession, a writer by heart". Kunal is a technical writer with a deep love & understanding of AI and ML, dedicated to simplifying complex concepts in these fields through his engaging and informative documentation.