A new research initiative between the US and China has proposed the use of Generative Adversarial Networks (GANs) to increase the realism of driving simulators.
In a novel take on the challenge of producing photorealistic POV driving scenarios, the researchers have developed a hybrid method that plays to the strengths of varying approaches, by mixing the more photorealistic output of CycleGAN-based systems with more conventionally-generated elements, which require a greater level of detail and consistency, such as road markings and the actual vehicles observed from the driver's point of view.
The system, called Hybrid Generative Neural Graphics (HGNG), injects highly-limited output from a conventional, CGI-based driving simulator into a GAN pipeline, where the NVIDIA SPADE framework takes over the work of environment generation.
The advantage, according to the authors, is that driving environments will become potentially more diverse, creating a more immersive experience. As it stands, even converting CGI output to photoreal neural rendering output cannot solve the problem of repetition, as the original footage entering the neural pipeline is constrained by the limits of the model environments, and their tendency to repeat textures and meshes.
The paper states*:
‘The fidelity of a conventional driving simulator depends on the quality of its computer graphics pipeline, which consists of 3D models, textures, and a rendering engine. High-quality 3D models and textures require artisanship, whereas the rendering engine must run complicated physics calculations for the realistic representation of lighting and shading.'
The new paper is titled Photorealism in Driving Simulations: Blending Generative Adversarial Image Synthesis with Rendering, and comes from researchers at the Department of Electrical and Computer Engineering at Ohio State University, and Chongqing Changan Automobile Co Ltd in Chongqing, China.
HGNG transforms the semantic layout of an input CGI-generated scene by mixing partially rendered foreground material with GAN-generated environments. Though the researchers experimented with various datasets on which to train the models, the most effective proved to be the KITTI Vision Benchmark Suite, which predominantly features captures of driver-POV material from the German town of Karlsruhe.
The researchers experimented with both Conditional GAN (cGAN) and CYcleGAN (CyGAN) as generative networks, finding ultimately that each has strengths and weaknesses: cGAN requires paired datasets, and CyGAN does not. However, CyGAN cannot currently outperform the state-of-the-art in conventional simulators, pending further improvements in domain adaptation and cycle consistency. Therefore cGAN, with its additional paired data requirements, obtains the best results at the moment.
In the HGNG neural graphics pipeline, 2D representations are formed from CGI-synthesized scenes. The objects that are passed through to the GAN flow from the CGI rendering are limited to ‘essential' elements, including road markings and vehicles, which a GAN itself cannot currently render at adequate temporal consistency and integrity for a driving simulator. The cGAN-synthesized image is then blended with the partial physics-based render.
To test the system, the researchers used SPADE, trained on Cityscapes, to convert the semantic layout of the scene into photorealistic output. The CGI source came from open source driving simulator CARLA, which leverages the Unreal Engine 4 (UE4).
The shading and lighting engine of UE4 provided the semantic layout and the partially rendered images, with only vehicles and lane markings output. Blending was achieved with a GP-GAN instance trained on the Transient Attributes Database, and all experiments runs on a NVIDIA RTX 2080 with 8 GB of GDDR6 VRAM.
The researchers tested for semantic retention – the ability of the output image to correspond to the initial semantic segmentation mask intended as the template for the scene.
In the test images above, we see that in the ‘render only' image (bottom left), the full render does not obtain plausible shadows. The researchers note that here (yellow circle) shadows of trees that fall onto the sidewalk were mistakenly classified by DeepLabV3 (the semantic segmentation framework used for these experiments) as ‘road' content.
In the middle column-flow, we see that cGAN-created vehicles do not have enough consistent definition to be usable in a driving simulator (red circle). In the right-most column flow, the blended image conforms to the original semantic definition, while retaining essential CGI-based elements.
To evaluate realism, the researchers used Frechet Inception Distance (FID) as a performance metric, since it can operate on paired data or unpaired data.
Three datasets were used as ground truth: Cityscapes, KITTI, and ADE20K.
The output images were compared against each other using FID scores, and against the physics-based (i.e., CGI) pipeline, while semantic retention was also evaluated.
In the results above, which relate to semantic retention, higher scores are better, with the CGAN pyramid-based approach (one of several pipelines tested by the researchers) scoring highest.
The results pictured directly above pertain to FID scores, with HGNG scoring highest through use of the KITTI dataset.
The ‘Only render' method (denoted as ) pertains to the output from CARLA, a CGI flow which is not expected to be photorealistic.
Qualitative results on the conventional rendering engine (‘c' in image directly above) exhibit unrealistic distant background information, such as trees and vegetation, while requiring detailed models and just-in-time mesh loading, as well as other processor-intensive procedures. In the middle (b), we see that cGAN fails to obtain adequate definition for the essential elements, cars and road markings. In the proposed blended output (a), vehicle and road definition is good, whilst the ambient environment is diverse and photorealistic.
The paper concludes by suggesting that the temporal consistency of the GAN-generated section of the rendering pipeline could be increased through the use of larger urban datasets, and that future work in this direction could offer a real alternative to costly neural transformations of CGI-based streams, while providing greater realism and diversity.
* My conversion of the authors' inline citations to hyperlinks.
First published 23rd July 2022.