The Visual Computing Research Center at Shenzhen University in China has developed a large scale urban scene data set that offers diverse, fully semantically labeled simulations of a number of major cities around the world, as a resource for driving, drone, and other kinds of machine learning environment-simulating research initiatives.
Entitled UrbanScene3D, the simulator features a variety of dense and detailed, navigable urban reconstructions with realistic textures. Many of the scenarios are created by professional modelers working from publicly available aerial data, and feature a level of human-led optimization which is currently difficult or expensive to simulate in entirely programmatic image synthesis and RGB-D capture systems based on photogrammetry, such as Neural Radiance Fields (NeRF).
The project addresses one of the major imbalances in computer vision research – a lack of rich, semantically-labeled urban environment datasets with high quality model structure, compared to the very high level of availability of similar semantic and modeling data relating to interior scenes.
Simulations run in UrbanScene3D can provide ground truth for generation of subsequent project-specific datasets relating to autonomous vehicles and drones, among other possibilities.
The project’s source files, around 70gb, have been released free for purposes of research and education use. Implementation can run in a C++ environment or in Python, and requires Unreal Engine 4 (with 4.24 recommended). For aerial projects, such as drone training and simulation, the project also supports Microsoft’s AirSim.
UrbanScene3D features six professionally modeled CAD environments generated by professional artists from images or from satellite maps, together with five reconstructed real world environments. The CAD scenes feature reconstructions of New York City, Chicago, San Francisco, Shenzhen, Suzhou and Shanghai. The image-derived data centers on five specific scenes from these cities, including a hospital and a university campus.
The raw acquisition data for UrbanScene3D is also being made available, featuring high resolution aerial images at 6000x4000 pixels, and 4K aerial videos, along with poses and the reconstructed 3D models.
The project aims to address the limitations of existing urban scene datasets, and is the first to provide high quality CAD-level detail together with semantic labeling and depth-map information. Previous efforts include:
Released in 2014, Microsoft’s Common Objects in Context (COCO) dataset features 1.5 million object instances across 80 categories, together with object recognition in context, and five captions per image. COCO does not feature GT mesh with pose or depth information.
The KITTI Vision Benchmark Suite
Produced by the Karlsruhe Institute of Technology and the Toyota Technological Institute at Chicago, KITTI provides depth information, but not instance masks.
The Cityscapes Dataset for Semantic Urban Scene Understanding (aka CityScape) was released in 2016, and features dense semantic segmentation, and instance segmentation of people and vehicles. As such, its primary objective is to aid in the development of autonomous driving systems and adjacent sectors of urban monitoring.
It features eight classes, including flat, human, vehicle, construction, object, nature, sky and void, and offers fine annotations across 5000 images.
CityScape was released in 2020, and is similar in features to UrbanScene3D, except that it lacks CAD modeling.
Launched in 2018 and led by Baidu Research, ApolloCar3D is a collaboration between a number of academic research units across the west and Asia, including the University of California at San Diego, the Australian National University, and the Northwestern Polytechnical University at Xi’an, China.
ApolloCar3D is specifically aimed at ground-level autonomous vehicle research, and features 5,277 driving images, and over 60,000 vehicle instances powered by detailed 3D CAD models rendered at absolute sizes, and labeled for semantic key-points. The dataset is more than 20 times larger than KITTI, but, unlike UrbanScene3D, only features partial depth information.
HoliCity, described as ‘A City-Scale Data Platform for Learning Holistic 3D Structures’, is a 2021 collaboration between UC Berkeley, Stanford, USC and Bytedance Research at Palo Alto. It comprises a city-scale 3D dataset with a high level of structural detail, and offers 6,300 real-world panorama scenes covering an area exceeding 20 square kilometers.
The project is aimed at real-world applications such as localization, augmented reality, mapping and city-scale reconstruction. Though it features CAD modeling, the level of detail is below that of UrbanScene3D.