Scaling up representations of text and visuals has been a major focus of research in recent years. Developments and research conducted in the recent past have led to numerous revolutions in language learning and vision. However, despite the popularity of scaling text and visual representations, the scaling of representations for 3D scenes and objects has not been sufficiently discussed.
Today, we will discuss Uni3D, a 3D foundation model that aims to explore unified 3D representations. The Uni3D framework employs a 2D-initialized ViT framework, pretrained end-to-end, to align image-text features with their corresponding 3D point cloud features.
The Uni3D framework uses pretext tasks and a simple architecture to leverage the abundance of pretrained 2D models and image-text-aligned models as initializations and targets, respectively. This approach unleashes the full potential of 2D models and strategies to scale them to the 3D world.
In this article, we will delve deeper into 3D computer vision and the Uni3D framework, exploring the essential concepts and the architecture of the model. So, let’s begin.
Uni3D and 3D Representation Learning : An Introduction
In the past few years, computer vision has emerged as one of the most heavily invested domains in the AI industry. Following significant advancements in 2D computer vision frameworks, developers have shifted their focus to 3D computer vision. This field, particularly 3D representation learning, merges aspects of computer graphics, machine learning, computer vision, and mathematics to automate the processing and understanding of 3D geometry. The rapid development of 3D sensors like LiDAR, along with their widespread applications in the AR/VR industry, has resulted in 3D representation learning gaining increased attention. Its potential applications continue to grow daily.
Although existing frameworks have shown remarkable progress in 3D model architecture, task-oriented modeling, and learning objectives, most explore 3D architecture on a relatively small scale with limited data, parameters, and task scenarios. The challenge of learning scalable 3D representations, which can then be applied to real-time applications in diverse environments, remains largely unexplored.
Moving along, in the past few years, scaling large language models that are pre-trained has helped in revolutionizing the natural language processing domain, and recent works have indicated a translation in the progress to 2D from language using data and model scaling which makes way for developers to try & reattempt this success to learn a 3D representation that can be scaled & be transferred to applications in real world.
Uni3D is a scalable and unified pretraining 3D framework developed with the aim to learn large-scale 3D representations that tests its limits at the scale of over a billion parameters, over 10 million images paired with over 70 million texts, and over a million 3D shapes. The figure below compares the zero-shot accuracy against parameters in the Uni3D framework. The Uni3D framework successfully scales 3D representations from 6 million to over a billion.
The Uni3D framework consists of a 2D ViT or Vision Transformer as the 3D encoder that is then pre-trained end-to-end to align the image-text aligned features with the 3D point cloud features. The Uni3D framework makes use of pretext tasks and simple architecture to leverage the abundance of pretrained 2D models and image text aligned models as initialization and targets respectively, thus unleashing the full potential of 2D models, and strategies to scale them to the 3D world. The flexibility & scalability of the Uni3D framework is measured in terms of
- Scaling the model from 6M to over a billion parameters.
- 2D initialization to text supervised from visual self-supervised learning.
- Text-image target model scaling from 150 million to over a billion parameters.
Under the flexible and unified framework offered by Uni3D, developers observe a coherent boost in the performance when it comes to scaling each component. The large-scale 3D representation learning also benefits immensely from the sharable 2D and scale-up strategies.
As it can be seen in the figure below, the Uni3D framework displays a boost in the performance when compared to prior art in few-shot and zero-shot settings. It is worth noting that the Uni3D framework returns a zero-shot classification accuracy score of over 88% on ModelNet which is at par with the performance of several state of the art supervision methods.
Furthermore, the Uni3D framework also delivers top notch accuracy & performance when performing other representative 3D tasks like part segmentation, and open world understanding. The Uni3D framework aims to bridge the gap between 2D vision and 3D vision by scaling 3D foundational models with a unified yet simple pre-training approach to learn more robust 3D representations across a wide array of tasks, that might ultimately help in the convergence of 2D and 3D vision across a wide array of modalities.
Uni3D : Related Work
The Uni3D framework draws inspiration, and learns from the developments made by previous 3D representation learning, and Foundational models especially under different modalities.
3D Representation Learning
The 3D representation learning method uses cloud points for 3D understanding of the object, and this field has been explored by developers a lot in the recent past, and it has been observed that these cloud points can be pre-trained under self-supervision using specific 3D pretext tasks including mask point modeling, self-reconstruction, and contrastive learning.
It is worth noting that these methods work with limited data, and they often do not investigate multimodal representations to 3D from 2D or NLP. However, the recent success of the CLIP framework that returns high efficiency in learning visual concepts from raw text using the contrastive learning method, and further seeks to learn 3D representations by aligning image, text, and cloud point features using the same contrastive learning method.
Developers have exhaustively been working on designing foundation models to scale up and unify multimodal representations. For example, in the NLP domain, developers have been working on frameworks that can scale up pre-trained language models, and it is slowly revolutionizing the NLP industry. Furthermore, advancements can be observed in the 2D vision domain as well because developers are working on frameworks that use data & model scaling techniques to help in the progress of language to 2D models, although such frameworks are difficult to replicate for 3D models because of the limited availability of 3D data, and the challenges encountered when unifying & scaling up the 3D frameworks.
By learning from the above two work domains, developers have created the Uni3D framework, the first 3D foundation model with over a billion parameters that makes use of a unified ViT or Vision Transformer architecture that allows developers to scale the Uni3D model using unified 3D or NLP strategies for scaling up the models. Developers hope that this method will allow the Uni3D framework to bridge the gap that currently separates 2D and 3D vision along with facilitating multimodal convergence.
Uni3D : Method and Architecture
The above image demonstrates the generic overview of the Uni3D framework, a scalable and unified pre-training 3D framework for large-scale 3D representation learning. Developers make use of over 70 million texts, and 10 million images paired with over a million 3D shapes to scale the Uni3D framework to over a billion parameters. The Uni3D framework uses a 2D ViT or Vision Transformer as a 3D encoder that is then trained end-to-end to align the text-image data with the 3D cloud point features, allowing the Uni3D framework to deliver the desired efficiency & accuracy across a wide array of benchmarks. Let us now have a detailed look at the working of the Uni3D framework.
Scaling the Uni3D Framework
Prior studies on cloud point representation learning have traditionally focused heavily on designing particular model architectures that deliver better performance across a wide range of applications, and work on a limited amount of data thanks to small-scale datasets. However, recent studies have tried exploring the possibility of using scalable pre-training in 3D but there were no major outcomes thanks to the availability of limited 3D data. To solve the scalability problem of 3D frameworks, the Uni3D framework leverages the power of a vanilla transformer structure that almost mirrors a Vision Transformer, and can solve the scaling problems by using unified 2D or NLP scaling-up strategies to scale the model size.
Prior studies on cloud point representation learning have traditionally focussed heavily on designing particular model architectures that deliver better performance across a wide range of applications, and work on a limited amount of data thanks to small-scale datasets. However, recent studies have tried exploring the possibility of using scalable pre-training in 3D but there were no major outcomes thanks to the availability of limited 3D data. To solve the scalability problem of 3D frameworks, the Uni3D framework leverages the power of a vanilla transformer structure that almost mirrors a Vision Transformer, and can solve the scaling problems by using unified 2D or NLP scaling-up strategies to scale the model size.
Another major challenge encountered by prior works involved in the scaling of 3D representations, the difficulties in convergence, and overfitting that were a result of the large size of the models. An effective approach to overcome this hurdle is to pretrain individual 3D backbones with specified 3D pretext tasks, and initialize pretrained parameters. However, the approach is accompanied with high training costs, and it is also difficult to establish a robust initialization for cross-modal learning thanks to the limited amount of 3D data available for training purposes.
The Uni3D framework leverages a vanilla transformer, the structure of which closely resembles ViT. With this approach, the Uni3D framework can naturally adopt the pre-trained large models with other modalities to initialize the Uni3D framework.
The Uni3D framework attempts to learn multi-model alignments across image, language, and point clouds by making use of paradigms similar to OpenShape, and ULIP frameworks. Furthermore, to ensure a fair comparison with other methods, the Uni3D framework uses the ensembled 3D dataset by OpenShape for training purposes. This ensembled dataset by OpenShape consists 4 3D datasets:
Experiments and Results
The Uni3D framework is tested across different settings, and across various classification tasks including its performance in zero-shot, and few-shot settings, results around open world understandings, and more. Let’s have a detailed look into these results.
Zero Shot Shape Classification
To evaluate the performance of the Uni3D framework across zero-shot shape classification tasks, the developers conduct experiments across three benchmarks including ModelNet, ScanObjNN, and Objaverse-LVIS benchmark datasets. ModelNet, and ScanObjNN are datasets widely used for classification tasks, and they consist of 15, and 40 object categories respectively, whereas the Objaverse-LVIS benchmark is a cleaned & annotated dataset consisting of over 40,000 objects across 1,100+ categories. The comparison between the frameworks is demonstrated in the image below, and as it can be seen, the Uni3D framework significantly outperforms the previous state of the art frameworks across different settings.
Few-Shot Linear Probing
In AI, Linear Probing is a common method used to evaluate the representations that a framework or a model learns. To evaluate Uni3D’s linear probing ability, the developers freeze the parameters of the Uni3D framework using the common settings as OpenShape. Following this, the developers train a linear classifier for Uni3D using few-shot class labels. The figure below demonstrates the linear probing ability of different frameworks on the Objaverse-LVIS dataset, and demonstrates the average performance of the model across 10 random seeds. As it can be seen, the Uni3D framework outperforms existing methods significantly under different few-shot settings.
To evaluate the capability of the Uni3D framework to understand real-world shapes & objects in real-time, developers use ScanNet and CLIP datasets to explore Uni3D’s performance. It is worth noting that the ground truth instant segmentation is available, and the primary motive is to recognize the category of every scene’s individual instant in a zero-shot setting. The results are demonstrated in the image below. As it can be seen, the Uni3D framework delivers exceptional results when performing real-world understanding & recognition. The Uni3D framework outperforms existing frameworks by a significant margin despite never training on real-world datasets.
The multi-modal representations learned by the Uni3D framework can allow the framework to retrieve 3D shapes naturally either from texts or images. To retrieve the 3D shapes, the model calculates the cosine similarity between the embeddings of 3D shapes, and the embeddings of a query text prompt or a query image. The framework then makes use of the KNN or K Nearest Neighbour algorithm to generate 3D shapes that resemble the query the most, and the results are demonstrated in the figure below. As it can be seen, the Uni3D framework successfully uses real-world images to retrieve 3D shapes. Furthermore, it is worth noting that training images are only for rendering purposes, and the gap between real-world and training images is substantial. Additionally, the model also takes two input images, and retrieves shapes similar to both input images by using the cosine similarity between the embedding averages of both the images, and their embedded 3D shapes. The results are interesting as they demonstrate Uni3D’s ability to learn diverse 3D representations, and perceive multiple 2D signals.
In the first column, the framework uses 2 query images to return 3D shapes that are most similar to the query images. In the second column, the framework uses two input images to retrieve 3D shapes that resemble both the input images. Finally, in the final column, the model uses query texts, and returns 3D shapes that resemble the text query the maximum.
In this article, we have talked about Uni3D, a scalable and unified pretraining 3D framework developed with the aim to learn large-scale 3D representations that tests its limits at the scale of over a billion parameters, over 10 million images paired with over 70 million texts, and over a million 3D shapes. The developers of the framework have included a vanilla transformer with its structure equivalent to ViTs that allows them to scale up the Uni3D framework using unified 2D or NLP scaling strategies. Furthermore, the Uni3D framework can leverage a wide array of pre-trained 2D frameworks and 2D strategies to the 3D world. The experimental results have already demonstrated the huge potential of the Uni3D framework as the Uni3D framework returns accurate & efficient results across a wide array of settings, and outperforms existing state-of-the-art frameworks.