Artificial Intelligence

Dreamcraft3D: Hierarchical 3D Generation With Bootstrapped Diffusion Prior

Updated on November 16, 2023

Kunal Kejriwal

Generative AI models have been a hot topic of discussion within the AI industry for a while. The recent success of 2D generative models has paved the way for the methods we use to create visual content today. Although the AI community has achieved remarkable success with 2D generative models, generating 3D content remains a major challenge for deep generative AI frameworks. This is especially true as the demand for 3D generated content reaches an all-time high, driven by a wide array of visual games, applications, virtual reality, and even cinema. It is worth noting that while there are 3D generative AI frameworks that deliver acceptable results for certain categories and tasks, they are unable to efficiently generate 3D objects. This shortfall can be attributed to the lack of extensive 3D data for training the frameworks. Recently, developers have proposed leveraging the guidance offered by pre-trained text-to-image AI generative models, an approach that has shown promising results.

In this article, we will discuss the DreamCraft3D framework, a hierarchical model for generating 3D content that produces coherent and high-fidelity 3D objects of high quality. The DreamCraft3D framework uses a 2D reference image to guide the geometry sculpting stage, enhancing the texture with a focus on addressing consistency issues encountered by current frameworks or methods. Additionally, the DreamCraft3D framework employs a view-dependent diffusion model for score distillation sampling, aiding in sculpting geometry that contributes to coherent rendering.

We will take a closer dive into the DreamCraft3D framework for 3D content generation. Furthermore, we will explore the concept of leveraging pretrained Text-to-Image (T2I) models for 3D content generation and examine how the DreamCraft3D framework aims to utilize this approach to generate realistic 3D content.

DreamCraft3D : An Introduction

DreafCraft3D is a hierarchical pipeline for generating 3D content. The DreamCraft3D framework attempts to leverage a state of the art T2I or Text to Image generative framework to create high-quality 2D images using a text prompt. The approach allows the DreamCraft3D framework to maximize the capabilities of state of the art 2D diffusion models to represent the visual semantics as described in the text prompt while retaining the creative freedom offered by these 2D AI generative frameworks. The image generated is then lifted to 3D with the help of cascaded geometric texture boosting, and geometric sculpting phases, and the specialized techniques are applied at each stage with the help of decomposing the problem.

For geometry, the DreamCraft3D framework focuses heavily on the global 3D structure, and multi-view consistency, thus making room for compromises on the detailed textures in the images. Once the framework gets rid of geometry-related issues, it shifts its focus on optimizing coherent & realistic textures by implementing a 3D-aware diffusion that bootstraps the 3D optimization approach. There are two key design considerations for the two optimization phases namely the Geometric Sculpting, and Texture Boosting.

With all being said, it would be safe to describe the DreamCraft3D as an AI generative framework that leverages a hierarchical 3D content generation pipeline to essentially transform 2D images into their 3D counterparts while maintaining the holistic 3D consistency.

Leveraging Pretrained T2I or Text-to-Image Models

The idea to leverage pretrained T2I or Text-to-Image models for generating 3D content was first introduced by the DreamFusion framework in 2022. The DreamFusion framework attempted to enforce a SDS or Score Distillation Sample loss to optimize the 3D framework in a way that the renderings at random viewpoints would align with the text-conditioned image distributions as interpreted by an efficient text-to-image diffusion framework. Although the DreamFusion approach delivered decent results, there were two major issues, blurriness, and over saturation. To tackle these issues, recent works implement various stage-wise optimization strategies in an attempt to improve the 2D distillation loss, which ultimately leads to better quality, and realistic 3D generated images.

However, despite the recent success of these frameworks, they are unable to match the ability of 2D generative frameworks to synthesize complex content. Furthermore, these frameworks are often riddled with the “Janus Issue”, a condition where 3D renderings that appear to be plausible individually, show stylistic & semantic inconsistencies when examined as a whole.

To tackle the issues faced by prior works, the DreamCraft3D framework explores the possibility of using a holistic hierarchical 3D content generation pipeline, and seeks inspiration from the manual artistic process in which a concept is first penned down into a 2D draft, after which the artist sculpts the rough geometry, refines the geometric details, and paints high-fidelity textures. Following the same approach, the DreamCraft3D framework breaks down the exhaustive 3D content or image generation tasks into various manageable steps. It starts off by generating a high-quality 2D image using a text prompt, and proceeds to use texture boosting & geometry sculpting to lift the image into the 3D stages. Splitting the process into subsequent stages helps the DreamCraft2D framework to maximize the potential of hierarchical generation that ultimately results in superior-quality 3D image generation.

In the first stage, the DreamCraft3D framework deploys geometrical sculpting to produce consistent & plausible 3D-geometric shapes using the 2D image as a reference. Furthermore, the stage not only makes use of the SDS loss for photometric losses and novel views at the reference view, but the framework also introduces a wide array of strategies to promote geometric consistency. The framework aims to leverage the Zero-1-to-3, a viewpoint-conditioned off the shelf image translation model to use the reference image to model the distribution of the novel views. Additionally, the framework also transitions from implicit surface representation to mesh representation for coarse to fine geometrical refinement.

The second stage of the DreamCraft3D framework uses a bootstrapped score distillation approach to boost the textures of the image as the current view-conditioned diffusion models are trained on a limited amount of 3D data which is why they often struggle to match the performance or fidelity of 2D diffusion models. Thanks to this limitation, the DreamCraft3D framework finetunes the diffusion model in accordance with multi-view images of the 3D instance that is being optimized, and this approach helps the framework in augmenting the 3D textures while maintaining multi-view consistency. When the diffusion model trains on these multi-view renderings, it provides better guidance for the 3D texture optimization, and this approach helps the DreamCraft3D framework achieve an insane amount of texture detailing while maintaining view consistency.

As can be observed in the above images, the DreamCraft3D framework is capable of producing creative 3D images & content with realistic textures, and intricate geometric structures. In the first image, is the body of Son Goku, an anime character mixed with the head of a running wild boar, whereas the second picture depicts a Beagle dressed in the outfit of a detective. Following are some additional examples.

DreamCraft3D : Working and Architecture

The DreamCraft3D framework attempts to leverage a state of the art T2I or Text to Image generative framework to create high-quality 2D images using a text prompt. The approach allows the DreamCraft3D framework to maximize the capabilities of state of the art 2D diffusion models to represent the visual semantics as described in the text prompt while retaining the creative freedom offered by these 2D AI generative frameworks. The image generated is then lifted to 3D with the help of cascaded geometric texture boosting, and geometric sculpting phases, and the specialized techniques are applied at each stage with the help of decomposing the problem. The following image briefly sums up the working of the DreamCraft3D framework.

Let’s have a detailed look at the key design considerations for the texture boosting, and geometric sculpting phases.

Geometry Sculpting

Geometry Sculpting is the first stage where the DreamCraft3D framework attempts to create a 3D model in a way it aligns with the appearance of the reference image at the same reference view while ensuring maximum plausibility even under different viewing angles. To ensure maximum plausibility, the framework makes use of SDS loss to encourage plausible image rendering for every individual sampled view that a pre-trained diffusion model can recognize. Furthermore, to utilize guidance from the reference image effectively, the framework penalizes photometric differences between the reference and the rendered images at the reference view, and the loss is computed only within the foreground region of the view. Additionally, to encourage scene sparsity, the framework also implements a mask loss that renders the silhouette. Despite this, maintaining appearance and semantics across back-views consistently still remains to be a challenge which is why the framework employs additional approaches to produce detailed, and coherent geometry.

3D Aware Diffusion Prior

The 3D optimization methods making use of per-view supervision alone is under-constrained which is the primary reason why the DreamCraft3D framework makes use of Zero-1-to-3, a view-conditioned diffusion model, as the Zero-1-to-3 framework offers an enhanced viewpoint awareness since it has been trained on a larger scale of 3D data assets. Furthermore, the Zero-1-to-3 framework is a fine-tuned diffusion model, that hallucinates the image in relation with the camera pose given the reference image.

Progressive View Training

Deriving free views directly in 360 degree might lead to geometrical artifacts or discrepancies like an extra leg on the chair, an event that might be credited to the ambiguity inherence of a single reference image. To tackle this hurdle, the DreamCraft3D framework enlarges the training views progressively following which the well-established geometry is gradually propagated to obtain results in 360 degrees.

Diffusion Time Step Annealing

The DreamCraft3D framework employs a diffusion time step annealing strategy in an attempt to align with the 3D optimization’s coarse-to-fine progression. At the start of the optimization process, the framework gives priority to sample a larger diffusion timestep, in an attempt to provide the global structure. As the framework proceeds with the training process, it linearly anneals the sampling range over the course of hundreds of iterations. Thanks to the annealing strategy, the framework manages to establish a plausible global geometry during early optimization steps prior to refining the structural details.

Detailed Structural Enhancement

The DreamCraft3D framework optimizes an implicit surface representation initially to establish a coarse structure. The framework then uses this result, and couples it with a deformable tetrahedral grid or DMTet to initialize a textured 3D mesh representation, that disentangles the learning of texture & geometry. When the framework is done with the structural enhancement, the model is able to preserve high-frequency details obtained from the reference image by refining the textures solely.

Texture Boosting using Bootstrapped Score Sampling

Although the geometry sculpting stage emphasizes on learning detailed and coherent geometry, it does blur the texture to a certain extent that might be a result of the framework’s reliance on a 2D prior model operating at a coarse resolution along with restricted sharpness on offer by the 3D diffusion model. Furthermore, common texture issues including over-saturation, and over-smoothing arises as a result of a large classifier-free guidance.

The framework makes use of a VSD or Variational Score Distillation loss to augment the realism of the textures. The framework opts for a Stable Diffusion model during this particular phase to get high-resolution gradients. Furthermore, the framework keeps the tetrahedral grid fixed to promote realistic rendering to optimize the overall structure of the mesh. During the learning stage, the DreamCraft3D framework does not make use of the Zero-1-to-3 framework since it has an adverse effect on the quality of the textures, and these inconsistent textures might be recurring, thus leading to bizarre 3D outputs.

Experiments and Results

To evaluate the performance of the DreamCraft3D framework, it is compared against current state of the art frameworks, and the qualitative & quantitative results are analyzed.

Comparison with Baseline Models

To evaluate the performance, the DreamCraft3D framework is compared against 5 state of the art frameworks including DreamFusion, Magic3D, ProlificDreamer, Magic123, and Make-it-3D. The test benchmark comprises 300 input images that are a mix of real-world images, and those generated by the Stable Diffusion framework. Each image in the test benchmark has a text prompt, a predicted depth map, and an alpha mask for the foreground. The framework sources the text prompts for the real images from an image caption framework.

Qualitative Analysis

The following image compares the DreamCraft3D framework with the current baseline models, and as it can be seen, the frameworks that rely on text-to-3D approach, often face multi-view consistency issues.

On one hand, you have the ProlificDreamer framework that offers realistic textures, but it falls short when it comes to generating a plausible 3D object. Frameworks like the Make-it-3D framework that rely on Image-to-3D methods manage to create high-quality frontal views, but they cannot maintain the ideal geometry for the images. The images generated by the Magic123 framework offer better geometrical regularization, but they generate overly saturated and smoothed geometric textures and details. When compared to these frameworks, the DreamCraft3D framework that makes use of a bootstrapped score distillation method, not only maintains semantic consistency, but it also improves the overall imagination diversity.

Quantitative Analysis

In an attempt to generate compelling 3D images that not only resembles the input reference image, but also conveys semantics from various perspectives consistently, the techniques used by the DreamCraft3D framework is compared against baseline models, and the evaluation process employs four metrics: PSNR and LPIPS for measuring fidelity at the reference viewpoint, Contextual Distance for assessing pixel-level congruence, and CLIP to estimate the semantic coherence. The results are demonstrated in the following image.

Conclusion

In this article, we have discussed DreamCraft3D, a hierarchical pipeline for generating 3D content. The DreamCraft3D framework aims to leverage a state-of-the-art Text-to-Image (T2I) generative framework to create high-quality 2D images using a text prompt. This approach allows the DreamCraft3D framework to maximize the capabilities of cutting-edge 2D diffusion models in representing the visual semantics described in the text prompt, while retaining the creative freedom offered by these 2D AI generative frameworks. The generated image is then transformed into 3D through cascaded geometric texture boosting and geometric sculpting phases. Specialized techniques are applied at each stage, aided by the decomposition of the problem. As a result of this approach, the DreamCraft3D framework can produce high-fidelity and consistent 3D assets with compelling textures, viewable from multiple angles.

Unite.AI