Models trained with 2D data can also generate 3D images

Models trained with 2D data can also generate 3D images. Enter a simple text prompt to generate a 3D model. How about the technology of this "AI painter"? See the effect directly.

The 3D model it generates also has density and color.

And can render in different lighting conditions.

Not only that, it can even fuse multiple generated 3D models into one scene.

What's more, the resulting 3D models can also be exported to meshes for further processing with modeling software. This is simply a high-end version of NeRF, and this AI artist, called DreamFusion, is the latest achievement of Google Research.

Does the DreamFusion name sound familiar? That's right, DreamFields! Not long ago, another Chinese brother opened up an AI painting program based on this model. And this DreamFusion evolved on the basis of DreamFields. So from DreamFields to DreamFusion, what changes have made DreamFusion such a huge leap?

Diffusion models are key

In a word, the biggest difference between DreamFusion and DreamFields is the method of calculating the loss. The latest DreamFusion, it replaces CLIP with a new loss calculation method: the loss is calculated by the Imagen diffusion model of text to image. The diffusion model should be familiar to everyone this year. DreamFusion is driven by the diffusion model of billions of image-text pairs, which is equivalent to a NeRF optimized by the diffusion model. It is difficult to think about it.

However, direct use of the diffusion model for 3D synthesis requires a large-scale labeled 3D dataset and an efficient 3D data denoising architecture, neither of which is currently available, and has to find another way. So in this work, the researchers cleverly circumvent these limitations, using a pre-trained 2D text-to-image diffusion model to perform text-to-3D synthesis. Specifically, the Imagen diffusion model is used to calculate the loss in the process of generating 3D images, and the 3D model is optimized. How is the loss calculated?

A key part of this is that the researchers introduced a new image sampling method: Scored Distillation Sampling (SDS), which samples in parameter space rather than pixel space. Due to the limitation of parameters, this method can well control the quality of the generated image (right of the figure below).

Here, scoring distillation sampling is used to represent the loss in the generation process, and this loss is minimized through continuous optimization, thereby outputting a good-quality 3D model. It is worth mentioning that in the process of generating images in DreamFusion, the parameters inside will be optimized and become a training sample of the diffusion model. The parameters after the training of the diffusion model have multi-scale characteristics, which are more conducive to subsequent image generation. In addition, the diffusion model brings another important point: backpropagation is not required because the diffusion model can directly predict the direction of the update.

Three of the research team are from Google Research, Ben Poole, Jon Barron and Ben Mildenhall, the first author of the paper, and a doctoral student at the University of California, Berkeley. Google Research is a department that conducts various state-of-the-art technology research within Google. They also have their own open-source projects, which are publicly available on GitHub.

Their mantra: Our team aspires to make discoveries that affect everyone, and at the heart of our approach is sharing our research and tools to advance the field. The first author, Ben Poole, is a Ph.D. in neurology at Stanford University and a researcher at Google Brain. His current research focuses on using generative models to improve algorithms for unsupervised and semi-supervised learning.

0 views0 comments

Recent Posts

See All