Efficient 3D GANs with Layered Surface Volumes | 3DV 2024

Yinghao Xu, Wang Yifan, Alexander W. Bergman, Menglei Chai, Bolei Zhou, Gordon Wetzstein

Efficient 3D Articulated Human Generation with Layered Surface Volumes

ABSTRACT

Access to high-quality and diverse 3D articulated digital human assets is crucial in various applications, ranging from virtual reality to social platforms. Generative approaches, such as 3D generative adversarial networks (GANs), are rapidly replacing laborious manual content creation tools. However, existing 3D GAN frameworks typically rely on scene representations that leverage either template meshes, which are fast but offer limited quality, or volumes, which offer high capacity but are slow to render, thereby limiting the 3D fidelity in GAN settings. In this work, we introduce layered surface volumes (LSVs) as a new 3D object representation for articulated digital humans. LSVs represent a human body using multiple textured mesh layers around a conventional template. These layers are rendered using alpha compositing with fast differentiable rasterization, and they can be interpreted as a volumetric representation that allocates its capacity to a manifold of finite thickness around the template. Unlike conventional single-layer templates that struggle with representing fine off-surface details like hair or accessories, our surface volumes naturally capture such details. LSVs can be articulated, and they exhibit exceptional efficiency in GAN settings, where a 2D generator learns to synthesize the RGBA textures for the individual layers. Trained on unstructured, single-view 2D image datasets, our LSV-GAN generates high-quality and view-consistent 3D articulated digital humans without the need for view-inconsistent 2D upsampling networks.

LSV-GAN Overview

LSV-pipeline figure
LSV-GAN pipeline. A latent code z is fed into a 2D StyleGAN2 generator network, which outputs N RGBA textures. These are applied to the individual mesh layers. All textured layers together are deformed into the target pose distribution and rendered using fast, differentiable rasterization before being fed into a camera- and body-pose-conditioned StyleGAN2 discriminator. An additional face discriminator is used but not shown.

FILES

CITATION

Yinghao Xu, Wang Yifan, Alexander W. Bergman, Menglei Chai, Bolei Zhou, Gordon Wetzstein, Efficient 3D Articulated Human Generation with Layered Surface Volumes, 3DV 2024.

@inproceedings{Xu2023:LayeredSurfaceVolumes,
author = {Xu, Yinghao and Yifan, Wang and Bergman, Alexander W. and Chai, Menglei and Zhou, Bolei and Wetzstein, Gordon},
title = {Efficient 3D Articulated Human Generation with Layered Surface Volumes},
booktitle = {3DV},
year = {2024}
}

Qualitative Comparison

We compare the results of several baselines, including GNARF, a representative implementation of StylePeople, and EVA3D, with our LSV-GAN using the AIST++ and SHHQ datasets. Our approach generates high-quality 3D humans with more detailed faces and smoother motions than the baselines.

COMPARISON ON AIST++

COMPARISON ON SHHQ

 

Quantitative evaluation

method AIST++ DeepFashion SHHQ
FID PCK FID PCK FID PCK
ENARF (1282) 73.07 42.85 77.03 43.74 80.54 40.17
GNARF (5122) 11.13 96.11 33.85 97.83 14.84 98.96
EVA3D (5122) 19.40 83.15 15.91 87.50 11.99 88.95
StylePeople (5122) 18.97 96.96 17.72 98.31 14.67 98.58
LSV-GAN (5122) 17.05 98.95 12.02 99.47 11.10 99.44

Ablation

Our LSV-GAN is trained with a progressive training strategy, a face discriminator, and a regularizer for hand structure. These components contribute to higher overall quality, better facial details, and improved geometry for the hand region.

Acknowledgements

We thank Thabo Beeler, Sida Peng, Jianfeng Zhang, Fangzhou Hong, Ceyuan Yang for fruitful discussions and comments about this work.

Related Projects

You may also be interested in related projects on 3D GANs, such as :

  • Bergman et al. GNARF, NeurIPS 2022 (link)
  • Chan et al. EG3D, CVPR 2022 (link)
  • Chan et al. pi-GAN. CVPR 2021 (link)