Unsupervised Learning of Shape Programs with Repeatable Implicit Parts

NeurIPS 2022

♦ Stanford University § MIT ★ Equal contributions

Paper | Talk

Abstract: Shape programs encode shape structures by representing object parts as subroutines and constructing the overall shape by composing these subroutines. This usually involves the reuse of subroutines for repeatable parts, enabling the modeling of correlations among shape elements such as geometric similarity. However, existing learning-based shape programs suffer from limited representation capacity because they use coarse geometry representations such as geometric primitives and low-resolution voxel grids. Further, their training requires manually annotated ground-truth programs, which are expensive to attain. We address these limitations by proposing Shape Programs with Repeatable Implicit Parts (ProGRIP). Using implicit functions to represent parts, ProGRIP greatly boosts the representation capacity of shape programs while preserving the higher-level structure of repetitions and symmetry. Meanwhile, we free ProGRIP from any inaccessible supervised training via devising a matching-based unsupervised training objective. Our empirical studies show that ProGRIP outperforms existing structured representations in shape reconstruction fidelity as well as segmentation accuracy of semantic parts.
  @inproceedings{progrip2022,
      Author = {Boyang Deng and Sumith Kulal and Zhengyang Deng and Congyue Deng and Yonglong Tian and Jiajun Wu},
      Title = {Unsupervised Learning of Shape Programs with Repeatable Implicit Parts},
      booktitle={Advances in Neural Information Processing Systems},
      year={2022},
  }

Video

An example of synthesized shape program.

Fig. 1: Our method represents an object as a shape program with repeatable implicit parts (ProGRIP). The program has two levels: the top level defines a set of repeatable parts (as latent vector $z_i$) and the bottom level defines all occurrences of each part with varying poses. The joint predictions, i.e., posed parts, are executed as posed implicit functions. Both the generation and the execution of ProGRIP are invariant to the order of predictions at both levels. ProGRIP can be learned without any annotations using our proposed matching-based unsupervised training objective.


An overview of the ProGRIP generation pipeline.

Fig. 2: Given a pointcloud, an auto-encoding architecture generates a ProGRIP composed of 2 levels of predictions. At the geometry level, our model predicts a set of $(s_i, z_i)$-pairs as the scales and deep latents of repeatable parts (middle); at the pose level, our model predicts a set of $(t_{i,j} , R_{i,j} , \delta_{i,j}$)-triplets as translations, rotations, and existence probabilities (right). Transformers are used at both levels for permutation invariant predictions.


We represent shapes as a collection of Posed Implicit Functions.

Fig. 3: We execute each posed part (i.e., a $(s_i, z_i, t_{i,j}, R_{i,j}, \delta_{i,j})$ tuple) as a posed implicit function. A posed implicit function constructs an occupancy function $o_{i,j}$ to answer point queries $x$. For each query point $x$, we first canonicalise it using $(s_i, t_{i,j}, R_{i,j})$, then predict its occupancy conditioned on part latent $z_i$, and finally mask it by binarised existance $\delta_{i,j}$.


We propose an unsupervised matching-based loss.

Fig. 4: On a task of fitting repeatable parts to a target shape, starting from the same initialization, the reconstruction loss (left) confines each posed parts to their initial local neighborhood and consequently prevents better parts arrangements. Conversely, our matching loss (right) match posed parts to local geometry of targets by shapes, thus rescue the part arrangement from the suboptimal local minima in reconstruction loss.


Qualitative visualization of reconstruction by ProGRIP.

Fig. 5: Colors indicate the repeatable parts predicted by ProGRIP. Bottom: An explosive view with the parts dissected apart.


Qualitative visualization of semantic segmentation using ProGRIP.

Fig. 6: Each color indicates a different semantic part predicted by ProGRIP.


Qualitative visualization of interactive editing enabled by ProGRIP.

Fig. 7: Shown here is a demonstration of interactive shape editing. In the two chairs show, ProGRIP enables switching arms from reference chair into target chair. This as simple as selecting arms from both chair with one click each. Then we replace the arm latents in the target chair with latents from the reference chair to get the shown output. Note how the orientation of the arms are correctly updated. This is due to our disentanglement of shape and pose. .


Acknowledgements: We'd like to thank Pratul Srinivasan for suggesting the compactness application of ProGRIP, Kaizhi Yang for kindly open-sourcing the code of CubeSeg, the Occupancy Networks team for releasing their pre-processed occupancy samples, and the Stanford Artificial Intelligence Laboratory (SAIL) staff for maintaining and supporting our computational infrastructures. This work is in part supported by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), Center for Integrated Facility Engineering (CIFE), Toyota Research Institute (TRI), NSF RI #2211258, NSF CCRI #2120095, ONR MURI N00014-22-1-2740, and Adobe, Amazon, Analog, Autodesk, IBM, JPMC, Meta, Salesforce, and Samsung. BD is funded by a Meta Research PhD Fellowship. SK is in part supported by the Brown Institute for Media Innovation.