In the previous post, we have studied about POSA Representation for HSI and the corresponding dataset for Learning Human Scene Interaction. In this post, we will discover the prediction of egocentric feature map as well as how to train POSA framework in the human-scene interaction task.
#Learning to predict egocentric feature map
Goal: To learn a probabilistic function from body pose and shape to the feature space of contact and semantics. Given a body, sample labelings of the vertices corresponding to likely scene contacts and their corresponding semantic label.
#Training conditional Variational Autoencoder
Train a conditional Variational Autoencoder (cVAE), where they condition the feature map on the vertex positions, Vb, which are a function of the body pose and shape parameters.
Figure 1: cVAE architecture.
#Encoder
Learn to approximate posterior Q(z∣f,Vb):
- Input: vertice coordinate (xi,yi,zi), contact label fc and semantic label fs.
- Output: Latent vector z∼N(0,I) for simpler randomness control (sampling)
The loss LKL encourages approximate posterior Q(z∣f,Vb) to match a distribution p(z):
LKL=KL(Q(z∣f,Vb) ∣∣ p(z))
- Spiral convolution [4]: Since f is defined on the vertices of the body mesh Mb, graph convolution can be used as building block for VAE. It acts directly on the 3D mesh and becomes efficient at encoding the ordering relationship between nodes.
Figure 2: Spiral sequence contain vertices around the red star vertex.
#Decoder
Learn to maximize the log-likelihood and reconstructs original per-vertex feature:
- Input: vertice coordinate (xi,yi,zi), latent vector z from the encoder
- Output: reconstructed contact label fc^ and reconstructed contact semantic fs^
The reconstruction loss Lrec encourages the reconstructed samples to resemble the input:
Lrec(f,f^)=λc∗i∑BCE(fci,fc^i)+λs∗i∑CCE(fsi,fs^i) Training optimizes the encoder and decoder parameters to minimize Ltotal using gradient descent:
Ltotal=α∗LKL+Lrec
Figure 3: Random samples from trained cVAE.
#References
[1]Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), 2019
[2]Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J.Black. 2021b. Populating 3D Scenes by Learning Human-Scene Interaction. In Conference on Computer Vision and Pattern Recognition (CVPR).
[3]Shunwang Gong, Lei Chen, Michael Bronstein, and Stefanos Zafeiriou. SpiralNet++: A fast and highly efficient mesh convolution operator. In International Conference on Computer Vision Workshops (ICCVw), 20
[4]Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3D human pose ambiguities with 3D scene constrains. In International Conference on Computer Vision (ICCV), 2019.