# Populating 3D Scenes by Learning Human Scene Interaction (Part 2).

Learning Human Scene Interaction - The training pipeline.

In the previous post, we have studied about POSA Representation for HSI and the corresponding dataset for Learning Human Scene Interaction. In this post, we will discover the prediction of egocentric feature map as well as how to train POSA framework in the human-scene interaction task.

## #Learning to predict egocentric feature map

Goal: To learn a probabilistic function from body pose and shape to the feature space of contact and semantics. Given a body, sample labelings of the vertices corresponding to likely scene contacts and their corresponding semantic label.

## #Training conditional Variational Autoencoder

Train a conditional Variational Autoencoder (cVAE), where they condition the feature map on the vertex positions, $V_b$, which are a function of the body pose and shape parameters. Figure 1: cVAE architecture.

## #Encoder

Learn to approximate posterior $Q(z| f, V_b)$:

• Input: vertice coordinate $(x_i,y_i,z_i)$, contact label $f_c$ and semantic label $f_s$.
• Output: Latent vector $z \sim N(0,I)$ for simpler randomness control (sampling)

The loss $\mathcal{L}_{KL}$ encourages approximate posterior $Q(z|f,V_b)$ to match a distribution $p(z)$: $\mathcal{L}_{KL} = KL(Q(z|f, V_b)\ || \ p(z))$

• Spiral convolution : Since $f$ is defined on the vertices of the body mesh $M_b$, graph convolution can be used as building block for VAE. It acts directly on the 3D mesh and becomes efficient at encoding the ordering relationship between nodes. Figure 2: Spiral sequence contain vertices around the red star vertex.

## #Decoder

Learn to maximize the log-likelihood and reconstructs original per-vertex feature:

• Input: vertice coordinate $(x_i,y_i,z_i)$, latent vector $z$ from the encoder
• Output: reconstructed contact label $\hat{f_c}$ and reconstructed contact semantic $\hat{f_s}$

The reconstruction loss $\mathcal{L}_{rec}$ encourages the reconstructed samples to resemble the input:

$L_{rec}(f,\hat{f}) = \lambda_c * \sum_{i}BCE(f_c^i, \hat{f_c}^i) + \lambda_s *\sum_{i}CCE(f_s^i, \hat{f_s}^i)$

Training optimizes the encoder and decoder parameters to minimize $\mathcal{L}_{total}$ using gradient descent:

$\mathcal{L}_{total} = \alpha * \mathcal{L}_{KL} + \mathcal{L}_{rec}$ Figure 3: Random samples from trained cVAE.