People constantly interact with their surroundings, and such interactions have semantics, specifically as combinations of actions and object instances. These interactions are becoming more and more diverse, and capture meaningful semantics. Understanding the interaction is crucial for developing intelligent systems that can effectively interact with humans in various contexts, including virtual environments, robotics, and human-computer interfaces. Moreover, through the semantics of these interactions, researchers can further understand how humans contact with difference environments with their pose and body, and capture meaningful scene semantics.

## Introduction

Human constantly interact with 3D space and such interactions involve physical contact between surfaces that is semantically meaningful. Thus, it is important to learn how humans interact with scenes and study their applications.

Despite the importance of the interactions, existing representations of the human body do not explicitly represent, support, or capture them.

SMPL-X model [1] can represent the shape and pose of people. Moreover, this representation includes hand and face, and it supports reasoning about contact between the body and the world. However, some challanges still remain:

- SMPL-X does not explicitly model contact .
- Not all parts of the body surface are equally likely to be in contact with the scene.
- The poses of body and scene semantics are highly intertwined.

POSA [2] (**P**ose with pr**O**ximitie**S** and cont**A**cts) leverages SMPL-X to capture contact and the semantics of Human-Scene Interactions (HSI) in a body-centric representation. POSA aims to solve challenging problems:

- Automatic scene population: Given a 3D scene and a body in a particular pose, where in the scene is this pose most likely?
- Monocular 3D human pose estimation in a 3D scene.

## Dataset

**Training data:** PROX-E [3] dataset was used for this task. It is a set of n pairs of 3D meshes

comprising body meshes $M_{b,i}$ and scene meshes $M_{s,i}$ and $i$ is the index of $\mathcal{M}$.

- $M_b= (V_b, F_b)$: body mesh which has a fixed topology with $N_b = |V_b| = 10475$ vertices $V_b \in \mathbb{R}^{N_b \times 3 }$ and body mesh faces $F_b$ .
- $M_s = (V_s, F_s, L_s)$: scene mesh which has a varying number of vertices $N_s = |V_s|$, triangle connectivity $F_s$ to model arbitrary scenes, and per-vertex semantic labels $L_s$ (e.g chair, bed, sofa,...).

**Figure 1**: PROX-E Dataset.Human meshes are represented by SMPL-X model, i.e. a differentiable function $M(\theta, \beta, \psi) : \mathbb{R}^{|\theta| \times |\beta| \times |\psi|} → \mathbb{R}^{N_b \times 3}$ parameterized by pose $\theta$, shape $\beta$ and facial expressions $\psi$.

The pose vector $\theta = (\theta_b, \theta_f , \theta_{lh}, \theta_{rh})$ is comprised of body $\theta_b \in \mathbb{R}^{66}$ , face parameters $\theta_f \in \mathbb{R}^9$, in axis-angle representation, and $\theta_{lh}, \theta_{rh} \in \mathbb{R}^{12}$ which parameterize the poses of the left and right hands respectively in a low-dimensional pose space.

The shape parameters, $\beta \in \mathbb{R}^{10}$, represent coefficients in a low-dimensional shape space learned from a large corpus of human body scans.

The joints, $J(\beta)$, of the body in the canonical pose are regressed from the body shape.

## Methodology: POSA Representation for HSI

POSA encodes the relationship between the human mesh $M_b$ and the scene mesh $M_s$ in an **egocentric** feature map $f$ that encodes per-vertex features on the SMPL-X mesh $M_b$:

$f : (V_b, M_s) \rightarrow [f_c, f_s] $ where $f_c$ is the contact label, $f_s$ is the semantic label of the contact point and $N_f$ is the feature dimension.

**Figure 2**: Illustration of the proposed representation.For each vertex $V_b^i$ on the body, find its closest scene point:

The distance $f_d$ is calculated:

Given $f_d$, determine whether vertex $V_b^i$ is in contact with the scene or not by comparing $f_d$ with a constant threshold :

The semantic label of the contacted surface fs is a one-hot encoding of the object class:

where $N_o$ is the number of object classes.