In the previous blog, we have explored some widely known 2D Human Pose Estimation (HPE) algorithms including single-person and multi-person methods, about how they work and the fundamental components of these methods and how to train the system properly. Although these methods can give really impressive results, however, the reconstructed human pose of these methods is merely in 2D image coordinates. This 2D pose information does not fully describe human motion in the real-world and hence is not sufficient for many downstream applications such as ranging from motion transferring, analysis to animation and robotics. In this blog, we will study some of the existing 3D Human Pose Estimation methods, which aim to predict the locations of body joints in 3D space. The 3D estimation problem has attracted much interest in recent years since it can provide extensive 3D structure information related to the human body.

Similar to 2D HPE, we first consider 3D estimation for a single person. For 3D human pose, there are two key approaches. First are methods that are only interested in the human skeleton (which can be represented as a set of 3D coordinates of body joints). This approach is called model-free approach. Another approach is to further study the body shape of the human, in addition to body pose. These methods can reconstruct the 3D mesh of the full human body. The reconstruction is commonly based on a statistical 3D body model which is learned from real human 3D scanned data and parameterized in a low-dimensional space.

**Figure 1**: Illustration of direct estimation and 2D-to-3D lifting method.Recent model-free methods include (see Figure 1 for illustration):

- Direct estimation: These method can directly infer the 3D coordinate of body joints from input image(s).
- 2D-to-3D lifting: These method first leverage an existing 2D pose estimation system to extract the 2D joints, then they try to map the 2D coordinates into 3D coordinates

## Direct estimation

Although deep learning approach can provide superior performance compared to traditional approaches. However, since 3D HPE from images is a highly unconstrained problem, naively predicting 3D coordinate from 2D image (eg., there is depth ambiguity, where many 3D pose configurations can be exactly projected into a single 2D pose in the image), directly regressing the 3D human pose may not yield desirable results in practice. In the following, we will present some of the recent interesting methods for regressing 3D pose from images.

### Volumetric prediction

Typically, the problem of 3D human pose estimation using ConvNets has been primarily approached as a coordinate regression problem. In this case, the target of the network is a long concatenated vector of all joints coordinates over the human body. To train deep networks, a standard regression loss can be utilized:

where $\mathbf{x}_j$ and $\mathbf{x}^*_j \in \mathbb{R}^3$ are the predicted and ground-truth 3D coordinates for the $j$-th body joint, respectively. Practically, the global joint position (i.e. world coordinate) is factored out by subtracting each joint location from the root, which results in the root-relative joint coordinates. Despite being simple, the straightforward direct regression approach has many potential issues. The tremendous complexity of the human pose makes the regression problem highly non-linear and difficult to train due to the lack of sufficient supervision information.

**Figure 2**:Illustration of predicted volumetric representation (Source).Inspired by the several advantages of the heatmap representation in 2D HPE (see previous blog). Pavlakos *et al.* [1] proposed to use a volumetric representation for 3D human keypoints. Specifically, each joint information can be represented as a 3D volumetric heatmap of size $W \times H \times D$. Each value in the volume pixel (or voxel) corresponds to the likelihood of that voxel containing the joint. Then, similar to the 2D heatmap, the target for each joint can be generated by applying a 3D Gaussian kernel centered at the ground-truth joint location $\mathbf{x}^*_j = (x^*_j,y^*_j,z^*_j)$ in the 3D grid:

To train the model, the $L2$ loss can be adopted:

The key benefit of the volumetric representation is that it reduces the highly non-linear problem of directly regressing 3D joints to localizing in a discretized space. This provides rich supervision about the target to train the network. Moreover, the heatmap representation can preserve the spatial structure and can leverage the huge advantage of the Fully Convolutional Networks. However, one major problem is that the 3D space must be discretized and it requires intensive memory to represent the 3D volumetric as it introduces a depth dimension along with the image width and height. And the accuracy also depends on the size of the volume. Low resolution may lead to degradation in performance.

As in previous blog, we demonstrated the advantages of intermediate supervision (the losses are applied across multiple stages of the network) to mitigate the problem of vanishing gradient in practice. The authors empirically found that naively stacking multiple volumes (the output volume of the previous stage is taken as input to the next stage) may lead to worse performance and more memory and computation because of the high dimensionality. For example, if we choose the volume resolution to be $64 \times 64 \times 64$ with 16 joints, this may be a serious problem since the network needs to predict over 4 million voxels. To cope with this, the authors suggested a coarse-to-fine volumetric prediction scheme. Specifically, in the earlier stages, the network is supervised with lower resolution targets and the volume size gradually increases through the latter stages (see Figure 3). The volume size in the experiments are $64 \times 64 \times D$ where $D$ is the depth dimension and it ranges in $\{1,2,4,8,16,32,64\}$ .

**Figure 3**:The coarse-to-fine volumetric prediction scheme (Source).The authors reported the experimental results on Human3.6M (which is a popular large motion capture dataset) in Table 1. It is shown that using volumetric representation significantly reduced the estimation error (mean coordinate error per joint) compared to direct regression. (85.82mm for volume size $=64 \times 64 \times 64$ and 112.41mm for direct regression).

**Table 1**:Coordinate versus volume regression on Human3.6M (Source).### Using ordinal depth relation information

A major issue of directly estimating 3D pose from images is the lack of accurate paired 3D ground-truth. 3D datasets are typically captured using Motion Capture (MoCap) system in a studio setting and it is difficult to reach the variability of 2D human pose datasets. One solution is to synthesize data based on some graphics rendering engine (such as unity, unreal engine...). However, there is no guarantee that the characteristics of the synthetic images match those of real images, thus the model trained on these synthesized examples may fail to produce desired results in practice.

Moreover, depth ambiguity problem can make it problematic for the model to learn and thus it may fail to produce accurate pose estimation, especially when a suffcient amount of 3D ground-truth data is not accessible. The authors in [2] proposed to leverage the depth relationship between pairs of joints (i.e. one joint is closer or has smaller depth than another joint). Note that although it is difficult to capture the exact 3D pose in the real-world (since it often requires a high-cost motion capture system), the ordinal depth relation can be easily annotated by human. In particular, consider the human body with $J$ joints. For each joint $j$, the first goal is to predict the depth $z_j$. For each joint pair $(i,j)$, the ordinal depth relation $r(i,j)$ is defined as follow:

- $r(i,j) = 1$, if $z_i < z_j$ ($i$ is closer than $j$)
- $r(i,j) = −1$, if $z_j < z_i$ ( $j$ is closer than $i$)
- $r(i,j)= 0$, if $z_i \approx z_j$ (their depths are roughly equal)

Suppose $z_j$ is the predicted depth of joint $j$ by the network (relative depth instead of exact depth at this stage). Given depth relation $r(i,j)$, we impose the depth relation constraints to enforce the network predicted depth is consistent and reasonable. It is given as the following loss function:

Then the total depth relation ranking loss is the sum of $\mathcal{L}_{r(i,j)}$ over all body joint pairs $(i,j)$: $\mathcal{L}_{rank} = \sum_{i,j}\mathcal{L}_{r(i,j)}$. Intuitively, this loss encourages $z_i$ to be smaller than $z_j$ by a margin if $r(i,j)=1$ by mimizing the distance $z_i - z_j$ in the $\log$ function (and vice versa), and otherwise encourage them to be similar if $r(i,j)=0$. Note that we do not require the relations for all pairs of joints to be available during training. We just need a subset of joint pairs to impose the depth relation constraint for efficiency.

**Figure 4**: Visualization of the volumetric output and marginalization for a specific joint (Source).In the previous section, we have investigated the use of volumetric representation for 3D human pose. However, in most scenarios, we assume that 3D ground-truth is not always available. Therefore, it is not straightforward to supervise per-voxel likelihood by the exact joint location. A solution to overcome this is to decompose the volume into the 2D spatial dimension and depth ($z$) dimension. The authors provide a way to approximate the predicted coordinates using soft-argmax function [3]. The process can be illustrated in Figure 4. Particularly, the volume heatmap $\mathbf{V}_j$ for joint j is then fed through softmax operation to obtain a probability distribution over 3D space. For a given image and the $j$-th body joint, the value $p_j(x,y,z) = p_j(\mathbf{x}) = softmax(\mathbf{V}_j(\mathbf{x}))$ represent the probability that joint $j$ is located at the voxel $(x,y,z)$. The marginalized probability for the 2D image space and the depth space is computed as:

This formulation also enables the network to be supervised with both unpaired 2D and 3D targets, which can make use of a large number of in-the-wild 2D images with ground-truth joints. The total weakly-supervised loss function can be computed as the sum of the standard 2D heatmap regression loss ($L2$) and the depth relation ranking loss: $\mathcal{L} = \mathcal{L}_{rank} + \mathcal{L}_{heatmap}$ where the heat value is computed as in Equation 5, and the depth $z_j$ for $\mathcal{L}_{rank}$ is computed as the expectation over the depth dimension $z_j = \sum_{z}zp_j(z)$. Finally, since the output 2D joint location of the model is in pixel coordinates and the joint depth is relative, we also train a neural network to reconstruct the precise human pose in 3D coordinate (note that this requires accurate 3D ground-truth data). Experiments show that incorporating the depth relation help reduce the reconstruction error significantly since they resolve the single-view depth ambiguity and produce rough depth estimation that respects the ordinal depths of the human joints. Interestingly, the authors showed that by using only the weak-supervision (2D joints and depth relation label, the method can also achieve competitive performance and outperform the previous baseline model. This indicates that the challenging depth ambiguity for in-the-wild images can be effectively addressed by considering the depth relation labels (which are easy to annotate in practice).

### Considering human body constraints

**Figure 5**: Human kinematic can be represented as a skeleton graph (Source).As previously mentioned, training the network directly with a standard regression loss on the set of joint positions has many potential problems. For example, the inter-joints relationships are not well exploited and no constraints about the human body structure (e.g. the bone must be symmetrical or the knee cannot bend backward) are taken into account. This can be a critical problem in real-world scenarios since valid human pose only exists in a small subspace of the pose representation (which is called the manifold). To address the above issues, the authors in [4] reparameterized the pose representation as to the set of human bones instead of joint positions. Mathematically speaking, a bone is defined as a direction vector that points from one joint to its parent in the kinematic tree (see Figure 5).

where $b_j$ is the $j$-th bone vector associated with joint position $\mathbf{x}_j$ and $P(j)$ is the index of the parent node of joint $j$ in the kinematic tree. This bone representation has many advantages over the common joint position representation. It is more constrained and stable than joints, hence easier to learn and we can encode the geometric structure and express the geometric constraints more easily than joints. To train the bone regression network, one can apply a standard regression loss function:

However, the authors also pointed out the key issue of this simple loss function is that prediction errors can be accumulated from the root to the end-effectors. For instance, to predict the wrist joint, we need to traverse a long path through the pelvis $\rightarrow$ thorax $\rightarrow$ shoulder $\rightarrow$ elbow $\rightarrow$ wrist. Therefore, prediction error from each joint along the path is accumulated and the final position is inaccurate. To that end, the authors enforced both short-range and long-range consistency for each pair of joints. Specifically, let $u$ and $v$ be two arbitrary joints (not necessarily adjacent). The long-range, relative offset from joint $u$ to joint $v$ is the sum of all bone vectors along the path from $u$ to $v$: $\Delta\mathbf{x}_{u\rightarrow v} = \sum_{k \in u\rightarrow v}\mathbf{b}_k$. Note that this is the forward path computation (since the kinematic tree is a directed graph in this case). The backward path is simply calculated as the reverse of the forward path (e.g. from $v$ to $u$). Next, the ground-truth relative offset is computed as $\Delta\mathbf{x}^*_{u\rightarrow v} = \mathbf{x}^*_v - \mathbf{x}^*_u$. Finally, the bone compositional loss is defined as:

**Figure 6**: Example of long range offset consistency constraint in the compositional loss.$\mathcal{P}$ is an arbitrary set of joint pairs (not necessarily all joint pairs). Intuitively, this loss enforces the bone consistency along the path between every joint pair $(u,v)$ in the kinematic tree. Each bone is constrained by multiple paths given a large number of joint pairs. In the experiments, the authors also tried various settings for choosing $\mathcal{P}$ (e.g. $\mathcal{P}$ only consider bone of two adjacent joints or some combination of joint pairs) and found that imposing the loss for all joint pairs yields the best performance (although it may add little computational cost). The experiments also demonstrate strong advantages of imposing human body constraints into a pose estimation system. Although regression-based, it can achieve competitive performance to the detect-based methods and show promising results.

In summary, we have explored some techniques that people use to address challenging problems of regressing 3D human pose directly from image (note that there are many other strategies that have not been explored in this review, we refer the reader to [5] for further study). In practical applications, we can consider many of the strategies (such as weak supervision like depth relation and human body constraints) to build a pose estimation system that can accurately and robustly capture the human motion in real-world settings.

## References

[1] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. 2017. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. In *CVPR*

[2] Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. 2018. Ordinal Depth Supervision for 3D Human Pose Estimation. In *CVPR*.

[3] Diogo C Luvizon, Hedi Tabia, and David Picard. 2019. Human pose regression by combining indirect part detection and contextual information. *Computers & Graphics (2019)*

[4] Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. 2017. Compositional human pose regression. In *ICCV*

[5] Ce Zheng, Wenhan Wu, Chen Chen, Taojiannan Yang, Sijie Zhu, Ju Shen, Nasser Kehtarnavaz, Mubarak Shah. 2020. Deep Learning-Based Human Pose Estimation: A Survey