In this post, we will discover the inference phase of POSA framework in the human-scene interaction task and how to evaluate its effectiveness.
#Inference phase
Putting people into scenes: Given a scene Ms, semantic labels of objects present, and a body mesh Mb, POSA finds where in Ms this given pose is likely to happen:
First, given the posed body, use the decoder of cVAE to generate a feature map by sampling P(fGen∣z,Vb) with fGen=[fc^,fs^] .
Second, optimize the objective function:
E(τ,θ0,θ)=Lafford+Lpen+Lreg where τ is the body translation, θ0 is the global body orientation, θ is the body pose and:
- The afforance loss Lafford finds position in the scene where given pose is likely to happen.
Lafford=λ1∣∣fGenc.fd∣∣22+λ2i∑CCE(fGensi,fsi) - The penetration penalty Lpen discourages the body from penetrating the scene.
Lpen=λpenfdi<0∑(fdi)2 - The regularizer Lreg that encourages the estimated pose to remain close to the initial pose θinit of Mb.
Lreg=λreg∣∣θ−θinit∣∣22
Figure 1: Putting realistic people in scenes.
Locating Clothed Bodies:
Figure 2: Locate clothed bodies in scenes.
Using SMPL-X fits to clothed meshes from the AGORA dataset. The optimization objective now is defined:
E(τ,θ0)=Lafford+Lpen Monocular Pose Estimation with HSI: Fit SMPL-X to RGB image features such that the contacts are consistent with the 3D scene and its semantics, in order to minimize an objective function of multiple terms: the re-projection error of 2D joints, priors and physical constraints on the body:
ESMPLify-X(β,θ,ψ,τ)=EJ+λθEθ+λαEα+λβEβ+λPEP To get a pose matching the image observations and roughly obeying scene constraints, sample features from P(fGen∣z,Vb) from body pose, then minimize
E(β,θ,ψ,τ,Ms)=ESMPLify-X+∣∣fGenc⋅fd∣∣+Lpen #Evaluation
Comparison to PROX ground truth: They take 4 real scenes from the PROX test set, 100 SMPL-X bodies from the AGORA dataset, corresponding to 100 different 3D scans from Renderpeople, and take each of these bodies and sample one feature map for each using cVAE. Then, automatically optimize the placement of each sample in all the scenes, one body per scene. The pose is changed slightly to fit the scene for unclothed bodies and kept fixed for clothed bodies.
For each variant, the optimization results in 400 unique body-scene pairs. They render each 3D human-scene interaction from 2 views so that subjects are able to get a good sense of the 3D relationships from the images.
Figure 3: Comparison to PROX ground truth. Subjects
are shown pairs of a generated 3D human-scene interaction
and PROX ground truth.
Comparison between POSA and PLACE directly compare POSA and PLACE using the above protocol. Adding semantics to POSA improves realism.
Figure 4: POSA compared to PLACE for 3D human-scene
interaction generation.
Physical Plausibility: They take 1200 bodies from the AGORA dataset and place all of them in each of the 4 test scenes of PROX. They compute the following scores:
- Non-collision score for each body mesh Mb, which is the ratio of body mesh vertices with positive scene signed distance field (SDF) values divided by the total number of SMPL-X vertices.
- The contact score for each Mb, which is 1 if at least one vertex of Mb has a non-positive value.
Experiment shows that POSA and PLACE are comparable under these metrics.
Figure 5: Evaluation of the physical plausibility metric.
#References
[1]Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), 2019
[2]Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J.Black. 2021b. Populating 3D Scenes by Learning Human-Scene Interaction. In Conference on Computer Vision and Pattern Recognition (CVPR).
[3]Shunwang Gong, Lei Chen, Michael Bronstein, and Stefanos Zafeiriou. SpiralNet++: A fast and highly efficient mesh convolution operator. In International Conference on Computer Vision Workshops (ICCVw), 20
[4]Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3D human pose ambiguities with 3D scene constrains. In International Conference on Computer Vision (ICCV), 2019.