## Uncertainty-aware Label Distribution Learning for Facial Expression Recognition (Part 1)

Facial expression recognition (FER) plays an important role in understanding people's feelings and interactions between humans. Recently, automatic emotion recognition has gained a lot of attention from the research community due to its tremendous applications in education, healthcare, human analysis, surveillance or human-robot interaction. Recent FER methods are mostly based on deep learning and can achieve impressive results. The success of deep models can be attributed to large-scale FER datasets [1][2]. However, ambiguities of facial expression is still a key challenge in FER. Specifically, people with different backgrounds might perceive and interpret facial expressions differently, which can lead to noisy and inconsistent annotations. In addition, real-life facial expressions usually manifest a mixture of feelings rather than only a single emotion.

## Motivation and Proposed Solution

**Figure 1**. Examples of real-world ambiguous facial expressions that can lead to noisy and inconsistent annotation.As an example, Figure 1 shows that people may have different opinions about the expressed emotion, particularly in ambiguous images. Consequently, a distribution over emotion categories is better than a single label because it takes all sentiment classes into account and can cover various interpretations, thus mitigating the effect of ambiguity. However, existing large-scale FER datasets only provide a single label for each sample instead of a label distribution, which means we do not have a comprehensive description for each facial expression. This can lead to insufficient supervision during training and pose a big challenge for many FER systems.

To overcome the ambiguity problem in FER, we proposes a new uncertainty-aware label distribution learning method that constructs emotion distributions for training samples. Specifically, we leverage the neighborhood information of samples that have similar expressions to construct the emotion distributions from single labels and utilize them as training supervision signal.

## Methodology

### Preliminaries

We denote $\mathbf{x} \in \mathcal{X}$ as the instance variable in the input space $\mathcal{X}$ and $\mathbf{x}^{i}$ as the particular $i$-th instance. The label set is denoted as $\mathcal{Y} = \{y_1, y_2,..., y_m\}$ where $m$ is the number of classes and $y_j$ is the label value of the $j$-th class. The logical label vector of $\mathbf{x}^{i}$ is indicated by $\mathbf{l}^{i}$ = $(l^{i}_{y_1}, l^{i}_{y_2}, ..., l^{i}_{y_m})$ with $\mathbf{l}^{i}_{y_j} \in \{0, 1\}$ and $\| \mathbf{l} \| _1 = 1$. We define the label distribution of $\mathbf{x}^{i}$ as $\mathbf{d}^{i}$ = $(d^{i}_{y_1}, d^{i}_{y_2}, ..., d^{i}_{y_m})$ with $\| \mathbf{d} \| _1 = 1$ and $d^{i}_{y_j} \in [0, 1]$ representing the relative degree that $\mathbf{x}^{i}$ belongs to the class $y_j$.

Most existing FER datasets assign only a single class or equivalently, a logical label $\mathbf{l}^{i}$ for each training sample $\mathbf{x}^{i}$. In particular, the given training dataset is a collection of $n$ samples with logical labels $D_l$ = $\{ (\mathbf{x}^{i}, \mathbf{l}^{i}) \vert 1 \le i \le n\}$. However, we find that a label distribution $\mathbf{d}^i$ is a more comprehensive and suitable annotation for the image than a single label.

Inspired by the recent success of label distribution learning (LDL) in addressing label ambiguity [3], we aim to construct an emotion distribution $\mathbf{d}^i$ for each training sample $\mathbf{x}^i$, thus transform the *training* set $D_l$ into $D_d$ = $\{ (\mathbf{x}^{i}, \mathbf{d}^{i}) \vert 1 \le i \le n\}$, which can provide richer supervision information and help mitigate the ambiguity issue. We use cross-entropy to measure the discrepancy between the model's prediction and the constructed target distribution. Hence, the model can be trained by minimizing the following classification loss:

where $f(\mathbf{x}; \theta)$ is a neural network with parameters $\theta$ followed by a softmax layer to map the input image $\mathbf{x}$ into a emotion distribution.

### Overview

**Figure 2**. An overview of our Label Distribution Learning with Valence-Arousal (LDLVA) for facial expression recognition under ambiguity.An overview of our method is presented in Figure 2. To construct the *label distribution* for each *training* instance $\mathbf{x}^i$, we leverage its neighborhood information in the valence-arousal space. Particularly, we identify $K$ neighbor instances for each training sample $\mathbf{x}^i$ and utilize our adaptive similarity mechanism to determine their contribution degrees to the target distribution $\mathbf{d}^i$. Then, we combine the neighbors' predictions and their corresponding contribution degrees with the provided label $\mathbf{l}^i$ and $\mathbf{l}^i$'s uncertainty factor to obtain the label distribution $\mathbf{d}^i$. The constructed distribution $\mathbf{d}^i$ will be used as supervision information to train the model via label distribution learning.

### Adaptive Similarity

We assume that the label distribution of the main instance $\mathbf{x}^i$ can be computed as a linear combination of its neighbors' distributions. To determine the contribution of each neighbor, we propose an adaptive similarity mechanism that not only leverages the relationships between $\mathbf{x}^i$ and its neighbors in the auxiliary space but also utilizes their feature vectors extracted from the backbone. We choose the valence-arousal [4] as the auxiliary space to construct the target label distribution. We use the $K$-Nearest Neighbor algorithm to identify $K$ closest points for each training sample $\mathbf{x}^i$, denoted as $N(i)$. We calculate the adaptive *contribution degrees* of neighbor instances as the product of the local similarity $s^i_k$ and the calibration score $\zeta^i_k$ as follows:

where the *local similarity* $s^i_k$ is defined based on the distance between the instance and its neighbor in the valence-arousal space $\mathbf{a}^i$ and $\mathbf{a}^k$

We utilize a multilayer perceptron (MLP) $g$ with parameter $\phi$ to calculate the adaptive calibration score from the extracted features of the two instances $\mathbf{v}^i$ and $\mathbf{v}^k$ obtained from the backbone.

The proposed adaptive similarity can correct the similarity errors in the valance-arousal space, as the valence-arousal values are not always available in practice and we leverage an existing method to generate pseudo-valence-arousal.

### Uncertainty-aware Label Distribution Construction

After obtaining the contribution degree of each neighbor $\mathbf{x}^k \in N(i)$, we can now generate the target label distribution $\mathbf{d}^i$ for the main instance $\mathbf{x}^i$. The target label distribution is calculated using the logical label $\mathbf{l}^i$ and the aggregated distribution $\tilde{\mathbf{d}}^i$ defined as follows:

where $\lambda^i \in [0,1]$ is the *uncertainty factor* for the logical label. It controls the balance between the provided label $\mathbf{l}^i$ and the aggregated distribution $\tilde{\mathbf{d}^i}$ from the local neighborhood.

Intuitively, a high value of $\lambda^i$ indicates that the logical label is highly uncertain, which can be caused by ambiguous expression or low-quality input images, thus we should put more weight towards neighborhood information $\tilde{\mathbf{d}^i}$. Conversely, when $\lambda^i$ is small, the label distribution $\mathbf{d}^i$ should be close to $\mathbf{l}^i$ since we are certain about the provided manual label. In our implementation, $\lambda^i$ is a trainable parameter for each instance and will be optimized jointly with the model's parameters using gradient descent.

### Loss Function

To enhance the model's ability to discriminate between ambiguous emotions, we also propose a discriminative loss to reduce the intra-class variations of the learned facial representations. We incorporate the label uncertainty factor $\lambda^i$ to adaptively penalize the distance between the sample and its corresponding class center. For instances with high uncertainty, the network can effectively tolerate their features in the optimization process. Furthermore, we also add pairwise distances between class centers to encourage large margins between different classes, thus enhancing the discriminative power. Our discriminative loss is calculated as follows:

where $y^i$ is the class index of the $i$-th sample while $\mathbf{\mu}_{j}$, $\mathbf{\mu}_{k}$, and $\mathbf{\mu}_{y^i}$ $\in \mathbb{R}^V$ are the center vectors of the ${j}$-th, ${k}$-th, and $y^i$-th classes, respectively. Intuitively, the first term of $\mathcal{L}_D$ encourages the feature vectors of one class to be close to their corresponding center while the second term improves the inter-class discrimination by pushing the cluster centers far away from each other. Finally, the total loss for training is computed as:

where $\gamma$ is the balancing coefficient between the two losses.

# References

[1] Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 2019

[2] Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, 2017.

[3] B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng. Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 2017.