3 posts tagged with "level: medium"

View All Tags

Style Transfer for 2D Talking Head Generation (Part 1)

Audio-driven talking head animation is a challenging research topic with many real-world applications. Recent works have focused on creating photo-realistic 2D animation, while learning different talking or singing styles remains an open problem. In this paper, we present a new method to generate talking head animation with learnable style references. Given a set of style reference frames, our framework can reconstruct 2D talking head animation based on a single input image and an audio stream. Our method first produces facial landmarks motion from the audio stream and constructs the intermediate style patterns from the style reference images. We then feed both outputs into a style-aware image generator to generate the photo-realistic and fidelity 2D animation. In practice, our framework can extract the style information of a specific character and transfer it to any new static image for talking head animation. The intensive experimental results show that our method achieves better results than recent state-of-the-art approaches qualitatively and quantitatively.

image

Introduction

Talking head animation is an active research topic in both academia and industry. This task has a wide range of real-world interactive applications such as digital avatars, and digital animations. Given an arbitrary input audio and a 2D image (or a set of 2D images) of a character, the goal of talking head animation is to generate photorealistic frames. The output can be the 2D or 3D talking head. With recent advances in deep learning, especially generative adversarial networks, several works have addressed different aspects of the talking head animation task such as head pose control, facial expression, emotion generation, and photo-realistic synthesis.

While there has been considerable advancement in the generation of talking head animation, achieving photo-realistic and high fidelity animation is not a trivial task. It is even more challenging to render natural motion of the head with different styles. In practice, several aspects contribute to this challenge. First, generating a photo-realistic talking head using only a single image and audio as inputs requires multi-modal synchronization and mapping between the audio stream and facial information. In many circumstances, this process may result in fuzzy backgrounds, ambiguous fidelity, or abnormal face attributes. Second, various talking and singing styles can express diverse personalities. Therefore, the animation methods should be able to adapt and generalize well to different styles. Finally, controlling the head motion and connecting it with the full-body animation remains an open problem.

Recently, several methods have been proposed to generate photo-realistic talking heads or to match the pose from a source video while little work has focused on learning the personalized character style. In practice, apart from personalized talking style, we have different singing styles such as ballad and rap. These styles pose a more challenging problem for talking head animation as they have the unique eye, head, mouth, and torso motion. The facial movements of singing styles are also more varied and dynamic than the talking style. Therefore, learning and bringing these styles into 2D talking heads is more challenging. Currently, most of the style-aware talking head animation methods do not fully disentangle the audio style information and the visual information, which causes ambiguity during the transferring process.

image

Figure 1. Given an audio stream, a single image, and a set of style reference frames, our method generates realistic 2D talking head animation.

In this work, we present a new deep learning framework called Style Transfer for 2D talking head animation. Our framework provides an effective way to transfer talking or singing styles from the style reference to animate single 2D portrait of a character given an arbitrary input audio stream. We first generate photo-realistic 2D animation with natural expression and motion. We then propose a new method to transfer the personalized style of a character into any talking head with a simple style-aware transfer process. Figure 1 shows an overview of our approach.

Research Overview

2D Talking Head Animation Creating talking head animation from an input image and audio has been widely studied in the past few years. One of the earliest works considered this as a sorting task that reorders images from footage video. Some works proposed to capture 3D model from dubber and actor to synthesize photo-realistic face, while others introduced a learning approach to create a trainable system that could synthesize a mouth shape from an unseen utterance. Later works focused on audio-driven to generate realistic mouth shapes and realistic faces, or generating full facial landmarks. Meanwhile, the quality of the animation of those aforementioned approaches can be improved by creating a talking face that includes pose and expression, mainly on generating fidelity talking head with natural head pose and realistic motions. Recently, some methods have been elaborated to encode the personalized information within the talking head animation, or by taking advantage of the diffusion model to improve the diversity of the generated talking face

Speaker Style Estimation There are many kinds of speaker styles such as generic, personal, controlled pose, or special expression. Generic style could be learned by training on multiple videos, while personalized style could be captured by particularly training on one avatar of a person. In general, some well-known methods aim to generates controllable poses with an input video, or to transfer poses and expressions from another video input, such as mapping the style from dubber to actor. Another interesting method tried to capture motions from the driven video and transfer them into input image during the generation process, speaker information and speaking environment can be further ensembled to characterize the speaker variability in the environment. For example, we can leverage a pre-captured database of 3D mouth shapes and associated speech audio from one speaker to refine the mouth shape of a new actor. Recently, Zhang et. al. developed a state-of-the-art method that can generate diverse and synchronized talking videos from input audio and a single reference image by utilizing condition variational autoencoder to capture style code.

Speech Representation for Face Animation Some prior works used hand-crafted models to match phoneme and mouth shape in each millisecond audio signal as speech representation. More advanced, DeepSpeech paved the way for learning a speech recognition system using an end-to-end deep network. Following that, an improvement was made by trainining Bi-LSTMs to learn a language-long-term structure that models the relationship between speech and the complex activity of faces. Additionally the Mel-frequency spectral coefficients can be utilized to synthesize high-quality mouth texture of a character, and then combined it with a 3D pose matching method to synchronize the lip motion with the audio in the target animation. With the rise of the diffusion technique, Diff2lip proposed an audio-conditional diffusion model that effectively encodes audio in their generator to solve the lip-sync challenge.

Our goal is to introduce a new deep-learning framework that can transfer talking or singing styles from any personalized style reference to animate a single 2D portrait of a character given an arbitrary input audio stream. Compared to existing approaches, which have been mainly focusing on conventional talking head animation, our method can not only produce animation for common talking styles but also allows transferring for several special styles that are much more challenging such as singing

To summarize, our research aims to propose a new framework for generating photorealistic 2D talking head animations from the audio stream as input. Furthemore, we present a style-aware transfer technique, which enables us to learn and apply any new style to the animated head. Our generated 2D animation is photo-realistic and high fidelity with natural motions. To validate our meticulously designed system, we conduct intensive analysis and demonstrate that our proposed method outperforms recent approaches both qualitatively and quantitatively.

Global-Local Attention for Context-aware Emotion Recognition (Part 2)

In this part, we will conduct experiements for validating the effectiveness of our proposed Global-Local Attention for Context-aware Emotion Recognition. Here, we only focus on static images with background context as our input. Therefore, we choose the static CAER (CAER-S) dataset [2] to validate our method. However, while experimenting with the CAER-S dataset, we observe that there is a correlation between images in the training and the test sets, which can make the model less robust to changes in data and may not generalize well on unseen samples. More specifically, many images in the training and the test set of the CAER-S dataset are extracted from the same video, hence making them look very similar to each other. To properly evaluate the models, we propose a new way to extract static frames from the CAER video clips to create a new static image dataset called Novel CAER-S (NCAER-S). In particular, for each video in the original CAER dataset, we split the video into multiple parts. Then we randomly select one frame of each part to include in the new NCAER-S dataset. Any original video that provides frames for the training set will be removed from the testing set. This process assures the new dataset is novel while the training frames and testing frames are never from one original input video.

Context-aware emotion recognition results

Table 1. Comparison with recent methods on the CAER-S dataset.

Table 1 summarizes the results of our network and other recent state-of-the-art methods on the CAER-S dataset [2]. This table clearly shows that integrating our GLA module can significantly improve the accuracy performance of the recent CAER-Net. In particular, our GLAMOR-Net (original) achieves 77.90% accuracy, which is a +4.38% improvement over the CAER-Net-S. When compared with other recent state-of-the-art approaches, the table clearly demonstrates that our GLAMOR-Net (ResNet-18) outperforms all those methods and achieves a new state-of-the-art performance with an accuracy of 89.88%. This result confirms our global-local attention mechanism can effectively encode both facial information and context information to improve the human emotion classification results.

Component Analysis

To further analyze the contribution of each component in our proposed method, we experiment with 4 different input settings on the NCAER-S dataset: (i) face only, (ii) context only with the facial region being masked, (iii) context only with the facial region visible, and (iv) both face and context (with masked face). When the context information is used, we compare the performance of the model with different context attention approaches. Note that to compute the saliency map with the proposed GLA in the (ii) and (iii) setting, we extract facial features using the Facial Encoding Module, however, these features are only used as the input of the GLA module to guide the context attention map learning process and not as the input of the Fusion Network to predict the emotion category. The results of these settings are summarized in Table 2.

Table 2. Ablation study of our proposed method on the NCAER-S dataset. w/F, w/mC, w/fC, w/CA, w/GLA denote using the output of the Facial Encoding Module, the Context Encoding Module with masked faces as input, the Context Encoding Module with visible faces as input, the standard attention in [2] and our proposed GLA, respectively, as input to the Fusion Network.

The results clearly show that our GLA consistently helps improve performance in all settings. Specifically, in setting (ii), using our GLA achieves an improvement of 1.06\% over method without attention. Our GLA also improves the performance of the model when both facial and context information is used to predict emotion. Specifically, our model with GLA achieves the best result with an accuracy of 46.91\%, which is higher than the method with no attention 3.72\%. The results from Table 2 show the effectiveness of our Global-Local Attention module for the task of emotion recognition. They also verify that the use of both the local face region and global context information is essential for improving emotion recognition accuracy.

Qualitative Analysis

Figure 5 shows the qualitative visualization with learned attention maps obtained by our method GLAMOR-Net in comparison with CAER-Net-S. It can be seen that our Global-Local Attention mechanism produces better saliency maps and helps the model attend to the right discriminative regions in the surrounding background than the attention map produced by CAER-Net-S. As we can see, our model is able to focus on the gesture of the person (Figure 5f) and also the face of surrounding people (Figure 5c and 5d) to infer the emotion accurately.

Figure 5.Visualization of the attention maps. From top to bottom: original image in the NCAER-S dataset, image with masked face, attention map of the CAER-Net-S, and attention map of our GLAMOR-Net.

Figure 6 shows some emotion recognition results of different approaches on the CAER-S dataset. More specifically, the first two rows (i) and (ii) contain predictions of the CAER-Net-S while the last two rows (iii) and (iv) show the results of our GLAMOR-Net. In some cases, our model was able to exploit the context effectively to perform inference accurately. For instance, with the same sad image input (shown on the (i) and (iii) rows), the CAER-Net-S misclassified it as neutral while the GLAMOR-Net correctly recognized the true emotion category. It might be because our model was able to identify that the man was hugging and appeasing the woman and inferred that they were sad. Another example is shown on the (i) and (iii) rows of the fear column. Our model classified the input accurately, while the CAER-Net-S is confused between the facial expression and the wedding surrounding, thus incorrectly predicted the emotion as happy.

Figure 5. Example predictions on the test set. The first two rows (i) and (ii) show the results of the CAER-Net-S while the last two rows (iii) and (iv) demonstrate predictions of our GLAMOR-Net. The columns names from (a) to (g) denote the ground-truth emotion of the images.

Conclusion

We have presented a novel method to exploit context information more efficiently by using the proposed globallocal attention model. We have shown that our approach can noticeably improve the emotion classification accuracy compared to the current state-of-the-art results in the context-aware emotion recognition task. The results on the context-aware emotion recognition datasets consistently demonstrate the effectiveness and robustness of our method.

References

[1] Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In NIPS, 2015.

[2] Lee, J., Kim, S., Kim, S., Park, J., Sohn, K.: Context-aware emotion recognition networks. In ICCV, 2019.

Global-Local Attention for Context-aware Emotion Recognition (Part 1)

Automatic emotion recognition has been a longstanding problem in both academia and industry. It enables a wide range of applications in various domains, ranging from healthcare, surveillance to robotics and human-computer interaction. Recently, significant progress has been made in the field and many methods have demonstrated promising results. However, recent works mainly focus on facial regions while ignoring the surrounding context, which is shown to play an important role in the understanding of the perceived emotion, especially when the emotions on the face are ambiguous or weakly expressed (see the examples in Figure 1).

Figure 1. Facial expression information is not always sufficient to infer people's emotions, especially when facial regions can not be seen clearly or are occluded.

We hypothesize that the local information (i.e., facial region) and global information (i.e., context background) have a correlative relationship, and by simultaneously learning the attention using both of them, the accuracy of the network can be improved. This is based on the fact that the emotion of one person can be indicated by not only the face’s emotion (i.e., local information) but also other context information such as the gesture, pose, or emotion/pose of a nearby person. To that end, we propose a new deep network, namely, Global-Local Attention for Emotion Recognition Network (GLAMOR-Net), to effectively recognize human emotions using a novel global-local attention mechanism. Our network is designed to extract features from both facial and context regions independently, then learn them together using the attention module. In this way, both the facial and contextual information is used altogether to infer human emotions.

Overview

Figure 2. The architecture of our proposed network. The whole process includes three steps. We extract the facial information (local) and context information (global) using two Encoding Modules. We then perform attention inference on the global context using the Global-Local Attention mechanism. Lastly, we fuse both features to determine the emotion.
Figure 2 shows an overview of our method. Specifically, we assume that emotions can be recognized by understanding the context components of the scene together with the facial expression. Our method aims to do emotion recognition in the wild by incorporating both facial information of the person’s face and contextual information surrounding that person. Our model consists of three components: Encoding Module, Global-Local Attention (GLA) Module, and Fusion Module. Our key design is the novel GLA module, which utilizes facial features as the local information to attend better to salient locations in the global context.

Face and Context Encoding

Our Encoding Module comprises the Facial Encoding Module to learn the face-specific features, and the Context Encoding Module to learn the context-specific features. Specifically, both the Face Encoding and Context Enconding Module are built on several convolutional layers to extract meaningful features from the corresponding input. Each module is comprised of five convolutional layers followed by a Batch Normalization layer an ReLU activation function. The number of filters starts with 32 in the first layer, increasing by a factor of 2 at each subsequent layer except the last one. Our network ends up with 256-channel feature map, which is the embedded representation with respect to the input image. In practice, we also mask the facial regions in the raw input to prevent the attention module from only focusing on the facial region while omitting the context information in other parts of the image.

Global-Local Attention

Inspired by the attention mechanism [1], to model the associative relationship of the local information (i.e., the facial region in our work) and global information (i.e., the surrounding context background), we propose the Global-Local Attention Module to guide the network focus on meaningful regions (Figure 3). In particular, our attention mechanism models the hidden correlation between the face and different regions in the context by capturing their similarity.

Figure 3. The proposed Global-Local Attention module takes the extracted face feature vector and the context feature map as the input to perform context attention inference.

We first reduce the facial feature map Ff\mathbf{F}_f into vector representation using the Global Pooling operator, denoted as vf\mathbf{v}_f. The context feature map can be viewed as a set of Wc×HcW_c \times H_c vectors with DcD_c dimensions, each vector in each cell (i,j)(i,j) represents the embedded features at that location with the corresponding patch in the input image. Therefore, at each region (i,j)(i,j) in the context feature map, we have Fc(i,j)=vi,j\mathbf{F}_c^{(i,j)} = \mathbf{v}_{i,j}.

We then concatenate [vf;vi,j][\mathbf{v}_f; \mathbf{v}_{i,j}] into a holistic vector vˉi,j\bar{\mathbf{v}}_{i,j}, which contains both information about the face and some small regions of the scene. We then employ a feed-forward neural network to compute the score corresponding to that region by feeding vˉi,j\bar{\mathbf{v}}_{i,j} into the network. By applying the same process for all regions, each region (i,j)(i,j) will output a raw score value si,js_{i,j}, we spatially apply the Softmax function to produce the attention map ai,j=Softmax(si,j)a_{i,j} = \text{Softmax}(s_{i,j}). To obtain the final context representation vector, we squish the feature maps by taking the average over all the regions weighted by ai,ja_{i,j} as follow:

vc=ΣiΣj(ai,jvi,j)\mathbf{v}_c = \Sigma_i\Sigma_j(a_{i,j} \odot \mathbf{v}_{i,j})

where vcRDc\mathbf{v}_c \in \mathbb{R}^{D_c} is the final single vector encoding the context information Intuively, vc\mathbf{v}_c mainly contains information from regions that have high attention, while other nonessential parts of the context are mostly ignored. With this design, our attention module can guide the network focus on important areas based on both facial information and context information of the image.

Face and Context Fusion

Figure 4. Detailed illustration of the Adaptive Fusion.

The Fusion Module takes the face vf\mathbf{v}_f and the context reprsentation vc\mathbf{v}_c as inputs, then the face score and context score are produced separately by two neural networks:

sf=F(vf;ϕf),sc=F(vc;ϕc)s_f = \mathcal{F}(\mathbf{v}_f; \phi_f), \quad\quad s_c = \mathcal{F}(\mathbf{v}_c; \phi_c)

where ϕf\phi_f and ϕc\phi_c are the network parameters of the face branch and context branch, respectively. Next, we normalize these scores by the Softmax function to produce weights for each face and context branch

wf=exp(sf)exp(sf)+exp(sc),wc=exp(sc)exp(sf)+exp(sc)w_f = \frac{\exp(s_f)}{\exp(s_f)+\exp(s_c)}, \quad w_c = \frac{\exp(s_c)}{\exp(s_f)+\exp(s_c)}

In this way, we let the two networks competitively determine which branch is more useful than the other. Then we amplify the more useful branch and lower the effect of the other by multiplying the extracted features with the corresponding weight:

vfvfwf,vcvcwc\mathbf{v}_f \leftarrow \mathbf{v}_f \odot w_f , \quad\quad \mathbf{v}_c \leftarrow \mathbf{v}_c \odot w_c

Finally, we use these vectors to estimate the emotion category. Specifically, in our experiments, after multiplying both vf\mathbf{v}_f and vc\mathbf{v}_c by their corresponding weights, we concatenate them together as the input for a network to make final predictions. Figure 4 shows our fusion procedure in detail.

References

[1] Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In NIPS, 2015.

[2] Lee, J., Kim, S., Kim, S., Park, J., Sohn, K.: Context-aware emotion recognition networks. In ICCV, 2019.