/ / 7 min read

Style Transfer for 2D Talking Head Generation (Part 1)

An introduction about Style Transfer for 2D Talking Head Generation

Audio-driven talking head animation is a challenging research topic with many real-world applications. Recent works have focused on creating photo-realistic 2D animation, while learning different talking or singing styles remains an open problem. In this paper, we present a new method to generate talking head animation with learnable style references. Given a set of style reference frames, our framework can reconstruct 2D talking head animation based on a single input image and an audio stream. Our method first produces facial landmarks motion from the audio stream and constructs the intermediate style patterns from the style reference images. We then feed both outputs into a style-aware image generator to generate the photo-realistic and fidelity 2D animation. In practice, our framework can extract the style information of a specific character and transfer it to any new static image for talking head animation. The intensive experimental results show that our method achieves better results than recent state-of-the-art approaches qualitatively and quantitatively.

image

Introduction

Talking head animation is an active research topic in both academia and industry. This task has a wide range of real-world interactive applications such as digital avatars, and digital animations. Given an arbitrary input audio and a 2D image (or a set of 2D images) of a character, the goal of talking head animation is to generate photorealistic frames. The output can be the 2D or 3D talking head. With recent advances in deep learning, especially generative adversarial networks, several works have addressed different aspects of the talking head animation task such as head pose control, facial expression, emotion generation, and photo-realistic synthesis.

While there has been considerable advancement in the generation of talking head animation, achieving photo-realistic and high fidelity animation is not a trivial task. It is even more challenging to render natural motion of the head with different styles. In practice, several aspects contribute to this challenge. First, generating a photo-realistic talking head using only a single image and audio as inputs requires multi-modal synchronization and mapping between the audio stream and facial information. In many circumstances, this process may result in fuzzy backgrounds, ambiguous fidelity, or abnormal face attributes. Second, various talking and singing styles can express diverse personalities. Therefore, the animation methods should be able to adapt and generalize well to different styles. Finally, controlling the head motion and connecting it with the full-body animation remains an open problem.

Recently, several methods have been proposed to generate photo-realistic talking heads or to match the pose from a source video while little work has focused on learning the personalized character style. In practice, apart from personalized talking style, we have different singing styles such as ballad and rap. These styles pose a more challenging problem for talking head animation as they have the unique eye, head, mouth, and torso motion. The facial movements of singing styles are also more varied and dynamic than the talking style. Therefore, learning and bringing these styles into 2D talking heads is more challenging. Currently, most of the style-aware talking head animation methods do not fully disentangle the audio style information and the visual information, which causes ambiguity during the transferring process.

image

Figure 1. Given an audio stream, a single image, and a set of style reference frames, our method generates realistic 2D talking head animation.

In this work, we present a new deep learning framework called Style Transfer for 2D talking head animation. Our framework provides an effective way to transfer talking or singing styles from the style reference to animate single 2D portrait of a character given an arbitrary input audio stream. We first generate photo-realistic 2D animation with natural expression and motion. We then propose a new method to transfer the personalized style of a character into any talking head with a simple style-aware transfer process. Figure 1 shows an overview of our approach.

Research Overview

2D Talking Head Animation Creating talking head animation from an input image and audio has been widely studied in the past few years. One of the earliest works considered this as a sorting task that reorders images from footage video. Some works proposed to capture 3D model from dubber and actor to synthesize photo-realistic face, while others introduced a learning approach to create a trainable system that could synthesize a mouth shape from an unseen utterance. Later works focused on audio-driven to generate realistic mouth shapes and realistic faces, or generating full facial landmarks. Meanwhile, the quality of the animation of those aforementioned approaches can be improved by creating a talking face that includes pose and expression, mainly on generating fidelity talking head with natural head pose and realistic motions. Recently, some methods have been elaborated to encode the personalized information within the talking head animation, or by taking advantage of the diffusion model to improve the diversity of the generated talking face

Speaker Style Estimation There are many kinds of speaker styles such as generic, personal, controlled pose, or special expression. Generic style could be learned by training on multiple videos, while personalized style could be captured by particularly training on one avatar of a person. In general, some well-known methods aim to generates controllable poses with an input video, or to transfer poses and expressions from another video input, such as mapping the style from dubber to actor. Another interesting method tried to capture motions from the driven video and transfer them into input image during the generation process, speaker information and speaking environment can be further ensembled to characterize the speaker variability in the environment. For example, we can leverage a pre-captured database of 3D mouth shapes and associated speech audio from one speaker to refine the mouth shape of a new actor. Recently, Zhang et. al. developed a state-of-the-art method that can generate diverse and synchronized talking videos from input audio and a single reference image by utilizing condition variational autoencoder to capture style code.

Speech Representation for Face Animation Some prior works used hand-crafted models to match phoneme and mouth shape in each millisecond audio signal as speech representation. More advanced, DeepSpeech paved the way for learning a speech recognition system using an end-to-end deep network. Following that, an improvement was made by trainining Bi-LSTMs to learn a language-long-term structure that models the relationship between speech and the complex activity of faces. Additionally the Mel-frequency spectral coefficients can be utilized to synthesize high-quality mouth texture of a character, and then combined it with a 3D pose matching method to synchronize the lip motion with the audio in the target animation. With the rise of the diffusion technique, Diff2lip proposed an audio-conditional diffusion model that effectively encodes audio in their generator to solve the lip-sync challenge.

Our goal is to introduce a new deep-learning framework that can transfer talking or singing styles from any personalized style reference to animate a single 2D portrait of a character given an arbitrary input audio stream. Compared to existing approaches, which have been mainly focusing on conventional talking head animation, our method can not only produce animation for common talking styles but also allows transferring for several special styles that are much more challenging such as singing

To summarize, our research aims to propose a new framework for generating photorealistic 2D talking head animations from the audio stream as input. Furthemore, we present a style-aware transfer technique, which enables us to learn and apply any new style to the animated head. Our generated 2D animation is photo-realistic and high fidelity with natural motions. To validate our meticulously designed system, we conduct intensive analysis and demonstrate that our proposed method outperforms recent approaches both qualitatively and quantitatively.

Like What You See?