/ / 76 min read

Controllable Group Choreography using Contrastive Diffusion (Part 1)

An introduction about controllable group choreography

Music-driven group choreography poses a considerable challenge but holds significant potential for a wide range of industrial applications. The ability to generate synchronized and visually appealing group dance motions that are aligned with music opens up opportunities in many fields such as entertainment, advertising, and virtual performances. However, most of the recent works are not able to generate high-fidelity long-term motions, or fail to enable controllable experience. In this work, we aim to address the demand for high-quality and customizable group dance generation by effectively governing the consistency and diversity of group choreographies. In particular, we utilize a diffusion-based generative approach to enable the synthesis of flexible {number of dancers} and long-term group dances, while ensuring coherence to the input music. Ultimately, we introduce a Group Contrastive Diffusion (GCD) strategy to enhance the connection between dancers and their group, presenting the ability to control the consistency or diversity level of the synthesized group animation via the classifier-guidance sampling technique. Through intensive experiments and evaluation, we demonstrate the effectiveness of our approach in producing visually captivating and consistent group dance motions. The experimental results show the capability of our method to achieve the desired levels of consistency and diversity, while maintaining the overall quality of the generated group choreography.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Introduction

With the widespread presence of digital social media platforms, the act of creating and editing dance videos has gained immense popularity among social communities. This surge in interest has resulted in the daily production and watching of millions of dancing videos across online platforms. Recently, researchers from computer vision, computer graphics, and machine learning communities have devoted considerable attention to developing techniques that can generate natural dance movements from music. These advancements have far-reaching implications and find applications in various domains, such as animation, the creation of virtual idols, the development of virtual meta-verse, and dance education. These techniques empower artists, animators, and educators alike, providing them with powerful tools to enhance their creative endeavors and enrich the dance experience for both performers and audiences.

While significant progress has been made in generating dancing motions for single dancer, the task of producing cohesive and expressive choreography for a group of dancers has received limited attention. The generation of synchronized group dance motions that are both realistic and aligned with music remains a challenging problem in the field of computer animation and motion synthesis. This is primarily due to the complex relationship between music and human motion, the diverse range of motions required for group performances, and the insufficient of a suitable dataset. At present, AIOZ-GDance stands as the most recent extensive dataset available to facilitate the task of generating group choreography. Besides, while current algorithms can generate individual movements and choreographic sequences, ensuring that these elements align seamlessly with the overall group performance is also paramount.

Different from solo dance, group dance involves coordination and interaction between dancers, making it crucial and challenging to establish correlations between motion series within a group. Besides, group dance can involve complex and diverse choreographies among participating dancers while still maintaining a semantic relationship between the motion and input music. Exploring the consistency and diversity between the movements of dancers of the synthesized group choreography is of vital importance to create a natural and expressive performance. The ability to control the consistency and diversity in group dance generation holds great potential across various applications. One such application is in the realm of entertainment and performance. Choreographers and creative teams can leverage this control ability to design captivating group dance routines that seamlessly blend synchronized movements with moments of individual expression. Second, in the context of animation and virtual metaverses, the control over consistency and diversity allows for the creation of visually stunning and immersive virtual dance performances. By balancing the synchronization of dancers' movements, while also introducing variations and unique flourishes, the generated group dances can captivate audiences and evoke a sense of realism and authenticity. Last but not least, in dance education and training, the ability to regulate consistency and diversity in group dance generation can be invaluable. It enables instructors to provide students with a diverse range of generated dance routines and samples that challenge their abilities, promote collaboration, and foster creativity. By dynamically adjusting the level of consistency and diversity, educators can cater to the unique need and skill level of each individual dancer, creating more inclusive instructions and enriching the learning environment. Although plenty of applications can be listed, due to some limitations of data establishment, investigating the consistency and diversity in group choreography has not been carefully explored.

In this paper, our goal is to develop a controllable technique for group dance generation. We present a Group Contrastive Diffusion (GCD) strategy that learns an encoder to capture the key targets between group dance movements. Diffusion modeling provides a flexible framework for manipulating the dance distribution, which allows us to modulate the degree of diversity and consistency in the generated dances. By using denoising diffusion probabilistic model as a key technique, we can effectively control the trade-off between diversity and consistency during the group dance generation, thanks to the guided sampling process. With this approach, we can guide the generation process toward a desired balance between diversity and consistency levels. Moreover, incorporating the encoder, which learns the association between the dancers and their group, can help to maintain the generated dance moves so that they are consistent with a specific dance style, music genre, or any long-term chorus. We empirically show that this approach has the potential to enhance the quality and naturalness of generated group dance performances, making it more appealing for various applications.

Figure 1. We present a contrastive diffusion method that controls the consistency (top row) and diversity (second row) in group choreography.

2. Overview

Music-driven Choreography. Creating natural and authentic human choreography from music is a complex task. One commonly employed technique involves using a motion graph derived from a vast motion database to generate new motions. This involves combining various motion segments and optimizing transition costs along the graph path. Alternatively, there are other methods that incorporate music-motion similarity matching constraints to ensure consistency between the motion and the accompanying music. Previous studies have extensively explored these methodologies. However, most of these approaches relied on heuristic algorithms to stitch together pre-existing dance segments sourced from a limited music-dance database. While these methods are successful in generating extended and realistic dance sequences, they face limitations when trying to create entirely novel dance fragments.

In recent years, several signs of progress have been made in the field of music-to-dance motion generation using Convolutional Network (CNN), Recurrent Network (RNN), Graph Neural Network (GNN), Generative Adversarial Network (GAN), or Transformer. Typically, these methods rely on multiple inputs such as the current music and a brief history of past dance movements to predict the future sequence of human poses. Recently, Gong et. al. propose an interesting task of generating dance by simultaneously utilizing both music and text instruction. A music-text feature fusion module was designed to fuse the inputs into a motion decoder to generate dance conditioned on both music and text. However, although these methods have the potential to produce natural and realistic dancing motion, they are often unable to create synchronized and harmonious movements between multiple dancers. Ensuring coordination and synchronization between dancers is a complicated problem, as it involves not only individual pose predictions but also the seamless integration of these poses within the context of a group. Achieving synchronized and harmonious group movements requires considering spatial and temporal relationships among dancers, their interactions, and the overall choreographic structure. Thus, further advances in the field are considered to address these challenges, including works that use deep learning approaches such as Variational Autoencoder (VAE), GAN, and Normalising Flow.

Unfortunately, most of these networks are limited by their ability to model long-term dance sequences ce may freeze or drift towards the end of the music. To facilitate long-term generation, many researchers apply a motion repeat constraint to predict future frames by attending to the historical motions. Nevertheless, this would limit the flexibility of the model by forcing it to always look into the past.

Group Choreography Group choreography and its related problem, multi-person motion prediction, have been an active research area with numerous studies addressing the challenges of predicting the behaviors of multiple individuals. Alahi et.al. utilize a Markov chain model to jointly analyze the trajectories of several pedestrians and predict their destinations in a given scene. Adel et. al. integrate social interactions and the visual context of the environment to forecast the future motion of multiple individuals. Multi-Range Transformers has proposed to predict the movements of groups with more than ten people engaging in social interactions. These aforementioned methods leverage various techniques to capture social interaction, spatial dependencies, and temporal dynamics, generally aiming to predict accurate and socially plausible future motions for multiple individuals in different scenarios. However, despite the notable advancements achieved, there remains a demand for further investigation of the correlation between the consistency and diversity of motions within group context. A deeper understanding of how to attain the optimal balance between consistency and diversity holds the potential to unlock new possibilities for creating group choreographies that benefit the users in many circumstances.

Diffusion for Music-driven Choreography.Recently, diffusion-based approaches have shown remarkable results on several generative tasks ranging from image generation, audio synthesis, pose estimation, natural language generation, and motion synthesis, to point cloud generation, 3D object synthesis, and scene creation. Diffusion models have shown that they can achieve high mode coverage, unlike GANs, while still maintaining high sample quality. Most existing diffusion-based approaches for human motion/dance synthesis only focus on generating motion sequences for a single character, conditioned on information such as text, audio, or both audio and text. Different from these prior works, we aim to create group of dancing motions from music, which includes coordinating multiple characters, avoiding collisions, and maintaining coherence between them. In addition to the vanilla diffusion loss term used for training in previous works, our method employs a contrastive learning strategy that directly influences the training of the diffusion reverse process, enhancing the association within the dance group.

A prominent issue of the diffusion approach for motion synthesis is that although it is highly effective in generating diverse samples, injecting a large amount of noise during the sampling process can lead to inconsistent results. This issue is particularly problematic for group dance paradigm. Therefore, our desideratum is to design a group dance generation model with the ability to address both diversity and consistency problem.

3. Preliminaries

3.1 Background

Given an input music sequence {a1,at,...,aT}\{a_1, a_t, ...,a_T\} with t={1,...,T}t = \{1,..., T\} indicates the index of the music frames, our goal is to generate the group motion sequences of NN dancers: {x11,...,xT1;...;x1N,...,xTN}\{x^1_1,..., x^1_T; ...;x^N_1,...,x^N_T\} where xtix^i_t is the pose of ii-th dancer at frame tt. We represent dances as sequences of poses in the 24-joint of the SMPL model, using the 6D continuous rotation for every joint, along with a single 3D root translation. This rotation representation ensures the uniqueness and continuity of the rotation vector, which is more beneficial to the training of deep neural networks. We tackle the group dance generation task by using a diffusion-based framework to synthesize the motions from a random noise distribution, given the music conditioning. Thanks to the sampling process of the diffusion model, we can effectively control the consistency and diversity in the generated sequences.

3.2 Forward Process of Diffusion Model.

Given an original sample from the real data distribution x0q(x0){x_0} \sim q({x_0}), following \cite{ho2020ddpm}, the forward diffusion process is defined as a Markov process that gradually adds Gaussian noise to the data under a pre-defined noise schedule up to MM steps.

q(xmxm1)=N(xm;1βmxm1,βmI),m{1,2,...,M}q(x_m | x_{m-1}) = \mathcal{N}(x_m; \sqrt{1-\beta_m} {x_{m-1}}, \beta_mI), \forall m \in \{1,2, ... ,M\}

If the noise variance schedule βm\beta_m is small and the number of diffusion step MM is large enough, the distribution q(xM)q(x_M) at the end of the process is well-approximated by a standard normal distribution N(0,I)\mathcal{N}(0,I), which is easy to sample from. Thanks to the nice property of the forward diffusion, we can directly obtain the noised sample at any arbitrary step mm without traversing through the whole chain:

q(xmx0)=N(xm;αˉmx0,(1αˉm)I),xm=αˉmx0+1αˉmϵ,ϵN(0,I)q(x_m | x_{0}) = \mathcal{N}(x_m; \sqrt{\bar{\alpha}_m} x_{0} , (1-\bar{\alpha}_m)I), x_m = \sqrt{\bar{\alpha}_m} x_{0} + \sqrt{1-\bar{\alpha}_m} \epsilon, \epsilon \sim \mathcal{N}(0, I)

where αm=1βm\alpha_m = 1-\beta_m and αˉm=s=0mαs\bar{\alpha}_m = \prod_{s=0}^m \alpha_s.

3.3 Reverse Process.

By additionally conditioning on x0x_0, the posterior of the reverse process is tractable and becomes a Gaussian distribution:

q(xm1xm,x0)=N(xm1;μ~m,β~mI),q(x_{m-1} | x_m, x_0) = \mathcal{N} (x_{m-1}; \tilde{\mu}_m, \tilde{\beta}_mI),

where μ~m\tilde{\mu}_m and β~m\tilde{\beta}_m are the posterior mean and variance that depend on both xmx_m and x0x_0, respectively. To obtain a sample from the original data distribution, we start by sampling from the noise distribution q(xM)q(x_M) and then gradually remove the noise until we reach x0x_0, following the reverse process. Therefore, our goal is to train a neural network to approximate the posterior q(xm1xm)q(x_{m-1} | x_m) of the reverse process as:

pθ(xm1xm)=N(xm1;μθ(xm,m),Σθ(xm,m))p_{\theta}(x_{m-1} | x_m) = \mathcal{N}(x_{m-1}; \mu_{\theta}(x_m, m), \Sigma_{\theta}(x_m,m) )

We model only the mean μθ(xm,m)\mu_{\theta}(x_m, m) of the reverse distribution while keeping the variance Σθ(xm,m)\Sigma_{\theta}(x_m,m) fixed according to the noise schedule. However, instead of predicting the noise ϵm\epsilon_m at any arbitrary step mm as in their approach, we train the network to learn to predict the original noiseless signal x0x_0. The sample at the previous step m1m-1 can be obtained by noising back the predicted x0x_0. For conditional generation setting, the network is additionally conditioned with the conditioning signal cc as x0Gθ(xm,m,c)x_0 \approx \mathcal{G}_\theta(x_m,m,c) with model parameters θ\theta.

Next

In the next post, we will mention our proposal in details.

Like What You See?