Music-Driven Group Choreography (Part 2)
In the previous post, we have introduced AIOZ-GDANCE, a new largescale in-the-wild dataset for music-driven group dance generation. On the basis of the new dataset, we introduce the first strong baseline for group dance generation that can jointly generate multiple dancing motions expressively and coherently.
Music-driven Group Dance Generation Method
Problem Formulation
Given an input music audio sequence with indicates the index of music segments, and the initial 3D positions of dancers , , our goal is to generate the group motion sequences where is the generated pose of -th dancer at time step . Specifically, we represent the human pose as a 72-dimensional vector where , represent the root translation and pose parameters of the SMPL model [1], respectively.
In general, the generated group dance motion should meet the two conditions: (i) consistency between the generated dancing motion and the input music in terms of style, rhythm, and beat; (ii) the motions and trajectories of dancers should be coherent without cross-body intersection between dancers. To that end, we propose the first baseline method, for group dance generation that can jointly generate multiple dancing motions expressively and coherently. Figure 1 shows the architecture of our proposed Music-driven 3D Group Dance generatoR (GDanceR), which consists of three main components:
- Transformer Music Encoder.
- Initial Pose Generator.
- Group Motion Generator.
Transformer Music Encoder
From the raw audio signal of the input music, we first extract music features using the available audio processing library Librosa. Concretely, we extract the mel frequency cepstral coefficients (MFCC), MFCC delta, constant-Q chromagram, tempogram, onset strength and one-hot beat, which results in a 438-dimensional feature vector. We then encode the music sequence , into a sequence of hidden representation , . In practice, we utilize the self-attention mechanism of transformer [2] to effectively encode the multi-scale information and the long-term dependency between music frames. The hidden audio at each time step is expected to contain meaningful structural information to ensure that the generated dancing motion is coherent across the whole sequence.
Specifically, we first embed the music features using a Linear layer followed by Positional Encoding to encode the time ordering of the sequence.
where denotes the Positional Encoding, and is the parameters of the linear projection layer. Then, the hidden audio information can be calculated using self-attention mechanism:
where , and are the parameters that transform the linear embedding audio into a query , a key , and a value respectively. is the dimension of the hidden audio representation while is the dimension of the query and key, and is the dimension of value. is a feed-forward neural network.
Initial Pose Generator
Given the initial positions of all dancers, we generate the initial poses by combing the audio feature with the starting positions. We aggregate the audio representation by taking an average over the audio sequence. The aggregated audio is then concatenated with the input position and fed to a multilayer perceptron (MLP) to predict the initial pose for each dancer:
where is the concatenation operator, is the initial position of the -th dancer.
Group Motion Generator
To generate the group dance motion, we aim to synthesize the coherent motion of each dancer such that it aligns well with the input music. Furthermore, we also need to maintain global consistency between all dancers. As shown in Figure 3, our Group Generator comprises a Group Encoder to encode the group sequence information and an MLP Decoder to decode the hidden representation back to the human pose space. To effectively extract both the local motion and global information of the group dance sequence through time, we design our Group Encoder based on two factors: Recurrent Neural Network [3] to capture the temporal motion dynamics of each dancer, and Attention mechanism [2] to encode the spatial relationship of all dancers.
Specifically, at each time step, the pose of each dancer in the previous frame is sent to an LSTM unit to encode the hidden local motion representation :
To ensure the motions of all dancers have global coherency and discourage strange effects such as cross-body intersection, we introduce the Cross-entity Attention mechanism. In particular, each individual motion representation is first linearly projected into a key vector , a query vector and a value vector as follows: \begin{equation} k^i = h^i W^{k}, \quad q^i = h^i W^{q}, \quad v^i = h^i W^{v}, \end{equation} where , and are parameters that transform the hidden motion into a query, a key, and a value, respectively. is the dimension of the query and key while is the dimension of the value vector. To encode the relationship between dancers in the scene, our Cross-entity Attention also utilizes the Scaled Dot-Product Attention as in the Transformer [3].
In practice, we find that people having closer positions to each other tend to have higher correlation in their movement. Therefore, we adopt Spacial Encoding strategy to encode the spacial relationship between each pair of dancers. The Spacial Encoding between two entities based on their distance in the 3D space is defined as follows:
where is the dimension of the position vector . Considering the query , which represents the current entity information, and the key , which represents other entity information, we inject the spatial relation information between these two entities onto their cross attention coefficient:
To preserve the spatial relative information in the attentive representation, we also embed them into the hidden value vector and obtain the global-aware representation of the -th entity as follows:
where is the learnable bias and scaled by the Spacial Encoding. Intuitively, the Spacial Encoding acts as the bias in the attention weight, encouraging the interactivity and awareness to be higher between closer entities. Our attention mechanism can adaptively attend to each dancer and others in both temporal and spatial manner, thanks to the encoded motion as well as the spatial information.
We then fuse both the local and global motion representation by adding and to obtain the final latent motion . Our final global-local representation of each entity is expected to carry the comprehensive information of their own past motion as well as the motion of every other entity, enabling the MLP Decoder to generate coherent group dancing sequences. Finally, we generate the next movement based on the final motion representation as well as the hidden audio representation , and thus can capture the fine-grained correspondence between music feature sequence and dance movement sequence:
Built upon these components, our model can effectively learn and generate coherent group dance animation given several pieces of music. In the next part, we will go through the experiments and detailed studies of the method.
References
[1] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multiperson linear model. ACM Trans. Graphics, 2015
[2] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. NIPS 2017.
[3] Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997 Nov 15;9(8):1735-80.