Controllable Group Choreography using Contrastive Diffusion (Part 3)
In previous part, we have discovered Music-Motion Transformer. In this part, let investigate the Group Modulation and Contrastive Divergence Loss.
Our code can be found at: https://github.com/aioz-ai/GCD
1. Group Modulation
To better apply the group information constraints to the learned hidden features of the dancers, we adopt a Group Modulation layer that learns to adaptively influence the output of the transformer attention block by applying an affine transformation to the intermediate features based on the group embedding . More specifically, we utilize two separate linear layers to learn the affine transformation parameters from the group embedding . The predicted affine parameters are then used to modulate the activations sequence as follows:
where each channel of the whole activation sequence is first normalized separately by calculating the mean and , and then scaled and biased using the outputted affine parameters and . Intuitively, this operation shifts the activated hidden motion features of each individual motion towards a unified group representation to further encourage the association between them. Finally, the output features are then projected back to the original motion dimensions via a linear layer, to obtain the predicted outputs .
2.Contrastive Diffusion
We learn the representations that encode the underlying shared information between the group embedding information and the group sequence . Specifically, we model a density ratio that preserves the mutual information between and as:
is a model (i.e., a neural network) to predict a positive score (how well is related to ) for a pair of .
To enhance the association between the generated group dance (data) and the group embedding (context), we aim to maximize their mutual information with a Contrastive Encoder via the contrastive learning objective as in Equation below. The encoder takes both the generated group dance sequence and a group embedding as inputs, and it outputs a score indicating the correspondence between these two.
where is a set of randomly constructed negative sequences. In general, this loss is similar to the cross-entropy loss for classifying the positive sample, and optimizing it leads to the maximization of the mutual information between the learned context representation and the data. Using the contrastive objective, we expect the Contrastive Encoder to learn to distinguish between the two quantities: consistency (the positive sequence) and diversity (the negative sequence). This is the key factor that enables the ability to control diversity and consistency in our framework.
Here, we will describe our strategy to construct contrastive samples to achieve our target. Recall that we use reverse distribution of Gaussian Diffusion with the mean as the prediction of the model while the variance is fixed to a scheduler (Equation~\ref{eq:approximateposterior}). To obtain the contrastive samples, given the true pair is , we first leverage forward diffusion process to obtain the noised sample . Then, our positive sample is $\theta(x{m-1} |xm, w)x^j_0 \neq x_0) with some probabilities, feeding it through the forward process to obtain $x^j_m, then our negative sample is $\theta(x^j{m-1} | x^j_m, w). By constructing contrastive samples this way, the positive pair $(x_0,w) represents a group sequence with high consistency, whereas the negative one represents a high diversity sample. This is because mixing a sample with dancers from different groups is likely to result in substantially distinctive movements between each dancer, making it a group dance sample with high degree of diversity. Note that negative sequences should also match the music because they are motions generated by the network whose inputs are manipulated to increase diversity. Particularly, negative samples are acquired from outputs of the denoising network whose inputs are both the current music and the noised mixed group with some replaced dancers. As the network is trained to reconstruct only positive samples, its outputs will likely follow the music. Therefore, negative samples are not just random bad samples but are the valid group dance generated from the network that is trained to generate group dance conditioned on the music. This is because our main diffusion training objective is calculated only for ground-truth dances (positive samples) that are consistent with the music. Our proposed strategy also allows us to learn a more powerful group representation as it directly affects the reverse process, which is beneficial to maintaining consistency in long-term synthesis.
3. Diversity vs. Consistency
Using the Contrastive Encoder , we extend the classifier guidance to control the generation process. Accordingly, we incorporate in the contrastive framework to replace the guiding classifier in the original formula, since it provides a score of how consistent the sample is with the group information. In particular, we shift the mean of the reverse diffusion process with the log gradient of the Contrastive Encoder with respect to the generated data as follows:
where is the control parameter that uses the encoder to enforce consistency and connection with the group embedding. Since the Contrastive Encoder is trained to classify between high-consistency and high-diversity samples, its gradients yield meaningful guidance signals to control the trade-off process. Intuitively, a positive value of encourages more consistency between dancers while a negative value (which corresponds to shifting the distribution with a negative gradient step) boosts the diversity between each individual dancer.
Next
In the next post, we will mention Experimental Setups and Analysis.