AIOZ AI / Feb 14th / 53 min read

Controllable Group Choreography using Contrastive Diffusion (Part 3)

Dive into the methodology of controllable group choreography

In previous part, we have discovered Music-Motion Transformer. In this part, let investigate the Group Modulation and Contrastive Divergence Loss.

Our code can be found at: https://github.com/aioz-ai/GCD

1. Group Modulation

To better apply the group information constraints to the learned hidden features of the dancers, we adopt a Group Modulation layer that learns to adaptively influence the output of the transformer attention block by applying an affine transformation to the intermediate features based on the group embedding $w$ . More specifically, we utilize two separate linear layers to learn the affine transformation parameters $\{S(w); b(w)\} \in \mathbb{R}^d$ from the group embedding $w$ . The predicted affine parameters are then used to modulate the activations sequence $h = \{h^1_1\dots h^1_T;\dots;h^N_1\dots h^N_T\}$ as follows:

\tilde{h} = S(w) * \frac{h-\mu(h)}{\sigma(h)}+ b(w)

where each channel of the whole activation sequence is first normalized separately by calculating the mean $\mu$ and $\sigma$ , and then scaled and biased using the outputted affine parameters $S(w)$ and $b(w)$ . Intuitively, this operation shifts the activated hidden motion features of each individual motion towards a unified group representation to further encourage the association between them. Finally, the output features are then projected back to the original motion dimensions via a linear layer, to obtain the predicted outputs $\hat{x}_0$ .

2.Contrastive Diffusion

We learn the representations that encode the underlying shared information between the group embedding information $w$ and the group sequence $x$ . Specifically, we model a density ratio that preserves the mutual information between $x$ and $w$ as:

f(x, w) \propto \frac{p({x}|{w})}{p({x})}

$f(\cdot)$ is a model (i.e., a neural network) to predict a positive score (how well $x$ is related to $w$ ) for a pair of $({x}, {w})$ .

To enhance the association between the generated group dance (data) and the group embedding (context), we aim to maximize their mutual information with a Contrastive Encoder $f(\hat{x},w)$ via the contrastive learning objective as in Equation below. The encoder takes both the generated group dance sequence $\hat{x}$ and a group embedding $w$ as inputs, and it outputs a score indicating the correspondence between these two.

\mathcal{L}_{\rm nce} = - \mathbb{E} \left[ \log\frac{f(\hat{x},w)}{f(\hat{x},w) + \Sigma_{x^j \in X'}f(\hat{x}^j, w)}\right]

where $X'$ is a set of randomly constructed negative sequences. In general, this loss is similar to the cross-entropy loss for classifying the positive sample, and optimizing it leads to the maximization of the mutual information between the learned context representation and the data. Using the contrastive objective, we expect the Contrastive Encoder to learn to distinguish between the two quantities: consistency (the positive sequence) and diversity (the negative sequence). This is the key factor that enables the ability to control diversity and consistency in our framework.

Here, we will describe our strategy to construct contrastive samples to achieve our target. Recall that we use reverse distribution $p_\theta(x_{t-1} | x_t)$ of Gaussian Diffusion with the mean as the prediction of the model while the variance is fixed to a scheduler (Equation~\ref{eq:approximateposterior}). To obtain the contrastive samples, given the true pair is $(x_0,w)$ , we first leverage forward diffusion process $q(x_m|x_0)$ to obtain the noised sample $x_m$ . Then, our positive sample is $\theta(x{m-1} |xm, w) $. Subsequently, we construct the negative sample from the positive pair by randomly replacing dancers from other group dance sequences ($ x^j_0 \neq x_0 $) with some probabilities, feeding it through the forward process to obtain $x^j_m$ , then our negative sample is $\theta(x^j{m-1} | x^j_m, w) $. By constructing contrastive samples this way, the positive pair $(x_0,w)$ represents a group sequence with high consistency, whereas the negative one represents a high diversity sample. This is because mixing a sample with dancers from different groups is likely to result in substantially distinctive movements between each dancer, making it a group dance sample with high degree of diversity. Note that negative sequences should also match the music because they are motions generated by the network whose inputs are manipulated to increase diversity. Particularly, negative samples are acquired from outputs of the denoising network whose inputs are both the current music and the noised mixed group with some replaced dancers. As the network is trained to reconstruct only positive samples, its outputs will likely follow the music. Therefore, negative samples are not just random bad samples but are the valid group dance generated from the network that is trained to generate group dance conditioned on the music. This is because our main diffusion training objective is calculated only for ground-truth dances (positive samples) that are consistent with the music. Our proposed strategy also allows us to learn a more powerful group representation as it directly affects the reverse process, which is beneficial to maintaining consistency in long-term synthesis.

3. Diversity vs. Consistency

Using the Contrastive Encoder $f(x_m,w)$ , we extend the classifier guidance to control the generation process. Accordingly, we incorporate $f(x_m,w)$ in the contrastive framework to replace the guiding classifier in the original formula, since it provides a score of how consistent the sample is with the group information. In particular, we shift the mean of the reverse diffusion process with the log gradient of the Contrastive Encoder with respect to the generated data as follows:

\hat{\mu}_\theta(x_m,m) = \mu_\theta(x_m,m) + \gamma \cdot \Sigma_{\theta}(x_m,m) \nabla_{x_m}\log f(x_m,w)

where $\gamma$ is the control parameter that uses the encoder to enforce consistency and connection with the group embedding. Since the Contrastive Encoder is trained to classify between high-consistency and high-diversity samples, its gradients yield meaningful guidance signals to control the trade-off process. Intuitively, a positive value of $\gamma$ encourages more consistency between dancers while a negative value (which corresponds to shifting the distribution with a negative gradient step) boosts the diversity between each individual dancer.