In this post, we will explore the specific method to lightweight the deep keypoint detection model: knowledge distillation.

## Knowledge Distillation for Keypoint Detection Model

Goal: Before distilling the entirety of the face moving model, instead of random initialisation, we can initialise the weights of the student model by

1) tranfer weights learned from another task, or

2) distil half of the model, i.e.

we want the guided layer to be able to predict the output of the hidden layer.

Given the separate nature of the face moving model, we can choose to distil the keypoint detector model such that its outputs (keypoints and jacobian matrices) are as similar to that of the teacher model as possible. Then we can use the weights of the distilled keypoint detector as initialisation for full face moving model distillation.

### Loss functions

As our goal is to match the keypoints and jacobian matrices, the most intuitive objective function to set is

where $kp_s, j_s$ and $kp_t, j_t$ are keypoints and jacobian matrices produced by student model and teacher model, respectively.

#### Heatmaps

However, this approach appears to have boiled the distillation problem down to a regression problem, without taking advantage of the abundant spatial information in the architecture. To let the student model learns from the immediate spatial information from the teacher model, as well as increase student model's fidelity at hidden layers, we can add another term in the loss function.

where $h_s, h_t$ are post-softmax heatmaps produced by student model and teacher model, respectively. Inspired by Hinton et al., the heatmaps used are post-softmax with temperature smoothing set to 0.1. This temperate value softens the distribution of the heatmaps and gives a better representation of the teacher's output rather than just keypoint coordinates and jacobian matrices. Furthermore, according to Wang et al., "with face alignment, pose and viewpoint variations have large influence, hence hidden layers are preferred for distillation".

#### Attention Transfer

To continue with our exploration of increasing student's fidelity at deeper layers, we experiment with matching the information encoded by the student's encoder to that of the teacher, by applying activation-based attention transfer. We have the teacher's encoder output $A_t$ of size (1024, 2, 2) and student's encoder output $A_s$ of size (512, 4, 4). Since they are of different sizes, we apply (1x1) conv and max pool on `A_s`

to match their sizes. Then we apply a mapping function on both activation tensors to get attention maps.

where the mapping function is

$F_{sum}^p(A) = \sum^C_{i=1} |A_i|^p$

where $A_i$ is each channel of activation tensor $A$. The loss function is as follows:

As demonstrated in the Attention Transfer paper, the attention map generated by $F^p_{sum}$ are highly correlated with objects found in the image, at different depths of the network. At shallower layers, attention maps will have higher values at corners/edges of the face such as the tip of the nose, eyes, and lips. And at deeper layers, the attention maps pay attention to the general structure of the face. In our case, we have decided to choose the encoder output. The attention maps at this location are very small in spatial dimension and they hold a high-level representation of the input image. Thus, we do not expect it to help much with detailed-level keypoint coordinates. However, we suspect that it should help with jacobian matrices, where local movements surrounding the keypoints are learned. It should also help downstream fidelity as we are pushing the student model's encoder to match the output of the teacher's encoder.

#### Transfer Initialisation

Finally, we consider another option suggested by Wang et al.: pre-training the student model with another task. As transferability increases as the task become more similar, we decided to pre-train the Squeeze U-Net on a face alignment task, i.e. also landmark localisation, but the landmarks are clearly defined landmarks of the face, not deep features.

#### Metrics

For consistent comparison between the loss functions, we use Average Keypoint Distance (AKD), which is the averaged L2 distance between the student model's predicted keypoints to that of the teacher model.

where L is the number of keypoints.

#### Results

https://docs.google.com/spreadsheets/d/1JJGYfOdFDOW7MhqMHF7RcumV3J1tof-bAgDb8ybrekI/edit?usp=sharing

While L2 and L3 gave competitive results, we see that L1 consistently gives the lowest val_loss, showing signs of stability and its ability to generalise well. However, given the limited dataset, it is difficult to explore the true capability of L2 and L3. Transfer initialisation also does not show significant improvement from random initialisation.