Model Compression Part 3.

Explore techniques used in a specific task, the Deep Keypoint Detection for Motion Driving task.
type: practicallevel: advance

In this post, we will explore the deep keypoint detection task, its architecture and then analyse how to lightweigh it.

Model goal and output

Within the First-Order Motion model, this keypoint detector network is part of the motion estimation module, where keypoints and local affine transformations predicted by this network will be used to model motion from one frame to another. The model does not require any keypoint annotations during training and is trained as part of the whole system. The keypoints learned are not specific to any structure and are often associated with highly moving parts.

In this report, we will explore the details of the keypoint detector network, its overall structure, each specific modules and layers.

Overall analysis

The model input is an RGB image of size (256,256). The image is then passed through four large modules: 1) anti-alias interpolation to reduce the spatial dimension down to (64,64), 2) hourglass module to extract feature maps and reconstruct to, 3) post-process module where the reconstructed feature maps will be converted into normalised heatmaps and in turn keypoints, and
4) jacobian module where the keypoints will be used to learn the local affine transformation, i.e. focus prediction near the keypoints.

Below is a brief summary of running time of each module.

video_idanti_aliashourglassheatmap_keypointjacobiantotal_infer_time
id102821.319524.10600.95151.677428.2299
id102862.306828.22531.71423.094035.5235
id102912.389728.33231.76713.187835.8659
id102812.580930.11291.94143.479438.3216
id102832.762630.65482.06343.726639.4218
id102802.862030.84202.12273.831539.8855

Hourglass module

A standard U-Net was used for extracting features from the original image and reconstruct before converting to heatmaps. 5 downblocks make up the Encoder and 5 upblocks make up the Decoder. With feature maps from the encoder being concatenated to the corresponding feature maps in the decoder using residual connections.

input_shapeoutput_shape# params
down_sampling_1(3,64,64)(64,32,32)1,920
down_sampling_2(64,32,32)(128,16,16)74,112
down_sampling_3(128,16,16)(256,8,8)295,680
down_sampling_4(256,8,8)(512,4,4)1,181,184
down_sampling_5(512,4,4)(1024,2,2)4,721,664
upsampling_1(1024,2,2)(512,4,4)4,720,128
upsampling_2(1024,4,4)(256,8,8)2,360,064
upsampling_3(512,8,8)(128,16,16)590,208
upsampling_4(256,16,16)(64,32,32)147,648
upsampling_5(128,32,32)(32,64,64)36,960
hour_glass(3,64,64)(35,64,64)14.130M

Encoder

Made up of 5 downblocks. Each downblock is a sequence of Conv2d -> BatchNorm -> Relu -> AvgPool2d. The number of channels doubles and the spatial dimensions reduce by a factor of two after every block.

"the first convolution layers have 32 filters and each subsequent convolution doubles the number of filters".

layerdownblock_1downblock_2downblock_3downblock_4downblock_5average
conv0.3068280.6170310.9003351.7614736.2108531.959304
batchnorm0.1628620.0975060.0559710.0637470.0839980.092817
relu0.158540.0754860.0374250.031630.0310380.066824
avgpool0.061770.0371660.0266930.024210.0277480.035517

The majority of a forward pass in the encoder is spent on the convolutional layers, especially the ones in the final downblock, making the convolutional layers a candidate to be replaced for better efficiency.

Decoder

Made up of 5 upblocks. Each upblock is a sequence of Interpolate -> Conv2d -> BatchNorm -> ReLU. The first upblock's input is the output of the encoder. Every other upblock's input is a concatenation of the previous upblock's output and the corresponding downblock output.

Note: all measurements reported are in ms.

layerupblock_1upblock_2upblock_3upblock_4upblock_5average
interpolation0.0348890.0481820.0698890.1125270.1924120.091580
conv5.4797795.134611.7543651.0807640.9177132.873446
batchnorm0.0833910.0868220.0577270.0653820.077860.074236
relu0.0284620.0343850.0321030.0430660.0612870.039861

Post-procesing (Heatmaps & Keypoints)

After the Hourglass module, there are two more steps to output keypoints: 1. generating K normalised heatmaps for K keypoints, and 2. converting those heatmaps to corresponding (x,y) coordinates.

We wish to get the coordinates of the keypoints from the heatmaps. In multiple landmark localisation works, the coordinates were usually extracted by applying argmax, i.e. taking the brightest point on the heatmap to be the keypoint. However, this process is non-differentiable and may lead to quantization error. Therefore, this paper employed the differentiable variant of argmax, or soft-argmax.

Heatmaps

The goal of this sub-module is to create normalised heatmaps from the feature maps generated by the Hourglass module, i.e. every element ranges from 0 to 1 and sum of all elements add up to 1. First, the output of size (35,64,64) from the hourglass module goes through another convolutional layer to finally generate 10 heatmaps of size (58,58), for 10 keypoints. As the output of the decoder is a concatenation of the input image and earlier decoder output, this convolution layer not only reduces the channel down to the number of keypoints, but it also acts as the last learning bit, with a larger kernel size for larger receptive field.

# a convolution layer to bring 35 channels down to 10
self.kp = nn.Conv2d(in_channels=self.predictor.out_filters, out_channels=num_kp, kernel_size=(7, 7), padding=pad)
prediction = self.kp(feature_map)
# normalised the heatmaps by applying softmax
final_shape = prediction.shape
# reshape to (10, 3364) to perform softmax
heatmap = prediction.view(final_shape[0], final_shape[1], -1)
heatmap = F.softmax(heatmap / self.temperature, dim=2)
# bring back to (10, 58, 58)
heatmap = heatmap.view(*final_shape)

Then, the heatmaps will be normalised using softmax. This part does not have any trainable parameters. But there is one hyperparameter, temperature for determining the smoothness of the distribution after applying softmax. If set temperature > 1, will make the softmax distribution smoother. In this model, temperature was set to 0.1. Quoted by the author

Thanks to the use of a low temperature for softmax, we obtain sharper heatmaps and avoid uniform heatmaps that would lead to keypoints constantly located in the image center.

heatmaps_bnw

Figure 1: 10 normalised heatmaps for 10 keypoints.

Gaussian2KP

This part is to implement soft-argmax, a variant of argmax that is differentiable. Its purpose is to convert a heatmap to a (x,y) coordinate. The process is purely computatonal and there is no learning here.

One clear advantage of soft-argmax is that it is differentiable, thus allows end-to-end training. It also alleviates the problems of quantization error due to the mismatch in size between the input image and the heatmaps. Efficiency-wise, it does not add extra parameters and computationally negligible.

Jacobian

To focus on the movements surrounding keypoints, Jacobian matrices are computed to represent the scale and rotation of each keypoint.

Keypoint Results

keypoints

keypoints-vox

Figure 2: Visualised keypoints.

It appears that the keypoints are semantically consistent across the images, such as the bright green keypoint is at between the eyes, the blue one is at the neck, and the dark blue point is at the left eyebrow, etc.