Hourglass module is used as an anti-alias function for The First Order Motion. In this post, we aim at finding possible methods for optimizing the efficiency of mentioned task. Note that the CPU running time was measured on Intel Core i7-8700K CPU @ 3.70GHz.
As the hourglass module accepts images of size (64, 64, 3), an interpolation module is needed to reduce the spatial dimension of the input image. In an earlier version of the paper, Monkey-Net, only an F.interpolate()
function with mode='nearest'
was used to bring down the image size. This, however, appeared to have produced aliasing artifacts. Thus, the authors have changed to use anti-alias interpolation in the First Order Motion paper.
Gaussian + Subsampling
The paper pre-filters the input image with a Gaussian filter before subsampling, inducing a blurring effect, and at the same time averaging out and reducing the sudden changes in intensity values, thus preserving these changes in the resized image in a smoother manner.
As the Gaussian kernel is explicitly coded out, there is no learnable parameter in this module.
Other methods
Regarding downsampling, the easiest technique is to select every other pixel according to the desired size. For instance, if we want to reduce the size of an image by half, we can simply skip every 2nd pixel. This technique, which is called subsampling, can lead to severe aliasing.
We can also collapse every nxn window by taking the average of the window or the median, which is equivalent to Average Pooling or Median Pooling.
Finally, we can pre-filter the image, similar to the technique used in the paper, before subsample the image. More sophisticated filters can be used at the expense of computation costs.
Experiments
The techniques tested include:
1. Baseline: Gaussian filter + Subsampling
2. No prefilter: Just subsampling, i.e. select every 4th pixel
3. Method done in Monkey-Net: F.interpolate()
with mode=nearest
4. F.interpolate()
with mode=bicubic
and antialias=True
5. F.interpolate()
with mode=bilinear
and antialias=True
6. F.interpolate()
with mode=area
: Equivalent to apply an Adaptive Average Filter + Subsampling
4. Average Pooling: Equivalent to apply an Average Filter + Subsampling
5. Median filter + Subsampling
6. Bilateral filter + Subsampling
Let the keypoints predicted by the baseline model be and another interpolation technique be T , and N the number of samples in the vox test dataset. The errors are calculated as follows:
Results
Interpolation methods | Error in keypoints | Average CPU Running time |
---|---|---|
Gaussian + Subsampling (Baseline) | 0 | 3.2464 ms |
Subsampling | 0.0653 | 0.1136 ms |
F.interpolate with mode=nearest | 0.0653 | 0.3729 ms |
F.interpolate with mode=bicubic | 0.1728 | 1.1172 ms |
F.interpolate with mode=bilinear | 0.1711 | 0.8454 ms |
F.interpolate with mode=area | 0.1729 | 0.4388 ms |
Average Pooling | 0.0786 | 0.3062 ms |
Median + Subsampling | 0.0689 | 56.9150 ms |
Bilateral + Subsampling | 0.0395 | 95.5080 ms |
Visually, the results that are most similar to the baseline result are ones that are produced by bicubic, bilinear, and area. However, the keypoints predicted after using these techniques are the furthest away from the original technique. The bilateral filter, on the other hand, didn't give the smoothest result, as the jagged lines along the left jaw and at the eyes are quite prominent. However, it gave the smallest deviance from the original keypoints. It also took the longest to run.