In this post, we will explore the specific method to lightweight the deep keypoint detection model: encoder-decoder for replacing the U-net like architecture.
Goal
To find lightweight encoder-decoder architectures, with no residual connections between encoder and decoder.
Findings
- The majority of lightweight encoder-decoder (LED) networks was introduced in 2016-2019, followed the success of FCN and ResNet.
- Some pioneering works in LED is ENet and ERFNet.
- From 2016-2019, the main difference in LED lays in the encoder, as the decoder is mostly just multiple blocks of ConvTransposed2D, or even just simply interpolation. Some use multiscale pyramid methods, e.g. LEDNet.
- The encoder is often made up of "core" blocks. These blocks are fairly similar to each other and inspired by the convolution decomposition methods introduced in:
- Residual Bottleneck Block as introduced in ResNet
- Inception Module as introduced in InceptionNet
- Depthwise Separable Conv as used extensively in XceptionNet and MobileNet
- Shuffle Unit as introduced in ShuffleNet
- From 2020 onwards, attention mechanism is applied more in decoders to recover details, and connections between encoder and decoder became a stable, e.g. LAEDNet, LRDNet
Main techniques used in encoder
Asymmetric Convolutions
- One of the most often used convolution decomposition method is to factorise a standard nxn conv by an nx1 conv followed by an 1xn conv, effectively an approximation of standard conv.
- Used in LEDNet, EDANet, ESNet, ERFNet.
- When kernel size = 3, #parameters and computation cost is saved by 33%.
Figure 1: a) Standard Convolution b) Asymmetric Convolution.
(Inverted) Residual bottleneck connection
Given the tremendous success of ResNet in 2015, residual bottleneck blocks have been applied extensively in LED networks. The overall structure of the block follows closely to that in Figure 2. With the 3x3 conv replaced by 3x1 and 1x3.
Inception module
The premise for Inception module is that salient parts in the image can have large variations in size, e.g. a dog can be very close to the camera and takes up most of the image, or very far and appears in a small corner. Thus, to capture details of varying sizes, we can have filters of multiple sizes operate on the same level in parallel, in order to provide multiscale features.
Another variant of this is to have multiple dilated convolutions with different dilated rates on different branches, in order to broaden receptive field.
ShuffleNet
Shuffle unit inCreated to address the problem of stacked group conv: outputs from a certain group only relate to the inputs within the group, blocks info flow between channels groups and weakens representation.
Encoder results
In this section, we will discuss the encoder of four LED networks, ERFNet (2017), EDANet (2018), ESNet (2019), and LEDNet (2019). They are all strictly encoder-decoder architectures with no residual connections between encoder and decoder. They also have learning in the decoder, not just simply interpolation.
Regarding the feature extraction path, we have ERFNet and EDANet main blocks take on the non-bottleneck-1D approach. On the other hand, ESNet and LEDNet, written by the same authors, have multiple branches in their modules. Both uses dilated convolutions with different rates for different branches. LEDNet uses a split-transform-merge strategy, inspired by that of ShuffleNetV2. We can see that a combination of the above four techniques were used in constructing these blocks.