Searching for a smaller model with a quick inference time
As the current Squeeze U-Net performs surprisingly well on the distillation task, we are searching for a smaller model that can maintain the performance with a quicker inference time. Given that our goal is to deploy the model on mobile devices, we target models with fast CPU runtimes. In this article, we will explore some experiments we have conducted to shorten further Squeeze U-Net runtime, our reasoning for each decision, whether it was successful, and other models designed for CPU runtime.
Note: Experiments were done on CPU Intel Xeon(R) CPU E5-2640 v3 @ 2.60GHz.
Lightweight variations of Squeeze U-Net
Depthwise Separable Convolutions
In previous articles, we have discussed the potential of Depthwise Separable Convolutions (DepSepConv). We considered replacing all non-pointwise convolutions in the current Squeeze U-Net with DepSepConv due to its advantages in terms of parameter count and computation costs. As expected, the parameter count and computation cost were greatly reduced. However, our new runtime is 0.0119s compared to 0.0142s, not exactly a big improvement.
One of the reasons for this modest reduction is that the runtime gain with DepSepConv decreases proportionally with spatial resolution. Given a convolution operation where the input shape is (64, S, S) and output shape is (128, S, S), where S is the spatial resolution variable of interest, and as we experimented with different values of S (range [4, 8, 16, 32, 64]), we found that as the spatial resolution decreased to 4, the runtime of regular conv and DepSepConv became very similar, unless C_in and C_out are large. Therefore, when integrate with an already optimised network such as Squeeze U-Net, DepSepConv does not contribute substantial reduction.
Furthermore, the authors of SqueezeNext has pointed out that DepSepConv is inefficient in terms of hardware (low arithmetic intensity) and may not have a good performance on some embedded device .
Therefore, we did not integrate DepSepConv into the keypoint detection model.
Remove residual connections
Another attempt to reduce Squeeze U-Net runtime is to remove the residual connections between encoder and decoder. This brought the runtime down to 0.0122s (reported in Table 3), which is not satisfactory. For further reduction, we replaced the Transpose Fire module to another variation proposed in SqueezeSeg . The current Transposed Fire module in Squeeze U-Net takes an input and performs
ConvTransposed2d before sending it to a Fire Module. In the new variation, we perform ConvTransposed between the
expand operations. This reduces the input's channels before putting it to ConvTransposed, which reduces the number of parameters considerably.
As can be seen in Table 3, the number of parameters, computations and pass size all greatly reduced with this new Transposed Fire module variant. However, the encoder-decoder architecture predictably falls short of the original Squeeze U-Net in terms of distillation performance. As shown in Table 4, Squeeze Encoder Decoder clearly underfits, which emphasizes the necessity of residual connections.
|Model||# parameters||Mult-adds (M)||Pass size (MB)||CPU Runtime (s)|
|Squeeze U-Net with dep_sep_conv||540,499||40.24||4.13||0.0119|
|Squeeze Encoder Decoder||781,375||26.01||2.78||0.0102|
|Model||Train Loss||Val Loss||Val Distance|
|Squeeze Encoder Decoder||0.0309||0.0343||0.0575|
We decided to look for other specifically built low-latency lightweight models that can operate in real-time on edge devices with little resources. One of them is L3U-Net . The main contribution of L3U-Net is the data folding technique - at its core, a reshaping operation - that is when followed by a convolution operation, it is equivalent to a strided convolution on the original tensor.
Their argument, which is supported by , is that the early conv layers take up the majority of runtime due to their low number of channels and high spatial resolution. Data folding allows us to downsample the image and at the same time increases the number of channels, reducing the imbalanced memory distribution and making it suitable for parallel processing.
These initial layers induce a significant proportion of the entire network’s latency since only a small number of cores can be used to process the small number of input channels that are big in spatial size. 
By utilising this method with the L3U-Net architecture as the keypoint model, we were able to cut the number of parameters by a factor of 14 and the runtime virtually in half.
In terms of accuracy, L3U-Net displayed encouraging results (a validation loss of 0.0084 compared to 0.0054 achieved by Squeeze U-Net), and indications of improved performance with additional tunings. As a result, we came to the following conclusions: 1. The keypoint detection task at hand might not require as complex a model as we had initially believed. 2. Residual connections are very important.
In our search for lightweight models catered for TinyML, we also came across EdgeSegNet, a result of "a human-machine collaborative design strategy, where human-driven principled network design prototyping is coupled with machine-driven design exploration" . In other words, with defined design principles and runtime & hardware constraints, an architecture search algorithm was set out to find the optimal model design tailored for semantic segmentation.
EdgeSegNet is unique in a way that it was built under runtime and performance restrictions. It also abandons the U-Net architecture. What we have now is one main path, and a mix of long and short-range connections added to that path. The Refine Module was adapted from RefineNet , where it uses feature maps with lower resolutions to further refine the details of intermediate feature maps. Additionally, EdgeSegNet heavily uses the Bottleneck Modules to reduce the number of input channels before passing it to regular 3x3 conv.
With EdgeSegNet, while the reduction in parameter count and computation cost is not as much as L3U-Net, the forward/backward pass size is half that of L3UNet, and we were able to reach the shortest runtime of 0.0077s. What's more astonishing is that with only 1/4th as many parameters, EdgeSegNet was able to obtain a superior distillation result than the original Squeeze U-Net (lower train_loss, val_loss, and val_distance). This shows that indeed, the keypoint detection problem is not as complex as expected and that we have not explored the full capacity of Squeeze U-Net.
 N. Beheshti and L. Johnsson, "Squeeze U-Net: A Memory and Energy Efficient Image Segmentation Network", 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020. Available: 10.1109/cvprw50498.2020.00190.
 M. Hollemans, "How fast is my model?", Machinethink.net, 2018. [Online]. Available: https://machinethink.net/blog/how-fast-is-my-model/.
 B. Wu, A. Wan, X. Yue and K. Keutzer, "SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud", 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018. Available: 10.1109/icra.2018.8462926.
 O. Erman, M. Ulkar and G. Uyanik, "L^3U-net: Low-Latency Lightweight U-net Based Image Segmentation Model for Parallel CNN Processors", ArVix, vol. 220316528, 2022. Available: https://arxiv.org/pdf/2203.16528.pdf.
 J. Lin, W. Chen, H. Cai, C. Gan and S. Han, "MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning", NeurIPS, 2021.
 Z. Lin, B. Chwyl and A. Wong, "EdgeSegNet: A Compact Network for Semantic Segmentation", ArVix, 2019. Available: https://arxiv.org/abs/1905.04222.  G. Lin, F. Liu, A. Milan, C. Shen and I. Reid, "RefineNet: Multi-Path Refinement Networks for Dense Prediction", Computer Vision and Pattern Recognition Conference (CVPR), 2017. Available: 10.1109/tpami.2019.2893630.  A. Gholami et al., "SqueezeNext: Hardware-Aware Neural Network Design", 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018. Available: https://arxiv.org/abs/1803.10615.