In this part, we will discover the implementation details of Minkowski Engine and corresponding Experimental Results.

**Minkowski Engine**

Minkowski Engine is an open-source auto-differentiation library for sparse tensors and the generalized sparse convolution. It is an extensive library with many functions

**1. Sparse Tensor Quantization**:

Convert an input into unique coordinates, associated features, and optionally labels using GPU functions.

**2. Generalized Sparse Convolution**:

Create this output coordinates dynamically allowing an arbitrary output coordinates $\mathcal{C}^{out}$ given the input coordinates $\mathcal{C}^{in}$ for the generalized sparse convolution

Create a kernel map for convolving the input with the kernel. The kernel map identifies which inputs affect which outputs, and is defined as pairs of lists of integers: the in map $\textbf{I}$ and the out map $\textbf{O}$. An integer in an in map $i\in \textbf{I}$ indicates the row index of the coordinate matrix or the feature matrix of an input sparse tensor. Similarly, an integer in the out map $o \in \textbf{O}$ also indicates the row index of the coordinate matrix of an output sparse tensor.

**Figure 2**: Convolution Kernel Map.For example, $3\times 3 $ kernel requires 9 kernel maps. Due to sparsity in tensor, some kernel maps do not have elements. The kernel maps are extracted:

- Kernel map $B: 1 \mapsto 0$
- Kernel map $B: 0 \mapsto 2$
- Kernel map $H: 2 \mapsto 3$
- Kernel map $I: 0 \mapsto 0$

**3. Max Pooling and Global Pooling**

Max pooling layer selects the maximum element within a region for each channel. For a sparse tensor input, define it as

where $x_{\textbf{u},i}$ indicates the $i$-th channel feature value at $\textbf{u}$. The region to pool features from is defined as $\mathcal{N}^D(\textbf{u}) \cap \mathcal{C}^{in}$. The global pooling is similar to the max pooling except that features from all non-zero elements in the sparse tensor are pooling:

**4. Normalization**

First, instance normalization computes batch-wise statistics and whiten features batch wise. The mean and standard deviations are:

where $x_{\textbf{u},bi}$ indicates the $i$-th channel feature at the coordinate $\textbf{u}$ with batch index $b$. $\mathcal{C}^{in}_b$ is the set of non-zero element coordinates in the $b$-th batch. $\mu_b$ indicates the $b$-th batch batch-wise feature mean and $\sigma_{bi}$ is the $i$-th feature channel standard deviation of the $b$-th batch.

Batch normalization is similar to the instance normalization except that it computes statistics for all batch:

**5. Non-linearity Layers**

Most of the commonly used non-linearity functions are applied independently element-wise. Thus, an element wise function $f(·)$ can be a rectified-linear function (ReLU), leaky ReLU, ELU, SELU, etc:

**Minkowski Convolutional Neural Networks**

Problems of high-dimensional convolutions:

- Computational cost and memory consumption increase exponentially due to the dimension increase, and they do not necessarily lead to better performance.
- The networks do not have an incentive to make the prediction consistent throughout the space and time with conventional cross-entropy loss alone.

**Hybrid kernel:** The hybrid kernel is a combination of a cross-shaped kernel a conventional cubic kernel.

Spatial dimensions: Use cubic kernel to capture the spatial geometry accurately.

Temporal dimension: Use cross-shaped kernel to connect the same point in space across time.

The hybrid kernel experimentally outperforms the tesseract kernel both in speed and accuracy.

**Figure 1**: Various kernels in space-time. The red arrow indicates the temporal dimension and the other two axes are for spatial dimensions## Experiment

**ScanNet:** The ScanNet 3D segmentation benchmark consists of 3D reconstructions of real rooms. It contains 1500 rooms, some repeated rooms captured with different sensors. They feed an entire room to a MinkowskiNet fully convolutionally without cropping.

**Figure 2**: 3D Semantic Label Benchmark on ScanNet.Visualization: 3D input point cloud | Predictions | Ground truth :-------------------------:|:-------------------------:|:-------------------------: | | | |

**Synthia 4D:** They use the Synthia dataset to create 3D video sequences of driving. Each sequence consists of 4 stereo RGB-D images taken from the top of a car. They back-project the depth images to the 3D space to create 3D videos. The Synthia 4D dataset has an order of magnitude more 3D scans than Synthia 3D dataset.

**Figure 3**: Segmentation results on the 4D Synthia dataset without noise addition for the input point cloud.Visualization:

3D network | 4D network |
---|---|

**Stanford 3D Indoor:** The ScanNet and the Stanford Indoor datasets are one of the largest non-synthetic datasets, which make the datasets ideal test beds for 3D segmentation. They have achieved $+19\%$ mIOU on ScanNet, and $+7\%$ on Stanford compared with the original works.

**Figure 4**: Stanford Area 5 Test.Visualization:

RGB input | Predictions | Ground truth |
---|---|---|