The objective of deep distance metric learning (DML) is to train a deep learning model that maps training samples into feature embeddings that are close together for samples that belong to the same category and far apart for samples from different categories. Traditional DML approaches require supervised information, i.e., class labels, to supervise the training. Although the supervised DML achieves impressive results on different tasks, it requires large amount of annotated training samples to train the model. Unfortunately, such large datasets are not always available and they are costly to annotate for specific domains. That disadvantage also limits the transferability of supervised DML to new domain/applications which do not have labeled data. These reasons have motivated recent studies aiming at learning feature embeddings without annotated datasets --- unsupervised deep distance metric learning (UDML). Our study is in that same direction, i.e., learning embeddings from unlabeled data.
There are two main challenges for UDML:
- Firstly, how to define positive and negative samples for a given anchor data point, such that we can apply distance-based losses, e.g., pairwise loss or triplet loss, in the embedding space.
- Secondly, how to make the training efficient, given a large number of pairs or triplets of samples, in the order of or , respectively, in which is the number of training samples.
In this paper, we propose a new method that utilizes deep clustering for deep metric learning to address the two challenges mentioned above. In particular,
- We propose to use a deep clustering loss to learn centroids, i.e., pseudo labels, that represent semantic classes.
- During learning, these centroids are also used to reconstruct the input samples. It hence ensures the representativeness of centroids — each centroid represents visually similar samples. Therefore, the centroids give information about positive (visually similar) and negative (visually dissimilar) samples.
- Based on pseudo labels, we propose a novel unsupervised metric loss which enforces the positive concentration and negative separation of samples in the embedding space.
The proposed framework is presented in Figure 1.
- For every original image in a batch, we make an augmented version by using a random geometric transformation.
- The input images are fed into the backbone network which is also considered as the encoder (G) to get image representations.
- The image representations are passed through the embedding module which consists of fully connected and L2 normalization layers (F), which results in unit norm image embeddings.
- The clustering module takes image embeddings as inputs, performs the clustering with a clustering loss, and outputs the cluster assignments.
- Given the cluster assignments, centroid representations are computed from image representations, which are then passed through the decoder (D) with a reconstruction loss to reconstruct images that belong to the corresponding clusters.
- The centroid representations are also passed through the embedding module (F) to get centroid embeddings. The centroid embeddings and image embeddings are used as inputs for the metric loss.
We formulate the clustering of embedding features as a classification problem. Given a set of embedding features in a batch and the number of clusters (i.e., the number of clusters is limited by the batch size ). The cluster assignment for then is estimated by . Let be the set of softmax outputs for .
Inspired by Regularized Information Maximization (RIM), we use the following objective function (1) for the clustering.
where and are entropy and conditional entropy, respectively; regularizes the classifier parameters (in this work we use regularization); is a weighting factor to control the importance of two terms.
Minimizing (1) is equivalent to maximizing and minimizing . Increasing the marginal entropy encourages cluster balancing, while decreasing the conditional entropy encourages cluster separation.
In order to enhance the representativeness of centroids, we introduce a reconstruction loss (2) that penalizes high reconstruction errors from centroids to corresponding samples. Specifically, the decoder takes a centroid representation of a cluster and minimizes the difference between input images that belong to the cluster and the reconstructed image from the centroid representation.
where is the decoder which reconstructs samples in the batch using their corresponding centroid representations and is the number of images in the batch.
Let and be the image embeddings of and , respectively. The proposed metric loss (3) aims to minimize the distance between and while pushing far away from negative clusters.
The network in Figure 1 is trained in an end-to-end manner with the following multi-task loss.
where is the center-based softmax loss (3) for deep metric learning, is the clustering loss (1), and is the reconstruction loss (2).
We denote our model with:
- only clustering loss (1) as only .
- both clustering and the metric losses (2) and (3) as Center-based Softmax (CBS).
- Center-based Softmax with Reconstruction (CBSwR).
Tables 1 and 2 present the comparative results between methods. The results show that using only the clustering loss, the accuracy is significantly lower than the baseline SME. However, when using the centroids from the clustering for calculating the metric loss (i.e., CBS), it gives the performance boost over the baseline (i.e., SME). Furthermore, the reconstruction loss enhances the representativeness of centroids, as confirmed by the improvements of CBSwR over CBS on both datasets.
Table 3 presents the training time of different methods on the CUB200-2011 and Car196 datasets. Although the asymptotic complexity of CBSwR for training one batch is , it also consists of a decoder part which affects the real training. It is worth noting that the decoder is only involved during training. During testing, our method has similar computational complexity as SME.
Table 4 presents the impact of the number of clusters in the clustering loss on the CUB200-2011 dataset with our proposed model CBSwR (recall that the number of clusters is limited by the batch size ). During training, the number of samples per clusters vary depending on batches and the number of clusters. At which is our final setting, the number of samples per cluster varies from 2 to 11, on the average. The retrieval performance is just slightly different for the different number of clusters. This confirms the robustness of the proposed method w.r.t. the number of clusters.
Comparison to the state of the art
Table 5 presents the comparative results on CUB200-2011 dataset. In terms of clustering quality (NMI metric), the proposed method and the state-of-the-art UDML methods MOM and SME achieve comparable accuracy. However, in terms of retrieval accuracy R@K, our method outperforms other approaches. Our proposed method is also competitive to most of the supervised DML methods.
Table 6 presents comparative results on Car196 dataset. Compared to unsupervised methods, the proposed method outperforms other approaches in terms of retrieval accuracy at all ranks of K. Our method is comparable to other unsupervised methods in terms of clustering quality.
Figure 2 shows the t-SNE plots on our learned embedding features on CUB200-2011. We can see that our embedding produces reasonable results in grouping similar visual objects despite the significant variations in view-point, pose, and configuration.
We propose a new method that utilizes deep clustering for deep metric learning to address the two challenges in UDML, i.e., positive/negative mining and efficient training. The method is based on a novel loss that consists of a learnable clustering function, a reconstruction function, and a center-based metric loss function. Our experiments on CUB200-2011 and Car196 datasets show state-of-the-art performance on the retrieval task, compared to other unsupervised learning methods.
🐱 Github: https://github.com/aioz-ai/BMVC20_CBSwR