Fine-Grained Visual Classification using Self-Assessment Classifier (Part 1)
Extracting discriminative features plays a crucial role in the fine-grained visual classification task. Most of the existing methods focus on developing attention or augmentation mechanisms to achieve this goal. However, addressing the ambiguity in the top-k prediction classes is not fully investigated. In this paper, we introduce a Self Assessment Classifier, which simultaneously leverages the representation of the image and top-k prediction classes to reassess the classification results. Our method is inspired by self-supervised learning with coarse-grained and fine-grained classifiers to increase the discrimination of features in the backbone and produce attention maps of informative areas on the image. In practice, our method works as an auxiliary branch and can be easily integrated into different architectures. We show that by effectively addressing the ambiguity in the top-k prediction classes, our method achieves new state-of-the-art results on CUB200-2011, Stanford Dog, and FGVC Aircraft datasets. Furthermore, our method also consistently improves the accuracy of different existing fine-grained classifiers with a unified setup.
1. Introduction
The task of fine-grained visual classification involves categorizing images that belong to the same class (e.g., various species of birds, types of aircraft, or different varieties of flowers). Compared to standard image classification tasks, fine-grained classification poses greater challenges due to three primary factors: (i) significant intra-class variation, where objects within the same category exhibit diverse poses and viewpoints; (ii) subtle inter-class distinctions, where objects from different categories may appear very similar except for minor differences, such as the color patterns of a bird's head often determining its fine-grained classification; and (iii) constraints on training data availability, as annotating fine-grained categories typically demands specialized expertise and considerable annotation effort. Consequently, achieving accurate classification results solely with state-of-the-art CNN models like VGG is nontrivial.
Recent research demonstrates that a crucial strategy for fine-grained classification involves identifying informative regions across various parts of objects and extracting distinguishing features. A common approach to achieving this is by learning the object's parts through human annotations. However, annotating fine-grained regions is labor-intensive, rendering this method impractical. Some advancements have explored unsupervised or weakly-supervised learning techniques to identify informative object parts or region of interest bounding boxes. While these methods offer potential solutions to circumvent manual labeling of fine-grained regions, they come with limitations such as reduced accuracy, high computational costs during training or inference, and challenges in accurately detecting distinct bounding boxes.
n this paper, we introduce the Self Assessment Classifier (SAC) method to tackle the inherent ambiguity present in fine-grained classification tasks. Essentially, our approach is devised to reevaluate the top-k prediction outcomes and filter out uninformative regions within the input image. This serves to mitigate inter-class ambiguity and enables the backbone network to learn more discerning features. Throughout training, our method generates attention maps that highlight informative regions within the input image. By integrating this method into a backbone network, we aim to reduce misclassifications among top-k ambiguous classes. It's important to note that "ambiguity classes" refer to instances where uncertainty in prediction can lead to incorrect classifications. Our contributions can be succinctly outlined as follows:
- We propose a novel self-class assessment method that simultaneously learns discriminative features and addresses ambiguity issues in fine-grained visual classification tasks.
- We demonstrate the versatility of our method by showcasing its seamless integration into various fine-grained classifiers, resulting in improved state-of-the-art performance.
2. Method Overview
We propose two main steps in our method: Top-k Coarse-grained Class Search (TCCS) and Self Assessment Classifier (SAC). TCCS works as a coarse-grained classifier to extract visual features from the backbone. The Self Assessment Classifier works as a fine-grained classifier to reassess the ambiguity classes and eliminate the non-informative regions. Our SAC has four modules: the Top-k Class Embedding module aims to encode the information of the ambiguity class; the Joint Embedding module aims to jointly learn the coarse-grained features and top-k ambiguity classes; the Self Assessment module is designed to differentiate between ambiguity classes; and finally, the Dropping module is a data augmentation method, designed to erase unnecessary inter-class similar regions out of the input image. Figure.2 shows an overview of our approach.
3. Top-k Coarse-grained Class Search
The TCCS takes an image as input. Each input image is passed through a Deep CNN to extract feature map and the visual feature . , and represent the feature map height, width, and the number of channels, respectively; denotes the dimension of the visual feature . In practice, the visual feature is usually obtained by applying some fully connected layers after the convolutional feature map .
The visual features is used by the classifier, i.e., the original classifier of the backbone, to obtain the top-k prediction results. Assuming that the fine-grained dataset has classes. The top-k prediction results is a subset of all prediction classes , with is the number of candidates that have the -highest confident scores.
4. Self Assessment Classifier
Our Self Assessment Classifier takes the image feature and top-k prediction from TCCS as the input to reassess the fine-grained classification results.
Top-k Class Embedding
The output of the TCCS module is passed through the top-k class embedding module to output label embedding set . This module contains a word embedding layer~\cite{pennington2014glove} for encoding each word in class labels and a GRU~\cite{2014ChoGRU} layer for learning the temporal information in class label names. represents the dimension of each class label. It is worth noting that the embedding module is trained end-to-end with the whole model. Hence, the class label representations are learned from scratch without the need of any pre-extracted/pre-trained or transfer learning.
Given an input class label, we trim the input to a maximum of words. The class label that is shorter than words is zero-padded. Each word is then represented by a -D word embedding. This step results in a sequence of word embeddings with a size of and denotes as of -th class label in class label set. In order to obtain the dependency within the class label name, the is passed through a Gated Recurrent Unit (GRU), which results in a -D vector representation for each input class. Note that, although we use the language modality (i.e., class label name), it is not extra information as the class label name and the class label identity (for calculating the loss) represent the same object category.
Joint Embedding
This module takes the feature map and the top-k class embedding as the input to produce the joint representation and the attention map. We first flatten into , and is into . The joint representation is calculated using two modalities and as follows:
where is a learnable tensor; ; ; is a vectorization operator; denotes the -mode tensor product.
In practice, the preceding is too large and infeasible to learn. Thus, we apply decomposition solutions that reduce the size of but still retain the learning effectiveness. We rely on the idea of the unitary attention mechanism. Specifically, let be the joint representation of couple of channels where each channel in the couple is from a different input. The joint representation is approximated by using the joint representations of all couples instead of using fully parameterized interaction as in Eq.~\ref{eq:hypothesis}. Hence, we compute as:
Note that in Equation above, we compute a weighted sum over all possible couples. The couple is associated with a scalar weight . The set of is called the attention map , where .
There are possible couples over the two modalities. The representation of each channel in a couple is , where , respectively. The joint representation is then computed as follows
where is the learning tensor between channels in the couple.
From Equation above, we can compute the attention map using the reduced parameterized bilinear interaction over the inputs and . The attention map is computed as
where is the learnable tensor.
The joint representation can be rewritten as
It is also worth noting from Equation above that to compute , instead of learning the large tensor , we now only need to learn two smaller tensors in Eq.~\ref{eq:couplecompute} and $\mathcal{T}\mathcal{M} \in \mathbb{R}^{d_f \times d_e}$.
Self Assessment
The joint representation from the Joint Embedding module is used as the input in the Self Assessment step to obtain the top-k predictions . Note that . Intuitively, is the top-k classification result after self-assessment. This module is a fine-grained classifier that produces the predictions to reassess the ambiguity classification results.
The contribution of the coarse-grained and fine-grained classifier is calculated by
where is the trade-off hyper-parameter . denotes the prediction probabilities for class , from the coarse-grained and fine-grained classifiers, respectively.
Next
In the next post, we will verify the effectiveness and efficiency of the method.