Compact Trilinear Interaction for Visual Question Answering
In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents. Thus, to selectively utilize image, question and answer information, we propose a novel trilinear interaction model which simultaneously learns high level associations between these three inputs. In addition, to overcome the interaction complexity, we introduce a multimodal tensor-based PARALIND decomposition which efficiently parameterizes trilinear interaction between the three inputs. Moreover, knowledge distillation is applied in Free-form Opened-ended VQA. It is not only for reducing the computational cost and required memory but also for transferring knowledge from trilinear interactionmodel to bilinear interaction model. The extensive experiments on benchmarking datasets TDIUC, VQA-2.0, and Visual7W show that the proposed compact trilinear interaction model achieves state-of-the-art results on all three datasets.
For free-form opened-ended VQA task, CTI achieved 67.4 on VQA-2.0 and 87.0 on TDIUC dataset in VQA accuracy metric.
For multiple choice VQA task, CTI achieved 72.3 on Visual7W dataset in MC-VQA accuracy metric.
Compact Trilinear Interaction in VQA.
Let be the representations of three inputs. , where is the number of channels of the input and is the dimension of each channel.
For example, if is the region-based representation for an image, then is the number of regions and is the dimension of the feature representation for each region. Let be the row of , i.e., the feature representation of channel in , where .
The input for training VQA is set of in which is an image representation; where is the number of interested regions (or bounding boxes) in the image and is the dimension of the representation for a region; is a question representation;
where is the number of hidden states and is the dimension for each hidden state.
is an answer representation;
where is the number of hidden states and is the dimension for each hidden state.
We firstly compute the attention map as follows:
Then the joint representation is computed as follows:
where and are learnable factor matrices; each is a learnable Tucker tensor.
Integrate CTI into different VQA task
For multiple choice VQA
Each input question and each answer are trimmed to a maximum of 12 words which will then be zero-padded if shorter than 12 words. Each word is then represented by a 300-D GloVe word embedding. Each image is represented by a grid feature (i.e., cells; each cell is with a -D feature), extracted from the second last layer of ResNet-152 which is pre-trained on ImageNet.
Input samples are divided into positive samples and negative samples. A positive sample, which is labelled as in binary classification, contains image, question and the right answer. A negative sample, which is labelled as in binary classification, contains image, question, and the wrong answer. These samples are then passed through CTI to get the joint representation . The joint representation is passed through a binary classifier to get the prediction. The Binary Cross Entropy loss is used for training the model.
For free-form opened-ended VQA
Unlike MC VQA, FFOE VQA treats the answering as a classification problem over the set of predefined answers. Hence the set possible answers for each question-image pair is much more than the case of MC VQA. For each question-image input, the model takes every possible answers from its answer list to computed the joint representation, causes high computational cost.
In addition, CTI requires all three inputs to compute the joint representation. However, during the testing, there are no available answer information in FFOE VQA. To overcome these challenges, we propose to use Knowledge Distillation to transfer the learned knowledge from a teacher model to a student model.
The loss function for the student model is defined as:
where stands for Cross Entropy loss; is the standard softmax output of the student; is the ground-truth answer labels;
is a hyper-parameter for controlling the importance of each loss component; are the softened outputs of the student and the teacher using the same temperature parameter , which are computed as follows:
where for both teacher and the student models, the logit is the predictions outputted by the corresponding classifiers.
Results
To further evaluate the effectiveness of CTI, we conduct a detailed comparison with the current state of the art. For FFOE VQA, we compare CTI with the recent state-of-the-art methods on TDIUC and VQA-2.0 datasets. For MC VQA, we compare with the state-of-the-art methods on Visual7W dataset.
Regarding FFOE VQA, Table 1 and Table 2 show comparative results on VQA-2.0 and TDIUC respectively. Specifcaly, Table 1 shows that distilled student BAN2-CTI outperforms all compared methods over all metrics by a large margin, i.e., the model outperforms the current state-of-the-art QTA on TDIUC by and on Ari and Har metrics, respectively. The results confirm that trilinear interaction has learned informative representations from the three inputs and the learned information is effectively transferred to student models by distillation.
Regarding MC VQA, Table 3 shows that the CTI outperforms compared methods by a noticeable margin. This model outperforms the current state-of-the-art STL by 1.1%. Again, this validates the effectiveness of the proposed joint presentation learning, which precisely and simultaneously learns interactions between the three inputs. We note that when comparing with other methods on Visual7W, for image representations, we used the grid features extracted from ResNet-512 for a fair comparison. Our proposed model can achieve further improvements by using the object detection-based features used in FFOE VQA. With new features, the model denoted as CTIwBoxes in Table 3 achieve 72.3% accuracy with Acc-MC metric which improves over the current state-of-the-art STL 4.1%.
Conclusion
A novel compact trilinear interaction is introduced to simultaneously learns high level associations between image, question, and answer in both MC VQA and FFOE VQA. In addition, knowledge distillation is the first time applied to FFOE VQA to overcome the computational complexity and memory issue of the interaction. The extensive experimental results show that these models achieve the state-of-the-art results on three benchmarking datasets.