In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents. Thus, to selectively utilize image, question and answer information, we propose a novel trilinear interaction model which simultaneously learns high level associations between these three inputs. In addition, to overcome the interaction complexity, we introduce a multimodal tensor-based PARALIND decomposition which efficiently parameterizes trilinear interaction between the three inputs. Moreover, knowledge distillation is applied in Free-form Opened-ended VQA. It is not only for reducing the computational cost and required memory but also for transferring knowledge from trilinear interactionmodel to bilinear interaction model. The extensive experiments on benchmarking datasets TDIUC, VQA-2.0, and Visual7W show that the proposed compact trilinear interaction model achieves state-of-the-art results on all three datasets.
For free-form opened-ended VQA task, CTI achieved 67.4 on VQA-2.0 and 87.0 on TDIUC dataset in VQA accuracy metric.
For multiple choice VQA task, CTI achieved 72.3 on Visual7W dataset in MC-VQA accuracy metric.
#Compact Trilinear Interaction in VQA.
Let M={M1,M2,M3} be the representations of three inputs. Mt∈Rnt×dt, where nt is the number of channels of the input Mt and dt is the dimension of each channel.
For example, if M1 is the region-based representation for an image, then n1 is the number of regions and d1 is the dimension of the feature representation for each region. Let mte∈R1×dt be the eth row of Mt, i.e., the feature representation of eth channel in Mt, where t∈{1,2,3}.
The input for training VQA is set of (V,Q,A) in which V is an image representation; V∈Rv×dv where v is the number of interested regions (or bounding boxes) in the image and dv is the dimension of the representation for a region; Q is a question representation; Q∈Rq×dq
where q is the number of hidden states and dq is the dimension for each hidden state.
A is an answer representation; A∈Ra×da
where a is the number of hidden states and da is the dimension for each hidden state.
We firstly compute the attention map M as follows:
M=r=1∑R[[Gr;VWvr,QWqr,AWar]] Then the joint representation z is computed as follows:
zT=i=1∑vj=1∑qk=1∑aMijk(ViWzv∘QjWzq∘AkWza) where Wvr,Wqr,War and Wzv,Wzq,Wza are learnable factor matrices; each Gr is a learnable Tucker tensor.
#Integrate CTI into different VQA task
#For multiple choice VQA
Figure 1. The model when CTI is applied to MC VQA.
Each input question and each answer are trimmed to a maximum of 12 words which will then be zero-padded if shorter than 12 words. Each word is then represented by a 300-D GloVe word embedding. Each image is represented by a 14×14×2048 grid feature (i.e., 196 cells; each cell is with a 2048-D feature), extracted from the second last layer of ResNet-152 which is pre-trained on ImageNet.
Input samples are divided into positive samples and negative samples. A positive sample, which is labelled as 1 in binary classification, contains image, question and the right answer. A negative sample, which is labelled as 0 in binary classification, contains image, question, and the wrong answer. These samples are then passed through CTI to get the joint representation z. The joint representation is passed through a binary classifier to get the prediction. The Binary Cross Entropy loss is used for training the model.
#For free-form opened-ended VQA
Figure 2. The model when CTI is applied to FFOE VQA.
Unlike MC VQA, FFOE VQA treats the answering as a classification problem over the set of predefined answers. Hence the set possible answers for each question-image pair is much more than the case of MC VQA. For each question-image input, the model takes every possible answers from its answer list to computed the joint representation, causes high computational cost.
In addition, CTI requires all three V,Q,A inputs to compute the joint representation. However, during the testing, there are no available answer information in FFOE VQA. To overcome these challenges, we propose to use Knowledge Distillation to transfer the learned knowledge from a teacher model to a student model.
The loss function for the student model is defined as:
LKD=αT2LCE(QSτ,QTτ)+(1−α)LCE(QS,ytrue) where LCE stands for Cross Entropy loss; QS is the standard softmax output of the student; ytrue is the ground-truth answer labels;
α is a hyper-parameter for controlling the importance of each loss component; QSτ,QTτ are the softened outputs of the student and the teacher using the same temperature parameter T, which are computed as follows:
Qiτ=∑iexp(li/T)exp(li/T) where for both teacher and the student models, the logit l is the predictions outputted by the corresponding classifiers.
#Results
Table 1. Performance of CTI and BAN2, SAN in VQA-2.0 validation set and test-dev set. BAN2-CTI and SANCTI are student models trained under the teacher model.
To further evaluate the effectiveness of CTI, we conduct a detailed comparison with the current state of the art. For FFOE VQA, we compare CTI with the recent state-of-the-art methods on TDIUC and VQA-2.0 datasets. For MC VQA, we compare with the state-of-the-art methods on Visual7W dataset.
Table 2. Performance comparison between different approaches with different evaluation metrics on TDIUC validation set. BAN2-CTI and SAN-CTI are the student models trained under compact trilinear interaction teacher model.
Regarding FFOE VQA, Table 1 and Table 2 show comparative results on VQA-2.0 and TDIUC respectively. Specifcaly, Table 1 shows that distilled student BAN2-CTI outperforms all compared methods over all metrics by a large margin, i.e., the model outperforms the current state-of-the-art QTA on TDIUC by 3.4% and 5.4% on Ari and Har metrics, respectively. The results confirm that trilinear interaction has learned informative representations from the three inputs and the learned information is effectively transferred to student models by distillation.
Table 3. Performance comparison between different approaches on Visual7W test set. Both training set and validation set are used for training. All models but CTIwBoxes are trained with same image and question representations. Both train set and validation set are used for training. Note that CTIwBoxes is the CTI model using Bottom-up features. instead of grid features for image representation.
Regarding MC VQA, Table 3 shows that the CTI outperforms compared methods by a noticeable margin. This model outperforms the current state-of-the-art STL by 1.1%. Again, this validates the effectiveness of the proposed joint presentation learning, which precisely and simultaneously learns interactions between the three inputs. We note that when comparing with other methods on Visual7W, for image representations, we used the grid features extracted from ResNet-512 for a fair comparison. Our proposed model can achieve further improvements by using the object detection-based features used in FFOE VQA. With new features, the model denoted as CTIwBoxes in Table 3 achieve 72.3% accuracy with Acc-MC metric which improves over the current state-of-the-art STL 4.1%.
#Conclusion
A novel compact trilinear interaction is introduced to simultaneously learns high level associations between image, question, and answer in both MC VQA and FFOE VQA. In addition, knowledge distillation is the first time applied to FFOE VQA to overcome the computational complexity and memory issue of the interaction. The extensive experimental results show that these models achieve the state-of-the-art results on three benchmarking datasets.
#Open Source
Github: https://github.com/aioz-ai/ICCV19_VQA-CTI