# Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering.

A novel VQA model that utilizes the question-type prior information to improve VQA by leveraging the multiple interactions between different joint modality methods.

Different approaches have been proposed to Visual Question Answering (VQA). However, few works are aware of the behaviors of varying joint modality methods over question type prior knowledge extracted from data in constraining answer search space, of which information gives a reliable cue to reason about answers for questions asked in input images. In this blog, we share a novel VQA model that utilizes the question-type prior information to improve VQA by leveraging the multiple interactions between different joint modality methods based on their behaviors in answering questions from different types. The solid experiments on two benchmark datasets, i.e., VQA 2.0 and TDIUC, indicate that the proposed method yields the best performance with the most competitive approaches.

## #Introduction

There are works that consider types of question as the side information whichgives a strong cue to reason about the answer. However, the relation between question types and answers from training data have not been investi-gated yet. Fig. 1 shows the correlation between question types and some answersin the VQA 2.0 dataset. It suggests that a question regarding the quantityshould be answered by a number, not a color. The observation indicated that theprior information got from the correlations between question types and answers open an answer search space constrain for the VQA model. The search spaceconstrain is useful for VQA model to give out final prediction and thus, improvethe overall performance. The Fig. 1 is consistent with our observation, e.g., itclearly suggests that a question regarding the quantity should be answered by anumber, not a color.

Figure 1. The distribution of candidate answers in each question type in VQA 2.0.
Although different joint modality methods or attention mechanisms have been proposed, we hypothesize that each method may capture different aspects of the input. That means different attentions may provide different answers for questions belonged to different question types.
Fig.2 shows examples in which the attention models (SAN and BAN) attend on different regions of input images when dealing with questions from different types. Unfortunately, most of recent VQA systems are based on single attention models BAN2, SAN, MLP, MCB, STL. From the above observation, it is necessary to develop a VQA system which leverages the power of different attention models to deal with questions from different question types.

Figure 2. Examples of attention maps of different attention mechanisms. BAN and SAN identify different visual areas when answering questions from different types.

## #Methodology

The proposed multiple interaction learning with question-type prior knowledge (MILQT) is illustrated in Fig. 3. Similar to the most of the VQA systems, multiple interaction learning with question-type prior knowledge (MILQT) consists of the joint learning solution for input questions and images, followed by a multi-class classification over a set of predefined candidate answers. However, MILQT allows to leverage multiple joint modality methods under the guiding of question-types to output better answers.

Figure 3. The introduced MILQT for VQA.
As in Fig.3, MILQT consists of two modules: Question-type awareness $\mathcal{A}$, and Multi-hypothesis interaction learning $\mathcal{M}$. The first module aims to learn the question-type representation, which is further used to enhance the joint visual-question embedding features and to constrain answer search space through prior knowledge extracted from data. Based on the question-type information, the second module aims to identify the behaviors of multiple joint learning methods and then justify adjust contributions to giving out final predictions.

Question representation. Given an input question, follow the recent state-of-the-art BAN, we trim the question to a maximum of 12 words. The questions that are shorter than 12 words are zero-padded. Each word is then represented by a 600-D vector that is a concatenation of the 300-D GloVe word embedding and the augmenting embedding from training data. This step results in a sequence of word embeddings with size of $12 \times 600$ and is denoted as $f_w$. In order to obtain the intent of question, the $f_w$ is passed through a Gated Recurrent Unit (GRU) which results in a 1024-D vector representation $f_q$ for the input question.

Image representation. We use bottom-up attention, i.e. an object detection which takes as FasterRCNN backbone, to extract image representation. At first, the input image is passed through bottom-up networks to get $K \times 2048$ bounding box representation which is denotes as $f_v$ in Fig. 3.

Multi-level multi-modal fusion. Unlike the previous works that perform only one level of fusion between linguistic and visual features that may limit the capacity of these models to learn a good joint semantic space. In our work, a multi-level multi-modal fusion that encourages the model to learn a better joint semantic space is introduced which takes the question-type representation got from question-type classification component as one of inputs.

• First level multi-modal fusion: The first level fusion is similar to previous works. Given visual features $f_v$, question features $f_{q}$, and any joint modality mechanism,
we combines visual features with question features and learn attention weights to weight for visual and/or linguistic features. Different attention mechanisms have different ways for learning the joint semantic space. The output of first level multi-modal fusion is denoted as $f_{att}$ in the Fig.3.
• Second level multi-modal fusion: In order to enhance the joint semantic space, the output of the first level multi-modal fusion $f_{att}$ is combined with the question-type feature $f_{qt}$, which is the output of the last FC layer of the Question-type classification component.
We try two simple but effective operators, i.e. element-wise multiplication --- EWM or element-wise addition --- EWA, to combine $f_{att}$ and $f_{qt}$. The output of the second level multi-modal fusion, which is denoted as $f_{att-qt}$ in Fig.3, can be seen as an attention representation that is aware of the question-type information.

Given an attention mechanism, the $f_{att-qt}$ will be used as the input for a classifier that predicts an answer for the corresponding question. This is shown at the Answer prediction boxes in the Fig.3.

Multi-hypothesis interaction learning As presented in Fig.3, MILQT allows to utilize multiple hypotheses (i.e., joint modality mechanisms). Specifically, we propose a multi-hypothesis interaction learning design $\mathcal{M}$ that takes answer predictions produced by different joint modality mechanisms and interactively learn to combine them.
Let $g \in \R^{A \times J}$ be the matrix of predicted probability distributions over $A$ answers from the $J$ joint modality mechanisms. $\mathcal{M}$ outputs the distribution $\rho \in \R^{A}$, which is calculated from $g$ through Equation below:

$\rho = \mathcal{M} \left(g,w_{mil}\right) = \sum_{j}\left(m^T_{qt-ans}w_{mil} \odot g\right)$

$w_ {mil} \in \textbf{R}^{P \times J}$ is the learnable weight which control the contributions of $J$ considered joint modality mechanisms on predicting answer based on the guiding of $P$ question types; $\odot$ denotes Hardamard product.

## #Results

Experiments on VQA 2.0 test-dev and test-standard.

We evaluate MILQTon the test-dev and test-standard of VQA 2.0 dataset. To train the model,similar to previous works, we use both training set and validationset of VQA 2.0. We also use the Visual Genome as additional training data. MILQT consists of three joint modality mechanisms, i.e., BAN-2, BAN-2-Counter, and SAN accompanied with the EWM for the multi-modal fusion, andthe predicted question type together with the prior information to augment theVQA loss. Table 4 presents the results of different methods on test-dev and test-std of VQA 2.0. The results show that our MILQT yields the good performance with the most competitive approaches.

Table 1. Comparison to the state of the arts on the test-dev and test-standard of VQA 2.0. For fair comparison, Glove embedding and GRU are leveraged for question embedding and Bottom-up features are used to extract visual information. CMP, i.e.Cross-Modality with Pooling, is the LXMERT with the aforementioned setup (Source).

Experiments on TDIUC.

In order to prove the stability of MILQT, we evaluate MILQT on TDIUC dataset.
The results in Table.2 show that the proposed model establishes the state-of-the-art results on both evaluation metrics Arithmetic MPT and Harmonic MPT. Specifically, our model significantly outperforms the recent QTA, i.e., on the overall, we improve over QTA $6.1\%$ and $11.1\%$ with Arithemic MPT and Harmonic MPT metrics, respectively. It is worth noting that the results of QTA in Table. 2, which are cited from QTA, are achieved when QTA used the one-hot predicted question type of testing question to weight visual features. When using the groundtruth question type to weight visual features, QTA reported $69.11\%$ and $60.08\%$ for Arithemic MPT and Harmonic MPT metrics, respectively. Our model also outperforms these performances a large margin, i.e., the improvements are $3.9\%$ and $6.8\%$ for Arithemic MPT and Harmonic MPT metrics, respectively.

Table 2. The comparative results between the proposed model and other models onthe validation set of TDIUC (Source).

## #Conclusion

We present a multiple interaction learning with question-type prior knowledge for constraining answer search space--- MILQT that takes into account the question-type information to improve the VQA performance at different stages. The system also allows to utilize and learn different attentions under a unified model in an interacting manner. The extensive experimental results show that all proposed components improve the VQA performance. We yields the best performance with the most competitive approaches on VQA 2.0 and TDIUC dataset.