Multiple Meta-model Quantifying for Medical Visual Question Answering
Motivation
A medical Visual Question Answering (VQA) system can provide meaningful references for both doctors and patients during the treatment process. Extracting image features is one of the most important steps in a medical VQA framework which outputs essential information to predict answers. Transfer learning, in which the pretrained deep learning models that are trained on the large scale labeled dataset such as ImageNet, is a popular way to initialize the feature extraction process. However, due to the difference in visual concepts between ImageNet images and medical images, finetuning process is not sufficient. Recently, Model Agnostic Meta-Learning (MAML) has been introduced to overcome the aforementioned problem by learning meta-weights that quickly adapt to visual concepts. However, MAML is heavily impacted by the meta-annotation phase for all images in the medical dataset. Different from normal images, transfer learning in medical images is more challenging due to:
- (i) noisy labels may occur when labeling images in an unsupervised manner;
- (ii) high-level semantic labels cause uncertainty during learning;
- (iii) difficulty in scaling up the process to all unlabeled images in medical datasets.
In this paper, we introduce a new Multiple Meta-model Quantifying (MMQ) process to address these aforementioned problems in MAML. Intuitively MMQ is designed to:
- (i) effectively increase meta-data by auto-annotation;
- (ii) deal with the noisy labels in the training phase by leveraging the uncertainty of predicted scores during the meta-agnostic process;
- (iii) output meta-models which contain robust features for down-stream medical VQA task. Note that, compared with the recent approach for meta-learning in medical VQA, our proposed MMQ does not take advantage of additional out-of-dataset images, while achieves superior accuracy in two challenging medical VQA datasets.
Methodology
Method overview
Our approach comprises two parts: our proposed multiple meta-model quantifying (MMQ - Figure 1) and a VQA framework for integrating meta-models outputted from MMQ (Figure 2). MMQ addresses the meta-annotation problem by outputting multiple meta-models. These models are expected to robust to each other and have high accuracy during the inference phase of model-agnostic tasks. The VQA framework aims to leverage different features extracted from candidate meta-models and then generates predicted answers.
Multiple meta-model quantifying
Multiple meta-model quantifying (Figure 1) contains three modules:
- (i) Meta-training which trains a specific meta-model for extracting image features used in medical VQA task by following MAML;
- (ii) Data refinement which increases the training data by auto-annotation and deal with the noisy label by leveraging the uncertainty of predicted scores;
- (iii) Meta-quantifying which selects meta-models whose robust to each others and have high accuracy during inference phase of model-agnostic tasks.
Unlike MAML where only one meta-model is selected, we develop the following refinement and meta-quantifying steps to select high-quality meta-models for transfer learning to the medical VQA framework later.
Data refinement. After finishing the meta-training phase, the weights of the meta-models are used for refining the dataset. The module aims to expand the meta-data pool for meta-training and removes samples that are expected to be hard-to-learn or have noisy labels (See Algorithm 1 for more details).
Meta-quantifying. This module aims to identify candidate meta-models that are useful for the medical VQA task. A candidate model should achieve high performance during the validating process and its features distinct from other features from other candidate models.
To achieve these goals, we design a fuse score :
where is the predicted score of the current meta-model over ground-truth label; is the feature extracted from the aforementioned meta-model that needs to compute the score; is the feature extracted from -th model of the list of meta-model ; Cosine is using for similarity checking between two features.
Since the predicted score at the ground-truth label and diverse score are co-variables, therefore the fuse score is also covariate with both aforementioned scores. This means that the larger is, the higher chance of the model to be selected for the VQA task. Algorithm 2 describes our meta-quantifying algorithm in details.
Integrate quantified meta-models to medical VQA framework
To leverage robust features extracted from quantified meta-models, we introduce a VQA framework as in Figure 2. Specifically, each input question is trimmed to a -word sentence and then zero-padded if its length is less than . Each word is represented by a 300-D GloVe word embedding. The word embedding is fed into a 1024-D LSTM to produce the question embedding .
Each input image is passed through quantified meta-models got from the meta-quantifying module, which produce vectors. These vectors are concatenated to form an enhanced image feature, denoted as in Figure 2. Since this vector contains multiple features extracted from different high-performed meta-models and each model has different views, the VQA framework is expected to be less affected by the bias problem. Image feature and question embedding are fed into an attention mechanism (BAN or SAN) to produce a joint representation . This feature is used as input for a multi-class classifier (over the set of predefined answer classes). To train the proposed model, we use a Cross Entropy loss for the answer classification task. The whole VQA framework is then fine-tuned in an end-to-end manner.
Experimental Results
State-of-the-art
Table 1 presents comparative results between different methods. The results show that our MMQ significantly outperforms other meta-learning methods by a large margin. Besides, the gain in performance of MMQ is stable with different attention mechanisms (BAN or SAN) in the VQA task. It worth noting that, compared with the most recent state-of-the-art method MEVF, we outperform in free-form questions of the PathVQA dataset and in the Open-ended questions of the VQA-RAD dataset, respectively. Moreover, no out-of-dataset images are used in MMQ for learning meta-models. The results imply that our proposed MMQ learns essential representative information from the input images and leverage effectively the features from meta-models to deal with challenging questions in medical VQA datasets.
Ablation Study
Table 2 presents our MMQ accuracy in PathVQA dataset when applying times refining data and quantified meta-models. The results show that, by using only quantified meta-model outputted from our MMQ, we significantly outperform both MAML and MEVF baselines. This confirms the effectiveness of the proposed MMQ for dealing with the limitation of meta-annotation in medical VQA, i.e., noisy labels and scalability. Besides, leveraging more quantified meta-models also further improves the overall performance.
We note that the improvements of our MMQ are more significant on free-form questions over yes/no questions. This observation implies that the free-form questions/answers which are more challenging and need more information from input images benefits more from our proposed method.
Table 2 also shows that increasing the number of refinement steps and the number of quantified meta-models can improve the overall result, but the gain is smaller after each loop. The training time also increases when the number of meta-models is set higher. However, our testing time and the total number of parameters are only slightly higher than MAML and MEVF. Based on the empirical results, we recommend applying times refinement with a maximum of quantified meta-models to balance the trade-off between the accuracy performance and the computational cost.
Conclusion
In this paper, we proposed a new multiple meta-model quantifying method to effectively leverage meta-annotation and deal with noisy labels in the medical VQA task. The extensively experimental results show that our proposed method outperforms the recent state-of-the-art meta-learning based methods by a large margin in both PathVQA and VQA-RAD datasets. Our implementation and trained models will be released for reproducibility.
Open Source
🐱 Github: https://github.com/aioz-ai/MICCAI21_MMQ