source

General Overview:

VQA is divided into two sub-domains based on models and datasets focusing on:

  1. Image understanding ( involving datasets with natural images )
  2. Conceptual reasoning ( involving datasets with synthetic images )

Thus natural and synthetic datasets serve a complementary purpose.

Hence, this paper believes that it is necessary to test the performance of VQA algorithms on both these datasets. However, the literature often focuses on attaining the state of the art on one particular dataset, tackling just one of these problems.

Thus, this paper not only measures the performance of the then SOTA algorithms on both these sub-domains but also recommends a new SOTA VQA algorithm that is able to simultaneously solve both these problems.


Detailed Description:

1. Comparing different (SOTA) VQA models:

Need for comparing different models with different datasets:

  1. Non-exhaustive testing:

    Many VQA algorithms that claim specific abilities are not tested on datasets used to test those abilities.

  2. Biases within existing datasets:

    There are innate biases in datasets e.g. with crowdsourced questions since humans are more likely to ask some particular questions.

  3. Two types of datasets:

    As discussed before, there are 2 types of datasets.

    1. Natural datasets: Datasets such as VQAv1 and VQAv2, which contain natural images and crowdsourced questions and answers.

      Example of an image-question pair from VQAv2 (source: this paper)

      Example of an image-question pair from VQAv2 (source: this paper)

    2. Synthetic datasets: Datasets such as CLEVR and SHAPES. these contain scenes with simple geometric shapes designed to test the complex reasoning capabilities of the model.

Example of an image-question pair from CLEVR (source: this paper)

Example of an image-question pair from CLEVR (source: this paper)

While synthetic data is useful for understanding the reasoning capability of the model, understanding complex scenes is better tested by a natural dataset.

Brief overview of datasets used: