Towards Fine-Tuning of VQA Models in Public Datasets


November 19, 2020


Workshop of Physical Agents (WAF)


Miguel E. Ortiz
Luis M. Bergasa
Roberto Arroyo
Sergio Alvarez Pardo
Aitor Aller


This paper studies the Visual Question Answering (VQA) topic, which combines Computer Vision (CV), Natural Language Processing (NLP) and Knowledge Representation & Reasoning (KR&R) in order to automatically provide natural language responses to questions asked by users over images. A review of the state of the art for this technology is initially carried out. Among the different approaches, we select the model known as Pythia to build upon it, because this approach is one of the most popularized and successful methods in the public VQA Challenge. Recently, an exhaustive breakdown was done to the Pythia code by Facebook AI Research (FAIR). We choose to use this updated framework after confirming that the two implementations had analog characteristics. We introduce the different modules of the FAIR implementation and how to train our model, proposing some improvements regarding the baseline. Different fine-tuned models are trained, obtaining an accuracy of 66.22% in the best case for the test set of the public VQA-v2 dataset. A comparison of the quantitative results for the most important experiments jointly some qualitative results are discussed. This experimentation is performed with the aim of finally applying it to eCommerce and store observation use cases for VQA in further research.