by Drew A. Hudson,Christopher D. ManningResearch Only
by Drew A. Hudson,Christopher D. ManningLicense : Research Only
Scene understanding remains a key challenge in computer vision. Visual question answering (VQA) is not only an effective way to measure it but an intriguing multimodal inference problem in its own right. Despite significant attention to this task in recent years, there is a widespread feeling that current benchmarks fail to provide accurate indication of visual understanding capacity. Not only are they severely biased, they also lack semantic compositionality, and do not provide any tools or measures to gain significant insight into models' performance and behavior. We design a new dataset, GQA, to address these shortcomings, featuring compositional questions over real-world images. We leverage semantic representations of both the scenes and questions to mitigate language priors and conditional biases and enable fine-grained diagnosis for different question types. The dataset consists of 22M questions about various day-to-day images. Each image is associated with a scene graph of the image's objects, attributes and relations, a new cleaner version based on Visual Genome. Each question is associated with a structured representation of its semantics, a functional program that specifies the reasoning steps have to be taken to answer it. Many of the GQA questions involve multiple reasoning skills, spatial understanding and multi-step inference, thus are generally more challenging than previous visual question answering datasets used in the community. We also made sure to balance the dataset, tightly controlling the answer distribution for different groups of questions, in order to prevent educated guesses using language and world priors. Finally, we complement the dataset with a suite of new metrics to test not only the accuracy, but also the consistency, validity and plausibility of models' responses, shedding much more light on their behavior. While the questions were auto-generated, they are based on the natural-language crowdsourced scene graphs and thus are grammatical, diverse and idiomatic.
CategoriesObjects In Image, Questions