Paper out!
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena accepted for publication at ACL 2022. We investigate to what extent existing pretrained vision-and-language models ground text on vision and vice-versa using counterfactuals and focusing on fine-grained linguistic phenomena! (code and data)