publications
The list below only includes papers since 2022 and may not be up-to-date. Please refer to Google Scholar for a complete list.
2024
- Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot LearningMustafa Dogan , Ilker Kesen , Iacer Calixto , Aykut Erdem , and Erkut ErdemarXiv e-prints, Jul 2024
@article{2024arXiv240712498D, author = {{Dogan}, Mustafa and {Kesen}, Ilker and {Calixto}, Iacer and {Erdem}, Aykut and {Erdem}, Erkut}, title = {{Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning}}, journal = {arXiv e-prints}, keywords = {Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition}, year = {2024}, month = jul, eid = {arXiv:2407.12498}, pages = {arXiv:2407.12498}, doi = {10.48550/arXiv.2407.12498}, archiveprefix = {arXiv}, eprint = {2407.12498}, primaryclass = {cs.CL}, adsurl = {https://ui.adsabs.harvard.edu/abs/2024arXiv240712498D}, adsnote = {Provided by the SAO/NASA Astrophysics Data System}, }
- Topic evolution before fall incidents in new fallers through natural language processing of general practitioners’ clinical notesNoman Dormosh , Ameen Abu-Hanna , Iacer Calixto , Martijn C Schut , Martijn W Heymans , and Nathalie VeldeAge and Ageing, Feb 2024
Falls involve dynamic risk factors that change over time, but most studies on fall-risk factors are cross-sectional and do not capture this temporal aspect. The longitudinal clinical notes within electronic health records (EHR) provide an opportunity to analyse fall risk factor trajectories through Natural Language Processing techniques, specifically dynamic topic modelling (DTM). This study aims to uncover fall-related topics for new fallers and track their evolving trends leading up to falls.This case–cohort study utilised primary care EHR data covering information on older adults between 2016 and 2019. Cases were individuals who fell in 2019 but had no falls in the preceding three years (2016–18). The control group was randomly sampled individuals, with similar size to the cases group, who did not endure falls during the whole study follow-up period. We applied DTM on the clinical notes collected between 2016 and 2018. We compared the trend lines of the case and control groups using the slopes, which indicate direction and steepness of the change over time.A total of 2,384 fallers (cases) and an equal number of controls were included. We identified 25 topics that showed significant differences in trends between the case and control groups. Topics such as medications, renal care, family caregivers, hospital admission/discharge and referral/streamlining diagnostic pathways exhibited a consistent increase in steepness over time within the cases group before the occurrence of falls.Early recognition of health conditions demanding care is crucial for applying proactive and comprehensive multifactorial assessments that address underlying causes, ultimately reducing falls and fall-related injuries.
@article{10.1093/ageing/afae016, author = {Dormosh, Noman and Abu-Hanna, Ameen and Calixto, Iacer and Schut, Martijn C and Heymans, Martijn W and van der Velde, Nathalie}, title = {{Topic evolution before fall incidents in new fallers through natural language processing of general practitioners’ clinical notes}}, journal = {Age and Ageing}, volume = {53}, number = {2}, pages = {afae016}, year = {2024}, month = feb, issn = {1468-2834}, doi = {10.1093/ageing/afae016}, url = {https://doi.org/10.1093/ageing/afae016}, eprint = {https://academic.oup.com/ageing/article-pdf/53/2/afae016/56669437/afae016.pdf}, }
- ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language ModelsIlker Kesen , Andrea Pedrotti , Mustafa Dogan , Michele Cafagna , Emre Can Acikgoz , Letitia Parcalabescu , Iacer Calixto , Anette Frank , and 3 more authorsIn The Twelfth International Conference on Learning Representations , Feb 2024
@inproceedings{kesen-etal-2024vilma, title = {Vi{LMA}: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models}, author = {Kesen, Ilker and Pedrotti, Andrea and Dogan, Mustafa and Cafagna, Michele and Acikgoz, Emre Can and Parcalabescu, Letitia and Calixto, Iacer and Frank, Anette and Gatt, Albert and Erdem, Aykut and Erkut, Erdem}, booktitle = {The Twelfth International Conference on Learning Representations}, year = {2024}, url = {https://openreview.net/forum?id=liuqDwmbQJ}, }
2023
- LLM aided semi-supervision for efficient Extractive Dialog SummarizationNishant Mishra , Gaurav Sahu , Iacer Calixto , Ameen Abu-Hanna , and Issam LaradjiIn Findings of the Association for Computational Linguistics: EMNLP 2023 , Dec 2023
Generating high-quality summaries for chat dialogs often requires large labeled datasets. We propose a method to efficiently use unlabeled data for extractive summarization of customer-agent dialogs. In our method, we frame summarization as a question-answering problem and use state-of-the-art large language models (LLMs) to generate pseudo-labels for a dialog. We then use these pseudo-labels to fine-tune a chat summarization model, effectively transferring knowledge from the large LLM into a smaller specialized model. We demonstrate our method on the TWEETSUMM dataset, and show that using 10% of the original labelled data set we can achieve 65.9/57.0/61.0 ROUGE-1/-2/-L, whereas the current state-of-the-art trained on the entire training data set obtains 65.16/55.81/64.37 ROUGE-1/-2/-L. In other words, in the worst case (i.e., ROUGE-L) we still effectively retain 94.7% of the performance while using only 10% of the data.
@inproceedings{mishra-etal-2023-llm, title = {{LLM} aided semi-supervision for efficient Extractive Dialog Summarization}, author = {Mishra, Nishant and Sahu, Gaurav and Calixto, Iacer and Abu-Hanna, Ameen and Laradji, Issam}, editor = {Bouamor, Houda and Pino, Juan and Bali, Kalika}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2023}, month = dec, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-emnlp.670}, doi = {10.18653/v1/2023.findings-emnlp.670}, pages = {10002--10009}, }
- Leveraging Multi-Word Concepts to Predict Acute Kidney Injury in Intensive CareLorenzo Brancato , Iacer Calixto , Ameen Abu-Hanna , and Iacopo VaglianoStud Health Technol Inform, Jun 2023Best paper award
Acute kidney injury (AKI) is an abrupt decrease in kidney function widespread in intensive care. Many AKI prediction models have been proposed, but only few exploit clinical notes and medical terminologies. Previously, we developed and internally validated a model to predict AKI using clinical notes enriched with single-word concepts from medical knowledge graphs. However, an analysis of the impact of using multi-word concepts is lacking. In this study, we compare the use of only the clinical notes as input to prediction to the use of clinical notes retrofitted with both single-word and multi-word concepts. Our results show that 1) retrofitting single-word concepts improved word representations and improved the performance of the prediction model; 2) retrofitting multi-word concepts further improves both results, albeit slightly. Although the improvement with multi-word concepts was small, due to the small number of multi-word concepts that could be annotated, multi-word concepts have proven to be beneficial.
@article{Brancato2023-le, title = {Leveraging {Multi-Word} Concepts to Predict Acute Kidney Injury in Intensive Care}, author = {Brancato, Lorenzo and Calixto, Iacer and Abu-Hanna, Ameen and Vagliano, Iacopo}, journal = {Stud Health Technol Inform}, volume = {305}, pages = {10--13}, month = jun, year = {2023}, address = {Netherlands}, keywords = {Clinical Prediction; Knowledge Graphs; Natural Language Processing}, language = {en}, note = {Best paper award} }
- Video-and-Language (VidL) models and their cognitive relevanceAnne Zonneveld , Albert Gatt , and Iacer CalixtoIn Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops , Oct 2023
In this paper we give a narrative review of multi-modal video-language (VidL) models. We introduce the current landscape of VidL models and benchmarks, and draw inspiration from neuroscience and cognitive science to propose avenues for future research in VidL models in particular and artificial intelligence (AI) in general. We argue that iterative feedback loops between AI, neuroscience, and cognitive science are essential to spur progress across these disciplines. We motivate why we focus specifically on VidL models and their benchmarks as a promising type of model to bring improvements in AI and categorise current VidL efforts across multiple’cognitive relevance axioms’. Finally, we provide suggestions on how to effectively incorporate this interdisciplinary viewpoint into research on VidL models in particular and AI in general. In doing so, we hope to create awareness of the potential of VidL models to narrow the gap between neuroscience, cognitive science, and AI.
@inproceedings{Zonneveld_2023_ICCV, author = {Zonneveld, Anne and Gatt, Albert and Calixto, Iacer}, title = {Video-and-Language (VidL) models and their cognitive relevance}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = oct, year = {2023}, pages = {325-338}, }
- Drug-related causes attributed to acute kidney injury and their documentation in intensive care patientsRachel M. Murphy , Dave A. Dongelmans , Izak Yasrebi-de Kom , Iacer Calixto , Ameen Abu-Hanna , Kitty J. Jager , Nicolette F. de Keizer , and Joanna E. KlopotowskaJournal of Critical Care, Oct 2023
Purpose To investigate drug-related causes attributed to acute kidney injury (DAKI) and their documentation in patients admitted to the Intensive Care Unit (ICU). Methods This study was conducted in an academic hospital in the Netherlands by reusing electronic health record (EHR) data of adult ICU admissions between November 2015 to January 2020. First, ICU admissions with acute kidney injury (AKI) stage 2 or 3 were identified. Subsequently, three modes of DAKI documentation in EHR were examined: diagnosis codes (structured data), allergy module (semi-structured data), and clinical notes (unstructured data). Results n total 8124 ICU admissions were included, with 542 (6.7%) ICU admissions experiencing AKI stage 2 or 3. The ICU physicians deemed 102 of these AKI cases (18.8%) to be drug-related. These DAKI cases were all documented in the clinical notes (100%), one in allergy module (1%) and none via diagnosis codes. The clinical notes required the highest time investment to analyze. Conclusions Drug-related causes comprise a substantial part of AKI in the ICU patients. However, current unstructured DAKI documentation practice via clinical notes hampers our ability to gain better insights about DAKI occurrence. Therefore, both automating DAKI identification from the clinical notes and increasing structured DAKI documentation should be encouraged.
@article{MURPHY2023154292, title = {Drug-related causes attributed to acute kidney injury and their documentation in intensive care patients}, journal = {Journal of Critical Care}, volume = {75}, pages = {154292}, year = {2023}, issn = {0883-9441}, doi = {https://doi.org/10.1016/j.jcrc.2023.154292}, url = {https://www.sciencedirect.com/science/article/pii/S0883944123000412}, author = {Murphy, Rachel M. and Dongelmans, Dave A. and Kom, Izak Yasrebi-de and Calixto, Iacer and Abu-Hanna, Ameen and Jager, Kitty J. and {de Keizer}, Nicolette F. and Klopotowska, Joanna E.}, keywords = {Electronic health records, Acute kidney injury, Nephrotoxicity, Phenotype algorithm, Adverse drug event, Automated identification}, }
- Soft-Prompt Tuning to Predict Lung Cancer Using Primary Care Free-Text Dutch Medical NotesAuke Elfrink , Iacopo Vagliano , Ameen Abu-Hanna , and Iacer CalixtoIn Artificial Intelligence in Medicine , Oct 2023
We examine the use of large Transformer-based pretrained language models (PLMs) for the problem of early prediction of lung cancer using free-text patient medical notes of Dutch primary care physicians. Specifically, we investigate: 1) how soft prompt-tuning compares to standard model fine-tuning; 2) whether simpler static word embedding models (WEMs) can be more robust compared to PLMs in highly imbalanced settings; and 3) how models fare when trained on notes from a small number of patients. All our code is available open source in https://bitbucket.org/aumc-kik/prompt_tuning_cancer_prediction/.
@inproceedings{10.1007/978-3-031-34344-5_23, author = {Elfrink, Auke and Vagliano, Iacopo and Abu-Hanna, Ameen and Calixto, Iacer}, editor = {Juarez, Jose M. and Marcos, Mar and Stiglic, Gregor and Tucker, Allan}, title = {Soft-Prompt Tuning to Predict Lung Cancer Using Primary Care Free-Text Dutch Medical Notes}, booktitle = {Artificial Intelligence in Medicine}, year = {2023}, publisher = {Springer Nature Switzerland}, address = {Cham}, pages = {193--198}, isbn = {978-3-031-34344-5}, }
- SemEval-2023 Task 1: Visual Word Sense DisambiguationAlessandro Raganato , Iacer Calixto , Asahi Ushio , Jose Camacho-Collados , and Mohammad Taher PilehvarIn Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023) , Jul 2023
This paper presents the Visual Word Sense Disambiguation (Visual-WSD) task. The objective of Visual-WSD is to identify among a set of ten images the one that corresponds to the intended meaning of a given ambiguous word which is accompanied with minimal context. The task provides datasets for three different languages: English, Italian, and Farsi.We received a total of 96 different submissions. Out of these, 40 systems outperformed a strong zero-shot CLIP-based baseline. Participating systems proposed different zero- and few-shot approaches, often involving generative models and data augmentation. More information can be found on the task’s website: }urlhttps://raganato.github.io/vwsd/.
@inproceedings{raganato-etal-2023-semeval, title = {{S}em{E}val-2023 Task 1: Visual Word Sense Disambiguation}, author = {Raganato, Alessandro and Calixto, Iacer and Ushio, Asahi and Camacho-Collados, Jose and Pilehvar, Mohammad Taher}, booktitle = {Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.semeval-1.308}, doi = {10.18653/v1/2023.semeval-1.308}, pages = {2227--2234}, }
- Fixing confirmation bias in feature attribution methods via semantic matchGiovanni Cinà , Daniel Fernandez-Llaneza , Nishant Mishra , Tabea E Röber , Sandro Pezzelle , Iacer Calixto , Rob Goedhart , and Ş İlker BirbilarXiv preprint arXiv:2307.00897, Jul 2023
Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model’s internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cinà et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.
@article{cina2023fixing, title = {Fixing confirmation bias in feature attribution methods via semantic match}, author = {Cin{\`a}, Giovanni and Fernandez-Llaneza, Daniel and Mishra, Nishant and R{\"o}ber, Tabea E and Pezzelle, Sandro and Calixto, Iacer and Goedhart, Rob and Birbil, {\c{S}} {\.I}lker}, journal = {arXiv preprint arXiv:2307.00897}, year = {2023}, }
2022
- Multi3Generation: Multitask, Multilingual, Multimodal Language GenerationAnabela Barreiro , José GC Souza , Albert Gatt , Mehul Bhatt , Elena Lloret , Aykut Erdem , Dimitra Gkatzia , Helena Moniz , and 7 more authorsIn Proceedings of the 23rd Annual Conference of the European Association for Machine Translation , Jun 2022
This paper presents the Multitask, Multilingual, Multimodal Language Generation COST Action – Multi3Generation (CA18231), an interdisciplinary network of research groups working on different aspects of language generation. This “meta-paper” will serve as reference for citations of the Action in future publications. It presents the objectives, challenges and a the links for the achieved outcomes.
@inproceedings{barreiro-etal-2022-multi3generation, title = {{M}ulti3{G}eneration: Multitask, Multilingual, Multimodal Language Generation}, author = {Barreiro, Anabela and de Souza, Jos{\'e} GC and Gatt, Albert and Bhatt, Mehul and Lloret, Elena and Erdem, Aykut and Gkatzia, Dimitra and Moniz, Helena and Russo, Irene and Kepler, Fabio and Calixto, Iacer and Paprzycki, Marcin and Portet, Fran{\c{c}}ois and Augenstein, Isabelle and Alhasani, Mirela}, booktitle = {Proceedings of the 23rd Annual Conference of the European Association for Machine Translation}, month = jun, year = {2022}, address = {Ghent, Belgium}, publisher = {European Association for Machine Translation}, url = {https://aclanthology.org/2022.eamt-1.63}, pages = {347--348}, }
- Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and LearningErkut Erdem , Menekse Kuyu , Semih Yagcioglu , Anette Frank , Letitia Parcalabescu , Barbara Plank , Andrii Babii , Oleksii Turuta , and 10 more authorsJ. Artif. Int. Res., May 2022
Developing artificial learning systems that can understand and generate natural language has been one of the long-standing goals of artificial intelligence. Recent decades have witnessed an impressive progress on both of these problems, giving rise to a new family of approaches. Especially, the advances in deep learning over the past couple of years have led to neural approaches to natural language generation (NLG). These methods combine generative language learning techniques with neural-networks based frameworks. With a wide range of applications in natural language processing, neural NLG (NNLG) is a new and fast growing field of research. In this state-of-the-art report, we investigate the recent developments and applications of NNLG in its full extent from a multidimensional view, covering critical perspectives such as multimodality, multilinguality, controllability and learning strategies. We summarize the fundamental building blocks of NNLG approaches from these aspects and provide detailed reviews of commonly used preprocessing steps and basic neural architectures. This report also focuses on the seminal applications of these NNLG models such as machine translation, description generation, automatic speech recognition, abstractive summarization, text simplification, question answering and generation, and dialogue generation. Finally, we conclude with a thorough discussion of the described frameworks by pointing out some open research directions.
@article{10.1613/jair.1.12918, author = {Erdem, Erkut and Kuyu, Menekse and Yagcioglu, Semih and Frank, Anette and Parcalabescu, Letitia and Plank, Barbara and Babii, Andrii and Turuta, Oleksii and Erdem, Aykut and Calixto, Iacer and Lloret, Elena and Apostol, Elena-Simona and Truic\u{a}, Ciprian-Octavian and \v{S}andrih, Branislava and Martin\v{c}i\'{c}-Ip\v{s}i\'{c}, Sanda and Berend, G\'{a}bor and Gatt, Albert and Korvel, Gr\u{a}zina}, title = {Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning}, year = {2022}, issue_date = {May 2022}, publisher = {AI Access Foundation}, address = {El Segundo, CA, USA}, volume = {73}, issn = {1076-9757}, url = {https://doi.org/10.1613/jair.1.12918}, doi = {10.1613/jair.1.12918}, journal = {J. Artif. Int. Res.}, month = may, numpages = {77}, alt_metric = {true}, dimensions = {true}, keywords = {natural language, neural networks} }
- Endowing language models with multimodal knowledge graph representationsNingyuan Huang , Yash R Deshpande , Yibo Liu , Houda Alberts , Kyunghyun Cho , Clara Vania , and Iacer CalixtoarXiv preprint arXiv:2206.13163, May 2022
We propose a method to make natural language understanding models more parameter efficient by storing knowledge in an external knowledge graph (KG) and retrieving from this KG using a dense index. Given (possibly multilingual) downstream task data, e.g., sentences in German, we retrieve entities from the KG and use their multimodal representations to improve downstream task performance. We use the recently released VisualSem KG as our external knowledge repository, which covers a subset of Wikipedia and WordNet entities, and compare a mix of tuple-based and graph-based algorithms to learn entity and relation representations that are grounded on the KG multimodal information. We demonstrate the usefulness of the learned entity representations on two downstream tasks, and show improved performance on the multilingual named entity recognition task by 0.3%–0.7% F1, while we achieve up to 2.5% improvement in accuracy on the visual sense disambiguation task. All our code and data are available in: \urlthis https URL.
@article{huang2022endowing, title = {Endowing language models with multimodal knowledge graph representations}, author = {Huang, Ningyuan and Deshpande, Yash R and Liu, Yibo and Alberts, Houda and Cho, Kyunghyun and Vania, Clara and Calixto, Iacer}, journal = {arXiv preprint arXiv:2206.13163}, year = {2022}, }
- Detecting Euphemisms with Literal Descriptions and Visual ImageryIlker Kesen , Aykut Erdem , Erkut Erdem , and Iacer CalixtoIn Proceedings of the 3rd Workshop on Figurative Language Processing (FLP) , Dec 2022
This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022. Euphemisms tone down expressions about sensitive or unpleasant issues like addiction and death. The ambiguous nature of euphemistic words or expressions makes it challenging to detect their actual meaning within a context. In the first stage, we seek to mitigate this ambiguity by incorporating literal descriptions into input text prompts to our baseline model. It turns out that this kind of direct supervision yields remarkable performance improvement. In the second stage, we integrate visual supervision into our system using visual imageries, two sets of images generated by a text-to-image model by taking terms and descriptions as input. Our experiments demonstrate that visual supervision also gives a statistically significant performance boost. Our system achieved the second place with an F1 score of 87.2%, only about 0.9% worse than the best submission.
@inproceedings{kesen-etal-2022-detecting, title = {Detecting Euphemisms with Literal Descriptions and Visual Imagery}, author = {Kesen, Ilker and Erdem, Aykut and Erdem, Erkut and Calixto, Iacer}, booktitle = {Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)}, month = dec, year = {2022}, address = {Abu Dhabi, United Arab Emirates (Hybrid)}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.flp-1.9}, doi = {10.18653/v1/2022.flp-1.9}, pages = {61--67}, }
- Natural language processing for mental disorders: an overviewIacer Calixto , Viktoriya Yaneva , and Raphael CardosoIn Natural Language Processing in Healthcare: A Special Focus on Low Resource Languages , Dec 2022
In recent years, there has been a surge in interest in using natural language processing (NLP) applications for clinical psychology and psychiatry. Despite the increased societal, economic, and academic interest, there has been no systematic critical analysis of the recent progress in NLP applications for mental disorders, or of the resources available for training and evaluating such systems. This chapter addresses this gap through two main contributions. First, it provides an overview of the NLP literature related to mental disorders, with a focus on autism, dyslexia, schizophrenia, depression and mental health in general. We discuss the strengths and shortcomings of current methodologies, specifically focusing on the challenges in obtaining large volumes of high-quality domain-specific data both for English and for lower-resource languages. We also provide a list of datasets publicly available for researchers who would like to develop NLP methods for specific mental disorders, categorized according to relevant criteria such as data source, language, annotation, and size. Our second contribution is a discussion on how to support the application of these methods to various languages and social contexts. This includes recommendations on conducting robust and ethical experiments from a machine learning perspective, and a discussion on how techniques such as cross-lingual transfer learning could be applied within this area.
@incollection{2436/624261, author = {Calixto, Iacer and Yaneva, Viktoriya and Cardoso, Raphael}, title = {Natural language processing for mental disorders: an overview}, publisher = {CRC Press}, editor = {}, booktitle = {Natural Language Processing in Healthcare: A Special Focus on Low Resource Languages}, year = {2022}, pages = {37-59}, isbn = {9780367685393}, doi = {https://doi.org/10.1201/9781003138013}, url = {http://hdl.handle.net/2436/624261}, }
- VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic PhenomenaLetitia Parcalabescu , Michele Cafagna , Lilitta Muradjan , Anette Frank , Iacer Calixto , and Albert GattIn Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , May 2022
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.
@inproceedings{parcalabescu-etal-2022-valse, title = {{VALSE}: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena}, author = {Parcalabescu, Letitia and Cafagna, Michele and Muradjan, Lilitta and Frank, Anette and Calixto, Iacer and Gatt, Albert}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = may, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.acl-long.567}, doi = {10.18653/v1/2022.acl-long.567}, pages = {8253--8280}, }