CoMeta: A Corpus for Metaphor Detection in Spanish
We present CoMeta, a manually annotated Corpus for Metaphor Detection in Spanish with the aim of facilitating research on automatic metaphor detection. We believe that CoMeta is the largest publicly available dataset with metaphorical annotations in texts of general domain for the Spanish language.
CoMeta is comprised of miscellaneous texts in Spanish, a subset of 1925 sentences of news domain from the AnCora dataset (Talué et al. 2008); 937 sentences of wiki, blogs, reviews texts from the GSD corpus; and a total of 771 sentences from manually collected transcriptions of political discourse, from both the Spanish Government and parliamentary sessions of the Basque Government.
CoMeta Description
CoMeta consists of a total number of 3633 sentences with annotations at token level and binary tagging (B-METAPHOR/O). Only words with semantic content were candidates for the labelling, that includes verbs, nouns, adjectives and adverbs. Due to the subjectivity of the annotations process, the dataset is susceptible of continuous updates, as well as open to improvements.
The annotation of CoMeta has been developed by following the MIPVU guidelines (Steen et al. 2010), used to label the most popular metaphor corpus for English: the VUAM corpus.
These instructions can be summarized as follows:
- Read the entire text–discourse to establish a general understanding of the meaning.
-
Determine the lexical units in the text–discourse
- (a) For each lexical unit in the text, establish its meaning in context, that is, how it applies to an entity, relation, or attribute in the situation evoked by the text (contextual meaning). Take into account what comes before and after the lexical unit.
(b) For each lexical unit, determine if it has a more basic contemporary meaning in other contexts than the one in the given context. For our purposes, basic meanings tend to be
- More concrete; what they evoke is easier to imagine, see, hear, feel, smell, and taste.
- Related to bodily action. * More precise (as opposed to vague).
- Historically older. Basic meanings are not necessarily the most frequent meanings of the lexical unit. (c) If the lexical unit has a more basic current–contemporary meaning in other con- texts than the given context, decide whether the contextual meaning contrasts with the basic meaning but can be understood in comparison with it.
- If yes, mark the lexical unit as metaphorical.
Datasets
The dataset is publicly available: Download Cometa Dataset v1
Metaphor Detection Results
The results show the high-performance of newer Large Language Models such as DeBERTa for English. Furthemrore, it is certainly noticeable the high transfer of metaphor across these two languages. We hypothesize that these results may be due to the difference in size of the training data in both languages or the application of MIPVU guidelines to Spanish, which is not the language it was originally designed for. Future experimental work is needed to test these interpretations.
Paper
If you use this resource, please cite the following paper:
Elisa Sanchez-Bayona and Rodrigo Agerri (2022). Leveraging a New Spanish Corpus for Multilingual and Crosslingual Metaphor Detection. In CoNLL 2022.