The PARSEME-FR annotated corpus adds an extra annotation layer for multiword expressions (MWEs) and named entites (NE) on top of the French Sequoia treebank. It is released using a variant of the PARSEME Shared task 2018 format, called cupt (short for Conll-U+Parseme-Tsv). Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the cupt format description page for details. Since cupt is based on Conll-U, please also check the Universal Dependencies Conll-U format description page and the recommendations for Conll-U Plus extended format, which we aim to be compatible with.
In short, a cupt file contains splitted sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the cupt file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Documentation about the 10 first columns can be found here >> MARIE insert link <<. The 11th column contains MWE annotation in PARSEME, and we extend it to also represent EN information in PARSEME-FR as detailed below. Therefore, the name of this column in the header metadata is PARSEME-FR:MWE instead of PARSEME:MWE.
Similarly to PARSEME:MWE, the information in the 11th column PARSEME-FR:MWE contains one of the following three options:
- an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. 2-3 du)
- an underscore '_' if the MWE/NE annotation is unspecified or missing
- a list of semicolon-separated codes if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see Lexicalized components and open slots in the PARSEME annotation guidelines).
- If the current line contains the first lexicalized component of the MWE/NE in the sentence, the code consists of an identifier followed by a colon ':' and a category-criteria label:
- identifiers are integers starting from 1 for each new sentence, and increased by 1 for each new annotation.
-
category-criteria labels are strings corresponding to information about the MWE/NE. These labels are composed of three fields separated by a pipe '|' character (i.e. POS|CATEGORY|CRITERION1,CRITERION2...):
- POS is a tag representing the part of speech of the whole MWE/NE. The tags were inferred automatically using heuristics, or defined manually for irregular constructions. MARIE add link to POS details here if relevant.
- CATEGORY is a tag corresponding to a category that depends on the type of entity being annotated. The prefix of the tag indicates whether this is a MWE (EP for _expression polylexicale) or NE (EN for entité nommée). For verbal MWEs, the suffix corresponds to the PARSEME 1.1 verbal MWE categories. For non-verbal MWEs, the suffix is unspecified (). For named entities, the suffiz corresponds to one of the 5 NE categories CARLOS ADD LINK annotated (PERSon, LOCation, ORGanization, PRODuct, EVEnt), with a sub-suffix indicating if the category is PRIMitive or FINAL
- CRITERIA: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the guidelines.
- If the current line contains a lexicalized component of the VMWE which is not the first one in the sentence, the VMWE code contains the VMWE identifier only, as described above, and no VMWE category label
- If the current line contains the first lexicalized component of the MWE/NE in the sentence, the code consists of an identifier followed by a colon ':' and a category-criteria label: