The PARSEME-FR annotated corpus adds an extra annotation layer for multiword expressions (MWEs) and named entites (NE) on top of the French [Sequoia treebank](https://deep-sequoia.inria.fr/). It is released using a variant of the [PARSEME Shared task 2018](http://multiword.sourceforge.net/sharedtask2018) format, called _cupt_ (short for **C**onll-**U**+**P**arseme-**T**sv). Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the [cupt format description page](multiword.sourceforge.net/cupt-format) for details. Since _cupt_ is based on Conll-U, please also check the Universal Dependencies [Conll-U format description page](universaldependencies.org/format) and the recommendations for [Conll-U Plus extended format](http://universaldependencies.org/ext-format.html), which we aim to be compatible with.
In short, a _cupt_ file contains splitted sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Documentation about the 10 first columns can be found **[here >> MARIE insert link <<](XXX)**. The 11th column contains MWE annotation in PARSEME, and we extend it to also represent EN information in PARSEME-FR as detailed below. Therefore, the name of this column in the header metadata is _PARSEME-FR:MWE_ instead of _PARSEME:MWE_.
Similarly to _PARSEME:MWE_, the information in the 11th column _PARSEME-FR:MWE_ contains one of the following three options:
1. an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. _2-3 du_)
2. an underscore '_' if the MWE/NE annotation is unspecified or missing
3. a list of semicolon-separated **codes** if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see [Lexicalized components and open slots](parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the PARSEME annotation guidelines).
- If the current line contains the first lexicalized component of the MWE/NE in the sentence, the code consists of an **identifier** followed by a colon ':' and a **category-criteria label**:
* **identifiers** are integers starting from 1 for each new sentence, and increased by 1 for each new annotation.
* **category-criteria labels** are strings corresponding to information about the MWE/NE. These labels are composed of three fields separated by a pipe '|' character (i.e. POS|CATEGORY|CRITERION1,CRITERION2...):
1. **POS** is a tag representing the part of speech of the whole MWE/NE. The tags were inferred automatically using heuristics, or defined manually for irregular constructions. **[MARIE add link to POS details here if relevant](XXX)**.
2. **CATEGORY** is a tag corresponding to a category that depends on the type of entity being annotated. The prefix of the tag indicates whether this is a MWE (EP for _expression polylexicale) or NE (EN for _entité nommée). For verbal MWEs, the suffix corresponds to the [PARSEME 1.1 verbal MWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=categ). For non-verbal MWEs, the suffix is unspecified (_). For named entities, the suffiz corresponds to one of the 5 [NE categories CARLOS ADD LINK](XXX) annotated (**PERS**on, **LOC**ation, **ORG**anization, **PROD**uct, **EVE**nt), with a sub-suffix indicating if the category is **PRIM**itive or **FINAL**
3. **CRITERIA**: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the guidelines.
- If the current line contains a lexicalized component of the VMWE which is not the first one in the sentence, the VMWE code contains the VMWE identifier only, as described above, and no VMWE category label
