... | ... | @@ -8,7 +8,7 @@ The corpus is released using a variant of the [PARSEME Shared task 2018](http:// |
|
|
|
|
|
In short, a _cupt_ file contains split sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Tokens roughly correspond to words, except for multiword tokens represented as ranges (see UD's [tokenization guide](http://universaldependencies.org/u/overview/tokenization.html) and PARSEME's page on [words and tokens](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=wordsandtokens)). Documentation about the 10 first columns can be found **[here >> MARIE insert link <<](XXX)**. In PARSEME's original _cupt_ format, the 11th column contains MWE annotations. In PARSEME-FR's _cupt_ variant, we extend it to (a) also represent NEs in addition to MWEs and (b) add extra information about the POS and criteria underlying the category labels, as detailed below. Therefore, the name of this column in the header metadata is _PARSEME-FR:MWE_ instead of _PARSEME:MWE_.
|
|
|
|
|
|
Similarly to _PARSEME:MWE_, the information in the 11th column _PARSEME-FR:MWE_ contains one of the following three options:
|
|
|
Similarly to _PARSEME:MWE_, the information in the 11th column called _PARSEME-FR:MWE_ contains one of the following three options:
|
|
|
1. an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. _2-3 du_)
|
|
|
2. an underscore '_' if the MWE/NE annotation is unspecified or missing
|
|
|
3. a list of semicolon-separated **codes** if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see [Lexicalized components and open slots](parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the PARSEME annotation guidelines).
|
... | ... | |