Update Corpus format description authored by Marie Candito's avatar Marie Candito
...@@ -10,19 +10,28 @@ In short, a _cupt_ file contains split sentences, each represented with one toke ...@@ -10,19 +10,28 @@ In short, a _cupt_ file contains split sentences, each represented with one toke
Similarly to _PARSEME:MWE_, the information in the 11th column called _PARSEME-FR:MWE_ contains one of the following three options: Similarly to _PARSEME:MWE_, the information in the 11th column called _PARSEME-FR:MWE_ contains one of the following three options:
1. an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. _2-3 du_) 1. an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. _2-3 du_)
2. an underscore '_' if the MWE/NE annotation is unspecified or missing 2. an underscore '_' if the MWE/NE annotation is unspecified or missing
3. a list of semicolon-separated **codes** if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see [Lexicalized components and open slots](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the PARSEME annotation guidelines).
- for all the components of a MWE/NE except the (linearly) first one, the code is simply an **identifier**: 3. a list of semicolon-separated **CODES** if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see [Lexicalized components and open slots](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the PARSEME annotation guidelines).
* the identifier of a MWE/NE is an integer, greater or equal to 1, and is unique within the sentence: the only requirement for identifiers is that all the components of a MWE/NE must have codes starting by the same identifier and no other MWE/NE in the sentence use it. - for all the components of a MWE/NE except the (linearly) first one), the CODE is simply an **IDENTIFIER**:
- for the (linearly) first component of a MWE/NE, the code consists of an identifier followed by a colon ':' and a **pos-category-criteria label**: * the **IDENTIFIER** of a MWE/NE is an integer, greater or equal to 1, and is unique within the sentence: the only requirement for IDENTIFIERS is that all the components of a MWE/NE must have CODES starting by the same IDENTIFIER and no other MWE/NE in the sentence use it.
* **pos|category|criteria labels** provide information about the MWE/NE. These labels are composed of three fields separated by a pipe '|' character (i.e. POS|CATEGORY|CRITERION1,CRITERION2..., for instance ADP|MWE|IRREG describes a MWE (not a NE), whose part of speech is ADP, and for which the criteria IRREG has been used): - for the (linearly) first component of a MWE/NE, the code consists of an identifier followed by a colon ':' and a **LABEL**:
1. **POS** is a tag representing the part of speech of the whole MWE/NE, or a "_" if the MWE was classified as regular. Please refer to the [page describing the heuristics used to classify MWEs as regular / irregular, and to assign the POS to irregular ones](reg-irreg-pos-heuristics) for details. * **LABELS** provide information about the MWE/NE and are composed of a **POS** field, a **CATEGORY** field and **CRITERIA** field separated by a pipe '|' character for (i.e. POS|CATEGORY|CRITERION1,CRITERION2..., for instance ADP|MWE|IRREG describes a MWE (not a NE), whose part of speech is ADP, and for which the criterion IRREG has been used):
2. **CATEGORY** is a tag corresponding to a category that depends on the type of entity being annotated. It contains a prefix and a suffix, separated by a dash.
* The prefix of the tag indicates whether this is a MWE (EP for _expression polylexicale_) or a NE (EN for _entité nommée_). 1. **POS** is a tag representing the part of speech of the whole MWE/NE, using the tagset of the syntactic annotation scheme (either UD or FTBdep), except for some MWEs that were classified as syntactically regular, in which case the POS is irrelevant ("_" is used). Please refer to the [page describing the heuristics used to classify MWEs as regular / irregular, and to assign the POS to irregular ones](reg-irreg-pos-heuristics) for details. Named entities always have the POS for proper nouns.
* The suffix depends on the prefix as follows: 2. **CATEGORY** corresponds to the type of unit being annotated. The category either starts by "NE" for named entities, or "MWE" for multi-word expressions that are not named entities.
1. For verbal MWEs (POS is VERB, prefix is EP), the suffix corresponds to the [PARSEME 1.1 verbal MWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=categ). * MWE is used for non-verbal MWEs
2. For non-verbal MWEs (POS is not VERB, prefix is EP), the suffix is unspecified (_). * for verbal MWEs, the category has a suffix corresponding to the [PARSEME 1.1 verbal MWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=categ), hence the categories:
3. For named entities (prefix is EN), the suffix corresponds to one of the 5 [NE categories CARLOS ADD LINK](XXX) annotated (**PERS**on, **LOC**ation, **ORG**anization, **PROD**uct, **EVE**nt), with a sub-suffix indicating if the category is **PRIM**itive or **FINAL** * MWE-IRV. (Inherently reflexive verb)
* MWE-LVC.cause (Causative light-verb construction)
* MWE-LVC.full (light-verb construction)
* MWE-MVC (multi-verb construction)
* MWE-VID (verbal idiom)
* For named entities, the category start by "EN-", followed by a type of named entity (**PERS**on, **LOC**ation, **ORG**anization, **PROD**uct, **EVE**nt), followed by "**.final**" or "**.prim**" to indicate whether the type matches the interpretation of the NE in context (.final) or matches the a priori interpretation only (.prim). See [NE categories CARLOS ADD LINK](XXX) for more details.
* for instance a person named entity will get the category NE-PERS.final
* a location used in context for an organization (as in "they played against Chicago") will lead to two NE annotations: one with category NE-LOC.prim and one with category NE-ORG.final
3. **CRITERIA**: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the [guidelines](https://gitlab.lis-lab.fr/PARSEME-FR/PARSEME-FR-public/wikis/Criteres), e.g. CRAN, ID, PRED, LEX, etc. 3. **CRITERIA**: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the [guidelines](https://gitlab.lis-lab.fr/PARSEME-FR/PARSEME-FR-public/wikis/Criteres), e.g. CRAN, ID, PRED, LEX, etc.
- If the current line contains a lexicalized component which is not the first one in the current MWE, the MWE code contains the MWE identifier only, as described above, and no MWE category label. - If the current line contains a lexicalized component which is not the first one in the current MWE, the MWE code contains the MWE identifier only, as described above, and no MWE category label.
... ...
......