Changes

Marie Candito · 6e82e327
--- a/Corpus-format-description.md
+++ b/Corpus-format-description.md
@@ -10,19 +10,28 @@ In short, a _cupt_ file contains split sentences, each represented with one toke

 Similarly to _PARSEME:MWE_, the information in the 11th column called _PARSEME-FR:MWE_ contains one of the following three options:
 1. an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. _2-3 du_)
+
 2. an underscore '_' if the MWE/NE annotation is unspecified or missing
- 3. a list of semicolon-separated **codes** if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see [Lexicalized components and open slots](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the PARSEME annotation guidelines). 
-    - for all the components of a MWE/NE except the (linearly) first one, the code is simply an **identifier**:
-      * the identifier of a MWE/NE is an integer, greater or equal to 1, and is unique within the sentence: the only requirement for identifiers is that all the components of a MWE/NE must have codes starting by the same identifier and no other MWE/NE in the sentence use it. 
-    - for the (linearly) first component of a MWE/NE, the code consists of an identifier followed by a colon ':' and a **pos-category-criteria label**:
-      * **pos|category|criteria labels** provide information about the MWE/NE. These labels are composed of three fields separated by a pipe '|' character (i.e. POS|CATEGORY|CRITERION1,CRITERION2..., for instance  ADP|MWE|IRREG describes a MWE (not a NE), whose part of speech is ADP, and for which the criteria IRREG has been used):
-          1. **POS** is a tag representing the part of speech of the whole MWE/NE, or a "_" if the MWE was classified as regular. Please refer to the [page describing the heuristics used to classify MWEs as regular / irregular, and to assign the POS to irregular ones](reg-irreg-pos-heuristics) for details.
-          2. **CATEGORY** is a tag corresponding to a category that depends on the type of entity being annotated. It contains a prefix and a suffix, separated by a dash.
-              * The prefix of the tag indicates whether this is a MWE (EP for _expression polylexicale_) or a NE (EN for _entité nommée_). 
-              * The suffix depends on the prefix as follows:
-                  1. For verbal MWEs (POS is VERB, prefix is EP), the suffix corresponds to the [PARSEME 1.1 verbal MWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=categ). 
-                  2. For non-verbal MWEs (POS is not VERB, prefix is EP), the suffix is  unspecified (_). 
-                  3. For named entities (prefix is EN), the suffix corresponds to one of the 5 [NE categories CARLOS ADD LINK](XXX) annotated (**PERS**on, **LOC**ation, **ORG**anization, **PROD**uct, **EVE**nt), with a sub-suffix indicating if the category is **PRIM**itive or **FINAL**
+
+ 3. a list of semicolon-separated **CODES** if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see [Lexicalized components and open slots](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the PARSEME annotation guidelines).
+    - for all the components of a MWE/NE except the (linearly) first one), the CODE is simply an **IDENTIFIER**:
+      * the **IDENTIFIER** of a MWE/NE is an integer, greater or equal to 1, and is unique within the sentence: the only requirement for IDENTIFIERS is that all the components of a MWE/NE must have CODES starting by the same IDENTIFIER and no other MWE/NE in the sentence use it. 
+    - for the (linearly) first component of a MWE/NE, the code consists of an identifier followed by a colon ':' and a **LABEL**:
+      * **LABELS** provide information about the MWE/NE and are composed of a **POS** field, a **CATEGORY** field and **CRITERIA** field separated by a pipe '|' character for (i.e. POS|CATEGORY|CRITERION1,CRITERION2..., for instance  ADP|MWE|IRREG describes a MWE (not a NE), whose part of speech is ADP, and for which the criterion IRREG has been used):
+
+          1. **POS** is a tag representing the part of speech of the whole MWE/NE, using the tagset of the syntactic annotation scheme (either UD or FTBdep), except for some MWEs that were classified as syntactically regular, in which case the POS is irrelevant ("_" is used). Please refer to the [page describing the heuristics used to classify MWEs as regular / irregular, and to assign the POS to irregular ones](reg-irreg-pos-heuristics) for details. Named entities always have the POS for proper nouns. 
+          2. **CATEGORY** corresponds to the type of unit being annotated. The category either starts by "NE" for named entities, or "MWE" for multi-word expressions that are not named entities.
+              * MWE is used for non-verbal MWEs
+              * for verbal MWEs, the category has a suffix corresponding to the [PARSEME 1.1 verbal MWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=categ), hence the categories:
+                * MWE-IRV. (Inherently reflexive verb)
+                * MWE-LVC.cause (Causative light-verb construction)
+                * MWE-LVC.full (light-verb construction)
+                * MWE-MVC (multi-verb construction)
+                * MWE-VID (verbal idiom)
+              * For named entities, the category start by "EN-", followed by a type of named entity (**PERS**on, **LOC**ation, **ORG**anization, **PROD**uct, **EVE**nt), followed by "**.final**" or "**.prim**" to indicate whether the type matches the interpretation of the NE in context (.final) or matches the a priori interpretation only (.prim). See [NE categories CARLOS ADD LINK](XXX) for more details.
+                * for instance a person named entity will get the category NE-PERS.final
+                * a location used in context for an organization (as in "they played against Chicago") will lead to two NE annotations: one with category NE-LOC.prim and one with category NE-ORG.final 
+
          3. **CRITERIA**: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the [guidelines](https://gitlab.lis-lab.fr/PARSEME-FR/PARSEME-FR-public/wikis/Criteres), e.g. CRAN, ID, PRED, LEX, etc.
    - If the current line contains a lexicalized component which is not the first one in the current MWE, the MWE code contains the MWE identifier only, as described above, and no MWE category label.