Update Corpus format description authored by Marie Candito's avatar Marie Candito
...@@ -12,10 +12,11 @@ Similarly to _PARSEME:MWE_, the information in the 11th column called _PARSEME-F ...@@ -12,10 +12,11 @@ Similarly to _PARSEME:MWE_, the information in the 11th column called _PARSEME-F
1. an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. _2-3 du_) 1. an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. _2-3 du_)
2. an underscore '_' if the MWE/NE annotation is unspecified or missing 2. an underscore '_' if the MWE/NE annotation is unspecified or missing
3. a list of semicolon-separated **codes** if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see [Lexicalized components and open slots](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the PARSEME annotation guidelines). 3. a list of semicolon-separated **codes** if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see [Lexicalized components and open slots](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the PARSEME annotation guidelines).
- If the current line contains the first lexicalized component of the MWE/NE in the sentence, the code consists of an **identifier** followed by a colon ':' and a **pos-category-criteria label**: - for all the components of a MWE/NE except the (linearly) first one, the code is simply an **identifier**:
* the **identifier** of a MWE/NE is an integer (starting at 1), and is unique within the sentence. * the identifier of a MWE/NE is an integer, greater or equal to 1, and is unique within the sentence: the only requirement for identifiers is that all the components of a MWE/NE must have codes starting by the same identifier and no other MWE/NE in the sentence use it.
* **pos-category-criteria labels** are strings corresponding to information about the MWE/NE. These labels are composed of three fields separated by a pipe '|' character (i.e. POS|CATEGORY|CRITERION1,CRITERION2...): - for the (linearly) first component of a MWE/NE, the code consists of an identifier followed by a colon ':' and a **pos-category-criteria label**:
1. **POS** is a tag representing the part of speech of the whole MWE/NE. The tags were inferred automatically using heuristics, or defined manually for irregular constructions. **[MARIE add link to POS details here if relevant](XXX)**. * **pos|category|criteria labels** provide information about the MWE/NE. These labels are composed of three fields separated by a pipe '|' character (i.e. POS|CATEGORY|CRITERION1,CRITERION2..., for instance ADP|MWE|IRREG describes a MWE (not a NE), whose part of speech is ADP, and for which the criteria IRREG has been used):
1. **POS** is a tag representing the part of speech of the whole MWE/NE, or a "_" if the MWE was classified as regular. Please refer to the [page describing the heuristics used to classify MWEs as regular / irregular, and to assign the POS to irregular ones](reg-irreg-pos-heuristics) for details.
2. **CATEGORY** is a tag corresponding to a category that depends on the type of entity being annotated. It contains a prefix and a suffix, separated by a dash. 2. **CATEGORY** is a tag corresponding to a category that depends on the type of entity being annotated. It contains a prefix and a suffix, separated by a dash.
* The prefix of the tag indicates whether this is a MWE (EP for _expression polylexicale_) or a NE (EN for _entité nommée_). * The prefix of the tag indicates whether this is a MWE (EP for _expression polylexicale_) or a NE (EN for _entité nommée_).
* The suffix depends on the prefix as follows: * The suffix depends on the prefix as follows:
... ...
......