Changes

Carlos Ramisch · eaf6109d
--- a/Corpus-format-description.md
+++ b/Corpus-format-description.md
-The PARSEME-FR annotated corpus is released using a variant of the [PARSEME Shared task 2018](multiword.sourceforge.net/sharedtask2018) format, called _cupt_ (short for **C**onll-**U**+**P**arseme-**T**sv). Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the [cupt format description page](multiword.sourceforge.net/cupt-format) for details. Since _cupt_ is based on Conll-U, please also check the Universal Dependencies [Conll-U format description page](universaldependencies.org/format) and the recommendations for [Conll-U Plus extended format](http://universaldependencies.org/ext-format.html), which we aim to be compatible with.
-
-In short, a _cupt_ file contains splitted sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). The 11th column contains MWE annotation in PARSEME, and we extend it to also represent EN information in PARSEME-FR as detailed below. Therefore, the name of this column in the header metadata is _PARSEMEFR:MWE_ instead of _PARSEME:MWE_
-
-AS for _PARSEME:MWE_, the information in the 11th column _PARSEMEFR:MWE_ contains either an asterisk _*_ (no annotation), and underscore *_* (unspecified) or a code in 2 parts indicating a MWE or EN annotation:
-
-
- format de la colonne:
-
-    - ID:POS%Type%Critere1,Critere2,...:
-    - POS - générés par Marie
-
-    - Type - ceux de PARSEME pour les EP verbales; EP pour les non verbales, EN, ...
-
-    - Critere - critère PARSEME-FR, enlever le numéro
-    - Exemples:
-        - 2:V%LVC.full%EP-OP
-
-        - 1:N%EN-PERS.final%_
-
-        - 1:ADV%_%EP-LEX
-    - Détails
-si champ non renseigné => mettre _
-
-EP verbales:
-
- pour le type: prendre le type PARSEME (LVC.full ...)
-
- pour le critère: prendre critère PARSEME-FR (OP, seV ...)
-Pour les EN:
-
- POS = PROPN,
-
- TYPE= EN-PERS | EN-ORG | EN-LOC | EN-PROD | EN-EVE
-
- critère en général non renseigné : _  
-    Entites nommees : num:PROPN%EN-PERS%_
-ATTENTION: revoir le lien avec la version 1.1
\ No newline at end of file
+The PARSEME-FR annotated corpus adds an extra annotation layer for multiword expressions (MWEs) and named entites (NE) on top of the French [Sequoia treebank](https://deep-sequoia.inria.fr/). It is released using a variant of the [PARSEME Shared task 2018](http://multiword.sourceforge.net/sharedtask2018) format, called _cupt_ (short for **C**onll-**U**+**P**arseme-**T**sv). Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the [cupt format description page](multiword.sourceforge.net/cupt-format) for details. Since _cupt_ is based on Conll-U, please also check the Universal Dependencies [Conll-U format description page](universaldependencies.org/format) and the recommendations for [Conll-U Plus extended format](http://universaldependencies.org/ext-format.html), which we aim to be compatible with.
+
+In short, a _cupt_ file contains splitted sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Documentation about the 10 first columns can be found **[here >> MARIE insert link <<](XXX)**. The 11th column contains MWE annotation in PARSEME, and we extend it to also represent EN information in PARSEME-FR as detailed below. Therefore, the name of this column in the header metadata is _PARSEME-FR:MWE_ instead of _PARSEME:MWE_.
+
+Similarly to _PARSEME:MWE_, the information in the 11th column _PARSEME-FR:MWE_ contains one of the following three options:
+ 1. an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. _2-3 du_)
+ 2. an underscore '_' if the MWE/NE annotation is unspecified or missing
+ 3. a list of semicolon-separated **codes** if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see [Lexicalized components and open slots](parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=lexicalized) in the PARSEME annotation guidelines).
+    - If the current line contains the first lexicalized component of the MWE/NE in the sentence, the code consists of an **identifier** followed by a colon ':' and a **category-criteria label**:
+      * **identifiers** are integers starting from 1 for each new sentence, and increased by 1 for each new annotation.
+      * **category-criteria labels** are strings corresponding to information about the MWE/NE. These labels are composed of three fields separated by a pipe '|' character (i.e. POS|CATEGORY|CRITERION1,CRITERION2...):
+          1. **POS** is a tag representing the part of speech of the whole MWE/NE. The tags were inferred automatically using heuristics, or defined manually for irregular constructions. **[MARIE add link to POS details here if relevant](XXX)**.
+          2. **CATEGORY** is a tag corresponding to a category that depends on the type of entity being annotated. The prefix of the tag indicates whether this is a MWE (EP for _expression polylexicale) or NE (EN for _entité nommée). For verbal MWEs, the suffix corresponds to the [PARSEME 1.1 verbal MWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=categ). For non-verbal MWEs, the suffix is  unspecified (_). For named entities, the suffiz corresponds to one of the 5 [NE categories CARLOS ADD LINK](XXX) annotated (**PERS**on, **LOC**ation, **ORG**anization, **PROD**uct, **EVE**nt), with a sub-suffix indicating if the category is **PRIM**itive or **FINAL**
+          3. **CRITERIA**: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the guidelines.
+    - If the current line contains a lexicalized component of the VMWE which is not the first one in the sentence, the VMWE code contains the VMWE identifier only, as described above, and no VMWE category label
\ No newline at end of file