Changes

Marie Candito · bf3910fe
--- a/Corpus-format-description.md
+++ b/Corpus-format-description.md
@@ -4,9 +4,16 @@

 The PARSEME-FR annotated corpus adds an extra annotation layer for multiword expressions (MWEs) and named entites (NE) on top of the French [Sequoia treebank](https://deep-sequoia.inria.fr/), using the project's internal [annotation guidelines (in French)](Guide-annotation-EP-EN).

-The corpus is released using a variant of the [PARSEME Shared task 2018](http://multiword.sourceforge.net/sharedtask2018) format, called _cupt_ (short for **C**onll-**U**+**P**arseme-**T**sv). Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the [cupt format description page](http://multiword.sourceforge.net/cupt-format) for details. Since _cupt_ is based on Conll-U, please also check the Universal Dependencies [Conll-U format description page](http://universaldependencies.org/format.html) and the recommendations for [Conll-U Plus extended format](http://universaldependencies.org/ext-format.html), which we aim to be compatible with.
+The corpus is released using a variant of the [PARSEME Shared task 2018](http://multiword.sourceforge.net/sharedtask2018) format, called _cupt_ (short for **C**onll-**U**+**P**arseme-**T**sv), in which:
+  - the first 10 columns follow the [Conll-U format](http://universaldependencies.org/format.html), and encode morphological and syntactic annotations
+  - the eleventh column contain the MWE and NE annotations

-In short, a _cupt_ file contains split sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Tokens roughly correspond to words, except for multiword tokens represented as ranges (see UD's [tokenization guide](http://universaldependencies.org/u/overview/tokenization.html) and PARSEME's page on [words and tokens](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=wordsandtokens)). Documentation about the 10 first columns can be found **[here >> MARIE insert link <<](XXX)**. In PARSEME's original _cupt_ format, the 11th column contains MWE annotations. In PARSEME-FR's _cupt_ variant, we extend it to (a) also represent NEs in addition to MWEs and (b) add extra information about the POS and criteria underlying the category labels, as detailed below. Therefore, the name of this column in the header metadata is _PARSEME-FR:MWE_ instead of _PARSEME:MWE_.
+Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the [cupt format description page](http://multiword.sourceforge.net/cupt-format) for details. Since _cupt_ is based on Conll-U, please also check the Universal Dependencies [Conll-U format description page](http://universaldependencies.org/format.html) and the recommendations for [Conll-U Plus extended format](http://universaldependencies.org/ext-format.html), which we aim to be compatible with. In short, a _cupt_ file contains split sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Tokens roughly correspond to words, except for multiword tokens represented as ranges (see UD's [tokenization guide](http://universaldependencies.org/u/overview/tokenization.html) and PARSEME's page on [words and tokens](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=wordsandtokens)).
+
+
+### Format of the MWE/NE annotation layer (11th column)
+
+In PARSEME's original _cupt_ format, the 11th column contains MWE annotations. In PARSEME-FR's _cupt_ variant, we extend it to (a) also represent NEs in addition to MWEs and (b) add extra information about the POS and criteria underlying the category labels, as detailed below. Therefore, the name of this column in the header metadata is _PARSEME-FR:MWE_ instead of _PARSEME:MWE_.

 Similarly to _PARSEME:MWE_, the information in the 11th column called _PARSEME-FR:MWE_ contains one of the following three options:
 1. an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. _2-3 du_)
@@ -19,7 +26,8 @@ Similarly to _PARSEME:MWE_, the information in the 11th column called _PARSEME-F
    - for the (linearly) first component of a MWE/NE, the code consists of an identifier followed by a colon ':' and a **LABEL**:
      * **LABELS** provide information about the MWE/NE and are composed of a **POS** field, a **CATEGORY** field and **CRITERIA** field separated by a pipe '|' character for (i.e. POS|CATEGORY|CRITERION1,CRITERION2..., for instance  ADP|MWE|IRREG describes a MWE (not a NE), whose part of speech is ADP, and for which the criterion IRREG has been used):

-          1. **POS** is a tag representing the part of speech of the whole MWE/NE, using the tagset of the syntactic annotation scheme (either UD or FTBdep), except for some MWEs that were classified as syntactically regular, in which case the POS is irrelevant ("_" is used). Please refer to the [page describing the heuristics used to classify MWEs as regular / irregular, and to assign the POS to irregular ones](reg-irreg-pos-heuristics) for details. Named entities always have the POS for proper nouns. 
+          1. **POS** is a tag representing the part of speech of the whole MWE/NE, using the tagset of the syntactic annotation scheme (either UD or FTBdep), except for some MWEs that were classified as syntactically regular, in which case the POS is irrelevant ("_" is used). Please refer to the [page describing the heuristics used to classify MWEs as regular / irregular, and to assign the POS to irregular ones](reg-irreg-pos-heuristics) for details.
+             - Named entities always have the POS for proper nouns. 
          2. **CATEGORY** corresponds to the type of unit being annotated. The category either starts by "NE" for named entities, or "MWE" for multi-word expressions that are not named entities.
              * MWE is used for non-verbal MWEs
              * for verbal MWEs, the category has a suffix corresponding to the [PARSEME 1.1 verbal MWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=categ), hence the categories:
@@ -36,26 +44,55 @@ Similarly to _PARSEME:MWE_, the information in the 11th column called _PARSEME-F
    - If the current line contains a lexicalized component which is not the first one in the current MWE, the MWE code contains the MWE identifier only, as described above, and no MWE category label.

 ### Example 
-**TODO ADD REAL EXAMPLE**

-Here is an example of sentence using the PARSEME-FR _cupt_ format described above.
+**(marie: je trouve qu'un exemple fabriqué permet de couvrir plus de phénomènes, ?)**
+
+Here is an example of sentence using the PARSEME-FR _cupt_ format described above,
+showing only columns 1 (ID), 2 (FORM) and 11 (MWE / NE annotation).
+
+E.g. "Peugeot" is annotated as a final ORG named entity (NE-ORG.final), with identifier 2, and also as a primary PERS named entity with identifier 1.
+
+"tout au plus" is annotated as a MWE, more precisely tokens "tout", "à", "le" and "plus" are annotated with identifier 3 ("au" is a multi-word token which is not annotated). It has "ADV" as part-of-speech, meaning it behaves as an adverb, but it is considered as irregular from the syntactic point of view. The criterion that was used to annotate it is "IRREG".
+
+The sentence contains an example of embedded MWEs: tokens 18 to 22 form a named entity of type ORG, itself containing a MWE ("conseil général") and a LOC (Essonne).

 ```
-# text = Jean fait partie du conseil général de l'Ardèche.
+# text = Chez Peugeot tout au plus on savait que Jean Gapé ne faisait plus partie du conseil général de l'Essonne.
 # sent_id = 123
 # source_sent_id = http://deep-sequoia.inria.fr/download/sequoia-8.2.tgz sequoia-8.2/sequoia.surf.conll 123
-1 Jean     ... 1:NOUN|EN-PERS.final|_
-2 fait     ... 2:VERB|EP-VID|_
-3 partie   ... 2
-5 du       ... *
-6 de       ... *
-7 le       ... *
-8 conseil  ... 3:NOUN|EN-ORG.final|_;4:NOUN|EP-_|LEX
-9 général  ... 3;4
-10 de      ... 3
-11 l'      ... 3
-12 Ardèche ... 3;5:NOUN|EN-LOC.final|_
-13 .       ... *
+1 Chez     ... *
+2 Peugeot  ... 1:PROPN|NE-PERS.prim|_;2:PROPN|NE-ORG.final|_
+3 tout     ... 3:ADV|MWE|IRREG
+4-5 à      ... *
+4 à        ... 3
+5 le       ... 3
+6 plus     ... 3
+7 on       ... *
+8 savait   ... *
+9 que      ... *
+10 Jean    ... 4:PROPN|EN-PERS.final|_
+11 Gapé    ... 4
+12 ne      ... *
+13 faisait ... 8:_|MWE-VID|_
+14 plus    ... *
+15 partie  ... 8
+16-17 du   ... *
+16 de      ... *
+17 le      ... *
+18 conseil ... 5:PROPN|EN-ORG.final|_;6:_|MWE|LEX
+19 général ... 5;6
+20 de      ... 5
+21 l'      ... 5
+22 Essonne ... 5;7:PROPN|EN-LOC.final|_
+23 .       ... *
 ```

+### Morpho-syntactic annotation schemes, and link to the MWE / NE annotations
+
+The PARSEME-FR annotated corpus comes in two variants for the morpho-syntactic annotations:
+
+ - the original dependency scheme for the [sequoia deep corpus](https://deep-sequoia.inria.fr/)
+ - the sequoia corpus converted to Universal Dependencies
+
+**TO BE CONTINUED HERE**