|
The PARSEME-FR annotated corpus adds an extra annotation layer for multiword expressions (MWEs) and named entites (NE) on top of the French [Sequoia treebank](https://deep-sequoia.inria.fr/). It is released using a variant of the [PARSEME Shared task 2018](http://multiword.sourceforge.net/sharedtask2018) format, called _cupt_ (short for **C**onll-**U**+**P**arseme-**T**sv). Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the [cupt format description page](multiword.sourceforge.net/cupt-format) for details. Since _cupt_ is based on Conll-U, please also check the Universal Dependencies [Conll-U format description page](universaldependencies.org/format) and the recommendations for [Conll-U Plus extended format](http://universaldependencies.org/ext-format.html), which we aim to be compatible with.
|
|
The PARSEME-FR annotated corpus adds an extra annotation layer for multiword expressions (MWEs) and named entites (NE) on top of the French [Sequoia treebank](https://deep-sequoia.inria.fr/), using the project's internal [annotation guidelines (in French)](Guide-annotation-EP-EN).
|
|
|
|
|
|
|
|
The corpus is released using a variant of the [PARSEME Shared task 2018](http://multiword.sourceforge.net/sharedtask2018) format, called _cupt_ (short for **C**onll-**U**+**P**arseme-**T**sv). Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the [cupt format description page](multiword.sourceforge.net/cupt-format) for details. Since _cupt_ is based on Conll-U, please also check the Universal Dependencies [Conll-U format description page](universaldependencies.org/format) and the recommendations for [Conll-U Plus extended format](http://universaldependencies.org/ext-format.html), which we aim to be compatible with.
|
|
|
|
|
|
In short, a _cupt_ file contains splitted sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Tokens roughly correspond to words, except for multiword tokens represented as ranges (see UD's [tokenization guide](http://universaldependencies.org/u/overview/tokenization.html) and PARSEME's page on [words and tokens](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=wordsandtokens)). Documentation about the 10 first columns can be found **[here >> MARIE insert link <<](XXX)**. In PARSEME's original _cupt_ format, the 11th column contains MWE annotations. In PARSEME-FR's _cupt_ variant, we extend it to (a) also represent NEs in addition to MWEs and (b) add extra information about the POS and criteria to category labels, as detailed below. Therefore, the name of this column in the header metadata is _PARSEME-FR:MWE_ instead of _PARSEME:MWE_.
|
|
In short, a _cupt_ file contains splitted sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Tokens roughly correspond to words, except for multiword tokens represented as ranges (see UD's [tokenization guide](http://universaldependencies.org/u/overview/tokenization.html) and PARSEME's page on [words and tokens](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=wordsandtokens)). Documentation about the 10 first columns can be found **[here >> MARIE insert link <<](XXX)**. In PARSEME's original _cupt_ format, the 11th column contains MWE annotations. In PARSEME-FR's _cupt_ variant, we extend it to (a) also represent NEs in addition to MWEs and (b) add extra information about the POS and criteria to category labels, as detailed below. Therefore, the name of this column in the header metadata is _PARSEME-FR:MWE_ instead of _PARSEME:MWE_.
|
|
|
|
|
... | @@ -19,22 +21,23 @@ Similarly to _PARSEME:MWE_, the information in the 11th column _PARSEME-FR:MWE_ |
... | @@ -19,22 +21,23 @@ Similarly to _PARSEME:MWE_, the information in the 11th column _PARSEME-FR:MWE_ |
|
3. **CRITERIA**: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the guidelines.
|
|
3. **CRITERIA**: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the guidelines.
|
|
- If the current line contains a lexicalized component of the VMWE which is not the first one in the sentence, the VMWE code contains the VMWE identifier only, as described above, and no VMWE category label.
|
|
- If the current line contains a lexicalized component of the VMWE which is not the first one in the sentence, the VMWE code contains the VMWE identifier only, as described above, and no VMWE category label.
|
|
|
|
|
|
### Example
|
|
### Example
|
|
|
|
**TODO ADD REAL EXAMPLE**
|
|
|
|
|
|
Here is an example of sentence using the PARSEME-FR _cupt_ format described above.
|
|
Here is an example of sentence using the PARSEME-FR _cupt_ format described above.
|
|
|
|
|
|
```
|
|
```
|
|
1 Jean
|
|
1 Jean ... 1:NOUN|EN-PERS.final|_
|
|
2 fait
|
|
2 fait ... 2:VERB|EP-VID|_
|
|
3 partie
|
|
3 partie ... 2
|
|
5 du
|
|
5 du ... *
|
|
6 de
|
|
6 de ... *
|
|
7 le
|
|
7 le ... *
|
|
8 conseil
|
|
8 conseil ... 3:NOUN|EN-ORG.final|_;4:NOUN|EP-_|LEX
|
|
9 général
|
|
9 général ... 3;4
|
|
10 de
|
|
10 de ... 3
|
|
11 l'
|
|
11 l' ... 3
|
|
12 Ardèche
|
|
12 Ardèche ... 3:5:NOUN|EN-LOC.final|_
|
|
```
|
|
```
|
|
|
|
|
|
**TODO ADD EXAMPLE** |
|
|