... | @@ -8,17 +8,21 @@ The corpus is released using a variant of the [PARSEME Shared task 2018](http:// |
... | @@ -8,17 +8,21 @@ The corpus is released using a variant of the [PARSEME Shared task 2018](http:// |
|
- the first 10 columns follow the [Conll-U format](http://universaldependencies.org/format.html), and encode morphological and syntactic annotations
|
|
- the first 10 columns follow the [Conll-U format](http://universaldependencies.org/format.html), and encode morphological and syntactic annotations
|
|
- the eleventh column contain the MWE and NE annotations
|
|
- the eleventh column contain the MWE and NE annotations
|
|
|
|
|
|
|
|
|
|
Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the [cupt format description page](http://multiword.sourceforge.net/cupt-format) for details. Since _cupt_ is based on Conll-U, please also check the Universal Dependencies [Conll-U format description page](http://universaldependencies.org/format.html) and the recommendations for [Conll-U Plus extended format](http://universaldependencies.org/ext-format.html), which we aim to be compatible with. In short, a _cupt_ file contains split sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Tokens roughly correspond to words, except for multiword tokens represented as ranges (see UD's [tokenization guide](http://universaldependencies.org/u/overview/tokenization.html) and PARSEME's page on [words and tokens](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=wordsandtokens)).
|
|
Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the [cupt format description page](http://multiword.sourceforge.net/cupt-format) for details. Since _cupt_ is based on Conll-U, please also check the Universal Dependencies [Conll-U format description page](http://universaldependencies.org/format.html) and the recommendations for [Conll-U Plus extended format](http://universaldependencies.org/ext-format.html), which we aim to be compatible with. In short, a _cupt_ file contains split sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the _cupt_ file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Tokens roughly correspond to words, except for multiword tokens represented as ranges (see UD's [tokenization guide](http://universaldependencies.org/u/overview/tokenization.html) and PARSEME's page on [words and tokens](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=wordsandtokens)).
|
|
|
|
|
|
### Format of colums 1 to 10: Morpho-syntactic annotation schemes
|
|
### Format of colums 1 to 10: Morpho-syntactic annotation schemes
|
|
|
|
|
|
The PARSEME-FR annotated corpus comes in two variants for the morpho-syntactic annotations:
|
|
The PARSEME-FR annotated corpus comes in two variants for the morpho-syntactic annotations:
|
|
|
|
|
|
- the original dependency scheme for the [sequoia deep corpus](https://deep-sequoia.inria.fr/)
|
|
- the original dependency scheme for the [sequoia deep corpus](https://deep-sequoia.inria.fr/), namely the FTBdep dependency scheme
|
|
- the [sequoia corpus converted to Universal Dependencies, version 2.3](https://github.com/UniversalDependencies/UD_French-Sequoia/tree/master)
|
|
- the [sequoia corpus converted to Universal Dependencies, version 2.3](https://github.com/UniversalDependencies/UD_French-Sequoia/tree/master)
|
|
|
|
|
|
|
|
### Interaction between syntactic annotation and MWE status
|
|
|
|
|
|
|
|
The syntactic representation and the MWE/NE annotations do interact. In particular, the syntactically irregular MWEs receive a specific representation in the syntactic dependency tree (with "dep_cpd" labels for the FTBdep scheme, and "fixed" for the UD scheme). See the specific page on [Interaction between syntactic annotation and MWE status](interaction_syntax_mwe).
|
|
|
|
|
|
### Format of the MWE/NE annotation layer (11th column)
|
|
### Format of column 11 : MWE/NE annotations
|
|
|
|
|
|
In PARSEME's original _cupt_ format, the 11th column contains MWE annotations. In PARSEME-FR's _cupt_ variant, we extend it to (a) also represent NEs in addition to MWEs and (b) add extra information about the POS and criteria underlying the category labels, as detailed below. Therefore, the name of this column in the header metadata is _PARSEME-FR:MWE_ instead of _PARSEME:MWE_.
|
|
In PARSEME's original _cupt_ format, the 11th column contains MWE annotations. In PARSEME-FR's _cupt_ variant, we extend it to (a) also represent NEs in addition to MWEs and (b) add extra information about the POS and criteria underlying the category labels, as detailed below. Therefore, the name of this column in the header metadata is _PARSEME-FR:MWE_ instead of _PARSEME:MWE_.
|
|
|
|
|
... | @@ -96,6 +100,4 @@ The sentence contains an example of a word (support verb "effectuait") belonging |
... | @@ -96,6 +100,4 @@ The sentence contains an example of a word (support verb "effectuait") belonging |
|
27 . ... *
|
|
27 . ... *
|
|
```
|
|
```
|
|
|
|
|
|
### Interaction between syntactic annotation and MWE status
|
|
|
|
|
|
|
|
** TO BE CONTINUED HERE ** |
|
|