Corpus format description · Wiki · PARSEME-FR / PARSEME-FR-public

This is an old version of this page.

Go to most recent version Browse history

THIS PAGE BELONGS TO THE PUBLIC DOC OF PARSEME-FR

The PARSEME-FR annotated corpus adds an extra annotation layer for multiword expressions (MWEs) and named entites (NE) on top of the French Sequoia treebank, using the project's internal annotation guidelines (in French).

The corpus is released using a variant of the PARSEME Shared task 2018 format, called cupt (short for Conll-U+Parseme-Tsv). Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the cupt format description page for details. Since cupt is based on Conll-U, please also check the Universal Dependencies Conll-U format description page and the recommendations for Conll-U Plus extended format, which we aim to be compatible with.

In short, a cupt file contains split sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the cupt file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Tokens roughly correspond to words, except for multiword tokens represented as ranges (see UD's tokenization guide and PARSEME's page on words and tokens). Documentation about the 10 first columns can be found here >> MARIE insert link <<. In PARSEME's original cupt format, the 11th column contains MWE annotations. In PARSEME-FR's cupt variant, we extend it to (a) also represent NEs in addition to MWEs and (b) add extra information about the POS and criteria underlying the category labels, as detailed below. Therefore, the name of this column in the header metadata is PARSEME-FR:MWE instead of PARSEME:MWE.

Similarly to PARSEME:MWE, the information in the 11th column called PARSEME-FR:MWE contains one of the following three options:

an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. 2-3 du)
an underscore '_' if the MWE/NE annotation is unspecified or missing
a list of semicolon-separated codes if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see Lexicalized components and open slots in the PARSEME annotation guidelines).
- for all the components of a MWE/NE except the (linearly) first one, the code is simply an identifier:
  - the identifier of a MWE/NE is an integer, greater or equal to 1, and is unique within the sentence: the only requirement for identifiers is that all the components of a MWE/NE must have codes starting by the same identifier and no other MWE/NE in the sentence use it.
- for the (linearly) first component of a MWE/NE, the code consists of an identifier followed by a colon ':' and a pos-category-criteria label:
  - pos|category|criteria labels provide information about the MWE/NE. These labels are composed of three fields separated by a pipe '|' character (i.e. POS|CATEGORY|CRITERION1,CRITERION2..., for instance ADP|MWE|IRREG describes a MWE (not a NE), whose part of speech is ADP, and for which the criteria IRREG has been used):
    1. POS is a tag representing the part of speech of the whole MWE/NE, or a "_" if the MWE was classified as regular. Please refer to the page describing the heuristics used to classify MWEs as regular / irregular, and to assign the POS to irregular ones for details.
    2. CATEGORY is a tag corresponding to a category that depends on the type of entity being annotated. It contains a prefix and a suffix, separated by a dash.
      - The prefix of the tag indicates whether this is a MWE (EP for expression polylexicale) or a NE (EN for entité nommée).
      - The suffix depends on the prefix as follows:
        
        For verbal MWEs (POS is VERB, prefix is EP), the suffix corresponds to the PARSEME 1.1 verbal MWE categories.
        
        For non-verbal MWEs (POS is not VERB, prefix is EP), the suffix is unspecified (_).
        
        For named entities (prefix is EN), the suffix corresponds to one of the 5 NE categories CARLOS ADD LINK annotated (PERSon, LOCation, ORGanization, PRODuct, EVEnt), with a sub-suffix indicating if the category is PRIMitive or FINAL
    3. CRITERIA: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the guidelines, e.g. CRAN, ID, PRED, LEX, etc.
- If the current line contains a lexicalized component which is not the first one in the current MWE, the MWE code contains the MWE identifier only, as described above, and no MWE category label.

Example

TODO ADD REAL EXAMPLE

Here is an example of sentence using the PARSEME-FR cupt format described above.

# text = Jean fait partie du conseil général de l'Ardèche.
# sent_id = 123
# source_sent_id = http://deep-sequoia.inria.fr/download/sequoia-8.2.tgz sequoia-8.2/sequoia.surf.conll 123
1 Jean     ... 1:NOUN|EN-PERS.final|_
2 fait     ... 2:VERB|EP-VID|_
3 partie   ... 2
5 du       ... *
6 de       ... *
7 le       ... *
8 conseil  ... 3:NOUN|EN-ORG.final|_;4:NOUN|EP-_|LEX
9 général  ... 3;4
10 de      ... 3
11 l'      ... 3
12 Ardèche ... 3;5:NOUN|EN-LOC.final|_
13 .       ... *

Comments

Please register or sign in to add a comment.