The corpus is released using a variant of the PARSEME Shared task 2018 format, called cupt (short for Conll-U+Parseme-Tsv), in which:
the first 10 columns follow the Conll-U format, and encode morphological and syntactic annotations
the eleventh column contain the MWE and NE annotations
Here we give a minimal description of this format, so that the documentation is self-contained. Please refer to the cupt format description page for details. Since cupt is based on Conll-U, please also check the Universal Dependencies Conll-U format description page and the recommendations for Conll-U Plus extended format, which we aim to be compatible with. In short, a cupt file contains split sentences, each represented with one token per line, blank lines separating sentences, and comments preceded by hashes (#) to add sentence meta-data such as raw text and sentence IDs. The first line of the cupt file contains special metadata listing the names of each column. Each token on a line contains 11 columns, corresponding to linguistic information about the token's form, morphology and syntax (ID, FORM, LEMMA, UPOS, etc). Tokens roughly correspond to words, except for multiword tokens represented as ranges (see UD's tokenization guide and PARSEME's page on words and tokens).
Colums 1 to 10: Morpho-syntactic annotation schemes
The PARSEME-FR annotated corpus comes in two variants for the morpho-syntactic annotations:
the original dependency scheme for the sequoia deep corpus, namely the FTBdep dependency scheme
Interaction between syntactic annotation and MWE status
The syntactic representation and the MWE/NE annotations do interact.
MWE are basically categorized as "syntactically regular" or "syntactically irregular". The former get a "normal" syntactic representation, whereas for the latter a specific flat representation is used in the syntactic dependency tree (with "dep_cpd" labels for the FTBdep scheme, and "fixed" for the UD scheme). See the specific page on Interaction between syntactic annotation and MWE status.
Column 11 : MWE/NE annotations
In PARSEME's original cupt format, the 11th column contains MWE annotations. In PARSEME-FR's cupt variant, we extend it to (a) also represent NEs in addition to MWEs and (b) add extra information about the POS and criteria underlying the category labels, as detailed below. Therefore, the name of this column in the header metadata is PARSEME-FR:MWE instead of PARSEME:MWE.
Similarly to PARSEME:MWE, the information in the 11th column called PARSEME-FR:MWE contains one of the following three options:
an asterisk '*' for words that are not part of a MWE/NE and for multiword tokens (e.g. 2-3 du)
an underscore '_' if the MWE/NE annotation is unspecified or missing
a list of semicolon-separated CODES if the current word is part of one or more MWEs/NEs. Codes are only assigned to the lexicalized components of a MWE/NE (see Lexicalized components and open slots in the PARSEME annotation guidelines).
for all the components of a MWE/NE except the (linearly) first one, the CODE is simply an IDENTIFIER:
the IDENTIFIER of a MWE/NE is an integer, greater or equal to 1, and is unique within the sentence: the only requirement for IDENTIFIERS is that all the components of a MWE/NE must have CODES starting by the same IDENTIFIER and no other MWE/NE in the sentence use it.
for the (linearly) first component of a MWE/NE, the code consists of an identifier followed by a colon ':' and a LABEL:
LABELS provide information about the MWE/NE and are composed of a POS field, a CATEGORY field and CRITERIA field separated by a pipe '|' character for (i.e. POS|CATEGORY|CRITERION1,CRITERION2..., for instance ADP|MWE|IRREG describes a MWE (not a NE), whose part of speech is ADP, and for which the criterion IRREG has been used):
POS is a tag representing the part of speech of the whole MWE/NE, using the tagset of the syntactic annotation scheme (either UD or FTBdep), except for some MWEs that were classified as syntactically regular, in which case the POS is irrelevant ("_" is used). For details please refer to the interaction page for the heuristics used to classify MWEs as regular / irregular, and to assign the POS to irregular ones].
Named entities always have the POS for proper nouns.
CATEGORY corresponds to the type of unit being annotated. The category either starts by "NE" for named entities, or "MWE" for multi-word expressions that are not named entities.
For named entities, the category start by "EN-", followed by a type of named entity (PERSon, LOCation, ORGanization, PRODuct, EVEnt), followed by ".final" or ".prim" to indicate whether the type matches the interpretation of the NE in context (.final) or matches the a priori interpretation only (.prim). See NE categories CARLOS ADD LINK for more details.
for instance a person named entity will get the category NE-PERS.final
a location used in context for an organization (as in "they played against Chicago") will lead to two NE annotations: one with category NE-LOC.prim and one with category NE-ORG.final
CRITERIA: For verbal MWEs and NEs, this field is unspecified (_). For non-verbal MWEs, this corresponds to a comma-separated list of the criteria used by annotators to decide that this combination of words is indeed a MWE. The criteria acronyms correspond to those in the guidelines, e.g. CRAN, ID, PRED, LEX, etc.
If the current line contains a lexicalized component which is not the first one in the current MWE, the MWE code contains the MWE identifier only, as described above, and no MWE category label.
Here is an example of sentence using the PARSEME-FR cupt format described above,
showing only columns 1 (ID), 2 (FORM) and 11 (MWE / NE annotation).
E.g. "Peugeot" is annotated as a final ORG named entity (NE-ORG.final), with identifier 2, and also as a primary PERS named entity with identifier 1.
"tout au plus" is annotated as a MWE, more precisely tokens "tout", "à", "le" and "plus" are annotated with identifier 3 ("au" is a multi-word token which is not annotated). It has "ADV" as part-of-speech, meaning it behaves as an adverb, but it is considered as irregular from the syntactic point of view. The criterion that was used to annotate it is "IRREG".
The sentence contains an example of a word (support verb "effectuait") belonging to two LVCs (tokens 21+23 and tokens 21+26 each form a LVC).
# text = Chez Peugeot tout au plus on savait que Jean Gapé ne faisait plus partie du conseil général mais effectuait divers travaux et divers tâches.# sent_id = 1231 Chez ... *2 Peugeot ... 1:PROPN|NE-PERS.prim|_;2:PROPN|NE-ORG.final|_3 tout ... 3:ADV|MWE|IRREG4-5 à ... *4 à ... 35 le ... 36 plus ... 37 on ... *8 savait ... *9 que ... *10 Jean ... 4:PROPN|EN-PERS.final|_11 Gapé ... 412 ne ... *13 faisait ... 8:_|MWE-VID|_14 plus ... *15 partie ... 816-17 du ... *16 de ... *17 le ... *18 conseil ... 5:_|MWE|LEX19 général ... 520 mais ... *21 effectuait ... 6:_|LVC|_;7:_|LVC|_22 divers ... *23 travaux ... 624 et ... *25 diverses ... *26 tâches ... 627 . ... *