THIS PAGE BELONGS TO THE
PUBLIC
DOC OF PARSEME-FR
PARSEME-FR annotation guidelines - v1.0
- Notations
- Interaction with tokenization
- Background: verbal MWEs of PARSEME and distinction between named entities and MWEs
- Top decision tree, which serves to direct the annotator to either:
Background: verbal MWEs of PARSEME and distinction between named entities and MWEs
Verbal expressions
Verbal multi-word expressions were the focus of the PARSEME shared task 1.1 (2018), organized within the international PARSEME (COST) project. The PARSEME v1.1 guide for verbal MWEs was designed and used to produce annotations for 20 languages, including French. For PARSEME-FR, we have thus focused on other MWEs (non verbal MWEs). Members of the French spin-off project PARSEME-FR were much involved in the multilingual PARSEME guide, so both guides are similar in spirit.
Nominal expressions : distinguishing "named entities" from other MWEs
For nominal multi-word expressions, we use a primary distinction concerning the naming convention that links the expression and the entity or entities the expression can refer to. The starting intuition is that one can distinguish:
- (1) entity names : some nominal MWEs work as the direct name of a specific entity (for instance Anna Duval)
- (2) versus instantiable concept names, working as the name of a concept, which can be used to refer to instances of this concept (e.g. neural network).
In this latter case, knowing the defining characteristics of the concept enables one to use it for future instances, without requiring to learn any new naming convention. This contrasts with entity names: in order to use the name Anna Duval for a new person, one needs to learn a new naming convention linking the name to this new person, and the characteristics of the person play almost no role (to be precise, with such an example the name tells us the person should be a woman). Note that
- an entity name may well be ambiguous (e.g. several people bearing the same name), the key differentiating trait between (1) and (2) concerns whether or not there must be a naming convention at the level of each entity (Kleiber, 2007)
- for concept names of course there is also a naming convention (why use the noun table for a table), but it is defined at the level of the class of entities, not at the level of each entity. In a given context, a NP headed by table may refer to a specific table t, but this is without any naming convention of this particular table.
This distinction between entity name and instantiable concept name is reminiscent of the proper noun versus common noun distinction, but this latter distinction is not so easy to define precisely. Of course, lexical items that are exclusively used for directly naming entities (e.g. the first and last names for people) are easily classified as proper nouns (sometimes called pure proper nouns). This is why Erhmann (2008) roughly defines proper nouns as the "désignation d’une entité précise par le biais d’une description dont le sens joue un rôle mineur par rapport à la dénomination, opérant directement, du référent" (the designation of a precise entity via a description whose meaning plays a minor role with respect to the denomination of the referent, which operates directly"). But an abundant literature shows that the proper / common noun distinction proves difficult to characterize in linguistic terms (we refer primarily to (Kleiber, 2001;2007) and (Erhmann, 2008) for a state of the art). Indeed within names of specific entities, we can distinguish:
- (1a) entity names composed of lexical items that are dedicated to naming entities (pure proper nouns), such as Italy, Anna Duval, Microsoft
- (1b) entity names that have a descriptive basis, such as the International League against Racism and Anti-Semitism or the Massif central (literally the central massif): the naming convention between the entity and the name is sociologically typical of a proper noun (the name of an association, of a geographical item), but also clearly results from the compatibility of the entity characteristics and the meaning of the lexical items
- (1c) but also names which serve to designate unique abstract entities, such as abstract simple nouns (taxidermy) or abstract MWEs (Euclidean geometry, machine translation): because of the unicity of the entity that can be called that way, they too can be viewed as entity names, for which the speakers have to learn the naming convention.
Now the thing is that cases (1d) [AGATA: Do you mean (1c)?] are traditionally not viewed as proper nouns. Kleiber (1996) argues that proper nouns function to name a particular entity within a specified class (a particular person within the class of persons).
Within PARSEME-FR, we have chosen to keep this tradition of considering (1b) cases as proper nouns, and (1c) cases as common nouns. We distinguish between:
- cases (1a)/(1b), which are generally considered in NLP as named entities and associated to a semantic type (person, organization etc...). Although the term is confusing (cf. the linguistic expression is an entity name, not a named entity) we will use it in the following, as usual in the NLP community, for entity names). We annotate these as named entities (EN), using a dedicated guide, provided they are of the following semantic type: PERSON, ORGANIZATION, LOCATION, HUMAN PRODUCT, EVENT (as these happen to often be named with a pure proper noun).
- moreover, for named entities, we do annotate both the multiword case (Anna Duval) and the single token cases (Italy, Anna): indeed, from the applicative point of view, it would be a pity to ignore the latter.
- for cases (2) and (1c) (which are not intuitively considered proper nouns) we use another guide, and a MWE tag (to be understood as non NE multi-word expression).
Note there are also names referring to unique concrete entities such as the sun or the moon (often called "unica"), whose status is widely debated. We tag these as named entities (e.g. in I can see you thanks to the moon), unless used to refer to an instance of a class (as in several planets have moons).
Our annotation process uses a top decision tree that directs the annotator towards the guide for verbal MWEs, the guide for NE and the guide for other MWEs.
Top decision tree
In running texts, annotators spot candidate linguistic expressions that might fall into the named entity or MWE category. In case of doubts, annotators must follow the decision tree provided below.
Candidate expressions
There are two types of candidates for a potential annotation:
- (1) single token or sequence of tokens that the annotator perceives potentially as the name of an entity of semantic type PERSON, ORGANIZATION, LOCATION, HUMAN PRODUCT or EVENT
- (2) a sequence of several tokens, whose meaning is at first sight obtained idiosyncratically and/or for which there components cannot vary freely (at the morphological or lexical level, substitutions that are normally possible are not acceptable for this sequence, or produce an unexpected change of meaning)
Note for some candidates, it might be unclear at the beginning whether they will be tagged as named entity or MWE, and what is their exact span. The annotators should decide using the decision tree.
Decision tree
For a given candidate expression c:
0. c has the distribution of a verb (or VP, or sentence) ?
- NO => continue
- YES => go to the guide for verbal MWEs (external link) if c is multiword, otherwise EXIT
1. c has the distribution of a noun (or NP) ?
- NO => go to the guide for non-verbal MWEs if c is multiword, otherwise EXIT
- YES => continue
2. [SPECIF_REF] : Is c used in context to refer to a single specific entity of the discourse world?
-
NO => go to the guide for non-verbal MWEs if c is polylexical, otherwise EXIT
- indeed, if c does not refer to a single specific entity, c is not a NE
- Examples:
- Generic interpretation:
- Une arme blanche est une arme tranchante, perforante ou contondante dont la mise en œuvre n'est due qu'à la force humaine…
- Le conseil départemental est l'assemblée délibérante d'un département
- Use of a plural to refer to all objects of the class defined by c:
- Edged weapons are prohibited in a plane
- Red-haired people are rare
- tables made of wood last longer
- Use of a plural to refer to several objects of a class:
- J’ai acheté deux stylos plume. (I bought two fountain pens)
- A Maisons-Alfort il y a plusieurs Pierres Martins => pb ici d’aller au guide EP ??? En faire un cas particulier ?
-
YES or UNSURE => CONTINUE
- Examples:
- Il a utilisé une arme blanche
- J’ai vendu ma voiture à Anna Duval
- Le désormais célèbre réalisateur du documentaire Merci patron ! a réussi à rattraper un retard de presque dix points
- Elle a enseigné la physique quantique
- La famille Dupont a déménagé
- Les Duponts ont déménagé (refers to a family)
- Le conseil départemental a voté le budget le vendredi dernier
- J’ai vu un petit chaperon rouge sur la table (référence à une coiffe)
- J’ai vu un petit chaperon rouge s’enfuir (metonymic reference to a child)
- Le petit chaperon rouge de l’histoire célèbre m’a toujours été sympathique (reference to the specific character of the story)
- Examples:
3. [CONCEPT_NAMING_CONV] Can the expression c be used to refer to another entity e', without any need for another naming convention between c and e' (simply based on matching properties between c and e')
-
NO because no other e' exists => CONTINUE
- Le président de la République française en 2013
- Le désormais célèbre réalisateur du documentaire Merci patron ! a réussi à rattraper un retard de presque dix points
- Elle a enseigné la physique quantique
- Le petit chaperon rouge de l’histoire célèbre m’a toujours été sympathique (reference to the specific character of the story)
-
NO => CONTINUE
- Examples of new naming convention needed:
- J’ai vendu ma voiture à Anna Duval
- Les Duponts ont déménagé (to refer to another family, one needs another naming convention)
- Examples of new naming convention needed:
-
YES => go to the guide for non-verbal MWEs if c is multiword, otherwise EXIT
- no other naming convention needed for:
- Il a utilisé une arme blanche
- La juge d'instruction a demandé une confrontation
- J’ai vu un petit chaperon rouge sur la table (référence à une coiffe)
- J’ai vu un petit chaperon rouge s’enfuir (metonymic reference to a child)
- no other naming convention needed for:
4. [SEM_TYPE] Is the entity e refered to by c a PERSON, ORGANIZATION, LOCATION, HUMAN PRODUCT or EVENT ?
- NO => go to the guide for non-verbal MWEs if polylexical, otherwise EXIT
-
YES => go to the NE guide
- If not NE, go to the guide for non-verbal MWEs if polylexical, otherwise EXIT