|
|
**`THIS PAGE BELONGS TO THE` _`PUBLIC`_ `DOC OF PARSEME-FR`**
|
|
|
|
|
|
# Decision trees for named entity (NE) annotation
|
|
|
|
|
|
[Go to the top on the annotation guidelines](Guide-annotation-ep-en)
|
|
|
[Go to the general principles of NE annotation](ep_et_en)
|
|
|
[Go to the discussion on the challenges behind EN annotation](defis-en)
|
|
|
[Go to the MWE annotation guide](Criteres)
|
|
|
|
|
|
<!--
|
|
|
## Color code
|
|
|
In the following, different colors are used to display examples:
|
|
|
- <font color="red">Red</font> is used for counter-examples, that is, expressions which look like VMWEs but are not one, whatever the language.
|
|
|
-->
|
|
|
|
|
|
NE annotation should be done in 1 main steps:
|
|
|
|
|
|
- **Step 1** - identify a **NE candidate** _c_, henceforth called _candidate sequence_. At this step some doubts may remain as to the precise left and right span of the candidate (e.g. as to the inclusion of classifiers like _la *laiterie* Besnier_ or of adverbials like _La Croix-Rouge française *de Blois*_. These doubts may be dispelled in Step 2 (tests Acron and WebPage) or in Step 3.
|
|
|
- **Step 2** - perform the (fuzzy) **NE identification** tests
|
|
|
- **Step 3** - if _c_ has been identified as a fuzzy NE in step 2, apply the **span tests**
|
|
|
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
## Step 1 - choosing a candidate sequence
|
|
|
choose a candidate sequence _c_ such that _c_ is a nominal group and _c_ names an entity of one of the relevant [types](ep_et_en#-2-les-types-den) ([PERS](ep_et_en#-21-noms-de-personne-pers), [LOC](ep_et_en#-22-noms-de-lieu-loc), [ORG](ep_et_en#-23-noms-dorganisation-org-incluant-les-humains-collectifs), [PROD](ep_et_en#-24-produits-humains-prod) or [EVE](ep_et_en#-25-ev%C3%A9nements-nomm%C3%A9s-eve))
|
|
|
apply [UniqueRef](#test-1-uniqueref-unique-referent)(_c_)
|
|
|
**NO** => _c_ is **not a NE**
|
|
|
**YES** => go to [Step 2](#step-2-identifying-a-named-entity).
|
|
|
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
## Step 2 - identifying a named entity
|
|
|
Let _c_ denote the candidate sequence identified in [Step 1](#step-1-choosing-a-candidate-sequence), and _t_ the text being annotated.
|
|
|
|
|
|
apply [ObviousProper](#test-2-obviousproper-obvious-proper-name)(_c_,_t_)
|
|
|
**YES** => _c_ is a fuzzy NE, go to [Step 3](#step-3-establishing-the-span-of-a-named-entity)
|
|
|
**NO** => apply [NameConv](#test-3-nameconv-naming-convention)(_c_,_t_)
|
|
|
preciseNE => c is a **NE with a precise span**
|
|
|
fuzzyNE => go to [Step 3](#step-3-establishing-the-span-of-a-named-entity)
|
|
|
noNE => c is **not a NE**
|
|
|
|
|
|
<!--
|
|
|
**NO** => apply [DefDesc](#DefDesc)(_c_,_t_)
|
|
|
**NO** => _c_ is not a NE
|
|
|
-->
|
|
|
|
|
|
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
## Step 3 - establishing the span of a named entity
|
|
|
Let _c_ denote the candidate sequence identified as a fuzzy NE in [Step 2](#step-2-identifying-a-named-entity). Some doubts may persist as to including some components of _c_ within the span of the final NE. To decide if the precise span of _c_ is valid apply the following decision tree.
|
|
|
|
|
|
Let _c_ denote the candidate sequence possibly with a determiner _d_, a classifier _cl_ and/or an adverbial _a_.
|
|
|
|
|
|
**if** ([Acron](#test-5-acron-acronym)(c,t) **or** [WebPage](#test-6-webpage-dedicated-web-page)(c) **or** [MinSpan](#test-7-minspan-minimal-span)(c)) **then**
|
|
|
c is a NE with a precise span
|
|
|
**else**
|
|
|
Exclude the determiner _d_ and the adverbial _a_ from _c_
|
|
|
**if** ([RelevUpper](#test-4-relevupper-relevant-uppercase)(c,t)) **then**
|
|
|
c is a **NE with a precise span**
|
|
|
**else**
|
|
|
**if** (there exists an occurrence c' identical to c such that [RelevUpper](#test-4-relevupper-relevant-uppercase)(c',t)) **then**
|
|
|
c is a **NE with a precise span**
|
|
|
**else**
|
|
|
**if** (c contains only a classifier) **then** //e.g. _mairie_, _conseil général_
|
|
|
c is **not a NE**
|
|
|
**else**
|
|
|
c=[SpanPerCat](#test-8-spanpercat-span-per-category)(c)
|
|
|
c is a **NE with a precise span**
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 1 [UniqueRef] - unique referent
|
|
|
Does the sequence name a unique object in the discourse world?
|
|
|
<!--- in an autonomous manner, i.e. without having to take the linguistic context (other than the place and date of the utterance) into account? -->
|
|
|
<!-- but indentifying the referent of the latter expression requires the linguistic context. Test passed for _François Ruffin_ but not for _Le désormais célèbre réalisateur de Merci patron!_.-->
|
|
|
|
|
|
Examples:
|
|
|
* "C’est un véritable petit exploit qu’a accompli _François Ruffin_. _Le désormais célèbre réalisateur du documentaire Merci patron !_ a réussi à rattraper un retard de presque dix points" - both _François Ruffin_ (PERS) and _Le désormais célèbre réalisateur de Merci patron!_ (no NE) refer to a unique referent in the discourse world. Test passed.
|
|
|
* _le Président de la République s'est rendu à Clamart_ - _le Président de la République_ (no NE) and Clamart (NE) names a unique person in the discourse world. Test passed.
|
|
|
* _le Petit Chaperon Rouge a rencontré le loup_ - _le Petit Chaperon Rouge_ (NE) names a unique person in the discourse world (even if it has no equivalent in the real world). Test passed.
|
|
|
* _Le conseil départemental est l'assemblée délibérante d'un département_ - _conseil départemental_ (no NE) refers to a whole class of institutions rather than to a unique institution. Test not passed.
|
|
|
* _Le conseil départemental a voté le budget le vendredi dernier_ - given the place and time of utterance, _conseil départemental_ (no NE) refers to a unique instance (a precise institution). Test passed.
|
|
|
* _Angiox est un médicament suédois_ - Angiox is considered here as the name of an invention (molecule) or of a trade mark, it refers to a unique instance. Test passed.
|
|
|
* _J'ai pris 2 Angiox avant de dormir_ - Angiox refers to several products of this mark. Test not passed.
|
|
|
* _A Maisons-Alfort il y a plusieurs Pierres Martins_ - _Pierres Martins_ (no NE) refers to several persons. Test not passed.
|
|
|
|
|
|
<!-- obsolete: il s'agit bien de l'interpretation en contexte
|
|
|
**AJOUT PENDANT ADJUDICATION MATHIEU/MARIE**:
|
|
|
Does the reference to a unique referent need to hold without context?
|
|
|
It seems we need to add this precision.
|
|
|
"does the candidate refers to a unique entity in the world, and would do so also without the context ?"
|
|
|
Hence for sequences used for short names of a named entity, one has to test whether the short name is understandable without context.
|
|
|
The test is complicated by the fact that a given name may well be ambiguous ("Pierre Martin" is the name of several persons!).
|
|
|
|
|
|
**Solution 1**: appliquer le unique ref hors contexte de manière stricte, tout en prenant en compte la possible ambiguïté du nom. Alors:
|
|
|
* "Traité" tout seul pas annotable, cf. il faut le contexte pour savoir de quel traité il s'agit.
|
|
|
* "Parti communiste" n'est pas annotable, cf. le vrai nom est "Parti communiste français"
|
|
|
* "ministère des Affaires étrangères" : on a le pb des insertions d'adj de nationalité (donc pas vraiment de référence unique), mais c'est en fait une séquence ambigüe: à la fois utilisable pour tout ministère des affaires étrangères ("Je suis allée en Italie, au ministère des affaires étrangères") mais il s'agit aussi du nom de l'organisation française. Ou alors on considère que "ministère des affaires étrangères" est un nom ambigu, et on les annote tous ??
|
|
|
* Pour "Commission", on a un peu plus l'impression qu'hors contexte il réfère bien à la commission européenne, mais en toute rigueur on peut construire un texte ou "Commission" tout seul est le diminutif d'une autre commission...
|
|
|
|
|
|
**Solution 2**: on prend aussi des diminutifs d'EN, même s'il faut un peu de contexte pour avoir la référence précise. Terrain glissant !!!!!
|
|
|
-->
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 2 [ObviousProper] - obvious proper name
|
|
|
|
|
|
Is the candidate sequence obviously a proper name? I.e. is the annotator confident about the existence of the naming convention concerning the sequence? For large classes of NEs (except person names, which fail to exhibit synonyms), this test can be reformulated in a way which is similar to the [LEX](Criteres-lexicaux#41-figement-des-%C3%A9l%C3%A9ments-lexicalement-pleins-t%C3%AAte-ou-compl%C3%A9ment-lex-et-term). Namely, when replacing one of the components of the candidate sequence by a synonym or a hyperonym is the reference to the initial entity lost?
|
|
|
|
|
|
Examples:
|
|
|
* _Pierre Martin habite à Maisons-Alfort_ - both _Pierre Martin_ (PERS) and _Maisons-Alfort_ (LOC) are clearly proper names. Notably, _Habitations-Alfort_ ne permet pas de garder le même référent. Test passed.
|
|
|
* _A Maisons-Alfort il y a plusieurs personnes nommées Pierre Martin_ - both _Pierre Martin_ (PERS) and _Maisons-Alfort_ (LOC) are clearly proper names. Test passed.
|
|
|
* _Structures d’insertion par l’activité économique en Vendée_ - it is not obvious that _Structures d’insertion par l’activité économique_ (no NE) is subject to a naming convention. The LEX-like test is hard to apply. Test not passed.
|
|
|
* _La fédération des entreprises d'insertion du Pays de la Loire_ - it is not obvious that _la fédération des entreprises d'insertion_ (ORG) is subject to a naming convention. The LEX-like test is hard to apply. Test not passed.
|
|
|
|
|
|
|
|
|
<!--
|
|
|
------------------------------------------------
|
|
|
### Test 3 [DefDesc] - definite description
|
|
|
|
|
|
Is the candidate sequence a definite description? A sequence is a definite description if its referent cannot be identified on the basis of the sole sequence, but requires empirical (i.e. extra-liguistic) knowledge instead.
|
|
|
|
|
|
Examples:
|
|
|
* _Le désormais célèbre réalisateur du documentaire Merci patron ! a réussi à rattraper un retard de presque dix points_ - empirical knowledge is understand that a film has a director and to know the director of this precise film (Merci patron !). Test passed.
|
|
|
* _Le président de la République exerce la plus haute fonction du pouvoir exécutif de la République française._ - _le président de la République_ (no NE) has a unique referent but the sentence defines it, so no previous knowledge is needed to identify it. Test not passed.
|
|
|
* _Le Conseil de l'Union européenne a été informé des économies réalisées par le FMI._ - empirical knowledge is needed to understand that a, institution like EU is led by a counsel - _Conseil de l'Union européenne_ (ORG) - consisting of all prime ministers. Test passed.
|
|
|
-->
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 3 [NameConv] - naming convention
|
|
|
|
|
|
Does a naming convention apply to the candidate sequence?
|
|
|
This test is organised as a decision tree. Let _c_ denote the candidate sequence and _t_ the text being annotated.
|
|
|
|
|
|
<!-- ``` -->
|
|
|
**if** ([RelevUpper](#test-4-relevupper-relevant-uppercase)(_c_,_t_)) **then**
|
|
|
**return** fuzzyNE
|
|
|
**else**
|
|
|
**if** (there exists a variant v of c in _t_ such that [RelevUpper](#test-4-relevupper-relevant-uppercase)(_v_,_t_) **then**
|
|
|
**return** fuzzyNE
|
|
|
**else**
|
|
|
**if** ([Acron](#test-5-acron-acronym)(_c_,_t_) **or** [WebPage](#test-6-webpage-dedicated-web-page)(_c_)) **then**
|
|
|
**return** preciseNE
|
|
|
**return** noNE
|
|
|
<!-- ``` -->
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 4 [RelevUpper] - relevant uppercase
|
|
|
|
|
|
Is the candidate sequence spelled with an initial uppercase letter to signal a proper name, rather than for other reasons? If unsure, answer no.
|
|
|
|
|
|
Examples:
|
|
|
* _Il a évoqué l'affaire des disparus du Beach_ - _affaire des disparus du Beach_ (EVE/no NE) is not spelled with an initial uppercase letter. Test not passed.
|
|
|
* _Affaire des disparus du Beach: 100 morts_ - _Affaire des disparus du Beach_ (EVE/no NE) is spelled with an initial uppercase letter because it starts a sentence. Test not passed.
|
|
|
* _J'ai eu l'honneur d'être reçu par Monsieur le Président du conseil en juin dernier_ - _Monsieur_ (noEN) and _Président
|
|
|
du conseil_ (no NE) are in the middle of a sentence, but they are spelled with an uppercase letter for honorific reasons. Test not passed.
|
|
|
* _QUID DE L'AFFAIRE DES DISPARUS DU BEACH_? - _AFFAIRE DES DISPARUS DU BEACH_ is spelled with an initial uppercase letter because the whole sequence is in uppercase. Test not passed.
|
|
|
* _Il a évoqué l'Affaire des disparus du Beach_ - _Affaire des disparus du Beach_ (EVE) is spelled with an initial upercase letter in the middle of a sentence, not for honorific reasons. We hypothesise that author signals a proper name. Test passed.
|
|
|
|
|
|
<!--
|
|
|
------------------------------------------------
|
|
|
### Test 5 [InitUpper] - initial uppercase
|
|
|
|
|
|
Is the candidate sequence spelled with an initial uppercase letter?
|
|
|
|
|
|
Examples:
|
|
|
* _Il a évoqué l'Affaire des disparus du Beach_ - test passed for _Affaire des disparus du Beach_ (EVE)
|
|
|
* _Il a évoqué l'affaire des disparus du Beach_ - test not passed for _affaire des disparus du Beach_ (EVE/no NE)
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 6 [HonoUpper] - honorific uppercase
|
|
|
|
|
|
Is the candidate sequence spelled with an initial uppercase letter for clearly honorific reasons? Fuzzy question may sometimes be hard to answer, so some fuzzy cases are hard to avoid.
|
|
|
|
|
|
Examples:
|
|
|
* _J'ai eu l'honneur d'être reçu par Monsieur le Président du conseil en juin dernier_ - _Monsieur_ et _Président (no NE) du conseil_ are in the middle of a sentence, and are spelled with an uppercase letter for honorific reasons. Test passed.
|
|
|
* _Il a suivi des cours à l'Université du temps libre_ - _Université_ (ORG) is in the middle of a sentence but is not spelled with an uppercase letter for honorific reasons. Test not passed.
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 7 [SentBeg] - beginning of the sentence
|
|
|
|
|
|
Does the candidate sequence occur at the beginning of the sentence?
|
|
|
|
|
|
Examples:
|
|
|
* _Affaire des disparus du Beach: 100 morts_ - test passed for _Affaire des disparus du Beach_ (EVE/no NE)
|
|
|
* _Il a évoqué l'Affaire des disparus du Beach_ - test not passed for _Affaire des disparus du Beach_ (NE)
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 8 [SentMid] - beginning of the sentence
|
|
|
|
|
|
Does the candidate sequence occur at the middle of the sentence?
|
|
|
|
|
|
Examples:
|
|
|
* _Affaire des disparus du Beach: 100 morts_ - test passed for _Affaire des disparus du Beach_ (EVE/no NE)
|
|
|
* _Il a évoqué l'Affaire des disparus du Beach_ - test not passed for _Affaire des disparus du Beach_ (NE)
|
|
|
-->
|
|
|
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 5 [Acron] - acronym
|
|
|
|
|
|
Does the candidate sequence have an acronym in the given text?
|
|
|
|
|
|
Examples:
|
|
|
* _Le Club Cynophile du Blaisois organise son concours demain. Toute l'équuipe du CCB vous attends._ - test passed for _Club Cynophile du Blaisois_ et _CCB_ (ORG).
|
|
|
* _Structures d’insertion par l’activité économique (SIAE) en Vendée_ - test passed for _Structures d’insertion par l’activité économique_ (no NE) but not for _Structures d’insertion par l’activité économique en Vendée_ (no NE)
|
|
|
* _L'insertion par l'activité économique (IAE) est une des composantes de ce que l'on appelle aujourd'hui l'Économie Sociale et Solidaire (ESS)_ - test passed for _insertion par l'activité économique_ (no NE) and _Économie Sociale et Solidaire_ (no NE).
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 6 [WebPage] - dedicated web page
|
|
|
|
|
|
Is there an official web page or a Wikipedia page titled by the candidate sequence?
|
|
|
|
|
|
Examples:
|
|
|
* _Mairie de Paris_ has a Wikipedia webpage titled with this precise sequence. Test passed.
|
|
|
* _Structures d’insertion par l’activité économique (SIAE) en Vendée_ - test passed for _Structures d’insertion par l’activité économique_ (there is a Wikipedia page) but not for _Structures d’insertion par l’activité économique en Vendée_
|
|
|
* _La fédération des entreprise d'insertion Pays de la Loire_ - it has its official webpage. Test passed.
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 7 [MinSpan] - minimal span
|
|
|
|
|
|
Does the candidate sequence _c_ have the minimal span, i.e. is it true that a shorter span than _c_ no longer refers to the same entity as _c_? Note that this test may be context-specific, e.g. inhabitants of Blois might say _aller à la République_ to mean _aller à la place de la République_ but this information is not available to the large population of French speakers. In case of doubts we suppose that the test is not passed.
|
|
|
|
|
|
Example:
|
|
|
* _Le Havre - La Rochelle en bus_ - _Le Havre_ and _La Rochelle_ cannot be referred to by _Havre_ and _Rochelle_. Test passed.
|
|
|
* _palais Jacques Coeur_ cannot be referred to as _Jacques Coeur_. Test passed.
|
|
|
* _place de l'Etoile_ can also be referred to as _l'Etoile_. Test not passed.
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 8 [SpanPerCat] - span per category
|
|
|
|
|
|
Note 1: This test does **not** apply to **any** NE of the given categories, it applies only when other more fine-grained tests have previously failed.
|
|
|
Note 2: According to the decision trees in [Step 3](#step-3-establishing-the-span-of-a-named-entity), this test is applied to candidate sequences for which the [MinSpan](#test-7-minspan-minimal-span) test failed, i.e. a shorted span can still refer to the same entity.
|
|
|
|
|
|
What is the **final type** of the candidate sequence _c_?
|
|
|
|
|
|
* **PERS** => exclude the classifier, e.g. _les frères [Dupond]_, _la famille [Champion]_, _le professeur [Władysław Strzemiński]_, _Mme [ de la Clairy]_, _le prédident [Giscard d'Estaing]_
|
|
|
|
|
|
* **LOC**:
|
|
|
* exclude the classifier if one of the following
|
|
|
+ région e.g. _la région [Ile-de-France]_
|
|
|
+ département e.g. _département [Indre-et-Loire]_
|
|
|
+ ville e.g. _la ville de [Clermont-Ferrand]_
|
|
|
* otherwise keep the classifier, e.g.
|
|
|
+ rue, avenue, degré, escalier e.g. [_rue de la Paix_], [_avenue Jean-Jaurès_], [_escalier Denis Papin_], [_degré Saint-Laumer_]
|
|
|
+ place, square, rond-point e.g. [_place Victor Hugo_], [_square Léon Blum_], [_rond-point Charles-de-Gaule_]
|
|
|
+ mer e.g. [_mer Baltique_], [_mer Égée_]
|
|
|
+ lac, étang e.g. [_lac Pavin_], [_étang Neuf_]
|
|
|
+ école e.g. [_école Notre-Dame_]
|
|
|
+ salle e.g. [_salle Jean-Mathieu_]
|
|
|
+ laiterie e.g. [_laiterie Besnier_], [_laiterie SOGECO_]
|
|
|
+ hôtel e.g. _l'hôtel [Formule 1]_
|
|
|
+ église e.g. [_église Notre-Dame_]
|
|
|
+ col e.g. [_col du Tourmalet_]
|
|
|
+ etc.
|
|
|
|
|
|
* **ORG**:
|
|
|
* exclude the classifier if one of the following
|
|
|
+ those that would be excluded for the primary type (if different from the final type), e.g. _la ville de [Clermont-Ferrand] a voté gauche_ (primary type: LOC, so _ville_ is excluded)
|
|
|
+ entreprise, e.g. _l'entreprise [Mc Kowal]_
|
|
|
+ société e.g. _la société [Lyon Tech]_
|
|
|
* otherwise keep the classifier, e.g.
|
|
|
* **PROD** => exclude the classifier, e.g. _le paquebot [Angelina Lauro]_
|
|
|
* **EVE** => exclude the classifier, e.g. _ouragan [El Niño]_
|
|
|
|
|
|
**Return** _c_.
|
|
|
|
|
|
|