|
|
**`THIS PAGE BELONGS TO THE` _`PUBLIC`_ `DOC OF PARSEME-FR`**
|
|
|
|
|
|
# Decision trees for named entity (NE) annotation
|
|
|
# Decision trees for checking naming conventions behind potential named entities
|
|
|
|
|
|
[Go to the top on the annotation guidelines](Guide-annotation-PARSEME_FR-chapeau)
|
|
|
[Go to the general principles of NE annotation](ep_et_en)
|
|
|
[Go to the discussion on the challenges behind EN annotation](defis-en)
|
|
|
[Go to the MWE annotation guide](Criteres)
|
|
|
|
|
|
<!--
|
|
|
## Color code
|
|
|
In the following, different colors are used to display examples:
|
|
|
- <font color="red">Red</font> is used for counter-examples, that is, expressions which look like VMWEs but are not one, whatever the language.
|
|
|
-->
|
|
|
|
|
|
NE annotation should be done in 1 main steps:
|
|
|
|
|
|
- **Step 1** - identify a **NE candidate** _c_, henceforth called _candidate sequence_. At this step some doubts may remain as to the precise left and right span of the candidate (e.g. as to the inclusion of classifiers like _la *laiterie* Besnier_ or of adverbials like _La Croix-Rouge française *de Blois*_. These doubts may be dispelled in Step 2 (tests Acron and WebPage) or in Step 3.
|
|
|
- **Step 2** - perform the (fuzzy) **NE identification** tests
|
|
|
- **Step 3** - if _c_ has been identified as a fuzzy NE in step 2, apply the **span tests**
|
|
|
This decision tree is entered from the [general decision tree](https://gitlab.lis-lab.fr/PARSEME-FR/PARSEME-FR-public/wikis/Guide-annotation-PARSEME_FR-chapeau) in which a _candidate sequence c_ has been identified. At this step some doubts may remain as to the precise left and right span of the candidate (e.g. as to the inclusion of classifiers like _la *laiterie* Besnier_ or of adverbials like _La Croix-Rouge française *de Blois*_. These doubts may be dispelled in Step 1 (tests Acron and WebPage) or in Step 2.
|
|
|
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
## Step 1 - choosing a candidate sequence
|
|
|
choose a candidate sequence _c_ such that _c_ is a nominal group and _c_ names an entity of one of the relevant [types](ep_et_en#-2-les-types-den) ([PERS](ep_et_en#-21-noms-de-personne-pers), [LOC](ep_et_en#-22-noms-de-lieu-loc), [ORG](ep_et_en#-23-noms-dorganisation-org-incluant-les-humains-collectifs), [PROD](ep_et_en#-24-produits-humains-prod) or [EVE](ep_et_en#-25-ev%C3%A9nements-nomm%C3%A9s-eve))
|
|
|
apply [UniqueRef](#test-1-uniqueref-unique-referent)(_c_)
|
|
|
**NO** => _c_ is **not a NE**
|
|
|
**YES** => go to [Step 2](#step-2-identifying-a-named-entity).
|
|
|
- **Step 1** - perform the (fuzzy) **NE identification** tests
|
|
|
- **Step 2** - if _c_ has been identified as a fuzzy NE in step 2, apply the **span tests** ()
|
|
|
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
## Step 2 - identifying a named entity
|
|
|
Let _c_ denote the candidate sequence identified in [Step 1](#step-1-choosing-a-candidate-sequence), and _t_ the text being annotated.
|
|
|
## Step 1 - identifying a naming convention
|
|
|
Let _c_ denote the candidate sequence identified while following the [general decision tree](https://gitlab.lis-lab.fr/PARSEME-FR/PARSEME-FR-public/wikis/Guide-annotation-PARSEME_FR-chapeau), and _t_ the text being annotated.
|
|
|
|
|
|
apply [ObviousProper](#test-2-obviousproper-obvious-proper-name)(_c_,_t_)
|
|
|
**YES** => _c_ is a fuzzy NE, go to [Step 3](#step-3-establishing-the-span-of-a-named-entity)
|
... | ... | @@ -37,15 +25,9 @@ apply [ObviousProper](#test-2-obviousproper-obvious-proper-name)(_c_,_t_) |
|
|
fuzzyNE => go to [Step 3](#step-3-establishing-the-span-of-a-named-entity)
|
|
|
noNE => c is **not a NE**
|
|
|
|
|
|
<!--
|
|
|
**NO** => apply [DefDesc](#DefDesc)(_c_,_t_)
|
|
|
**NO** => _c_ is not a NE
|
|
|
-->
|
|
|
|
|
|
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
## Step 3 - establishing the span of a named entity
|
|
|
Let _c_ denote the candidate sequence identified as a fuzzy NE in [Step 2](#step-2-identifying-a-named-entity). Some doubts may persist as to including some components of _c_ within the span of the final NE. To decide if the precise span of _c_ is valid apply the following decision tree.
|
|
|
## Step 2 - establishing the span of a named entity
|
|
|
Let _c_ denote the candidate sequence identified as a fuzzy NE in [Step 1](#step-2-identifying-a-named-entity). Some doubts may persist as to including some components of _c_ within the span of the final NE. To decide if the precise span of _c_ is valid apply the following decision tree.
|
|
|
|
|
|
Let _c_ denote the candidate sequence possibly with a determiner _d_, a classifier _cl_ and/or an adverbial _a_.
|
|
|
|
... | ... | @@ -66,6 +48,74 @@ Let _c_ denote the candidate sequence possibly with a determiner _d_, a classifi |
|
|
c is a **NE with a precise span**
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 1 [ObviousProper] - obvious proper name
|
|
|
|
|
|
Is the candidate sequence obviously a proper name? I.e. is the annotator confident about the existence of the naming convention concerning the sequence? For large classes of NEs (except person names, which fail to exhibit synonyms), this test can be reformulated in a way which is similar to the [LEX](Criteres-lexicaux#41-figement-des-%C3%A9l%C3%A9ments-lexicalement-pleins-t%C3%AAte-ou-compl%C3%A9ment-lex-et-term). Namely, when replacing one of the components of the candidate sequence by a synonym or a hyperonym is the reference to the initial entity lost?
|
|
|
|
|
|
Examples:
|
|
|
* _Pierre Martin habite à Maisons-Alfort_ - both _Pierre Martin_ (PERS) and _Maisons-Alfort_ (LOC) are clearly proper names. Notably, _Habitations-Alfort_ ne permet pas de garder le même référent. Test passed.
|
|
|
* _A Maisons-Alfort il y a plusieurs personnes nommées Pierre Martin_ - both _Pierre Martin_ (PERS) and _Maisons-Alfort_ (LOC) are clearly proper names. Test passed.
|
|
|
* _Structures d’insertion par l’activité économique en Vendée_ - it is not obvious that _Structures d’insertion par l’activité économique_ (no NE) is subject to a naming convention. The LEX-like test is hard to apply. Test not passed.
|
|
|
* _La fédération des entreprises d'insertion du Pays de la Loire_ - it is not obvious that _la fédération des entreprises d'insertion_ (ORG) is subject to a naming convention. The LEX-like test is hard to apply. Test not passed.
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 2 [DiscNameConv] - discovering a naming convention
|
|
|
|
|
|
Does a naming convention, initially unknown (or unclear) to the annotator, apply to the candidate sequence?
|
|
|
This test is organised as a decision tree. Let _c_ denote the candidate sequence and _t_ the text being annotated.
|
|
|
|
|
|
<!-- ``` -->
|
|
|
**if** ([RelevUpper](#test-4-relevupper-relevant-uppercase)(_c_,_t_)) **then**
|
|
|
**return** fuzzyNE
|
|
|
**else**
|
|
|
**if** (there exists a variant v of c in _t_ such that [RelevUpper](#test-4-relevupper-relevant-uppercase)(_v_,_t_) **then**
|
|
|
**return** fuzzyNE
|
|
|
**else**
|
|
|
**if** ([Acron](#test-5-acron-acronym)(_c_,_t_) **or** [WebPage](#test-6-webpage-dedicated-web-page)(_c_)) **then**
|
|
|
**return** preciseNE
|
|
|
**return** noNE
|
|
|
<!-- ``` -->
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 3 [RelevUpper] - relevant uppercase
|
|
|
|
|
|
Is the candidate sequence spelled with an initial uppercase letter to signal a proper name, rather than for other reasons? If unsure, answer no.
|
|
|
|
|
|
Examples:
|
|
|
* _Il a évoqué l'affaire des disparus du Beach_ - _affaire des disparus du Beach_ (EVE/no NE) is not spelled with an initial uppercase letter. Test not passed.
|
|
|
* _Affaire des disparus du Beach: 100 morts_ - _Affaire des disparus du Beach_ (EVE/no NE) is spelled with an initial uppercase letter because it starts a sentence. Test not passed.
|
|
|
* _J'ai eu l'honneur d'être reçu par Monsieur le Président du conseil en juin dernier_ - _Monsieur_ (noEN) and _Président
|
|
|
du conseil_ (no NE) are in the middle of a sentence, but they are spelled with an uppercase letter for honorific reasons. Test not passed.
|
|
|
* _QUID DE L'AFFAIRE DES DISPARUS DU BEACH_? - _AFFAIRE DES DISPARUS DU BEACH_ is spelled with an initial uppercase letter because the whole sequence is in uppercase. Test not passed.
|
|
|
* _Il a évoqué l'Affaire des disparus du Beach_ - _Affaire des disparus du Beach_ (EVE) is spelled with an initial upercase letter in the middle of a sentence, not for honorific reasons. We hypothesise that author signals a proper name. Test passed.
|
|
|
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
<!--------------------------------------------------------------------------------------------->
|
|
|
<!---------------------------------------------------------------------------------------------
|
|
|
<!--
|
|
|
## Color code
|
|
|
In the following, different colors are used to display examples:
|
|
|
- <font color="red">Red</font> is used for counter-examples, that is, expressions which look like VMWEs but are not one, whatever the language.
|
|
|
-->
|
|
|
|
|
|
## Step 1 - choosing a candidate sequence
|
|
|
choose a candidate sequence _c_ such that _c_ is a nominal group and _c_ names an entity of one of the relevant [types](ep_et_en#-2-les-types-den) ([PERS](ep_et_en#-21-noms-de-personne-pers), [LOC](ep_et_en#-22-noms-de-lieu-loc), [ORG](ep_et_en#-23-noms-dorganisation-org-incluant-les-humains-collectifs), [PROD](ep_et_en#-24-produits-humains-prod) or [EVE](ep_et_en#-25-ev%C3%A9nements-nomm%C3%A9s-eve))
|
|
|
apply [UniqueRef](#test-1-uniqueref-unique-referent)(_c_)
|
|
|
**NO** => _c_ is **not a NE**
|
|
|
**YES** => go to [Step 2](#step-2-identifying-a-named-entity).
|
|
|
--------------------------------------------------------------------------------------------->
|
|
|
|
|
|
|
|
|
<!--
|
|
|
**NO** => apply [DefDesc](#DefDesc)(_c_,_t_)
|
|
|
**NO** => _c_ is not a NE
|
|
|
-->
|
|
|
|
|
|
|
|
|
|
|
|
<!------------------------------------------------
|
|
|
### Test 1 [UniqueRef] - unique referent
|
|
|
Does the sequence name a unique object in the discourse world?
|
|
|
<!--- in an autonomous manner, i.e. without having to take the linguistic context (other than the place and date of the utterance) into account? -->
|
... | ... | @@ -80,6 +130,7 @@ Examples: |
|
|
* _Angiox est un médicament suédois_ - Angiox is considered here as the name of an invention (molecule) or of a trade mark, it refers to a unique instance. Test passed.
|
|
|
* _J'ai pris 2 Angiox avant de dormir_ - Angiox refers to several products of this mark. Test not passed.
|
|
|
* _A Maisons-Alfort il y a plusieurs Pierres Martins_ - _Pierres Martins_ (no NE) refers to several persons. Test not passed.
|
|
|
------------------------------------------------>
|
|
|
|
|
|
<!-- obsolete: il s'agit bien de l'interpretation en contexte
|
|
|
**AJOUT PENDANT ADJUDICATION MATHIEU/MARIE**:
|
... | ... | @@ -98,16 +149,6 @@ The test is complicated by the fact that a given name may well be ambiguous ("Pi |
|
|
**Solution 2**: on prend aussi des diminutifs d'EN, même s'il faut un peu de contexte pour avoir la référence précise. Terrain glissant !!!!!
|
|
|
-->
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 2 [ObviousProper] - obvious proper name
|
|
|
|
|
|
Is the candidate sequence obviously a proper name? I.e. is the annotator confident about the existence of the naming convention concerning the sequence? For large classes of NEs (except person names, which fail to exhibit synonyms), this test can be reformulated in a way which is similar to the [LEX](Criteres-lexicaux#41-figement-des-%C3%A9l%C3%A9ments-lexicalement-pleins-t%C3%AAte-ou-compl%C3%A9ment-lex-et-term). Namely, when replacing one of the components of the candidate sequence by a synonym or a hyperonym is the reference to the initial entity lost?
|
|
|
|
|
|
Examples:
|
|
|
* _Pierre Martin habite à Maisons-Alfort_ - both _Pierre Martin_ (PERS) and _Maisons-Alfort_ (LOC) are clearly proper names. Notably, _Habitations-Alfort_ ne permet pas de garder le même référent. Test passed.
|
|
|
* _A Maisons-Alfort il y a plusieurs personnes nommées Pierre Martin_ - both _Pierre Martin_ (PERS) and _Maisons-Alfort_ (LOC) are clearly proper names. Test passed.
|
|
|
* _Structures d’insertion par l’activité économique en Vendée_ - it is not obvious that _Structures d’insertion par l’activité économique_ (no NE) is subject to a naming convention. The LEX-like test is hard to apply. Test not passed.
|
|
|
* _La fédération des entreprises d'insertion du Pays de la Loire_ - it is not obvious that _la fédération des entreprises d'insertion_ (ORG) is subject to a naming convention. The LEX-like test is hard to apply. Test not passed.
|
|
|
|
|
|
|
|
|
<!--
|
... | ... | @@ -122,36 +163,6 @@ Examples: |
|
|
* _Le Conseil de l'Union européenne a été informé des économies réalisées par le FMI._ - empirical knowledge is needed to understand that a, institution like EU is led by a counsel - _Conseil de l'Union européenne_ (ORG) - consisting of all prime ministers. Test passed.
|
|
|
-->
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 3 [NameConv] - naming convention
|
|
|
|
|
|
Does a naming convention apply to the candidate sequence?
|
|
|
This test is organised as a decision tree. Let _c_ denote the candidate sequence and _t_ the text being annotated.
|
|
|
|
|
|
<!-- ``` -->
|
|
|
**if** ([RelevUpper](#test-4-relevupper-relevant-uppercase)(_c_,_t_)) **then**
|
|
|
**return** fuzzyNE
|
|
|
**else**
|
|
|
**if** (there exists a variant v of c in _t_ such that [RelevUpper](#test-4-relevupper-relevant-uppercase)(_v_,_t_) **then**
|
|
|
**return** fuzzyNE
|
|
|
**else**
|
|
|
**if** ([Acron](#test-5-acron-acronym)(_c_,_t_) **or** [WebPage](#test-6-webpage-dedicated-web-page)(_c_)) **then**
|
|
|
**return** preciseNE
|
|
|
**return** noNE
|
|
|
<!-- ``` -->
|
|
|
|
|
|
------------------------------------------------
|
|
|
### Test 4 [RelevUpper] - relevant uppercase
|
|
|
|
|
|
Is the candidate sequence spelled with an initial uppercase letter to signal a proper name, rather than for other reasons? If unsure, answer no.
|
|
|
|
|
|
Examples:
|
|
|
* _Il a évoqué l'affaire des disparus du Beach_ - _affaire des disparus du Beach_ (EVE/no NE) is not spelled with an initial uppercase letter. Test not passed.
|
|
|
* _Affaire des disparus du Beach: 100 morts_ - _Affaire des disparus du Beach_ (EVE/no NE) is spelled with an initial uppercase letter because it starts a sentence. Test not passed.
|
|
|
* _J'ai eu l'honneur d'être reçu par Monsieur le Président du conseil en juin dernier_ - _Monsieur_ (noEN) and _Président
|
|
|
du conseil_ (no NE) are in the middle of a sentence, but they are spelled with an uppercase letter for honorific reasons. Test not passed.
|
|
|
* _QUID DE L'AFFAIRE DES DISPARUS DU BEACH_? - _AFFAIRE DES DISPARUS DU BEACH_ is spelled with an initial uppercase letter because the whole sequence is in uppercase. Test not passed.
|
|
|
* _Il a évoqué l'Affaire des disparus du Beach_ - _Affaire des disparus du Beach_ (EVE) is spelled with an initial upercase letter in the middle of a sentence, not for honorific reasons. We hypothesise that author signals a proper name. Test passed.
|
|
|
|
|
|
<!--
|
|
|
------------------------------------------------
|
... | ... | |