Skip to content
Snippets Groups Projects
Commit 43274cf2 authored by ybrenning's avatar ybrenning
Browse files

Initial commit

parent 85467447
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id:afc75a8b-e333-438d-96d2-1aaac1156fef tags:
# CoNLL-U in Python
This notebook contains a very brief demonstration of the CoNLL-U format and corresponding Python library.
[CoNLL-U](https://universaldependencies.org/format.html) is a standard file format used to represent syntactic annotations of sentences for NLP tasks.
Each file contains one or more sentences, where each sentence is represented by:
- Optional comment lines starting with `#` (e.g., sentence text, sentence ID)
- One line per word/token, with 10 tab-separated fields
- Underscore (`_`) denotes unspecified values
An example sentence from a `.conll` might look something like this:
%% Cell type:markdown id:e291ced6-e816-4c11-8f14-8a1036f4b5d2 tags:
```console
# text = Caderousse resta un instant étourdi sous le poids de cette supposition .
# sent_id = 0
1 Caderousse Caderousse PROPN _ _ 2 nsubj _ start_char=0|end_char=10
2 resta rester VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ start_char=11|end_char=16
3 un un DET _ Definite=Ind|Gender=Masc|Number=Sing|PronType=Art 4 det _ start_char=17|end_char=19
4 instant instant NOUN _ Gender=Masc|Number=Sing 2 obj _ start_char=20|end_char=27
5 étourdi étourdir VERB _ Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass 4 acl _ start_char=28|end_char=35
6 sous sous ADP _ _ 8 case _ start_char=36|end_char=40
7 le le DET _ Definite=Def|Gender=Masc|Number=Sing|PronType=Art 8 det _ start_char=41|end_char=43
8 poids poids NOUN _ Gender=Masc|Number=Sing 2 obl:mod _ start_char=44|end_char=49
9 de de ADP _ _ 11 case _ start_char=50|end_char=52
10 cette ce DET _ Gender=Fem|Number=Sing|PronType=Dem 11 det _ start_char=53|end_char=58
11 supposition supposition NOUN _ Gender=Fem|Number=Sing 8 nmod _ start_char=59|end_char=70
12 . . PUNCT _ _ 2 punct _ start_char=71|end_char=72
```
%% Cell type:markdown id:341ab9fd-37ab-492c-b347-637b48877d01 tags:
First, make sure the corresponding [Python library](https://pypi.org/project/conllu/) is installed and run:
%% Cell type:code id:7a535674-659b-4aa4-84f0-9c931dadaff6 tags:
``` python
!pip install conllu
```
%% Cell type:code id:406299bd-97c5-4ed6-a825-032e4bd90dcd tags:
``` python
import conllu
```
%% Cell type:code id:764d7cd0-a89e-42c3-bb4a-ff98e7947c88 tags:
``` python
file_path = "data/Le_comte_de_Monte-Cristo,_Tome_I.tok.dev.conll"
with open(file_path, "r", encoding="utf-8") as f:
content = f.read()
sents = conllu.parse(content)
first_sent = sents[0]
print(f"Loaded {len(sents)} sentences.\n\nExample sentence:\n{first_sent}")
```
%% Output
Loaded 526 sentences.
Example sentence:
TokenList<Caderousse, resta, un, instant, étourdi, sous, le, poids, de, cette, supposition, ., metadata={text: "Caderousse resta un instant étourdi sous le poids de cette supposition .", sent_id: "0"}>
%% Cell type:markdown id:aaab26be-c773-4d0c-9284-7e40b377623c tags:
Each sentence is parsed as a list of tokens, each of which is represented as a Python dictionary. We can take a closer look at the properties of the token `resta` from the sentence above:
%% Cell type:code id:c403bd8b-9c15-48ec-8dfc-40de7b6c00fc tags:
``` python
tok = first_sent[1]
tok
```
%% Output
{'id': 2,
'form': 'resta',
'lemma': 'rester',
'upos': 'VERB',
'xpos': None,
'feats': {'Mood': 'Ind',
'Number': 'Sing',
'Person': '3',
'Tense': 'Past',
'VerbForm': 'Fin'},
'head': 0,
'deprel': 'root',
'deps': None,
'misc': {'start_char': '11', 'end_char': '16'}}
%% Cell type:code id:a4897a96-d1c5-4e24-988c-122ad5257db0 tags:
``` python
tok["lemma"]
```
%% Output
'rester'
%% Cell type:markdown id:41805d2a-dd0f-490f-bd03-a009d44b4dc8 tags:
Generally, each token contains the following information in one way or another:
- `id`
- `form` (the word)
- `lemma`
- `upos` (universal POS tag)
- `xpos` (language-specific POS tag)
- `feats` (features like gender, number, etc.)
- `head` (governor word ID)
- `deprel` (dependency relation)
A more detailed explanation of each of these fields can be found [here](https://universaldependencies.org/format.html).
%% Cell type:markdown id:a0923af5-9d2c-4f66-86f9-590ec5d4a170 tags:
The `serialize()` method can be used to convert a `TokenList` back into CoNLL-U format:
%% Cell type:code id:e3a66dcd-5291-4278-a498-54d1d741de7a tags:
``` python
sents[0].serialize()
```
%% Output
'# text = Caderousse resta un instant étourdi sous le poids de cette supposition .\n# sent_id = 0\n1\tCaderousse\tCaderousse\tPROPN\t_\t_\t2\tnsubj\t_\tstart_char=0|end_char=10\n2\tresta\trester\tVERB\t_\tMood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\t0\troot\t_\tstart_char=11|end_char=16\n3\tun\tun\tDET\t_\tDefinite=Ind|Gender=Masc|Number=Sing|PronType=Art\t4\tdet\t_\tstart_char=17|end_char=19\n4\tinstant\tinstant\tNOUN\t_\tGender=Masc|Number=Sing\t2\tobj\t_\tstart_char=20|end_char=27\n5\tétourdi\tétourdir\tVERB\t_\tGender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass\t4\tacl\t_\tstart_char=28|end_char=35\n6\tsous\tsous\tADP\t_\t_\t8\tcase\t_\tstart_char=36|end_char=40\n7\tle\tle\tDET\t_\tDefinite=Def|Gender=Masc|Number=Sing|PronType=Art\t8\tdet\t_\tstart_char=41|end_char=43\n8\tpoids\tpoids\tNOUN\t_\tGender=Masc|Number=Sing\t2\tobl:mod\t_\tstart_char=44|end_char=49\n9\tde\tde\tADP\t_\t_\t11\tcase\t_\tstart_char=50|end_char=52\n10\tcette\tce\tDET\t_\tGender=Fem|Number=Sing|PronType=Dem\t11\tdet\t_\tstart_char=53|end_char=58\n11\tsupposition\tsupposition\tNOUN\t_\tGender=Fem|Number=Sing\t8\tnmod\t_\tstart_char=59|end_char=70\n12\t.\t.\tPUNCT\t_\t_\t2\tpunct\t_\tstart_char=71|end_char=72\n\n'
%% Cell type:markdown id:4f6e86c7-1cd7-41ec-9524-06ad0e6defc2 tags:
If the file is very large, it may help to use `parse_incr` instead of `parse` in order to read it incrementally:
%% Cell type:code id:8857d51b-3989-4ca0-b477-c22d823ff3f1 tags:
``` python
with open(file_path, "r", encoding="utf-8") as f:
sent_iter = conllu.parse_incr(f)
for i, sent in enumerate(sent_iter):
for tok in sent:
print(tok["form"], end=" ")
print("\n")
# Only read first three sentences for demo
if i == 2:
break
```
%% Output
Caderousse resta un instant étourdi sous le poids de cette supposition .
«Oh ! dit -il à le bout d' un instant , et en prenant son chapeau qu' il posa sur le mouchoir rouge noué autour de sa tête , nous allons bien le savoir .
-- Et comment cela ?
%% Cell type:code id:63dac353-8054-41f1-96e8-d671b9ec14a1 tags:
``` python
```
This diff is collapsed.
"""
Basic sketch of the `metric` script.
The script should be capable of performing two different tasks.
Given two corpora $C_1$ and $C_2$:
1. Calculate individual "scores" based on a custom metric.
2. Calculate a similarity score between the two corpora based on this metric.
The metric will be calculated based on the distributions of various linguistic
features, all of which should be obtainable given a corpus' CoNLL-U file.
In order to test the metric, we first apply it to pre-existing corpora,
with the long-term goal of applying it to the task of generated text
evaluation.
Some information on possible evaluation metrics for generated text can be
found in the README file of this repository.
"""
FILEPATH_1 = ""
FILEPATH_2 = ""
def read_conllu(filepath):
"""
Read a corpus (text) from a .conllu file.
See also:
- [CoNLL-U Format](https://universaldependencies.org/format.html]
- [Python package](https://pypi.org/project/conllu/)
"""
raise NotImplementedError()
def calculate_abs_score(corpus):
"""
Given some CoNLL-U corpus, calculate its "score"
based on the distributions of various linguistic features.
"""
raise NotImplementedError()
def calculate_rel_score(corpus_1, corpus_2):
"""
Given two CoNLL-U corpora, calculate a "relative score",
i.e., a similarity metric between the two texts based on
the compared distributions of various linguistic features.
"""
raise NotImplementedError()
def main():
corpus_1 = read_conllu(FILEPATH_1)
corpus_2 = read_conllu(FILEPATH_2)
score_1 = calculate_abs_score(corpus_1)
score_2 = calculate_abs_score(corpus_2)
similarity = calculate_rel_score(corpus_1, corpus_2)
if __name__ == "__main__":
main()
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment