"This notebook contains a very brief demonstration of the CoNLL-U format and corresponding Python library.\n",
"\n",
"[CoNLL-U](https://universaldependencies.org/format.html) is a standard file format used to represent syntactic annotations of sentences for NLP tasks. \n",
"Each file contains one or more sentences, where each sentence is represented by:\n",
"First, make sure the corresponding [Python library](https://pypi.org/project/conllu/) is installed and run:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a535674-659b-4aa4-84f0-9c931dadaff6",
"metadata": {},
"outputs": [],
"source": [
"!pip install conllu"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "406299bd-97c5-4ed6-a825-032e4bd90dcd",
"metadata": {},
"outputs": [],
"source": [
"import conllu"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "764d7cd0-a89e-42c3-bb4a-ff98e7947c88",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded 526 sentences.\n",
"\n",
"Example sentence:\n",
"TokenList<Caderousse, resta, un, instant, étourdi, sous, le, poids, de, cette, supposition, ., metadata={text: \"Caderousse resta un instant étourdi sous le poids de cette supposition .\", sent_id: \"0\"}>\n"
"Each sentence is parsed as a list of tokens, each of which is represented as a Python dictionary. We can take a closer look at the properties of the token `resta` from the sentence above:"
"Generally, each token contains the following information in one way or another:\n",
"\n",
"- `id`\n",
"- `form` (the word)\n",
"- `lemma`\n",
"- `upos` (universal POS tag)\n",
"- `xpos` (language-specific POS tag)\n",
"- `feats` (features like gender, number, etc.)\n",
"- `head` (governor word ID)\n",
"- `deprel` (dependency relation)\n",
"\n",
"A more detailed explanation of each of these fields can be found [here](https://universaldependencies.org/format.html)."
]
},
{
"cell_type": "markdown",
"id": "a0923af5-9d2c-4f66-86f9-590ec5d4a170",
"metadata": {},
"source": [
"The `serialize()` method can be used to convert a `TokenList` back into CoNLL-U format:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "e3a66dcd-5291-4278-a498-54d1d741de7a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'# text = Caderousse resta un instant étourdi sous le poids de cette supposition .\\n# sent_id = 0\\n1\\tCaderousse\\tCaderousse\\tPROPN\\t_\\t_\\t2\\tnsubj\\t_\\tstart_char=0|end_char=10\\n2\\tresta\\trester\\tVERB\\t_\\tMood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\\t0\\troot\\t_\\tstart_char=11|end_char=16\\n3\\tun\\tun\\tDET\\t_\\tDefinite=Ind|Gender=Masc|Number=Sing|PronType=Art\\t4\\tdet\\t_\\tstart_char=17|end_char=19\\n4\\tinstant\\tinstant\\tNOUN\\t_\\tGender=Masc|Number=Sing\\t2\\tobj\\t_\\tstart_char=20|end_char=27\\n5\\tétourdi\\tétourdir\\tVERB\\t_\\tGender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass\\t4\\tacl\\t_\\tstart_char=28|end_char=35\\n6\\tsous\\tsous\\tADP\\t_\\t_\\t8\\tcase\\t_\\tstart_char=36|end_char=40\\n7\\tle\\tle\\tDET\\t_\\tDefinite=Def|Gender=Masc|Number=Sing|PronType=Art\\t8\\tdet\\t_\\tstart_char=41|end_char=43\\n8\\tpoids\\tpoids\\tNOUN\\t_\\tGender=Masc|Number=Sing\\t2\\tobl:mod\\t_\\tstart_char=44|end_char=49\\n9\\tde\\tde\\tADP\\t_\\t_\\t11\\tcase\\t_\\tstart_char=50|end_char=52\\n10\\tcette\\tce\\tDET\\t_\\tGender=Fem|Number=Sing|PronType=Dem\\t11\\tdet\\t_\\tstart_char=53|end_char=58\\n11\\tsupposition\\tsupposition\\tNOUN\\t_\\tGender=Fem|Number=Sing\\t8\\tnmod\\t_\\tstart_char=59|end_char=70\\n12\\t.\\t.\\tPUNCT\\t_\\t_\\t2\\tpunct\\t_\\tstart_char=71|end_char=72\\n\\n'"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sents[0].serialize()"
]
},
{
"cell_type": "markdown",
"id": "4f6e86c7-1cd7-41ec-9524-06ad0e6defc2",
"metadata": {},
"source": [
"If the file is very large, it may help to use `parse_incr` instead of `parse` in order to read it incrementally:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "8857d51b-3989-4ca0-b477-c22d823ff3f1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Caderousse resta un instant étourdi sous le poids de cette supposition . \n",
"\n",
"«Oh ! dit -il à le bout d' un instant , et en prenant son chapeau qu' il posa sur le mouchoir rouge noué autour de sa tête , nous allons bien le savoir . \n",
"\n",
"-- Et comment cela ? \n",
"\n"
]
}
],
"source": [
"with open(file_path, \"r\", encoding=\"utf-8\") as f:\n",
TokenList<Caderousse, resta, un, instant, étourdi, sous, le, poids, de, cette, supposition, ., metadata={text: "Caderousse resta un instant étourdi sous le poids de cette supposition .", sent_id: "0"}>
Each sentence is parsed as a list of tokens, each of which is represented as a Python dictionary. We can take a closer look at the properties of the token `resta` from the sentence above:
'# text = Caderousse resta un instant étourdi sous le poids de cette supposition .\n# sent_id = 0\n1\tCaderousse\tCaderousse\tPROPN\t_\t_\t2\tnsubj\t_\tstart_char=0|end_char=10\n2\tresta\trester\tVERB\t_\tMood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin\t0\troot\t_\tstart_char=11|end_char=16\n3\tun\tun\tDET\t_\tDefinite=Ind|Gender=Masc|Number=Sing|PronType=Art\t4\tdet\t_\tstart_char=17|end_char=19\n4\tinstant\tinstant\tNOUN\t_\tGender=Masc|Number=Sing\t2\tobj\t_\tstart_char=20|end_char=27\n5\tétourdi\tétourdir\tVERB\t_\tGender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass\t4\tacl\t_\tstart_char=28|end_char=35\n6\tsous\tsous\tADP\t_\t_\t8\tcase\t_\tstart_char=36|end_char=40\n7\tle\tle\tDET\t_\tDefinite=Def|Gender=Masc|Number=Sing|PronType=Art\t8\tdet\t_\tstart_char=41|end_char=43\n8\tpoids\tpoids\tNOUN\t_\tGender=Masc|Number=Sing\t2\tobl:mod\t_\tstart_char=44|end_char=49\n9\tde\tde\tADP\t_\t_\t11\tcase\t_\tstart_char=50|end_char=52\n10\tcette\tce\tDET\t_\tGender=Fem|Number=Sing|PronType=Dem\t11\tdet\t_\tstart_char=53|end_char=58\n11\tsupposition\tsupposition\tNOUN\t_\tGender=Fem|Number=Sing\t8\tnmod\t_\tstart_char=59|end_char=70\n12\t.\t.\tPUNCT\t_\t_\t2\tpunct\t_\tstart_char=71|end_char=72\n\n'
Caderousse resta un instant étourdi sous le poids de cette supposition .
«Oh ! dit -il à le bout d' un instant , et en prenant son chapeau qu' il posa sur le mouchoir rouge noué autour de sa tête , nous allons bien le savoir .