Multi Column Files Format
The Multi Column Files (mcf) format is the text format used to represent text and its annotations. Every line of an mcf corresponds to an atomic unit of text (abusively called word). Each column describes an attribute of the atomic token. Columns are separated by tab characters. The number of columns in an mcf is unbounded. Columns can be associated to a label via an mcd file. The association of a column to a label allows to access the content of each column through Word Features
The list of labels is the following:
- FORM form of the word
- CPOS coarse part of speech
- POS part of speech
- LEMMA lemma
- FEATS other linguistic features (usually morphological)
- GOV relative position of the governor (-n indicates that the governor is n words to the left, n indicates that it is n words to the right)
- LABEL label of the syntactic dependency
- SENT_SEG indicates that the word is the last word in the sentence
- A to Z other labels used to represent other useful information (word duration, speaker, ...)
Here is an example of two sentences represented as an mcf. The first column corresponds to FORM, the second to POS, the third to LEMMA the fourth to GOV the fifth to LABEL and the last to SENT_SEG
la | det | le | 1 | det | 0 |
---|---|---|---|---|---|
diane | nc | diane | 1 | suj | 0 |
chantait | v | chanter | 0 | root | 0 |
dans | prep | dans | -1 | mod | 0 |
la | det | le | 1 | det | 0 |
cour | nc | cour | -2 | obj | 0 |
des | prep | des | -1 | dep | 0 |
casernes | nc | caserne | -1 | obj | 0 |
. | poncts | . | -6 | eos | 1 |
et | coo | et | 0 | root | 0 |
le | det | le | 1 | det | 0 |
vent | nc | vent | 3 | suj | 0 |
du | prep | du | -1 | dep | 0 |
matin | nc | matin | -1 | obj | 0 |
soufflait | v | souffler | -5 | dep_coord | 0 |
sur | prep | sur | -1 | mod | 0 |
les | det | le | 1 | det | 0 |
lanternes | nc | lanterne | -2 | obj | 0 |
. | poncts | . | -9 | eos | 1 |