... | ... | @@ -2,18 +2,6 @@ |
|
|
|
|
|
The Multi Column File (mcf) format is the text format used to represent textual and its annotations. Every line of an mcf corresponds to an atomic unit of text (abusively called word). Each column describes an attribute of the atomic token. Columns are separated by tab characters. The number of columns in an mcf is unbounded. Columns can be associated to a **label** via an [mcd](mcd) file. The association of a column to a [label] (column_labels) allows to access the content of each column through [Word Features](features)
|
|
|
|
|
|
The list of labels is the following:
|
|
|
|
|
|
* **FORM** form of the word
|
|
|
* **CPOS** coarse part of speech
|
|
|
* **POS** part of speech
|
|
|
* **LEMMA** lemma
|
|
|
* **FEATS** other linguistic features (usually morphological)
|
|
|
* **GOV** relative position of the governor (-n indicates that the governor is n words to the left, n indicates that it is n words to the right)
|
|
|
* **LABEL** label of the syntactic dependency
|
|
|
* **SENT_SEG** indicates that the word is the last word in the sentence
|
|
|
* **A** to **Z** other labels used to represent other useful information (word duration, speaker, ...)
|
|
|
|
|
|
Here is an example of two sentences represented as an mcf. The first column corresponds to **FORM**, the second to **POS**, the third to **LEMMA** the fourth to **GOV** the fifth to **LABEL** and the last to **SENT_SEG**
|
|
|
|
|
|
la | det | le | 1 | det | 0
|
... | ... | |