-
Franck Dary authoredFranck Dary authored
Getting Started
The program comes in two modes : train and decode.
You can view all the possible program arguments with macaon train -h
and macaon decode -h
.
It is possible to directly use the program to train models and annotate input, but it is a bit tedious as you would have to type in a lot of arguments.
It is recommended to use macaon_data instead, a collection of scripts to organize your models. It is assumed in this documentation that you are using macaon_data.
If you understand the project structure well enough, you can obviously write your own scripts instead.
Organizing your training corpora :
The format used as both input and output of macaon is the CoNLL-U Plus Format.
If a model is trained to perform tokenization, it will takes raw text as input.
The scripts in macaon_data expects your data to be organized in the style of Universal Dependencies : one directory per corpus, each directory containing train, dev and test files.
Filenames are important, they must respectively match : *train*\.conllu
, *dev*\.conllu
and *test*\.conllu
.
If you want to use a Universal Dependencies corpus as your training data, you just have to download UD and extract it somewhere on your computer.
Setting up macaon_data :
Simply edit the file macaon_data/UD_any/config
so that UD_ROOT=
points to the directory containing your corpora directories.
Structure of macaon_data/UD_any :
-
bin
: contains all of your trained models. -
data
: will be copied inside every model directory, is in charge of generating Transition Sets. -
prepareExperiment.sh
: script that create a directory inbin
for your model, allowing it to be trained. -
train.sh
: train a model that has been prepared byprepareExperiment.sh
. -
evaluate.sh
: evaluate a model that has been trained butrain.sh
. -
batches.py
: a file that you can use to define multiple experiments. To be used as an argument tolaunchBatches.py
. -
launchBatches.py
: script that allows you to run multiple experiments at the same time. Can be used to launch oar or slurm jobs. -
Every other directory
: contains a Reading Machine file that you can train usingtrain.sh
.