gettingStarted.md



Getting Started
The program comes in two modes : train and decode.

You can view all the possible program arguments with macaon train -h and macaon decode -h.

It is possible to directly use the program to train models and annotate input, but it is a bit tedious as you would have to type in a lot of arguments.

It is recommended to use macaon_data instead, a collection of scripts to organize your models. It is assumed in this documentation that you are using macaon_data.

If you understand the project structure well enough, you can obviously write your own scripts instead.

Organizing your training corpora :
The format used as both input and output of macaon is the CoNLL-U Plus Format.

If a model is trained to perform tokenization, it will takes raw text as input.
The scripts in macaon_data expects your data to be organized in the style of Universal Dependencies : one directory per corpus, each directory containing train, dev and test files.

Filenames are important, they must respectively match : *train*\.conllu, *dev*\.conllu and *test*\.conllu.

If you want to use a Universal Dependencies corpus as your training data, you just have to download UD and extract it somewhere on your computer.

Setting up macaon_data :
Simply edit the file macaon_data/UD_any/config so that UD_ROOT= points to the directory containing your corpora directories.

Structure of macaon_data/UD_any :


bin : contains all of your trained models.

data : will be copied inside every model directory, is in charge of generating Transition Sets.

prepareExperiment.sh : script that create a directory in bin for your model, allowing it to be trained.

train.sh : train a model that has been prepared by prepareExperiment.sh.

evaluate.sh : evaluate a model that has been trained bu train.sh.

batches.py : a file that you can use to define multiple experiments. To be used as an argument to launchBatches.py.

launchBatches.py : script that allows you to run multiple experiments at the same time. Can be used to launch oar or slurm jobs.

templates/* : contains a Reading Machine file that you can train using train.sh.


Next steps :

Training a machine
Defining your own machine

Back to main page