Skip to content
Snippets Groups Projects

Getting Started

The program comes in two modes : train and decode.
You can view all the possible program arguments with macaon train -h and macaon decode -h.
It is possible to directly use the program to train models and annotate input, but it is a bit tedious as you would have to type in a lot of arguments.
It is recommended to use macaon_data instead, a collection of scripts to organize your models. It is assumed in this documentation that you are using macaon_data.
If you understand the project structure well enough, you can obviously write your own scripts instead.

Organizing your training corpora :

The format used as both input and output of macaon is the CoNLL-U Plus Format.
If a model is trained to perform tokenization, it will takes raw text as input.

The scripts in macaon_data expects your data to be organized in the style of Universal Dependencies : one directory per corpus, each directory containing train, dev and test files.
Filenames are important, they must respectively match : *train*\.conllu, *dev*\.conllu and *test*\.conllu.
If you want to use a Universal Dependencies corpus as your training data, you just have to download UD and extract it somewhere on your computer.

Setting up macaon_data :

Simply edit the file macaon_data/UD_any/config so that UD_ROOT= points to the directory containing your corpora directories.

Structure of macaon_data/UD_any :

  • bin : contains all of your trained models.
  • data : will be copied inside every model directory, is in charge of generating Transition Sets.
  • prepareExperiment.sh : script that create a directory in bin for your model, allowing it to be trained.
  • train.sh : train a model that has been prepared by prepareExperiment.sh.
  • evaluate.sh : evaluate a model that has been trained bu train.sh.
  • batches.py : a file that you can use to define multiple experiments. To be used as an argument to launchBatches.py.
  • launchBatches.py : script that allows you to run multiple experiments at the same time. Can be used to launch oar or slurm jobs.
  • Every other directory : contains a Reading Machine file that you can train using train.sh.

Next steps :

Back to main page