Skip to content
Snippets Groups Projects
Select Git revision
  • master default protected
  • loss
  • producer
3 results

gettingStarted.md

Blame
  • Getting Started

    The program comes in two modes : train and decode.
    You can view all the possible program arguments with macaon train -h and macaon decode -h.
    It is possible to directly use the program to train models and annotate input, but it is a bit tedious as you would have to type in a lot of arguments.
    It is recommended to use macaon_data instead, a collection of scripts to organize your models. It is assumed in this documentation that you are using macaon_data.
    If you understand the project structure well enough, you can obviously write your own scripts instead.

    Organizing your training corpora :

    The format used as both input and output of macaon is the CoNLL-U Plus Format.
    If a model is trained to perform tokenization, it will takes raw text as input.

    The scripts in macaon_data expects your data to be organized in the style of Universal Dependencies : one directory per corpus, each directory containing train, dev and test files.
    Filenames are important, they must respectively match : *train*\.conllu, *dev*\.conllu and *test*\.conllu.
    If you want to use a Universal Dependencies corpus as your training data, you just have to download UD and extract it somewhere on your computer.

    Setting up macaon_data :

    Simply edit the file macaon_data/UD_any/config so that UD_ROOT= points to the directory containing your corpora directories.

    Structure of macaon_data/UD_any :

    • bin : contains all of your trained models.
    • data : will be copied inside every model directory, is in charge of generating Transition Sets.
    • prepareExperiment.sh : script that create a directory in bin for your model, allowing it to be trained.
    • train.sh : train a model that has been prepared by prepareExperiment.sh.
    • evaluate.sh : evaluate a model that has been trained bu train.sh.
    • batches.py : a file that you can use to define multiple experiments. To be used as an argument to launchBatches.py.
    • launchBatches.py : script that allows you to run multiple experiments at the same time. Can be used to launch oar or slurm jobs.
    • templates/* : contains a Reading Machine file that you can train using train.sh.

    Next steps :

    Back to main page