Skip to content
Snippets Groups Projects
gettingStarted.md 2.48 KiB
Newer Older
  • Learn to ignore specific revisions
  • Franck Dary's avatar
    Franck Dary committed
    # Getting Started
    
    The program comes in two modes : *train* and *decode*.\
    You can view all the possible program arguments with `macaon train -h` and `macaon decode -h`.\
    It is possible to directly use the program to train models and annotate input, but it is a bit tedious as you would have to type in a lot of arguments.\
    It is recommended to use [macaon_data](https://gitlab.lis-lab.fr/franck.dary/new_macaon_data) instead, a collection of scripts to organize your models. It is assumed in this documentation that you are using *macaon_data*.\
    If you understand the project structure well enough, you can obviously write your own scripts instead.
    
    ## Organizing your training corpora :
    
    The format used as both input and output of macaon is the [CoNLL-U Plus Format](https://universaldependencies.org/ext-format.html).\
    If a model is trained to perform tokenization, it will takes raw text as input.
    
    The scripts in *macaon_data* expects your data to be organized in the style of [Universal Dependencies](https://universaldependencies.org/) : one directory per corpus, each directory containing *train*, *dev* and *test* files.\
    Filenames are important, they must respectively match : `*train*\.conllu`, `*dev*\.conllu` and `*test*\.conllu`.\
    If you want to use a *Universal Dependencies* corpus as your training data, you just have to download UD and extract it somewhere on your computer.
    
    ## Setting up macaon_data :
    
    Simply edit the file `macaon_data/UD_any/config` so that `UD_ROOT=` points to the directory containing your corpora directories.
    
    ## Structure of macaon_data/UD_any :
    
    * `bin` : contains all of your trained models.
    * `data` : will be copied inside every model directory, is in charge of generating [Transition Sets](transitionSet.md).
    * `prepareExperiment.sh` : script that create a directory in `bin` for your model, allowing it to be trained.
    * `train.sh` : train a model that has been prepared by `prepareExperiment.sh`.
    * `evaluate.sh` : evaluate a model that has been trained bu `train.sh`.
    * `batches.py` : a file that you can use to define multiple experiments. To be used as an argument to `launchBatches.py`.
    * `launchBatches.py` : script that allows you to run multiple experiments at the same time. Can be used to launch *oar* or *slurm* jobs.
    * `Every other directory` : contains a [Reading Machine](readingMachine.md) file that you can train using `train.sh`.
    
    ## Next steps :
    
    * [Training a machine](training.md)
    * [Defining your own machine](readingMachine.md)
    
    [Back to main page](../README.md)