gettingStarted.md

# Getting Started

The program comes in two modes : *train* and *decode*.\
You can view all the possible program arguments with `macaon train -h` and `macaon decode -h`.\
It is possible to directly use the program to train models and annotate input, but it is a bit tedious as you would have to type in a lot of arguments.\
It is recommended to use [macaon_data](https://gitlab.lis-lab.fr/franck.dary/new_macaon_data) instead, a collection of scripts to organize your models. It is assumed in this documentation that you are using *macaon_data*.\
If you understand the project structure well enough, you can obviously write your own scripts instead.

## Organizing your training corpora :

The format used as both input and output of macaon is the [CoNLL-U Plus Format](https://universaldependencies.org/ext-format.html).\
If a model is trained to perform tokenization, it will takes raw text as input.

The scripts in *macaon_data* expects your data to be organized in the style of [Universal Dependencies](https://universaldependencies.org/) : one directory per corpus, each directory containing *train*, *dev* and *test* files.\
Filenames are important, they must respectively match : `*train*\.conllu`, `*dev*\.conllu` and `*test*\.conllu`.\
If you want to use a *Universal Dependencies* corpus as your training data, you just have to download UD and extract it somewhere on your computer.

## Setting up macaon_data :

Simply edit the file `macaon_data/UD_any/config` so that `UD_ROOT=` points to the directory containing your corpora directories.

## Structure of macaon_data/UD_any :

* `bin` : contains all of your trained models.
* `data` : will be copied inside every model directory, is in charge of generating [Transition Sets](transitionSet.md).
* `prepareExperiment.sh` : script that create a directory in `bin` for your model, allowing it to be trained.
* `train.sh` : train a model that has been prepared by `prepareExperiment.sh`.
* `evaluate.sh` : evaluate a model that has been trained bu `train.sh`.
* `batches.py` : a file that you can use to define multiple experiments. To be used as an argument to `launchBatches.py`.
* `launchBatches.py` : script that allows you to run multiple experiments at the same time. Can be used to launch *oar* or *slurm* jobs.
* `Every other directory` : contains a [Reading Machine](readingMachine.md) file that you can train using `train.sh`.

## Next steps :

* [Training a machine](training.md)
* [Defining your own machine](readingMachine.md)

[Back to main page](../README.md)