Skip to content
Snippets Groups Projects
Commit 6da709d7 authored by Franck Dary's avatar Franck Dary
Browse files

Added documentation

parent a8613b8b
No related branches found
No related tags found
No related merge requests found
# Macaon Documentation
## Overview
Macaon is a trainable software whose purpose is to annotate text.\
It has been built to perform any combination of the folloing annotations :
* Tokenization
* POS tagging
* Feats tagging
* Dependency parsing
* Sentence segmentation
The focus is set on customization :
* Chose the order the predictions are made in, along with the mode of annotation :
* Sequential (pipeline) mode : the whole input is processed at a certain annotation level **n** before being processed at level **n+1**.
* Incremental mode : each word is processed at every anotation level before passing to the next word.
* Precisely chose what part of the input text will be used as feature for the classifier.
## [Installation](documentation/install.md)
## [Getting Started](documentation/gettingStarted.md)
TODO
# Getting Started
The program comes in two modes : *train* and *decode*.\
You can view all the possible program arguments with `macaon train -h` and `macaon decode -h`.\
It is possible to directly use the program to train models and annotate input, but it is a bit tedious as you would have to type in a lot of arguments.\
It is recommended to use [macaon_data](https://gitlab.lis-lab.fr/franck.dary/new_macaon_data) instead, a collection of scripts to organize your models. It is assumed in this documentation that you are using *macaon_data*.\
If you understand the project structure well enough, you can obviously write your own scripts instead.
## Organizing your training corpora :
The format used as both input and output of macaon is the [CoNLL-U Plus Format](https://universaldependencies.org/ext-format.html).\
If a model is trained to perform tokenization, it will takes raw text as input.
The scripts in *macaon_data* expects your data to be organized in the style of [Universal Dependencies](https://universaldependencies.org/) : one directory per corpus, each directory containing *train*, *dev* and *test* files.\
Filenames are important, they must respectively match : `*train*\.conllu`, `*dev*\.conllu` and `*test*\.conllu`.\
If you want to use a *Universal Dependencies* corpus as your training data, you just have to download UD and extract it somewhere on your computer.
## Setting up macaon_data :
Simply edit the file `macaon_data/UD_any/config` so that `UD_ROOT=` points to the directory containing your corpora directories.
## Structure of macaon_data/UD_any :
* `bin` : contains all of your trained models.
* `data` : will be copied inside every model directory, is in charge of generating [Transition Sets](transitionSet.md).
* `prepareExperiment.sh` : script that create a directory in `bin` for your model, allowing it to be trained.
* `train.sh` : train a model that has been prepared by `prepareExperiment.sh`.
* `evaluate.sh` : evaluate a model that has been trained bu `train.sh`.
* `batches.py` : a file that you can use to define multiple experiments. To be used as an argument to `launchBatches.py`.
* `launchBatches.py` : script that allows you to run multiple experiments at the same time. Can be used to launch *oar* or *slurm* jobs.
* `Every other directory` : contains a [Reading Machine](readingMachine.md) file that you can train using `train.sh`.
## Next steps :
* [Training a machine](training.md)
* [Defining your own machine](readingMachine.md)
[Back to main page](../README.md)
# Installation
## Requirements :
* GNU/Linux OS
* CMake >= 3.16.4
* C++20 compiler such as g++ >= 9.2
* LibTorch version 1.5 cxx11 ABI : [link](https://pytorch.org/get-started/locally/)
* Boost >= 1.53.0 with program_options : [link](https://www.boost.org/doc/libs/1_73_0/more/getting_started/unix-variants.html)
## Download :
https://gitlab.lis-lab.fr/franck.dary/new_macaon
## Compilation :
`$ cd macaon`\
`$ mkdir build`\
`$ cd build`\
`$ cmake -DCMAKE_INSTALL_PREFIX=/path/to/install ..`\
`$ make -j && make install`
[Back to main page](../README.md)
# Reading Machine
A reading machine can be thought of as a kind of deterministic finite automaton where :
* The machine is made of *states*, *transitions* and a *strategy*.
* The machine works on a *configuration*, representing the input text to be annotated and its annotations so far.
* A *transition* is an endofunction in the domain of *configurations*, making a small change like adding a single annotation.
* The *strategy* is a function that takes the current *state* and the chosen *transition* as inputs, and returns the new *state*.
* The machine contains a classifier that is trained to predict the next *transition* to take, given the current *configuration*.
* At each step, and until the *configuration* is final :
1. The classifier will predict the next *transition* to take.
2. This *transition* will be applied to the current *configuration*, thus yielding the new *configuration*.
3. The *strategy* will determine the new *state*.
## Configuration :
A configuration is the current state of the analysis, it is made of the input text along with all of the annotations predicted so far.\
It is said to be final when all the input text has been processed.
## File format :
A reading machine is defined in a `.rm` file (or given as argument to `macaon train`).\
Here is an example of a Reading Machine doing POS tagging, dependency parsing and sentence segmentation in an incremental fashion (POS tag one word, then attach it to the dependency tree, then cut the sentence or don't, then change the focus to the next word and repeat) :
```
Name : Tagger, Parser and Segmenter incremental Machine
Classifier : taggerparser
{
Transitions : {tagger,data/tagger.ts parser,data/parser.ts segmenter,data/segmenter.ts}
LossMultiplier : {segmenter,10.0}
Network type : Modular
StateName : Out{64}
Context : Buffer{-3 -2 -1 0 1 2} Stack{} Columns{FORM} LSTM{1 1 0 1} In{64} Out{64}
Context : Buffer{-3 -2 -1 0} Stack{1 0} Columns{UPOS} LSTM{1 1 0 1} In{64} Out{64}
Focused : Column{ID} NbElem{1} Buffer{-1 0 1 2} Stack{2 1 0} LSTM{1 1 0 1} In{64} Out{64}
Focused : Column{FORM} NbElem{13} Buffer{-1 0 1 2} Stack{2 1 0} LSTM{1 1 0 1} In{64} Out{64}
Focused : Column{EOS} NbElem{1} Buffer{-1} Stack{} LSTM{1 1 0 1} In{64} Out{64}
Focused : Column{DEPREL} NbElem{1} Buffer{} Stack{2 1 0} LSTM{1 1 0 1} In{64} Out{64}
DepthLayerTree : Columns{DEPREL} Buffer{} Stack{2 1 0} LayerSizes{3} LSTM{1 1 0.0 1} In{64} Out{64}
InputDropout : 0.5
MLP : {2048 0.3 2048 0.3}
End
Optimizer : Adam {0.0002 0.9 0.999 0.00000001 0.00001 true}
}
Predictions : UPOS HEAD DEPREL EOS
Strategy
{
Block : End{cannotMove}
tagger parser * 0
parser segmenter SHIFT 0
parser segmenter RIGHT 0
parser parser * 0
segmenter tagger * 1
}
```
This format is composed of several parts :
* Name : The name of your machine.
* Classifier : The name of your classifier, followed by its definition between braces. See [Classifier](classifier.md).
* Predictions : Names of the columns that are predicted by your machine.
* Strategy, followed by its definition between braces. See [Strategy](strategy.md).
[Back to main page](../README.md)
TODO
# Training
The easiest way to train a [Reading Machine](readingMachine.md) is to use the scripts provided by *macaon_data*.\
For example, if one would like to train a *parser* called *myFrenchParser* on the UD treebank French-GSD :\
`$ cd macaon_data/UD_any`\
`$ ./prepareExperiment.sh UD_French-GSD parser myFrenchParser`\
`$ ./train.sh tsv bin/myFrenchParser`
## prepareExperiment.​sh
The purpose of this script is simply to generate a new experiment directory inside *bin/*.\
The usage is `./prepareExperiment corpusName templateName experimentName`.
## train.​sh
This script will your model by calling `macaon train` with the correct arguments.\
The usage is `./train.sh mode experimentPath arguments`, where :
* mode is txt if your model does tokenization, tsv if it doesn't.
* experimentPath is the relative path to your model.
* arguments is a list of arguments to give to `macaon train`, it can be empty.
Example : `$ ./train.sh tsv bin/myFrenchParser -n 30 --batchSize 128`.
For a list of available arguments execute `macaon train -h`.\
You can inspect how a model has been trained by looking at the file `bin/yourModel/train.info`.\
You can stop training and resume it anytime you want, thus allowing to increase the number of epoch.
## evaluate.​sh
This script will evaluate your trained model against the test corpora using the official [CoNLL 2018 Shared Task](http://universaldependencies.org/conll18/evaluation.html) eval script.\
Under the hood it is a call to `macaon decode`, the usage is `./evaluate.sh mode experimentPath arguments`, where :
* mode is txt if your model does tokenization, tsv if it doesn't.
* experimentPath is the relative path to your model.
* arguments is a list of arguments to give to `macaon decode`, it can be empty.
Example : `$ ./evaluate.sh tsv bin/myFrenchParser`
## Using your trained model
Once a model has been trained, you can use it to annotate text.\
If your model doesn't do tokenization, your input file must be formated in the [CoNLL-U Plus Format](https://universaldependencies.org/ext-format.html). Otherwise, your input file must be raw utf8 text.\
To use your trained model `myFrenchParser` to annotate the text in the file `myFrenchFile.conllu` :
* `$ macaon decode --model bin/myFrenchParser --inputTSV myFrenchFile.conllu`
The annotated file will be printed to the standard output.
[Back to main page](../README.md)
TODO
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment