This page describes the basic usage of the software, once you are familiar with it you can learn to Customize your machines
\section Introduction
Macaon is a software designed to perform fundamental Natural Language Processing tasks on a tokenized text input, such as :
- Part of speech tagging
- Morphosyntactic tagging
- Lemmatization
- Dependency parsing
Such processing is done in a greedy and incremental way :
- Greedy because at each step, only one decision will be considered, the one maximizing our local score function.
- Incremental because when multiple tasks are performed (like the 4 above : POS tagging, Morpho tagging, lemmatization and parsing), they are not treated in a sequential fashion like in traditional systems (first POS tagging for all the input, then morpho tagging for all the input...) ; but instead, for each word of the input, it will be treated at the different levels by the system, before shifting to the next word and so on.
Sequential way :
Incremental way :
Macaon is designed to be a simple to use tool, in the following chapters we will explain how to install and use it.
\section Installation
First of all, Macaon relies on the following libraries :
- Boost program_options
- Dynet (you can install it with MKL as a backend to enable multi-threading)
- Fasttext
Make sure to download and install them all.
Then download the source code macaon and install it :
cd macaon
mkdir build
cd build
cmake .. && make -j && sudo make install
Macaon should compile and install itself.
Then you need to download the data repository containing the corpora and the trained tools.
You will need to compile the tools :
cd macaon_data
cd tools
make
Then you have to set the environement variable 'MACAON_DIR' to the path where macaon_data is installed.
echo "export MACAON_DIR=/path/to/macaon_data" >> ~/.bashrc
bash
Now everything is installed and you can proceed to the training section.
\section Training
Go to the desired language directory :
cd macaon_data/fr
If this is your first time training a model for this language, you will have to create the datasets :
cd data
cd morpo-lexicon
make
cd ..
cd treebank
make
cd ../..
Now you can use the train.sh script to train a model to fit the dataset present in data/treebank. The script takes two arguments :
- The name of the folder containing the description of the machine.
- The name of the model that will be trained.
For instance you can train a model called tagger1 by calling :
./train.sh tagger tagger1
Training of 'Tagger Machine' :
[dynet] random seed: 100
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
Tagger topology : (242->300->25)
Iteration 1/5 :
Tagger accuracy : train(89.58%) dev(96.81%) SAVED
Iteration 2/5 :
Tagger accuracy : train(97.68%) dev(97.30%) SAVED
Iteration 3/5 :
Tagger accuracy : train(98.32%) dev(97.46%) SAVED
Iteration 4/5 :
Tagger accuracy : train(98.71%) dev(97.46%)
Iteration 5/5 :
Tagger accuracy : train(98.92%) dev(97.41%)
Importat informations that will appear during training are :
- The topology of each Multi-Layer Perceptron used. In this example we have one MLP with a input layer of 242 neurons, a single hidden layer of 300 neurons and an output layer of 25 neurons.
- For each epoch/iteration, the accuracy of every classifier on the training set and on the developement set will be displayed.
- For each itertion and each classifier, 'SAVED' will appear if this version of the classifier have been saved to the disk. It happens when the current classifier beat its former accuracy record.
For more informations about the allowed options of the train.sh script, you can consult help :
./train.sh -h
After the training is complete, the trained model will be stored in the bin directory.
You can train as many models of a machine as you want, they will not interfere with each other.
\section Evaluation
If you want to evaluate trained models against their testing dataset, just navigate to the eval folder and launch the script :
cd macaon_data/fr/eval
./eval.sh tagger
Evaluation of tagger1 ... Done !
tagger 97.60 0.00 0.00 3.79 0.00 0.00 0.00 31110
The scirpt can evaluate multiple models at once :
./eval.sh tagger tagger1 parser2
The result of each evaluation is stored in the file named language.res.
cat fr.res
tool pos morpho lemma uas las srec sacc nbWords
tagger 97.60 0.00 0.00 3.79 0.00 0.00 0.00 31110
tagger1 96.70 0.00 0.00 3.79 0.00 0.00 0.00 31110
Where :
- tool is the name of the tool being evaluated.
- pos is the accuracy in part of speech tagging.
- morpho is the accuracy in morphosyntactic tagging.
- lemma is the accuracy in lemmatization.
- uas is the accuracy in governor prediction.
- las is the accuracy in syntactic function and governor prediction.
- srec is the recall in end of sentence prediction.
- sacc is the accuracy in end of sentence prediction.
- nbWords is the number of tokens in the test dataset.
\section a Using your trained models
To use a previously trained model, simply launch the corresponding script in the bin folder of the desired language.
For instance to process an input 'file.mcf' with our trained model 'tagger1' :
cd macaon_data/fr/bin
./maca_tm_tagger1 file.mcf file.mcd
Where file.mcd is a file describing the columns of file.mcf.
For more convenience, you can add your favorite language bin directory to you PATH environement variable :
echo "PATH=$PATH:$MACAON_DIR/fr/bin" >> ~/.bashrc
bash
So that you can all any trained model from anywhere without typing the absolute path.