[TOC] This page describes the basic usage of the software, once you are familiar with it you can learn to : * [Customize your machines](docs/machine.md) \section Introduction __Macaon__ is a software designed to perform fundamental Natural Language Processing tasks on a tokenized text input, such as : * Part of speech tagging * Morphosyntactic tagging * Lemmatization * Dependency parsing Such processing is done in a _greedy_ and _incremental_ way : * _Greedy_ because at each step, only one decision will be considered, the one maximizing our local score function. * _Incremental_ because when multiple tasks are performed (like the 4 above : POS tagging, Morpho tagging, lemmatization and parsing), they are not treated in a sequential fashion like in traditional systems (first POS tagging for all the input, then morpho tagging for all the input...) ; but instead, for each word of the input, it will be treated at the different levels by the system, before shifting to the next word and so on. Sequential way : <img src="/home/franck/macaon/docs/sequential.svg" width="300px"/> Incremental way : <img src="/home/franck/macaon/docs/incremental.svg" width="300px"/> __Macaon__ is designed to be a simple to use tool, in the following chapters we will explain how to install and use it. \section Installation First of all, __Macaon__ relies on the following libraries : * [Boost program_options](https://www.boost.org/doc/libs/1_55_0/doc/html/bbv2/installation.html) * [Dynet](https://dynet.readthedocs.io/en/latest/install.html) (you can install it with MKL as a backend to enable multi-threading) * [Fasttext](https://github.com/facebookresearch/fastText) Make sure to download and install them all. Then download the source code [macaon](https://gitlab.lis-lab.fr/franck.dary/macaon) and install it : cd macaon mkdir build cd build cmake .. && make -j && sudo make install _Macaon_ should compile and install itself. Then you need to download the [data repository](https://gitlab.lis-lab.fr/franck.dary/macaon_data) containing the corpora and the trained tools. You will need to compile the tools : cd macaon_data cd tools make Then you have to set the environement variable 'MACAON_DIR' to the path where macaon_data is installed. echo "export MACAON_DIR=/path/to/macaon_data" >> ~/.bashrc bash Now everything is installed and you can proceed to the training section. \section Training Go to the desired language directory : cd macaon_data/fr If this is your first time training a model for this language, you will have to create the datasets : cd data make Now you can use the train.sh script to train a model to fit the dataset present in data/treebank. The script takes two arguments : * The name of the folder containing the description of the machine. * The name of the model that will be trained. For instance you can train a model called tagger1 by calling : ./train.sh tagger tagger1 Training of 'Tagger Machine' : [dynet] random seed: 100 [dynet] allocating memory: 512MB [dynet] memory allocation done. Tagger topology : (242->300->25) Iteration 1/5 : Tagger accuracy : train(89.58%) dev(96.81%) SAVED Iteration 2/5 : Tagger accuracy : train(97.68%) dev(97.30%) SAVED Iteration 3/5 : Tagger accuracy : train(98.32%) dev(97.46%) SAVED Iteration 4/5 : Tagger accuracy : train(98.71%) dev(97.46%) Iteration 5/5 : Tagger accuracy : train(98.92%) dev(97.41%) Importat informations that will appear during training are : * The topology of each Multi-Layer Perceptron used. In this example we have one MLP with a input layer of 242 neurons, a single hidden layer of 300 neurons and an output layer of 25 neurons. * For each epoch/iteration, the accuracy of every classifier on the training set and on the developement set will be displayed. * For each itertion and each classifier, 'SAVED' will appear if this version of the classifier have been saved to the disk. It happens when the current classifier beat its former accuracy record. For more informations about the allowed options of the train.sh script, you can consult help : ./train.sh -h After the training is complete, the trained model will be stored in the bin directory. You can train as many models of a machine as you want, they will not interfere with each other. \section Evaluation If you want to evaluate trained models against their testing dataset, just navigate to the eval folder and launch the script : cd macaon_data/fr/eval ./eval.sh tagger Evaluation of tagger1 ... Done ! tagger 97.60 0.00 0.00 3.79 0.00 0.00 0.00 31110 The scirpt can evaluate multiple models at once : ./eval.sh tagger tagger1 parser2 The result of each evaluation is stored in the file named language.res. cat fr.res tool pos morpho lemma uas las srec sacc nbWords tagger 97.60 0.00 0.00 3.79 0.00 0.00 0.00 31110 tagger1 96.70 0.00 0.00 3.79 0.00 0.00 0.00 31110 Where : * tool is the name of the tool being evaluated. * pos is the accuracy in part of speech tagging. * morpho is the accuracy in morphosyntactic tagging. * lemma is the accuracy in lemmatization. * uas is the accuracy in governor prediction. * las is the accuracy in syntactic function and governor prediction. * srec is the recall in end of sentence prediction. * sacc is the accuracy in end of sentence prediction. * nbWords is the number of tokens in the test dataset. \section a Using your trained models To use a previously trained model, simply launch the corresponding script in the bin folder of the desired language. For instance to process an input 'file.mcf' with our trained model 'tagger1' : cd macaon_data/fr/bin ./maca_tm_tagger1 file.mcf file.mcd Where file.mcd is a file describing the columns of file.mcf. For more convenience, you can add your favorite language bin directory to you PATH environement variable : echo "PATH=$PATH:$MACAON_DIR/fr/bin" >> ~/.bashrc bash So that you can call any trained model from anywhere without typing the absolute path.