Skip to content
Snippets Groups Projects
Commit d4b9cf3a authored by Franck Dary's avatar Franck Dary
Browse files

Wrote documentation

parent a1df6c6f
No related branches found
No related tags found
No related merge requests found
[TOC]
This page describes the basic usage of the software, once you are familiar with it you can learn to [Customize your machines](docs/machine.md)
\section Introduction
__Macaon__ is a software designed to perform fundamental Natural Language Processing tasks on a tokenized text input, such as :
* Part of speech tagging
* Morphosyntactic tagging
* Lemmatization
* Dependency parsing
Such processing is done in a _greedy_ and _incremental_ way :
* _Greedy_ because at each step, only one decision will be considered, the one maximizing our local score function.
* _Incremental_ because when multiple tasks are performed (like the 4 above : POS tagging, Morpho tagging, lemmatization and parsing), they are not treated in a sequential fashion like in traditional systems (first POS tagging for all the input, then morpho tagging for all the input...) ; but instead, for each word of the input, it will be treated at the different levels by the system, before shifting to the next word and so on.
Sequential way :
<img src="/home/franck/macaon/docs/sequential.svg" width="300px"/>
Incremental way :
<img src="/home/franck/macaon/docs/incremental.svg" width="300px"/>
__Macaon__ is designed to be a simple to use tool, in the following chapters we will explain how to install and use it.
\section Installation
First of all, __Macaon__ relies on the following libraries :
* [Boost program_options](https://www.boost.org/doc/libs/1_55_0/doc/html/bbv2/installation.html)
* [Dynet](https://dynet.readthedocs.io/en/latest/install.html) (you can install it with MKL as a backend to enable multi-threading)
* [Fasttext](https://github.com/facebookresearch/fastText)
Make sure to download and install them all.
Then download the source code [macaon](https://gitlab.lis-lab.fr/franck.dary/macaon) and install it :
cd macaon
mkdir build
cd build
cmake .. && make -j && sudo make install
_Macaon_ should compile and install itself.
Then you need to download the [data repository](https://gitlab.lis-lab.fr/franck.dary/macaon_data) containing the corpora and the trained tools.
You will need to compile the tools :
cd macaon_data
cd tools
make
Then you have to set the environement variable 'MACAON_DIR' to the path where macaon_data is installed.
echo "export MACAON_DIR=/path/to/macaon_data" >> ~/.bashrc
bash
Now everything is installed and you can proceed to the training section.
\section Training
Go to the desired language directory :
cd macaon_data/fr
If this is your first time training a model for this language, you will have to create the datasets :
cd data
cd morpo-lexicon
make
cd ..
cd treebank
make
cd ../..
Now you can use the train.sh script to train a model to fit the dataset present in data/treebank.
The script takes two arguments :
* The name of the folder containing the description of the machine.
* The name of the model that will be trained.
For instance you can train a model called tagger1 by calling :
./train.sh tagger tagger1
Training of 'Tagger Machine' :
[dynet] random seed: 100
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
Tagger topology : (242->300->25)
Iteration 1/5 :
Tagger accuracy : train(89.58%) dev(96.81%) SAVED
Iteration 2/5 :
Tagger accuracy : train(97.68%) dev(97.30%) SAVED
Iteration 3/5 :
Tagger accuracy : train(98.32%) dev(97.46%) SAVED
Iteration 4/5 :
Tagger accuracy : train(98.71%) dev(97.46%)
Iteration 5/5 :
Tagger accuracy : train(98.92%) dev(97.41%)
Importat informations that will appear during training are :
* The topology of each Multi-Layer Perceptron used. In this example we have one MLP with a input layer of 242 neurons, a single hidden layer of 300 neurons and an output layer of 25 neurons.
* For each epoch/iteration, the accuracy of every classifier on the training set and on the developement set will be displayed.
* For each itertion and each classifier, 'SAVED' will appear if this version of the classifier have been saved to the disk. It happens when the current classifier beat its former accuracy record.
For more informations about the allowed options of the train.sh script, you can consult help :
./train.sh -h
After the training is complete, the trained model will be stored in the bin directory.
You can train as many models of a machine as you want, they will not interfere with each other.
\section Evaluation
If you want to evaluate trained models against their testing dataset, just navigate to the eval folder and launch the script :
cd macaon_data/fr/eval
./eval.sh tagger
Evaluation of tagger1 ... Done !
tagger 97.60 0.00 0.00 3.79 0.00 0.00 0.00 31110
The scirpt can evaluate multiple models at once :
./eval.sh tagger tagger1 parser2
The result of each evaluation is stored in the file named language.res.
cat fr.res
tool pos morpho lemma uas las srec sacc nbWords
tagger 97.60 0.00 0.00 3.79 0.00 0.00 0.00 31110
tagger1 96.70 0.00 0.00 3.79 0.00 0.00 0.00 31110
Where :
* tool is the name of the tool being evaluated.
* pos is the accuracy in part of speech tagging.
* morpho is the accuracy in morphosyntactic tagging.
* lemma is the accuracy in lemmatization.
* uas is the accuracy in governor prediction.
* las is the accuracy in syntactic function and governor prediction.
* srec is the recall in end of sentence prediction.
* sacc is the accuracy in end of sentence prediction.
* nbWords is the number of tokens in the test dataset.
\section a Using your trained models
To use a previously trained model, simply launch the corresponding script in the bin folder of the desired language.
For instance to process an input 'file.mcf' with our trained model 'tagger1' :
cd macaon_data/fr/bin
./maca_tm_tagger1 file.mcf file.mcd
Where file.mcd is a file describing the columns of file.mcf.
For more convenience, you can add your favorite language bin directory to you PATH environement variable :
echo "PATH=$PATH:$MACAON_DIR/fr/bin" >> ~/.bashrc
bash
So that you can all any trained model from anywhere without typing the absolute path.
...@@ -310,7 +310,7 @@ MARKDOWN_SUPPORT = YES ...@@ -310,7 +310,7 @@ MARKDOWN_SUPPORT = YES
# Minimum value: 0, maximum value: 99, default value: 0. # Minimum value: 0, maximum value: 99, default value: 0.
# This tag requires that the tag MARKDOWN_SUPPORT is set to YES. # This tag requires that the tag MARKDOWN_SUPPORT is set to YES.
TOC_INCLUDE_HEADINGS = 0 TOC_INCLUDE_HEADINGS = 1
# When enabled doxygen tries to link words that correspond to documented # When enabled doxygen tries to link words that correspond to documented
# classes, or namespaces to their corresponding documentation. Such a link can # classes, or namespaces to their corresponding documentation. Such a link can
...@@ -982,7 +982,7 @@ FILTER_SOURCE_PATTERNS = ...@@ -982,7 +982,7 @@ FILTER_SOURCE_PATTERNS =
# (index.html). This can be useful if you have a project on for instance GitHub # (index.html). This can be useful if you have a project on for instance GitHub
# and want to reuse the introduction page also for the doxygen output. # and want to reuse the introduction page also for the doxygen output.
USE_MDFILE_AS_MAINPAGE = USE_MDFILE_AS_MAINPAGE = docs/basics.md
#--------------------------------------------------------------------------- #---------------------------------------------------------------------------
# Configuration options related to source browsing # Configuration options related to source browsing
...@@ -1160,7 +1160,7 @@ HTML_FOOTER = ...@@ -1160,7 +1160,7 @@ HTML_FOOTER =
# obsolete. # obsolete.
# This tag requires that the tag GENERATE_HTML is set to YES. # This tag requires that the tag GENERATE_HTML is set to YES.
HTML_STYLESHEET = docs/doxygen.css HTML_STYLESHEET =
# The HTML_EXTRA_STYLESHEET tag can be used to specify additional user-defined # The HTML_EXTRA_STYLESHEET tag can be used to specify additional user-defined
# cascading style sheets that are included after the standard style sheets # cascading style sheets that are included after the standard style sheets
...@@ -1173,7 +1173,7 @@ HTML_STYLESHEET = docs/doxygen.css ...@@ -1173,7 +1173,7 @@ HTML_STYLESHEET = docs/doxygen.css
# list). For an example see the documentation. # list). For an example see the documentation.
# This tag requires that the tag GENERATE_HTML is set to YES. # This tag requires that the tag GENERATE_HTML is set to YES.
HTML_EXTRA_STYLESHEET = HTML_EXTRA_STYLESHEET = docs/macaon.css
# The HTML_EXTRA_FILES tag can be used to specify one or more extra images or # The HTML_EXTRA_FILES tag can be used to specify one or more extra images or
# other source files which should be copied to the HTML output directory. Note # other source files which should be copied to the HTML output directory. Note
......
This diff is collapsed.
File moved
[TOC]
This page explains how a machine is encoded by files and how you can alter its hyperparameters.
\section b Machine Template
Inside every language directory you will see folders named after tools like 'tagger', 'morpho', 'lemmatizer', 'parser', 'tagparser'.
Each of these folders contains the full description of an untrained TransitionMachine :
ls macaon_data/fr/tagger/
machine.tm tagger.as tagger.dicts test.bd
signature.cla tagger.cla tagger.fm train.bd
When a folder like this one is used for training (./train.sh), it is copied into the bin folder and trained.
In the following chapters we will explain what these files stands for and how to modify them.
\section c .bd files
They stand for Buffer Description.
Each line of the .bd file defines one tape of the multi-tapes buffer.
Example :
#Name ref/hyp dict Policy Must print?#
############################################
FORM ref form FromZero 1
POS hyp pos FromZero 1
SGN hyp sgn FromZero 1
The columns :
* __Name__ is the name of this tape.
* __ref/hyp__ stands for reference/hypothesis. A tape is reference if its content are given as input.
* __dict__ is the name of the dictionary that will store the elements of this tape.
* __Policy__ How the dictionary needs to be constructed :
* __FromZero__ if the dictionary starts empty.
* __Modifiable__ if the dictionary keep its current values, and can modify them.
* __Final__ if the dictionary keep its current values, and cannot modify them.
* __Must print__ is whether or not this tape must be printed as an output of the program.
There are 2 .bd files per machine, one describing the buffer during training and the other describing the buffer during testing.
\section d .dicts file
Each line of this file describes a dictionary that will be used by the machine.
Example :
#Name Dimension Mode #
############################
bool 02 Embeddings
int 05 Embeddings
letters 10 Embeddings
pos 15 Embeddings
form 30 Embeddings
sgn 10 Embeddings
The columns :
* __Name__ is the name of this dictionary.
* __Dimension__ is the number of float values used to encode each entry of this dictionary.
* __Mode__ Whether the entries are encoded in a __OneHot__ fashion or as __Embeddings__.
\section e .as files
They stand for ActionSet.
Each line of a .as file is the name of an action that the corresponding classifier is capable of.
Example :
WRITE 0 POS adj
WRITE 0 POS adv
WRITE 0 POS advneg
There are multiple kind of actions :
* __WRITE x y z__ : write __z__ on the tape __y__ on column __x__ relative to the Config head.
* __RULE x ON y z__ : apply the rule __z__ on the tape __y__ and store the result on the tape __x__. All of this relative to the columns pointed by the Config head.
* __SHIFT__ : place the Config head in the stack.
* __REDUCE__ : pop the Config stack.
* __EOS__ : Tag the Config head as an end of sentence, and pop the stack.
* __ROOT__ : Tag the Config head as root of sentence, and pop the stack.
* __LEFT x__ : Add a dependency from the Config head to the top of the stack, of label __x__, and pop the stack.
* __RIGHT x__ : Add a dependency from the top of the stack to the Config head, of label __x__.
An ActionSet can have a default action. In this case the file must begin with :
Default : x
Where __x__ is the name of the default action, the one that will be applied when no other action can be applied.
\section f .fm files
They stand for FeatureModel.
Each line of a .fm file is the name of one feature that will be used to transform a Configuration into a feature description, for a specific Classifier.
Example :
b.-3.POS
b.0.FORM.U
b.0.FORM.LEN
b.0.FORM.PART.-1
b.0.FORM.PART.-3.-1
b.0.FORM.PART.+1
s.0.ldep.LABEL
tc.0
Where :
* __b.-3.POS__ : the content of the tape POS, at the column that is at index -3 relatively to the head.
* __b.0.FORM.U__ : whether or not the content of the tape FORM, at the column under the head, start with an uppercase letter.
* __b.0.FORM.LEN__ : the number of letters of the content of the tape FORM, at the column under the head.
* __b.0.FORM.PART.-1__ : the last letter of the content of the tape FORM, at the column under the head.
* __b.0.FORM.PART.-3.-1__ : the last three letters of the content of the tape FORM, at the column under the head.
* __b.0.FORM.PART.+1__ : the second letter of the content of the tape FORM, at the column under the head.
* __s.0.ldep.LABEL__ : the content of the tape LABEL, relatively to the column of the closest left dependent of the column whose index is the value of the top of the stack.
* __tc.0__ is the previous action that was predicted by this Classifier.
\section g .cla files
Each .cla file describe a Classifier of the TransitionMachine.
There are three types of Classifier.
The ones of type __Prediction__, example :
Name : Tagger
Type : Prediction
Oracle : tagger
Feature Model : tagger.fm
Action Set : tagger.as
Topology : (300,RELU,0.3)
Where :
* __Name__ is the name of this Classifier.
* __Name__ is the type of this Classifier.
* __Oracle__ is the name of the oracle that will be used to train this Classifier.
* __Feature Model__ the name of the FeatureModel used by this Classifier.
* __Action Set__ the name of the ActionSet of this Classifier.
* __Topology__ the topology of the underlying Multi-Layer Perceptron. It is a list of hidden layers, each hidden layer is written as : the number of neurons, the activation function and the dropout rate.
This is the type of Classifier that relies on a neural network to make predictions, it require a training.
The ones of the type __Information__, example :
Name : Tagger
Type : Information
Oracle : signature
Oracle Filename : ../../data/morpho-lexicon/fP
Where :
* __Name__ is the name of this Classifier.
* __Name__ is the type of this Classifier.
* __Oracle__ is the name of the oracle that will be used by this Classifier.
* __Oracle Filename__ is the name of the file that the oracle can use to make its predictions.
This is the type of Classifier used to add information to the experiement, like the lemmas of words for instance.
And finally there are the ones of type __Forced__, which are only able to predict one Action.
\section h .tm files
A .tm file describes a TransitionMachine.
There must only be one .tm file per folder, and it must be called machine.tm.
Example :
Name : Tagger Machine
Dicts : tagger.dicts
%CLASSIFIERS
tagger tagger.cla
signature signature.cla
%STATES
signature1 signature
tagger1 tagger
%TRANSITIONS
signature1 tagger1 MULTIWRITE 0
tagger1 signature1 WRITE +1
Where :
* __Name__ is the name of this machine.
* __Dicts__ is the file describing the dictionaries used by the machine.
* __%CLASSIFIERS__ is the start of the classifiers section :
* Each line contains the name of a Classifier and its corresponding file.
* __%STATES__ is the start of the states section :
* Each line contains the name of a state and its corresponding classifier.
* __%TRANSITIONS__ is the start of the transitions section :
* Each line describes a transition with : the starting state, the ending state, the corresponding type of action and the relative movement of the head.
The initial state of the machine is the state that is defined first.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment