Wrote documentation

d4b9cf3a · Franck Dary · a1df6c6f · d4b9cf3a · d4b9cf3a · d4b9cf3a
Commit d4b9cf3a authored Aug 14, 2018 by Franck Dary
--- a/docs/basics.md
+++ b/docs/basics.md
+[TOC]
+This page describes the basic usage of the software, once you are familiar with it you can learn to [Customize your machines](docs/machine.md)
+\section Introduction
+__Macaon__ is a software designed to perform fundamental Natural Language Processing tasks on a tokenized text input, such as :
+* Part of speech tagging
+* Morphosyntactic tagging
+* Lemmatization
+* Dependency parsing
+Such processing is done in a _greedy_ and _incremental_ way :
+* _Greedy_ because at each step, only one decision will be considered, the one maximizing our local score function.
+* _Incremental_ because when multiple tasks are performed (like the 4 above : POS tagging, Morpho tagging, lemmatization and parsing), they are not treated in a sequential fashion like in traditional systems (first POS tagging for all the input, then morpho tagging for all the input...) ; but instead, for each word of the input, it will be treated at the different levels by the system, before shifting to the next word and so on.
+Sequential way :
+<img src="/home/franck/macaon/docs/sequential.svg" width="300px"/>
+Incremental way :
+<img src="/home/franck/macaon/docs/incremental.svg" width="300px"/>
+__Macaon__ is designed to be a simple to use tool, in the following chapters we will explain how to install and use it.
+\section Installation
+First of all, __Macaon__ relies on the following libraries :
+* [Boost program_options](https://www.boost.org/doc/libs/1_55_0/doc/html/bbv2/installation.html) 
+* [Dynet](https://dynet.readthedocs.io/en/latest/install.html) (you can install it with MKL as a backend to enable multi-threading)
+* [Fasttext](https://github.com/facebookresearch/fastText)
+Make sure to download and install them all.
+Then download the source code [macaon](https://gitlab.lis-lab.fr/franck.dary/macaon) and install it :
+    cd macaon
+    mkdir build
+    cd build
+    cmake .. && make -j && sudo make install
+_Macaon_ should compile and install itself.
+Then you need to download the [data repository](https://gitlab.lis-lab.fr/franck.dary/macaon_data) containing the corpora and the trained tools.
+You will need to compile the tools : 
+    cd macaon_data
+    cd tools
+    make
+Then you have to set the environement variable 'MACAON_DIR' to the path where macaon_data is installed.
+    echo "export MACAON_DIR=/path/to/macaon_data" >> ~/.bashrc
+    bash
+Now everything is installed and you can proceed to the training section.
+\section Training
+Go to the desired language directory :
+    cd macaon_data/fr
+If this is your first time training a model for this language, you will have to create the datasets :
+    cd data
+    cd morpo-lexicon
+    make
+    cd ..
+    cd treebank
+    make
+    cd ../..
+Now you can use the train.sh script to train a model to fit the dataset present in data/treebank.
+The script takes two arguments :
+* The name of the folder containing the description of the machine.
+* The name of the model that will be trained.
+For instance you can train a model called tagger1 by calling :
+    ./train.sh tagger tagger1
+    Training of 'Tagger Machine' :
+    [dynet] random seed: 100
+    [dynet] allocating memory: 512MB
+    [dynet] memory allocation done.
+    Tagger topology : (242->300->25)
+    Iteration 1/5 :
+    Tagger accuracy : train(89.58%) dev(96.81%) SAVED 
+    Iteration 2/5 :
+    Tagger accuracy : train(97.68%) dev(97.30%) SAVED 
+    Iteration 3/5 :
+    Tagger accuracy : train(98.32%) dev(97.46%) SAVED 
+    Iteration 4/5 :
+    Tagger accuracy : train(98.71%) dev(97.46%)  
+    Iteration 5/5 :
+    Tagger accuracy : train(98.92%) dev(97.41%) 
+Importat informations that will appear during training are :
+* The topology of each Multi-Layer Perceptron used. In this example we have one MLP with a input layer of 242 neurons, a single hidden layer of 300 neurons and an output layer of 25 neurons.
+* For each epoch/iteration, the accuracy of every classifier on the training set and on the developement set will be displayed.
+* For each itertion and each classifier, 'SAVED' will appear if this version of the classifier have been saved to the disk. It happens when the current classifier beat its former accuracy record.
+For more informations about the allowed options of the train.sh script, you can consult help :
+    ./train.sh -h
+After the training is complete, the trained model will be stored in the bin directory.
+You can train as many models of a machine as you want, they will not interfere with each other.
+\section Evaluation
+If you want to evaluate trained models against their testing dataset, just navigate to the eval folder and launch the script :
+    cd macaon_data/fr/eval
+    ./eval.sh tagger
+    Evaluation of tagger1 ... Done !
+    tagger  97.60 0.00  0.00  3.79  0.00  0.00  0.00  31110
+The scirpt can evaluate multiple models at once : 
+    ./eval.sh tagger tagger1 parser2
+The result of each evaluation is stored in the file named language.res.
+    cat fr.res
+    tool     pos    morpho  lemma  uas   las   srec  sacc  nbWords
+    tagger   97.60  0.00    0.00   3.79  0.00  0.00  0.00  31110
+    tagger1  96.70  0.00    0.00   3.79  0.00  0.00  0.00  31110
+Where :
+* tool is the name of the tool being evaluated.
+* pos is the accuracy in part of speech tagging.
+* morpho is the accuracy in morphosyntactic tagging.
+* lemma is the accuracy in lemmatization.
+* uas is the accuracy in governor prediction.
+* las is the accuracy in syntactic function and governor prediction.
+* srec is the recall in end of sentence prediction.
+* sacc is the accuracy in end of sentence prediction.
+* nbWords is the number of tokens in the test dataset.
+\section a Using your trained models
+To use a previously trained model, simply launch the corresponding script in the bin folder of the desired language.
+For instance to process an input 'file.mcf' with our trained model 'tagger1' :
+    cd macaon_data/fr/bin
+    ./maca_tm_tagger1 file.mcf file.mcd
+Where file.mcd is a file describing the columns of file.mcf.
+For more convenience, you can add your favorite language bin directory to you PATH environement variable :
+    echo "PATH=$PATH:$MACAON_DIR/fr/bin" >> ~/.bashrc
+    bash
+So that you can all any trained model from anywhere without typing the absolute path.
--- a/docs/config
+++ b/docs/config
@@ -310,7 +310,7 @@ MARKDOWN_SUPPORT       = YES
 # Minimum value: 0, maximum value: 99, default value: 0.
 # This tag requires that the tag MARKDOWN_SUPPORT is set to YES.
-TOC_INCLUDE_HEADINGS   = 0
+TOC_INCLUDE_HEADINGS   = 1
 # When enabled doxygen tries to link words that correspond to documented
 # classes, or namespaces to their corresponding documentation. Such a link can
@@ -982,7 +982,7 @@ FILTER_SOURCE_PATTERNS =
 # (index.html). This can be useful if you have a project on for instance GitHub
 # and want to reuse the introduction page also for the doxygen output.
-USE_MDFILE_AS_MAINPAGE =
+USE_MDFILE_AS_MAINPAGE = docs/basics.md
 #---------------------------------------------------------------------------
 # Configuration options related to source browsing
@@ -1160,7 +1160,7 @@ HTML_FOOTER            =
 # obsolete.
 # This tag requires that the tag GENERATE_HTML is set to YES.
-HTML_STYLESHEET        = docs/doxygen.css
+HTML_STYLESHEET        = 
 # The HTML_EXTRA_STYLESHEET tag can be used to specify additional user-defined
 # cascading style sheets that are included after the standard style sheets
@@ -1173,7 +1173,7 @@ HTML_STYLESHEET        = docs/doxygen.css
 # list). For an example see the documentation.
 # This tag requires that the tag GENERATE_HTML is set to YES.
-HTML_EXTRA_STYLESHEET  =
+HTML_EXTRA_STYLESHEET  = docs/macaon.css
 # The HTML_EXTRA_FILES tag can be used to specify one or more extra images or
 # other source files which should be copied to the HTML output directory. Note

--- a/docs/incremental.svg
+++ b/docs/incremental.svg
--- a/docs/doxygen.css
+++ b/docs/doxygen.css
--- a/docs/machine.md
+++ b/docs/machine.md
+[TOC]
+This page explains how a machine is encoded by files and how you can alter its hyperparameters.
+\section b Machine Template
+Inside every language directory you will see folders named after tools like 'tagger', 'morpho', 'lemmatizer', 'parser', 'tagparser'.
+Each of these folders contains the full description of an untrained TransitionMachine : 
+    ls macaon_data/fr/tagger/
+    machine.tm     tagger.as   tagger.dicts  test.bd
+    signature.cla  tagger.cla  tagger.fm     train.bd
+When a folder like this one is used for training (./train.sh), it is copied into the bin folder and trained.
+In the following chapters we will explain what these files stands for and how to modify them.
+\section c .bd files
+They stand for Buffer Description.
+Each line of the .bd file defines one tape of the multi-tapes buffer.
+Example :
+    #Name  ref/hyp dict    Policy   Must print?#
+    ############################################
+    FORM   ref     form    FromZero 1
+    POS    hyp     pos     FromZero 1
+    SGN    hyp     sgn     FromZero 1
+The columns :
+* __Name__ is the name of this tape.
+* __ref/hyp__ stands for reference/hypothesis. A tape is reference if its content are given as input.
+* __dict__ is the name of the dictionary that will store the elements of this tape.
+* __Policy__ How the dictionary needs to be constructed :
+  * __FromZero__ if the dictionary starts empty.
+  * __Modifiable__ if the dictionary keep its current values, and can modify them.
+  * __Final__ if the dictionary keep its current values, and cannot modify them.
+* __Must print__ is whether or not this tape must be printed as an output of the program.
+There are 2 .bd files per machine, one describing the buffer during training and the other describing the buffer during testing.
+\section d .dicts file
+Each line of this file describes a dictionary that will be used by the machine.
+Example : 
+    #Name   Dimension Mode     #
+    ############################
+    bool    02        Embeddings
+    int     05        Embeddings
+    letters 10        Embeddings
+    pos     15        Embeddings
+    form    30        Embeddings
+    sgn     10        Embeddings
+The columns :
+* __Name__ is the name of this dictionary.
+* __Dimension__ is the number of float values used to encode each entry of this dictionary.
+* __Mode__ Whether the entries are encoded in a __OneHot__ fashion or as __Embeddings__.
+\section e .as files
+They stand for ActionSet.
+Each line of a .as file is the name of an action that the corresponding classifier is capable of.
+Example :
+    WRITE 0 POS adj
+    WRITE 0 POS adv
+    WRITE 0 POS advneg
+There are multiple kind of actions :
+* __WRITE x y z__ : write __z__ on the tape __y__ on column __x__ relative to the Config head.
+* __RULE x ON y z__ : apply the rule __z__ on the tape __y__ and store the result on the tape __x__. All of this relative to the columns pointed by the Config head.
+* __SHIFT__ : place the Config head in the stack.
+* __REDUCE__ : pop the Config stack.
+* __EOS__ : Tag the Config head as an end of sentence, and pop the stack.
+* __ROOT__ : Tag the Config head as root of sentence, and pop the stack.
+* __LEFT x__ : Add a dependency from the Config head to the top of the stack, of label __x__, and pop the stack.
+* __RIGHT x__ : Add a dependency from the top of the stack to the Config head, of label __x__.
+An ActionSet can have a default action. In this case the file must begin with :
+    Default : x 
+Where __x__ is the name of the default action, the one that will be applied when no other action can be applied.
+\section f .fm files
+They stand for FeatureModel.
+Each line of a .fm file is the name of one feature that will be used to transform a Configuration into a feature description, for a specific Classifier.
+Example :
+    b.-3.POS
+    b.0.FORM.U
+    b.0.FORM.LEN
+    b.0.FORM.PART.-1
+    b.0.FORM.PART.-3.-1
+    b.0.FORM.PART.+1
+    s.0.ldep.LABEL
+    tc.0
+Where :
+* __b.-3.POS__ : the content of the tape POS, at the column that is at index -3 relatively to the head.
+* __b.0.FORM.U__ : whether or not the content of the tape FORM, at the column under the head, start with an uppercase letter.
+* __b.0.FORM.LEN__ : the number of letters of the content of the tape FORM, at the column under the head.
+* __b.0.FORM.PART.-1__ : the last letter of the content of the tape FORM, at the column under the head.
+* __b.0.FORM.PART.-3.-1__ : the last three letters of the content of the tape FORM, at the column under the head.
+* __b.0.FORM.PART.+1__ : the second letter of the content of the tape FORM, at the column under the head.
+* __s.0.ldep.LABEL__ : the content of the tape LABEL, relatively to the column of the closest left dependent of the column whose index is the value of the top of the stack.
+* __tc.0__ is the previous action that was predicted by this Classifier.
+\section g .cla files
+Each .cla file describe a Classifier of the TransitionMachine.
+There are three types of Classifier.
+The ones of type __Prediction__, example :
+    Name : Tagger
+    Type : Prediction
+    Oracle : tagger
+    Feature Model : tagger.fm
+    Action Set : tagger.as
+    Topology : (300,RELU,0.3)
+Where :
+* __Name__ is the name of this Classifier.
+* __Name__ is the type of this Classifier.
+* __Oracle__ is the name of the oracle that will be used to train this Classifier.
+* __Feature Model__ the name of the FeatureModel used by this Classifier.
+* __Action Set__ the name of the ActionSet of this Classifier.
+* __Topology__ the topology of the underlying Multi-Layer Perceptron. It is a list of hidden layers, each hidden layer is written as : the number of neurons, the activation function and the dropout rate.
+This is the type of Classifier that relies on a neural network to make predictions, it require a training.
+The ones of the type __Information__, example :
+    Name : Tagger
+    Type : Information
+    Oracle : signature
+    Oracle Filename : ../../data/morpho-lexicon/fP
+Where :
+* __Name__ is the name of this Classifier.
+* __Name__ is the type of this Classifier.
+* __Oracle__ is the name of the oracle that will be used by this Classifier.
+* __Oracle Filename__ is the name of the file that the oracle can use to make its predictions.
+This is the type of Classifier used to add information to the experiement, like the lemmas of words for instance.
+And finally there are the ones of type __Forced__, which are only able to predict one Action.
+\section h .tm files
+A .tm file describes a TransitionMachine.
+There must only be one .tm file per folder, and it must be called machine.tm.
+Example :
+    Name : Tagger Machine
+    Dicts : tagger.dicts
+    %CLASSIFIERS
+    tagger tagger.cla
+    signature signature.cla
+    %STATES
+    signature1 signature
+    tagger1 tagger
+    %TRANSITIONS
+    signature1 tagger1 MULTIWRITE 0
+    tagger1 signature1 WRITE +1
+Where :
+* __Name__ is the name of this machine.
+* __Dicts__ is the file describing the dictionaries used by the machine.
+* __%CLASSIFIERS__ is the start of the classifiers section :
+  * Each line contains the name of a Classifier and its corresponding file.
+* __%STATES__ is the start of the states section :
+  * Each line contains the name of a state and its corresponding classifier.
+* __%TRANSITIONS__ is the start of the transitions section :
+  * Each line describes a transition with : the starting state, the ending state, the corresponding type of action and the relative movement of the head.
+The initial state of the machine is the state that is defined first.
--- a/docs/sequential.svg
+++ b/docs/sequential.svg