From a4b28a3e96ff64a21b5956c158244e0f2396351c Mon Sep 17 00:00:00 2001
From: Franck Dary <franck.dary@lis-lab.fr>
Date: Tue, 2 Jun 2020 13:12:01 +0200
Subject: [PATCH] Updated documentation

---
 documentation/classifier.md     | 58 ++++++++++++++++++++++++++----
 documentation/gettingStarted.md |  2 +-
 documentation/install.md        |  3 ++
 documentation/readingMachine.md | 51 ++++++++++++++++----------
 documentation/strategy.md       | 64 ++++++++++++++++++++++++++++++++-
 documentation/transitionSet.md  | 30 ++++++++++++++++
 6 files changed, 181 insertions(+), 27 deletions(-)

diff --git a/documentation/classifier.md b/documentation/classifier.md
index c6b5744..4fed932 100644
--- a/documentation/classifier.md
+++ b/documentation/classifier.md
@@ -59,16 +59,16 @@ There are two mandatory modules :
 	```
 * `InputDropout : scalar`\
 	Dropout (between 0.0 and 1.0) to apply to the input of the MLP.
-	Example `InputDropout : 0.5`
+	E.g. `InputDropout : 0.5`
 
 And then there is a list of optional modules you can choose from :
 * `StateName : Out{embeddingSize}`\
 	An embedding of size *embeddingSize* representing the name of the current state.
 * `Context : Buffer{$1} Stack{$2} Columns{$3} $4{$5 $6 $7 $8} In{$9} Out{$10}`\
 	An embedding capturing a relative context around the machine's current word index.
-	* $1 : List of relative buffer indexes to capture. Ex `{-3 -2 -1 0 1 2}`.
-	* $2 : List of stack indexes to capture. Ex `{2 1 0}`.
-	* $3 : List of column names to capture. Ex `{FORM UPOS}`.
+	* $1 : List of relative buffer indexes to capture. E.g. `{-3 -2 -1 0 1 2}`.
+	* $2 : List of stack indexes to capture. E.g. `{2 1 0}`.
+	* $3 : List of column names to capture. E.g. `{FORM UPOS}`.
 	* $4 : Type of recurrent module to use to generate the context embedding, LSTM or GRU.
 	* $5 : Use bidirectional RNN ? 1 or 0.
 	* $6 : Number of RNN layers to use (minimum 1).
@@ -79,10 +79,10 @@ And then there is a list of optional modules you can choose from :
 * `Focused Column{$1} NbElem{$2} Buffer{$3} Stack{$4} $5{$6 $7 $8 $9} In{$10} Out{$11}`\
 	An embedding capturing a specific string, viewed as a sequence of elements.\
 	If Column = FORM elements are the letters, if Column = FEATS elements are the traits.
-	* $1 : Column name to capture. Ex `{FORM}`.
+	* $1 : Column name to capture. E.g. `{FORM}`.
 	* $2 : Maximum number of elements (example max number of letters in a word).
-	* $3 : List of relative buffer indexes to capture. Ex `{-3 -2 -1 0 1 2}`.
-	* $4 : List of stack indexes to capture. Ex `{2 1 0}`.
+	* $3 : List of relative buffer indexes to capture. E.g. `{-3 -2 -1 0 1 2}`.
+	* $4 : List of stack indexes to capture. E.g. `{2 1 0}`.
 	* $5 : Type of recurrent module to use to generate the context embedding, LSTM or GRU.
 	* $6 : Use bidirectional RNN ? 1 or 0.
 	* $7 : Number of RNN layers to use (minimum 1).
@@ -90,3 +90,47 @@ And then there is a list of optional modules you can choose from :
 	* $9 : 1 to concatenate all of the RNN hidden states, 0 to only use the last RNN hidden state.
 	* $10 : Size of the embeddings used to feed the RNN.
 	* $11 : Size of the hidden states of the RNN.
+* `DepthLayerThree : Columns{$1} Buffer{$2} Stack{$3} LayerSizes{$4} $5{$6 $7 $8 $9} In{$10} Out{$11}`\
+	For each captured index, compute an embedding of the syntactic tree rooted by this index.
+	* $1 : List of column names to capture. E.g. `{DEPREL UPOS}`.
+	* $2 : List of relative buffer indexes to capture. E.g. `{-3 -2 -1}`.
+	* $3 : List of stack indexes to capture. E.g. `{2 1}`.
+	* $4 : List of sizes, the length of the list is the maximum depth of the childs, and each size is the maximum number of child for each depth. E.g. `{3 6}`.
+	* $5 : Type of recurrent module to use to generate the context embedding, LSTM or GRU.
+	* $6 : Use bidirectional RNN ? 1 or 0.
+	* $7 : Number of RNN layers to use (minimum 1).
+	* $8 : Dropout to use after RNN hidden layers. Must be 0 if number of layers is 1.
+	* $9 : 1 to concatenate all of the RNN hidden states, 0 to only use the last RNN hidden state.
+	* $10 : Size of the embeddings used to feed the RNN.
+	* $11 : Size of the hidden states of the RNN.
+* `History : NbElem{$1} $2{$3 $4 $5 $6} In{$7} Out{$8}`\
+	An embedding representing the history of the previous transitions.
+	* $1 : Take into account only the $1 last transitions.
+	* $2 : Type of recurrent module to use to generate the context embedding, LSTM or GRU.
+	* $3 : Use bidirectional RNN ? 1 or 0.
+	* $4 : Number of RNN layers to use (minimum 1).
+	* $5 : Dropout to use after RNN hidden layers. Must be 0 if number of layers is 1.
+	* $6 : 1 to concatenate all of the RNN hidden states, 0 to only use the last RNN hidden state.
+	* $7 : Size of the embeddings used to feed the RNN.
+	* $8 : Size of the hidden states of the RNN.
+
+* `RawInput : Left{$1} Right{$2} $3{$4 $5 $6 $7} In{$8} Out{$9}`\
+	Embedding representing a window of the raw text input (centered around the character index).
+	* $1 : Size of the left window (how many character to the left of the center do the machine sees). E.g. 5.
+	* $2 : Size of the right window (how many character to the right of the center do the machine sees). E.g. 5.
+	* $3 : Type of recurrent module to use to generate the context embedding, LSTM or GRU.
+	* $4 : Use bidirectional RNN ? 1 or 0.
+	* $5 : Number of RNN layers to use (minimum 1).
+	* $6 : Dropout to use after RNN hidden layers. Must be 0 if number of layers is 1.
+	* $7 : 1 to concatenate all of the RNN hidden states, 0 to only use the last RNN hidden state.
+	* $8 : Size of the embeddings used to feed the RNN.
+	* $9 : Size of the hidden states of the RNN.
+* `SplitTrans : $1{$2 $3 $4 $5} In{$6} Out{$7}`\
+	An embedding representing the currently appliable split transitions (see [Transition Set](transitionSet.md)).
+	* $1 : Type of recurrent module to use to generate the context embedding, LSTM or GRU.
+	* $2 : Use bidirectional RNN ? 1 or 0.
+	* $3 : Number of RNN layers to use (minimum 1).
+	* $4 : Dropout to use after RNN hidden layers. Must be 0 if number of layers is 1.
+	* $5 : 1 to concatenate all of the RNN hidden states, 0 to only use the last RNN hidden state.
+	* $6 : Size of the embeddings used to feed the RNN.
+	* $7 : Size of the hidden states of the RNN.
diff --git a/documentation/gettingStarted.md b/documentation/gettingStarted.md
index 95f0310..97c3325 100644
--- a/documentation/gettingStarted.md
+++ b/documentation/gettingStarted.md
@@ -28,7 +28,7 @@ Simply edit the file `macaon_data/UD_any/config` so that `UD_ROOT=` points to th
 * `evaluate.sh` : evaluate a model that has been trained bu `train.sh`.
 * `batches.py` : a file that you can use to define multiple experiments. To be used as an argument to `launchBatches.py`.
 * `launchBatches.py` : script that allows you to run multiple experiments at the same time. Can be used to launch *oar* or *slurm* jobs.
-* `Every other directory` : contains a [Reading Machine](readingMachine.md) file that you can train using `train.sh`.
+* `templates/*` : contains a [Reading Machine](readingMachine.md) file that you can train using `train.sh`.
 
 ## Next steps :
 
diff --git a/documentation/install.md b/documentation/install.md
index 38ca0bc..709149b 100644
--- a/documentation/install.md
+++ b/documentation/install.md
@@ -7,6 +7,9 @@
 * LibTorch version 1.5 cxx11 ABI : [link](https://pytorch.org/get-started/locally/)
 * Boost >= 1.53.0 with program_options : [link](https://www.boost.org/doc/libs/1_73_0/more/getting_started/unix-variants.html)
 
+## Optional : 
+* [Word2Vec](https://github.com/tmikolov/word2vec), used by default by `macaon_data/UD_any/train.sh` to initialize the embeddings.
+
 ## Download :
 https://gitlab.lis-lab.fr/franck.dary/macaon
 
diff --git a/documentation/readingMachine.md b/documentation/readingMachine.md
index a2108d3..5286e7e 100644
--- a/documentation/readingMachine.md
+++ b/documentation/readingMachine.md
@@ -19,43 +19,58 @@ It is said to be final when all the input text has been processed.
 ## File format :
 
 A reading machine is defined in a `.rm` file (or given as argument to `macaon train`).\
-Here is an example of a Reading Machine doing POS tagging, dependency parsing and sentence segmentation in an incremental fashion (POS tag one word, then attach it to the dependency tree, then cut the sentence or don't, then change the focus to the next word and repeat) :
+Here is an example of a Reading Machine doing tokenization, POS tagging, Morphological tagging, dependency parsing and sentence segmentation in a sequential fashion :
 
 ```
-Name : Tagger, Parser and Segmenter incremental Machine
-Classifier : taggerparser
+Name : Tokenizer, Tagger, Morpho and Parser Machine
+Classifier : tokeparser
 {
-  Transitions : {tagger,data/tagger.ts parser,data/parser.ts segmenter,data/segmenter.ts}
-  LossMultiplier : {segmenter,10.0}
+  Transitions : {tokenizer,data/tokenizer.ts tagger,data/tagger.ts morpho,data/morpho_parts.ts parser,data/parser_eager_rel_strict.ts segmenter,data/segmenter.ts}
+  LossMultiplier : {segmenter,3.0}
   Network type : Modular
-  StateName : Out{64}
-  Context : Buffer{-3 -2 -1 0 1 2} Stack{} Columns{FORM} LSTM{1 1 0 1} In{64} Out{64}
-  Context : Buffer{-3 -2 -1 0} Stack{1 0} Columns{UPOS} LSTM{1 1 0 1} In{64} Out{64}
-  Focused : Column{ID} NbElem{1} Buffer{-1 0 1 2} Stack{2 1 0} LSTM{1 1 0 1} In{64} Out{64}
+  StateName : Out{1024}
+  Context : Buffer{-3 -2 -1 1 2} Stack{} Columns{FORM} LSTM{1 1 0 1} In{64} Out{64}
+  Context : Buffer{-3 -2 -1 0 1 2} Stack{1 0} Columns{UPOS} LSTM{1 1 0 1} In{64} Out{64}
+  Focused : Column{ID} NbElem{1} Buffer{-1 0 1} Stack{2 1 0} LSTM{1 1 0 1} In{64} Out{64}
   Focused : Column{FORM} NbElem{13} Buffer{-1 0 1 2} Stack{2 1 0} LSTM{1 1 0 1} In{64} Out{64}
-  Focused : Column{EOS} NbElem{1} Buffer{-1} Stack{} LSTM{1 1 0 1} In{64} Out{64}
+  Focused : Column{FEATS} NbElem{13} Buffer{-1 0 1 2} Stack{2 1 0} LSTM{1 1 0 1} In{64} Out{64}
+  Focused : Column{EOS} NbElem{1} Buffer{-1 0} Stack{} LSTM{1 1 0 1} In{64} Out{64}
   Focused : Column{DEPREL} NbElem{1} Buffer{} Stack{2 1 0} LSTM{1 1 0 1} In{64} Out{64}
   DepthLayerTree : Columns{DEPREL} Buffer{} Stack{2 1 0} LayerSizes{3} LSTM{1 1 0.0 1} In{64} Out{64}
+  History : NbElem{10} LSTM{1 1 0 1} In{64} Out{64}
+	RawInput : Left{5} Right{5} LSTM{1 1 0.0 1} In{32} Out{32}
+	SplitTrans : LSTM{1 1 0.0 1} In{64} Out{64}
   InputDropout : 0.5
   MLP : {2048 0.3 2048 0.3}
   End
-  Optimizer : Adam {0.0002 0.9 0.999 0.00000001 0.00001 true}
+  Optimizer : Adam {0.0003 0.9 0.999 0.00000001 0.00002 true}
 }
-Predictions : UPOS HEAD DEPREL EOS
+Splitwords : data/splitwords.ts
+Predictions : ID FORM UPOS FEATS HEAD DEPREL EOS
 Strategy
 {
-  Block : End{cannotMove}
-  tagger parser * 0
-  parser segmenter SHIFT 0
-  parser segmenter RIGHT 0
-  parser parser * 0
-  segmenter tagger * 1
+	Block : End{cannotMove}
+	tokenizer tokenizer ENDWORD 1
+	tokenizer tokenizer SPLIT 1
+	tokenizer tokenizer * 0
+	Block : End{cannotMove}
+	tagger tagger * 1
+	Block : End{cannotMove}
+	morpho morpho NOTHING 1
+	morpho morpho * 0
+	Block : End{cannotMove}
+	parser segmenter eager_SHIFT 0
+	parser segmenter eager_RIGHT_rel 0
+	parser parser * 0
+	segmenter parser * 1
 }
 ```
 
 This format is composed of several parts :
 * Name : The name of your machine.
 * Classifier : The name of your classifier, followed by its definition between braces. See [Classifier](classifier.md).
+* Splitwords : [Transition Set](transitionSet.md) file that contains transitions for multi-words tokenization.\
+It is only mandatory if the machine performs tokeization. This file is automatically generated by `train.sh`.
 * Predictions : Names of the columns that are predicted by your machine.
 * Strategy, followed by its definition between braces. See [Strategy](strategy.md).
 
diff --git a/documentation/strategy.md b/documentation/strategy.md
index 1333ed7..414d837 100644
--- a/documentation/strategy.md
+++ b/documentation/strategy.md
@@ -1 +1,63 @@
-TODO
+# Strategy
+
+A strategy defines the workflow of its [Reading Machine](readingMachine.md).\
+More precisely, it lets you decide the *order* in which predictions ar made (E.g. sentence segmentation before or after POS tagging ?).\
+It is flexible enough to allow for *sequential* vs *incremental* modes :
+* Sequential : All the text is processed at level *n* (E.g. POS tagging) before begin processed at level *n+1* (E.g. Dependency parsing). It is the most common mode, often called *pipeline* model.
+* Incremental : Word *w* is processed at the different levels (E.g. tokenization, pos tagging, attaching it to the syntactic tree...) of analysis, before passing on to word *w+1*.
+
+Example of a strategy (sequential mode : tokenization, pos tagging, morphology tagging, dependency parsing and sentence segmentation) :
+
+```
+Strategy
+{
+  Block : End{cannotMove}
+  tokenizer tokenizer ENDWORD 1
+  tokenizer tokenizer SPLIT 1
+  tokenizer tokenizer * 0 
+  Block : End{cannotMove} 
+  tagger tagger * 1
+  Block : End{cannotMove}
+  morpho morpho NOTHING 1
+  morpho morpho * 0
+  Block : End{cannotMove}
+  parser segmenter eager_SHIFT 0
+  parser segmenter eager_RIGHT_rel 0
+  parser parser * 0
+  segmenter parser * 1 
+}
+```
+
+Here we have a strategy composed of 4 *blocks*.\
+Every block is parametrized by its end condition `cannotMove`, which means that the flow of the machine will stay into this block until the condition is met (when the word index has reached the end of the tapes, and cannot move further).\
+When the end condition of the current block is met, the word index is reset to 0, and the next block is entered.\
+When a new block is entered, the state is set to the origin state of the first defined transition.\
+When there is no next block, the analysis is done.
+
+Inside a block are defined the transitions between states.\
+Here in the last block we have 4 transitions.\
+A transition is defined by `originState destinationState transitionName movement`.\
+In the example we have the transition `parser segmenter eager_SHIFT 0`,\
+it means that if the current state is `parser` and the classifier makes a prediction of type `eager_SHIFT` (shift in arc-eager TBP) then the new state is segmenter and the word index must not change (0 relative movement).
+
+Here is an example of another strategy (incremental mode : tokenization, pos tagging, morphology tagging, dependency parsing and sentence segmentation) :
+
+```
+Strategy
+{
+  Block : End{cannotMove}
+  tokenizer tagger ENDWORD 0
+  tokenizer tagger SPLIT 0
+  tokenizer tokenizer * 0
+  tagger morpho * 0
+  morpho parser NOTHING 0
+  morpho morpho * 0
+  parser segmenter eager_SHIFT 0
+  parser segmenter eager_RIGHT_rel 0
+  parser parser * 0
+  segmenter tokenizer * 1
+}
+```
+
+There is only one block, because word index is never reset to 0, the work flow only goes to the right one word after another.
+
diff --git a/documentation/transitionSet.md b/documentation/transitionSet.md
index 1333ed7..6893847 100644
--- a/documentation/transitionSet.md
+++ b/documentation/transitionSet.md
@@ -1 +1,31 @@
+# Transition Set
+
+Transition Set files (.ts) are the link between the [Classifier](classifier.md) and the [Strategy](strategy.md).\
+Each state is linked to a Transition Set file (see [Classifier](classifier.md)), so that when the classifier is making a prediction,\
+the index of the most activated output neuron corresponds to an index of the Transition Set file and thus to a Transition.\
+When the classifier has predicted the next Transition, the [Strategy](strategy.md) is used to determine the next state and the relative movement of the word index.
+
+Each line of a .ts file will be of the form `[<stateName>] transitionName [arguments]` where [elements] are optional.\
+stateName restricts the transition to a certain state.
+
+## Transitions
+
+Here is the list of all available transitions along with their possible arguments :
+* `WRITE $1.$2 $3 $4`\
+	Write a string to a cell. E.g. `WRITE b.0 UPOS VERB`
+	* $1 : `b` or `s`, writing to the buffer or the stack.
+	* $2 : Relative index to write into.
+	* $3 : The name of the tape to write into.
+	* $4 : The string to write.
+* `ADD $1.$2 $3 $4`\
+	Add a string to a cell. E.g. `ADD b.0 FEATS Gender=Fem`
+	* $1 : `b` or `s`, writing to the buffer or the stack.
+	* $2 : Relative index to write into.
+	* $3 : The name of the tape to write into.
+	* $4 : The string to append.
+* `eager_SHIFT`\
+	Shift transition in arc_eager transition based parsing. Push the current word index onto the stack.
+* `standard_SHIFT`\
+	Shift transition in arc_standard transition based parsing. Push the current word index onto the stack.
+
 TODO
-- 
GitLab