This page explains how a machine is encoded by files and how you can alter its hyperparameters.
\section b Machine Template
Inside every language directory you will see folders named after tools like 'tagger', 'morpho', 'lemmatizer', 'parser', 'tagparser'.
Each of these folders contains the full description of an untrained TransitionMachine :
ls macaon_data/fr/tagger/
machine.tm tagger.as tagger.dicts test.bd
signature.cla tagger.cla tagger.fm train.bd
When a folder like this one is used for training (./train.sh), it is copied into the bin folder and trained.
In the following chapters we will explain what these files stands for and how to modify them.
\section c .bd files
They stand for Buffer Description.
Each line of the .bd file defines one tape of the multi-tapes buffer.
Example :
#Name ref/hyp dict Policy Must print?#
############################################
FORM ref form FromZero 1
POS hyp pos FromZero 1
SGN hyp sgn FromZero 1
The columns :
- Name is the name of this tape.
- ref/hyp stands for reference/hypothesis. A tape is reference if its content are given as input.
- dict is the name of the dictionary that will store the elements of this tape.
-
Policy How the dictionary needs to be constructed :
- FromZero if the dictionary starts empty.
- Modifiable if the dictionary keep its current values, and can modify them.
- Final if the dictionary keep its current values, and cannot modify them.
- Must print is whether or not this tape must be printed as an output of the program.
There are 2 .bd files per machine, one describing the buffer during training and the other describing the buffer during testing.
\section d .dicts file
Each line of this file describes a dictionary that will be used by the machine.
Example :
#Name Dimension Mode #
############################
bool 02 Embeddings
int 05 Embeddings
letters 10 Embeddings
pos 15 Embeddings
form 30 Embeddings
sgn 10 Embeddings
The columns :
- Name is the name of this dictionary.
- Dimension is the number of float values used to encode each entry of this dictionary.
- Mode Whether the entries are encoded in a OneHot fashion or as Embeddings.
\section e .as files
They stand for ActionSet.
Each line of a .as file is the name of an action that the corresponding classifier is capable of.
Example :
WRITE 0 POS adj
WRITE 0 POS adv
WRITE 0 POS advneg
There are multiple kind of actions :
- WRITE x y z : write z on the tape y on column x relative to the Config head.
- RULE x ON y z : apply the rule z on the tape y and store the result on the tape x. All of this relative to the columns pointed by the Config head.
- SHIFT : place the Config head in the stack.
- REDUCE : pop the Config stack.
- EOS : Tag the Config head as an end of sentence, and pop the stack.
- ROOT : Tag the Config head as root of sentence, and pop the stack.
- LEFT x : Add a dependency from the Config head to the top of the stack, of label x, and pop the stack.
- RIGHT x : Add a dependency from the top of the stack to the Config head, of label x.
An ActionSet can have a default action. In this case the file must begin with :
Default : x
Where x is the name of the default action, the one that will be applied when no other action can be applied.
\section f .fm files
They stand for FeatureModel.
Each line of a .fm file is the name of one feature that will be used to transform a Configuration into a feature description, for a specific Classifier.
Example :
b.-3.POS
b.0.FORM.U
b.0.FORM.LEN
b.0.FORM.PART.-1
b.0.FORM.PART.-3.-1
b.0.FORM.PART.+1
s.0.ldep.LABEL
tc.0
Where :
- b.-3.POS : the content of the tape POS, at the column that is at index -3 relatively to the head.
- b.0.FORM.U : whether or not the content of the tape FORM, at the column under the head, start with an uppercase letter.
- b.0.FORM.LEN : the number of letters of the content of the tape FORM, at the column under the head.
- b.0.FORM.PART.-1 : the last letter of the content of the tape FORM, at the column under the head.
- b.0.FORM.PART.-3.-1 : the last three letters of the content of the tape FORM, at the column under the head.
- b.0.FORM.PART.+1 : the second letter of the content of the tape FORM, at the column under the head.
- s.0.ldep.LABEL : the content of the tape LABEL, relatively to the column of the closest left dependent of the column whose index is the value of the top of the stack.
- tc.0 is the previous action that was predicted by this Classifier.
\section g .cla files
Each .cla file describe a Classifier of the TransitionMachine.
There are three types of Classifier.
The ones of type Prediction, example :
Name : Tagger
Type : Prediction
Oracle : tagger
Feature Model : tagger.fm
Action Set : tagger.as
Topology : (300,RELU,0.3)
Where :
- Name is the name of this Classifier.
- Name is the type of this Classifier.
- Oracle is the name of the oracle that will be used to train this Classifier.
- Feature Model the name of the FeatureModel used by this Classifier.
- Action Set the name of the ActionSet of this Classifier.
- Topology the topology of the underlying Multi-Layer Perceptron. It is a list of hidden layers, each hidden layer is written as : the number of neurons, the activation function and the dropout rate.
This is the type of Classifier that relies on a neural network to make predictions, it require a training.
The ones of the type Information, example :
Name : Tagger
Type : Information
Oracle : signature
Oracle Filename : ../../data/morpho-lexicon/fP
Where :
- Name is the name of this Classifier.
- Name is the type of this Classifier.
- Oracle is the name of the oracle that will be used by this Classifier.
- Oracle Filename is the name of the file that the oracle can use to make its predictions.
This is the type of Classifier used to add information to the experiement, like the lemmas of words for instance.
And finally there are the ones of type Forced, which are only able to predict one Action.
\section h .tm files
A .tm file describes a TransitionMachine.
There must only be one .tm file per folder, and it must be called machine.tm.
Example :
Name : Tagger Machine
Dicts : tagger.dicts
%CLASSIFIERS
tagger tagger.cla
signature signature.cla
%STATES
signature1 signature
tagger1 tagger
%TRANSITIONS
signature1 tagger1 MULTIWRITE 0
tagger1 signature1 WRITE +1
Where :
- Name is the name of this machine.
- Dicts is the file describing the dictionaries used by the machine.
-
%CLASSIFIERS is the start of the classifiers section :
- Each line contains the name of a Classifier and its corresponding file.
-
%STATES is the start of the states section :
- Each line contains the name of a state and its corresponding classifier.
-
%TRANSITIONS is the start of the transitions section :
- Each line describes a transition with : the starting state, the ending state, the corresponding type of action and the relative movement of the head.
The initial state of the machine is the state that is defined first.