[TOC]

This page explains how a machine is encoded by files and how you can alter its hyperparameters.

\section b Machine Template

Inside every language directory you will see folders named after tools like 'tagger', 'morpho', 'lemmatizer', 'parser', 'tagparser'.

Each of these folders contains the full description of an untrained TransitionMachine :

ls macaon_data/fr/tagger/
machine.tm tagger.as tagger.dicts test.bd
signature.cla tagger.cla tagger.fm train.bd

When a folder like this one is used for training (./train.sh), it is copied into the bin folder and trained.

In the following chapters we will explain what these files stands for and how to modify them.

\section c .bd files

They stand for Buffer Description.

Each line of the .bd file defines one tape of the multi-tapes buffer.

Example :

#Name ref/hyp dict Policy Must print?#
############################################
FORM ref form FromZero 1
POS hyp pos FromZero 1
SGN hyp sgn FromZero 1

The columns :
* __Name__ is the name of this tape.
* __ref/hyp__ stands for reference/hypothesis. A tape is reference if its content are given as input.
* __dict__ is the name of the dictionary that will store the elements of this tape.
* __Policy__ How the dictionary needs to be constructed :
* __FromZero__ if the dictionary starts empty.
* __Modifiable__ if the dictionary keep its current values, and can modify them.
* __Final__ if the dictionary keep its current values, and cannot modify them.
* __Must print__ is whether or not this tape must be printed as an output of the program.

There are 2 .bd files per machine, one describing the buffer during training and the other describing the buffer during testing.

\section d .dicts file

Each line of this file describes a dictionary that will be used by the machine.

Example :

#Name Dimension Mode #
############################
bool 02 Embeddings
int 05 Embeddings
letters 10 Embeddings
pos 15 Embeddings
form 30 Embeddings
sgn 10 Embeddings

The columns :
* __Name__ is the name of this dictionary.
* __Dimension__ is the number of float values used to encode each entry of this dictionary.
* __Mode__ Whether the entries are encoded in a __OneHot__ fashion or as __Embeddings__.

\section e .as files

They stand for ActionSet.

Each line of a .as file is the name of an action that the corresponding classifier is capable of.

Example :

WRITE 0 POS adj
WRITE 0 POS adv
WRITE 0 POS advneg

There are multiple kind of actions :
* __WRITE x y z__ : write __z__ on the tape __y__ on column __x__ relative to the Config head.
* __RULE x ON y z__ : apply the rule __z__ on the tape __y__ and store the result on the tape __x__. All of this relative to the columns pointed by the Config head.
* __SHIFT__ : place the Config head in the stack.
* __REDUCE__ : pop the Config stack.
* __EOS__ : Tag the Config head as an end of sentence, and pop the stack.
* __ROOT__ : Tag the Config head as root of sentence, and pop the stack.
* __LEFT x__ : Add a dependency from the Config head to the top of the stack, of label __x__, and pop the stack.
* __RIGHT x__ : Add a dependency from the top of the stack to the Config head, of label __x__.

An ActionSet can have a default action. In this case the file must begin with :

Default : x

Where __x__ is the name of the default action, the one that will be applied when no other action can be applied.

\section f .fm files

They stand for FeatureModel.

Each line of a .fm file is the name of one feature that will be used to transform a Configuration into a feature description, for a specific Classifier.

Example :

b.-3.POS
b.0.FORM.U
b.0.FORM.LEN
b.0.FORM.PART.-1
b.0.FORM.PART.-3.-1
b.0.FORM.PART.+1
s.0.ldep.LABEL
tc.0

Where :
* __b.-3.POS__ : the content of the tape POS, at the column that is at index -3 relatively to the head.
* __b.0.FORM.U__ : whether or not the content of the tape FORM, at the column under the head, start with an uppercase letter.
* __b.0.FORM.LEN__ : the number of letters of the content of the tape FORM, at the column under the head.
* __b.0.FORM.PART.-1__ : the last letter of the content of the tape FORM, at the column under the head.
* __b.0.FORM.PART.-3.-1__ : the last three letters of the content of the tape FORM, at the column under the head.
* __b.0.FORM.PART.+1__ : the second letter of the content of the tape FORM, at the column under the head.
* __s.0.ldep.LABEL__ : the content of the tape LABEL, relatively to the column of the closest left dependent of the column whose index is the value of the top of the stack.
* __tc.0__ is the previous action that was predicted by this Classifier.

\section g .cla files

Each .cla file describe a Classifier of the TransitionMachine.

There are three types of Classifier.

The ones of type __Prediction__, example :

Name : Tagger
Type : Prediction
Oracle : tagger
Feature Model : tagger.fm
Action Set : tagger.as
Topology : (300,RELU,0.3)

Where :
* __Name__ is the name of this Classifier.
* __Name__ is the type of this Classifier.
* __Oracle__ is the name of the oracle that will be used to train this Classifier.
* __Feature Model__ the name of the FeatureModel used by this Classifier.
* __Action Set__ the name of the ActionSet of this Classifier.
* __Topology__ the topology of the underlying Multi-Layer Perceptron. It is a list of hidden layers, each hidden layer is written as : the number of neurons, the activation function and the dropout rate.

This is the type of Classifier that relies on a neural network to make predictions, it require a training.

The ones of the type __Information__, example :

Name : Tagger
Type : Information
Oracle : signature
Oracle Filename : ../../data/fP

Where :
* __Name__ is the name of this Classifier.
* __Name__ is the type of this Classifier.
* __Oracle__ is the name of the oracle that will be used by this Classifier.
* __Oracle Filename__ is the name of the file that the oracle can use to make its predictions.

This is the type of Classifier used to add information to the experiement, like the lemmas of words for instance.

And finally there are the ones of type __Forced__, which are only able to predict one Action.

\section h .tm files

A .tm file describes a TransitionMachine.

There must only be one .tm file per folder, and it must be called machine.tm.

Example :

Name : Tagger Machine
Dicts : tagger.dicts
%CLASSIFIERS
tagger tagger.cla
signature signature.cla
%STATES
signature1 signature
tagger1 tagger
%TRANSITIONS
signature1 tagger1 MULTIWRITE 0
tagger1 signature1 WRITE +1

Where :
* __Name__ is the name of this machine.
* __Dicts__ is the file describing the dictionaries used by the machine.
* __%CLASSIFIERS__ is the start of the classifiers section :
* Each line contains the name of a Classifier and its corresponding file.
* __%STATES__ is the start of the states section :
* Each line contains the name of a state and its corresponding classifier.
* __%TRANSITIONS__ is the start of the transitions section :
* Each line describes a transition with : the starting state, the ending state, the corresponding type of action and the relative movement of the head.

The initial state of the machine is the state that is defined first.