This repository contains a baseline MWE + supersense tagger to address the DiMSUM challenge on lexico-semantic segmentation and tagging. Details about the shared task are available [here](https://dimsum16.github.io/). The corpus files present in this folder in `dimsum` format were retrieved from the shared task's [git repository](https://github.com/dimsum16/dimsum-data).
The baseline is a simple neural system composed of a feedforward network implemented in keras. To deal with word-level tagging and variable-length sentences, we simply apply a sliding window of +/-k words around a target word as features, and predict the tag of each word independenty.
The output is a coarse concatenation of BIO-style MWE tags and supersense tags. This concatenation allows predicting both tasks with a single model, although the complexity of the tagset probably hinders the model's performance.
### How to use the baseline
A single script is used to train and test the model. To train the model, you give a training corpus filename and the names of the model and dictionary files to be written:
The `train` keyword indicates the mode of the script and must be "train" or "test"
- train: trains from corpusfile and writes modelfile and dictsfile
- test: reads modelfile and dictsfile and tags corpusfile (ignores gold labels if present)
The corpusfile (here `dimsum-data/dimsum16.train`) contains a corpus in DimSum format (tab-separated, UTF-8). MWE labels are in the 5th field in BIO form, and supersense labels are in the 8th field.
The model file (here `baseline-model.h5`) will contain a keras h5 model after training.
The dictionary file (here `baseline-dicts.json`) will contain the vocabularies of input and output words/labes in json format.
In other words, in `train` mode, the training corpus is an input and the model and dictionary files are outputs.
To test the baseline, the same script is used with similar parameters, but different interpretation:
```bash
./baseline.py test dimsum-data/dimsum16.test.blind baseline-model.h5 baseline-dicts.json
```
The corpusfile (here `dimsum-data/dimsum16.test.blind`) contains the corpus for which we the model predicts MWEs and supersenses. If there are annotations in the test corpus, they will be overwritten. The baseline model can predict incompatible tag sequences and will not fill in the 6th field with MWE indices.
The model file (here `baseline-model.h5`) and dictionary file (here `baseline-dicts.json`) were generated by the baseline training procedure described above. In other words, in `test` mode, the test corpus, model file, and dictionary file are inputs, and the output is written in DiMSUM format to the command line (stdout).
### How to evaluate the baseline
The `DiMSUM` shared task provides a script called `dimsum-data/scripts/dimsumeval.py` - however, this script is poorly documented and very picky with BIO format. Therefore, we first convert the result to the [CUPT](https://multiword.sourceforge.net/cupt-format/) format and then evaluate it separately for MWEs and supersenses:
Thes evaluation scripts are adapted from the PARSEME shared task, check [their website](https://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_04_LAW-MWE-CxG_2018___lb__COLING__rb__&subpage=CONF_50_Evaluation_metrics) for the details.
### How to improve the baseline
***Input features** The system currently uses one-hot representations of the most frequent word forms as features. This can be improved in several ways, for instance:
- Replace one-hot encoding by pre-trained static word embeddings (`word2vec` or `fasttext`)
- Replace one-hot encoding by contextual embeddings (`BERT`)
- Apply techniques to deal with OOVs, such as `<UNK>` embedding learning, character-based representations, or character-based convolution
- Use other available information such as POS tags and lemmas
***Model architecture** The system uses a very (too) simple network to predict the tags, here are some ideas on how to improve them
- Replace the MLP by a sequence model such as LSTM or GRU, or even stack 2 or 3 LSTM/GRU layers
- Use a more complex architecture using attention and/or transition systems
- Tune hyper-parameters on part of the training set used for development (or using cross-validation)
- Implement early stopping, improve regularisation, etc.
***Output representation** The coarse tags of the output are the concatenation of BIO-style MWE tags and supersense tags. This creates a very large tagset and completely disjoint distributions for labels sharing part of the information (e.g. a completely different distribution for `B_v.stative` and `I_v.stative` although they share the supersense). This can be improved in many possible ways.
- Use Viterby decoding to prevent malformed BIO sequences
- Use two separate classifiers for both tasks, one for MWEs and one for supersenses
- Use a more sophisticated architecture to perform multi-task learning and prediction, so that part of the information is shared by both tasks
Moreover, experimental questions can be studied within this problem:
- What is the influence of discontinuous MWEs?
- What is the influence of nested MWEs?
The combination of all these questions creates a huge space of possibilities. Target two to three questions, justify their choice, and implement the modifications to test their impact on the model's performance. You will then present your findings in the final report.