Baseline MWE + supersense tagger
This repository contains a baseline MWE + supersense tagger to address the DiMSUM challenge on lexico-semantic segmentation and tagging. Details about the shared task are available here. The corpus files present in this folder in dimsum
format were retrieved from the shared task's git repository.
The baseline is a simple neural system composed of a feedforward network implemented in keras. To deal with word-level tagging and variable-length sentences, we simply apply a sliding window of +/-k words around a target word as features, and predict the tag of each word independenty.
The output is a coarse concatenation of BIO-style MWE tags and supersense tags. This concatenation allows predicting both tasks with a single model, although the complexity of the tagset probably hinders the model's performance.
How to use the baseline
A single script is used to train and test the model. To train the model, you give a training corpus filename and the names of the model and dictionary files to be written:
./baseline.py train dimsum-data/dimsum16.train baseline-model.h5 baseline-dicts.json
The train
keyword indicates the mode of the script and must be "train" or "test"
- train: trains from corpusfile and writes modelfile and dictsfile
- test: reads modelfile and dictsfile and tags corpusfile (ignores gold labels if present)
The corpusfile (here dimsum-data/dimsum16.train
) contains a corpus in DimSum format (tab-separated, UTF-8). MWE labels are in the 5th field in BIO form, and supersense labels are in the 8th field.
The model file (here baseline-model.h5
) will contain a keras h5 model after training.
The dictionary file (here baseline-dicts.json
) will contain the vocabularies of input and output words/labes in json format.
In other words, in train
mode, the training corpus is an input and the model and dictionary files are outputs.
To test the baseline, the same script is used with similar parameters, but different interpretation:
./baseline.py test dimsum-data/dimsum16.test.blind baseline-model.h5 baseline-dicts.json
The corpusfile (here dimsum-data/dimsum16.test.blind
) contains the corpus for which we the model predicts MWEs and supersenses. If there are annotations in the test corpus, they will be overwritten. The baseline model can predict incompatible tag sequences and will not fill in the 6th field with MWE indices.
The model file (here baseline-model.h5
) and dictionary file (here baseline-dicts.json
) were generated by the baseline training procedure described above. In other words, in test
mode, the test corpus, model file, and dictionary file are inputs, and the output is written in DiMSUM format to the command line (stdout).
How to evaluate the baseline
The DiMSUM
shared task provides a script called dimsum-data/scripts/dimsumeval.py
- however, this script is poorly documented and very picky with BIO format. Therefore, we first convert the result to the CUPT format and then evaluate it separately for MWEs and supersenses:
eval/dimsum2cupt.py --dimsum-file dimsum-data/dimsum16.test.pred > dimsum-data/dimsum16.test.pred.cupt
# Supersense P/R/F
eval/evaluate-ss.py --gold dimsum-data/dimsum16.train.cupt --pred dimsum-data/dimsum16.test.pred.cupt
# MWE P/R/F for exact match (MWE-based) and fuzzy match (Token-based) + some additional stats
eval/evaluate-mwe.py --gold dimsum-data/dimsum16.train.cupt --pred dimsum-data/dimsum16.test.pred.cupt
Thes evaluation scripts are adapted from the PARSEME shared task, check their website for the details.
How to improve the baseline
-
Input features The system currently uses one-hot representations of the most frequent word forms as features. This can be improved in several ways, for instance:
- Replace one-hot encoding by randomly initialised dense embeddings (e.g.
Embeddings
layer) - Replace one-hot encoding by pre-trained static word embeddings (
word2vec
orfasttext
) - Replace one-hot encoding by contextual embeddings (
BERT
) - Apply techniques to deal with OOVs, such as
<UNK>
embedding learning, character-based representations, or character-based convolution - Use other available information such as POS tags and lemmas
- Replace one-hot encoding by randomly initialised dense embeddings (e.g.
-
Model architecture The system uses a very (too) simple network to predict the tags, here are some ideas on how to improve them
- Replace the MLP by a sequence model such as LSTM or GRU, or even stack 2 or 3 LSTM/GRU layers
- Use a more complex architecture using attention and/or transition systems
- Tune hyper-parameters on part of the training set used for development (or using cross-validation)
- Implement early stopping, improve regularisation, etc.
-
Output representation The coarse tags of the output are the concatenation of BIO-style MWE tags and supersense tags. This creates a very large tagset and completely disjoint distributions for labels sharing part of the information (e.g. a completely different distribution for
B_v.stative
andI_v.stative
although they share the supersense). This can be improved in many possible ways.- Use Viterby decoding to prevent malformed BIO sequences
- Use two separate classifiers for both tasks, one for MWEs and one for supersenses
- Use a more sophisticated architecture to perform multi-task learning and prediction, so that part of the information is shared by both tasks
Moreover, experimental questions can be studied within this problem:
- What is the influence of discontinuous MWEs?
- What is the influence of nested MWEs?
The combination of all these questions creates a huge space of possibilities. Target two to three questions, justify their choice, and implement the modifications to test their impact on the model's performance. You will then present your findings in the final report.