Initial baseline commit

f8f3e2ae · ceramisch · f8f3e2ae · f8f3e2ae · f8f3e2ae
Commit f8f3e2ae authored Dec 10, 2022 by ceramisch
--- a/.gitignore
+++ b/.gitignore
+*.json
+*.h5
--- a/README.md
+++ b/README.md
+Baseline MWE + supersense tagger
+----------------------------
+This repository contains a baseline MWE + supersense tagger to address the DiMSUM challenge on lexico-semantic segmentation and tagging. Details about the shared task are available [here](https://dimsum16.github.io/). The corpus files present in this folder in `dimsum` format were retrieved from the shared task's [git repository](https://github.com/dimsum16/dimsum-data).
+The baseline is a simple neural system composed of a feedforward network implemented in keras. To deal with word-level tagging and variable-length sentences, we simply apply a sliding window of +/-k words around a target word as features, and predict the tag of each word independenty.
+The output is a coarse concatenation of BIO-style MWE tags and supersense tags. This concatenation allows predicting both tasks with a single model, although the complexity of the tagset probably hinders the model's performance.
+### How to use the baseline
+A single script is used to train and test the model. To train the model, you give a training corpus filename and the names of the model and dictionary files to be written:
+```bash
+./baseline.py train dimsum-data/dimsum16.train baseline-model.h5 baseline-dicts.json
+```
+The `train` keyword indicates the mode of the script and must be "train" or "test"
+  - train: trains from corpusfile and writes modelfile and dictsfile
+  - test: reads modelfile and dictsfile and tags corpusfile (ignores gold labels if present)
+The corpusfile (here `dimsum-data/dimsum16.train`) contains a corpus in DimSum format (tab-separated, UTF-8). MWE labels are in the 5th field in BIO form, and supersense labels are in the 8th field.
+The model file (here `baseline-model.h5`) will contain a keras h5 model after training.
+The dictionary file (here `baseline-dicts.json`) will contain the vocabularies of input and output words/labes in json format.
+In other words, in `train` mode, the training corpus is an input and the model and dictionary files are outputs.
+To test the baseline, the same script is used with similar parameters, but different interpretation:
+```bash
+./baseline.py test dimsum-data/dimsum16.test.blind baseline-model.h5 baseline-dicts.json
+```
+The corpusfile (here `dimsum-data/dimsum16.test.blind`) contains the corpus for which we the model predicts MWEs and supersenses. If there are annotations in the test corpus, they will be overwritten. The baseline model can predict incompatible tag sequences and will not fill in the 6th field with MWE indices.
+The model file (here `baseline-model.h5`) and dictionary file (here `baseline-dicts.json`) were generated by the baseline training procedure described above. In other words, in `test` mode, the test corpus, model file, and dictionary file are inputs, and the output is written in DiMSUM format to the command line (stdout).
+### How to evaluate the baseline
+The `DiMSUM` shared task provides a script called `dimsum-data/scripts/dimsumeval.py` - however, this script is poorly documented and very picky with BIO format. Therefore, we first convert the result to the [CUPT](https://multiword.sourceforge.net/cupt-format/) format and then evaluate it separately for MWEs and supersenses:
+```bash
+eval/dimsum2cupt.py --dimsum-file dimsum-data/dimsum16.test.pred > dimsum-data/dimsum16.test.pred.cupt
+# Supersense P/R/F
+eval/evaluate-ss.py --gold dimsum-data/dimsum16.train.cupt --pred dimsum-data/dimsum16.test.pred.cupt
+# MWE P/R/F for exact match (MWE-based) and fuzzy match (Token-based) + some additional stats
+eval/evaluate-mwe.py --gold dimsum-data/dimsum16.train.cupt --pred dimsum-data/dimsum16.test.pred.cupt
+```
+Thes evaluation scripts are adapted from the PARSEME shared task, check [their website](https://multiword.sourceforge.net/PHITE.php?sitesig=CONF&page=CONF_04_LAW-MWE-CxG_2018___lb__COLING__rb__&subpage=CONF_50_Evaluation_metrics) for the details.
+### How to improve the baseline
+* **Input features** The system currently uses one-hot representations of the most frequent word forms as features. This can be improved in several ways, for instance:
+  - Replace one-hot encoding by randomly initialised dense embeddings (e.g. `Embeddings` layer)
+  - Replace one-hot encoding by pre-trained static word embeddings (`word2vec` or `fasttext`)
+  - Replace one-hot encoding by contextual embeddings (`BERT`)
+  - Apply techniques to deal with OOVs, such as `<UNK>` embedding learning, character-based representations, or character-based convolution
+  - Use other available information such as POS tags and lemmas
+* **Model architecture** The system uses a very (too) simple network to predict the tags, here are some ideas on how to improve them
+  - Replace the MLP by a sequence model such as LSTM or GRU, or even stack 2 or 3 LSTM/GRU layers
+  - Use a more complex architecture using attention and/or transition systems
+  - Tune hyper-parameters on part of the training set used for development (or using cross-validation)
+  - Implement early stopping, improve regularisation, etc.
+* **Output representation** The coarse tags of the output are the concatenation of BIO-style MWE tags and supersense tags. This creates a very large tagset and completely disjoint distributions for labels sharing part of the information (e.g. a completely different distribution for `B_v.stative` and `I_v.stative` although they share the supersense). This can be improved in many possible ways.
+  - Use Viterby decoding to prevent malformed BIO sequences
+  - Use two separate classifiers for both tasks, one for MWEs and one for supersenses
+  - Use a more sophisticated architecture to perform multi-task learning and prediction, so that part of the information is shared by both tasks
+Moreover, experimental questions can be studied within this problem:
+- What is the influence of discontinuous MWEs?
+- What is the influence of nested MWEs?
+The combination of all these questions creates a huge space of possibilities. Target two to three questions, justify their choice, and implement the modifications to test their impact on the model's performance. You will then present your findings in the final report.
--- a/baseline.py
+++ b/baseline.py
+#!/usr/bin/env python3
+import sys
+import sklearn.preprocessing
+import argparse
+import json
+import conllu # pip3 install --user conllu
+from collections import Counter
+import tensorflow.keras as keras
+from tensorflow.keras.utils import to_categorical
+from tensorflow.keras.models import Sequential, load_model
+from tensorflow.keras.layers import Dense, Dropout
+import numpy as np
+import pdb
+########################################################################
+header = ["id", "form", "lemma", "upos", "mwe", "mweid", "misc", "supersense", "source"]
+window_size = 3 # number of words before/after each target
+min_count = 4 # minimum number of occurrences to keep form in input vocab
+field_in = "form"
+field_out = "mwe###supersense"
+PADDING_ID=0
+UNK_ID=1
+########################################################################
+def get_histograms(corpus, columns, combined=[]):
+  """Calculate histograms (nb. of occurs per item) for all fields
+  `combined`: list of '_'-separated field names for combined field statistics
+  """
+  with open(corpusfilename,"r",encoding="utf8") as corpusfile:
+    hist = { x: Counter() for x in header + combined}
+    for tokenlist in conllu.parse_incr(corpusfile, fields=header):    
+      for token in tokenlist :
+        for x in header : # simple field stats
+          hist[x].update([token[x]]) 
+        for c in combined : # combined field stats
+          hist[c].update(["###".join(token[j] for j in c.split("###"))])
+    return hist
+########################################################################
+def generate_vocab(hist, field, min_count, start=0):
+  """Keep items occurring >= min_count times in field's histogram, assign int IDs to kept items"""
+  itemid = start
+  vocab_filtered = {}
+  for x, c in hist[field].items() :
+    if c >= min_count :
+      vocab_filtered[x] = itemid
+      itemid += 1
+  return vocab_filtered
+########################################################################
+def invert_vocab(vocab):
+  inverted = []
+  for i in sorted(vocab.items(), key=lambda it: it[1]):
+    inverted.append(i[0])
+  return inverted
+########################################################################
+def tokenlist_to_ids(tokenlist, field, vocab, padding=0):  
+  seq = []  
+  for token in tokenlist :
+    seq.append("###".join(token[x] for x in field.split("###")))
+  return [PADDING_ID]*padding + \
+         [vocab.get(s, UNK_ID) for s in seq] + \
+         [PADDING_ID]*padding
+########################################################################
+def corpus_to_ids(corpusfilename, header, field_in, vocab_in, field_out, vocab_out, window_size):  
+  with open(corpusfilename,"r",encoding="utf8") as corpusfile:
+    for tokenlist in conllu.parse_incr(corpusfile, fields=header):      
+      seq_in = tokenlist_to_ids(tokenlist, field_in, vocab_in, window_size)
+      seq_out = tokenlist_to_ids(tokenlist, field_out, vocab_out, 0)
+      yield seq_in, seq_out
+########################################################################
+def prepare_data(corpusfilename, header, field_in, vocab_in, field_out, 
+                 vocab_out, window_size, vocab_in_len):  
+  train_x_list = []
+  train_y_list = [] 
+  nb_feats = window_size*2+1 # nb of input words seen at a time
+  in_size = vocab_in_len * nb_feats
+  for (seq_in, seq_out) in corpus_to_ids(corpusfilename, header, field_in,
+                                          vocab_in, field_out, vocab_out, window_size):
+    for i_window in range(len(seq_out)):
+      in_words = seq_in[i_window:i_window+nb_feats]
+      in_onehot = to_categorical(in_words, num_classes=vocab_in_len)
+      train_x_list.append(in_onehot.reshape(in_size))
+      train_y_list.append(to_categorical(seq_out[i_window], num_classes=len(vocab_out)))
+  return np.array(train_x_list), np.array(train_y_list), in_size
+########################################################################
+def train_save_model(corpusfilename, modelfilename, dictsfilename, field_in, field_out, min_count, window_size, header):
+  combined = [x for x in [field_in, field_out] if "###" in x]
+  hist = get_histograms(corpusfilename, header, combined) 
+   # start=2: item ID 0 is reserved for padding, 1 for unknown word
+  vocab_in = generate_vocab(hist, field_in, min_count, 2)
+  vocab_out = generate_vocab(hist, field_out, 0, 0)
+  vocab_in_len = len(vocab_in)+2 # +2 for padding + unk
+  # save dict to json file
+  print("In/out vocab len:", vocab_in_len,len(vocab_out), file=sys.stderr)
+  json.dump((vocab_in,vocab_out),open(dictsfilename,"w",encoding="utf8"))  
+  train_x, train_y, in_size = prepare_data(corpusfilename, header, field_in, 
+                                  vocab_in, field_out, vocab_out, 
+                                  window_size, vocab_in_len)  
+  model = Sequential()
+  model.add(Dense(units=512, activation='relu', input_dim=in_size))
+  model.add(Dropout(0.3))
+  model.add(Dense(units=len(vocab_out), activation='softmax'))
+  model.compile(loss='categorical_crossentropy',
+                    optimizer='adam',
+                    metrics=['accuracy'])
+  model.fit(train_x, train_y, epochs=2, batch_size=32, validation_split=0.1)
+  model.save(modelfilename)    
+########################################################################
+def load_model_predict(corpusfilename,modelfilename,dictsfilename,field_in, field_out, window_size, header):  
+  vocab_in, vocab_out = json.load(open(dictsfilename,"r",encoding="utf8"))
+  invert_out = invert_vocab(vocab_out)
+  model = load_model(modelfilename)
+  test_x, test_y, in_size = prepare_data(corpusfilename, header, field_in, vocab_in, field_out, vocab_out, window_size, len(vocab_in)+2)
+  #for in_vector in test_x :
+  labels = model.predict(test_x, verbose=0, steps=None).argmax(axis=1)
+  i_label = 0    
+  with open(corpusfilename,"r",encoding="utf8") as corpusfile:
+    for tokenlist in conllu.parse_incr(corpusfile, fields=header):
+      pred_mwe = "X"
+      for token in tokenlist :        
+        mwe,ss = invert_out[labels[i_label]].split("###")        
+        # Ugly workaround to avoid incompatible taggings
+        # token["mwe"] = "O" # mwe
+        token["mwe"] = mwe
+        token["mweid"] = None # field will be ignored when converting to CUPT
+        token["supersense"] = ss
+        i_label += 1
+      print(tokenlist.serialize().replace("_\t","\t"),end="")        
+########################################################################
+if __name__ == '__main__':
+  if len(sys.argv)!=5:
+    usage="""Usage:
+  {} mode corpusfile modelfile dictsfile
+  mode is one of "train" or "test"
+  - train: trains from corpusfile and writes modelfile and dictsfile
+  - test: reads modelfile and dictsfile and tags corpusfile (ignores gold labels if present)
+  corpusfile contains a corpus in DimSum format (tab-separated, UTF-8)
+  modelfile contains a keras h5 model
+  dictsfile contains vocabularies in json format""".format(sys.argv[0])
+    print(usage,file=sys.stderr)
+    exit()
+  ##########################
+  mode = sys.argv[1]
+  corpusfilename = sys.argv[2]
+  modelfilename = sys.argv[3]
+  dictsfilename = sys.argv[4]
+  if mode == "train":
+    train_save_model(corpusfilename,modelfilename,dictsfilename,field_in, field_out, min_count, window_size, header)
+  elif mode == "test":
+    load_model_predict(corpusfilename,modelfilename,dictsfilename,field_in, field_out, window_size, header)
+  else:
+    print('Error, mode must be "train" or "test", found {}'.format(mode))