Select Git revision

Benoit Favre authored
README.md 1.58 KiB
Topic classifier for biomedical articles
Multilabel topic classifier for medical articles.
This system learns a topic classifier based for articles labelelled with multiple topics. The included model uses a variant of BERT pre-trained on medical texts, and finetunes it on task instances.
Data
Input data is expected to be a json-formatted file containing a list of articles. Each article should have a title, an abstract and a topics field containing a list of topics.
Installing
virtualenv -p python3 env
source env/bin/activate
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html
Training
python trainier.py [options]
optional arguments:
-h, --help show this help message and exit
--gpus GPUS
--nodes NODES
--name NAME
--fast_dev_run
--train_filename TRAIN_FILENAME
--learning_rate LEARNING_RATE
--batch_size BATCH_SIZE
--epochs EPOCHS
--valid_size VALID_SIZE
--max_len MAX_LEN
--bert_flavor BERT_FLAVOR
--selected_features SELECTED_FEATURES
Example training command line:
python trainer.py --gpus=-1 --name test1 --train_filename ../scrappers/data/20200529/litcovid.json
pytorch-lightning provides a tensorboard logger. You can check it with
tensorboard --logdir lightning_logs
Then point your browser to http://localhost:6006/.
Generating predictions
predict.py --checkpoint checkpoints/epoch\=0-val_loss\=0.2044.ckpt --test_filename ../scrappers/data/20200529/cord19-metadata.json > predicted.json