BART is a modular toolkit for anaphora resolution that supports state-of-the-art statistical approaches to the task and enables efficient feature engineering. There are two modes for using BART: (a) BART Server and (b) Terminal application. Use BART Server if you need end-to-end coreference as a black box: if you want to automatically annotate textual documents with coreference chains, without going into any details on coreference itself. Section (1) below explains how to do that (English only). Use the full-scale application if you want to train a new model. Sections (2-7) below explain briefly how to do that, send us an email if you need more detailed instructions or encounter some difficulties. NB: you don't have to start the BART Server to use it as a Terminal application, these are completely separate modes. If you are using BART for your research, please cite the following papers: @inproceedings{bart, Author = {Yannick Versley and Simone Paolo Ponzetto and Massimo Poesio and Vladimir Eidelman and Alan Jern and Jason Smith and Xiaofeng Yang and Alessandro Moschitti}, Booktitle = {{Proceedings of the 2008 Conference of the Association for Computational Linguistics}}, Pages = {9--12}, Title = {{BART}: a modular toolkit for coreference resolution}, Year = {2008} } @InProceedings{conll2012bart, author = {Olga Uryupina and Alessandro Moschitti and Massimo Poesio}, title = {{BART goes multilingual: The UniTN~/~Essex submission to the CoNLL-2012 Shared Task}}, booktitle = {{Proceedings of the Sixteenth Conference on Computational Natural Language Learning (CoNLL 2012)}}, year = 2012 } =========================================== (1) Using BART as a web demo, UNIX or MACOS: - in your terminal, run the following: $ source setup.sh $ java -Xmx1024m elkfed.webdemo.BARTServer this will launch the BART demo. Now open your browser and open http://localhost:8125/index.jsp You will see the output on the screen. In the webdemo_temp directory, BART will create MMAX files for your document, using a randomly generated name. (1a) Sending requests to the BART server from a terminal. Once the BART server is running (cf. (1) above), you can also send requests from your terminal: - prepare some file with a textual input - run the following: $ lwp-request -m POST -c 'text/html; charset=ISO8859-15' http://localhost:8125/BARTDemo/ShowText/process/ < your_file_with_txt_input You will get an XML output in your terminal and all the MMAX files in the webdemo_temp directory (1b) distributed models: We distribute BART with two models. The default model is a reimplementation of Soon et al. (2001), trained on the OntoNotes English data. The alternative model is the system we (UniTN + UniEssex) have used for our CoNLL submission. It is slower but more accurate. To use this model, substitute "config/config.properties" with "config/config.properties.conll". To get back to the faster model, substitute "config/config.properties" with "config/config.properties.noconll". Be careful not to overwrite distributed models by running "XMLExperiment" with no settings xml-file specified (see (2) below). If that happens, delete all the files in models/coref and replace them with the ones from the release. ---------- (2) Retraning/retesting BART on a sample corpus: $ source setup.sh $ java -Xmx2000m -Delkfed.corpus=sample elkfed.main.XMLExperiment presets/soonbaseline.xml This will run a full train-and-test experiment on the dataset called "sample". The dataset's parameters (language, path to the train/test data) are set in config/config.properties. The experimental parameters (ML settings, features) are specified in presets/soonbaseline.xml As a result of the full run, the MUC scores will be displayed on the screen (per document and total). In the testing directory, the "response" level will be added to all the documents (check "./sample/english-preprocessed/test/markables"). While it is possible to run BART with no corpus and/or no experimental parameter file specified, we strongly recommend you not to do that as it will overwrite our distributed models (cf. 1b above). Once you've trained a model, you can test it once again on any dataset: $ java -Xmx2000m -Delkfed.corpus=sample elkfed.main.XMLAnnotator presets/soonbaseline.xml This will run just the testing part of the XMLExperiment. ---------- (3) Running BART on a windows machine To run BART on a windows machine, you have to set the CLASSPATH properly. Instead of this: $ source setup.sh try this: > set CLASSPATH=.;dist/BART.jar;libs/* For example, to start the BART Server on a Windows machine, you need to do the following: > set CLASSPATH=.;dist/BART.jar;libs/* > java -Xmx1024m elkfed.webdemo.BARTServer ---------- (4) Running BART on a new dataset, English only: BART expects its input to be in the MMAX format. (Hint: if your documents are plain-text, use BART Server to generate temporary MMAX files for you, cf. 1a above.) To start working on an English dataset, you need at least tokens (Basedata/*_words.xml) and coreference levels (markables/*coref_level.xml). An example of such a minimally prepared English dataset can be found in sample/english. Prepare your dataset, modify config/config.properties to specify the paths to you training and test data and your language ("english") -- see "sample1" corpus in config/config.properties. Edit config/config.properties to turn on the preprocessing pipeline: -------replace this ------------ #runPipeline = true runPipeline=false -------with this --------------- runPipeline = true #runPipeline=false ------------------- You can now launch BART's preprocessing pipeline to generate all the other levels for you: $ java -Xmx3000m -Delkfed.corpus=sample1 elkfed.main.PreProcess BART implements several preprocessing pipelines, to be specified in the config/config.properties: ParserPipeline -- this is the default pipeline, specify this in most cases NoNLPCarafePipeline -- this pipeline assumes that you have pre-generated "enamex" levels for each document. It will generate markables by porting them from the enamex levels and assigning missing properties. Use this pipeline in two cases: (a) if you want to use an external mention detection tool, e.g. CARAFE or (b) if you want to test the system on gold markables (in this case you will have to generate "enamex" levels from "coref" levels yourself first) CoNLLClosedPipeline -- this pipeline assumes that you have pre-generated all the levels except "markables" and only runs BART mention detector. Use this pipeline to integrate your own preprocessing tools. Once you have preprocessed your dataset, turn off the pipeline in config/config.properties: #runPipeline = true runPipeline=false Now you can run XMLExperiment on your dataset (cf. (3) above) ---------- (5) Running BART on a new dataset, other languages We do not support preprocessing for languages other than English. So, to run BART on another language, you first have to preprocess your data yourself, generating all the necessary markable levels, including the "markable" level that contains info on the mentions. In sample/generic-min, we show the minimal amount of information to be provided to BART to run any experiment. In sample/generic-max, we show the same documents, but with much more information encoded both via MMAX levels and via attributes on the "markable" level. If your processing tools allow you to include any of these, the performance will go up. (NB: both generic-min and generic-max folders contain English examples for illustrative purposes, however, we do not recommend runing BART in the non-English version for English data, since many English-specific improvements will be discarded.) Prepare your dataset in the MMAX format, making sure that you include at least all the information shown in the sample/generic-min example (that is: tokens in Basedata/*words.xml, coreference levels, pos levels, markable levels specifying markable_id and span for each markable). In you config/config.properties, specify paths to your training and testing data and specify language for your corpus to be "generic". Make sure that the pipeline is turned off (cf. 4 above). You can run XMLExperiment now. ---------- (6) Presets and precompiled configurations Your BART run is controlled by two parameter files: config/config.properties and the experimental xml file. The former specifies data-specific settings (paths, preprocessing etc). The latter specifies experiment-specific settings (feature set etc). Before running any experiments, make sure that your config/config.properties contains correct paths to both your training and testing data as well as the language. All these parameters can be specified for each corpus individually by using corpus alias names (e.g. "sample", "sample1" etc in the distributed files). make sure that the pipeline is turned onn/off and that the pipeline name is specified correctly (cf. 4). Other configuration parameters to check include: ------------ ## treat possessive NPs as "[NP's]", not "[NP]'s" (set true for conll, false for ACE) fullpossessives=false ## output singleton entities in the response (false for conll, muc; true for ace) singletons=true ## set this to true normally, but to false for conll/ontonotes domarkablecleanup=true ------------ We provide two variants of the config/config.properties: config.properties.conll (to be used for OntoNotes) and config.properties.noconll (to be used for ACE, MUC, ARRAU and other corpora). Replace config/config.properties with one of them depending on your dataset (you will still have to specify the paths and the language for your data). Experimental xmls specify your experimental settings (model, ML parameters and features). We have created a set of ready-to-use experimental xmls in "presets". The "all.xml" preset contains all the feature groups implemented so far. You can specify your experimental xml file while evoking XMLExperiment (or XMLAnnotator) -- cf. (2) above. NB: please re-read (2) on running XMLExperiment without specifying your experimental xml file explicitly. ---------- (7) More experiments with BART You can add new features, models.. BART can be compiled from NetBeans. BART can also be compiled from the terminal: $ ant If the distribution doesn't compile, try the following: $ sourse setup.sh $ javac -d build/classes/ -sourcepath src/ src/elkfed/mmax/pipeline/BerkeleyParser.java $ ant If it still doesn't compile, send us an email -- attaching the error log. ----------- =========================================== September 2016, BART version 2.4 features: * New language plugins for French and Basque * New Machine Learning Feature extractors for thread-awareness in processing online forums Full details of the Classes changed can be found in file ClassesChanges_2.0vs2.4.txt in the directory next to this README. An extensive set of READMEs documenting various processes making use of the new version can be found in subdirectory READMEv2.4. For questions on the new functionality contact mkabadjov @ yahoo.co.uk
Select Git revision
bart-sensei
-
-
- Open with
- Visual Studio Code
- IntelliJ IDEA
- Download source code