Eloi Perdereau
luminy-org

Repository



SENSEI processing environment
=============================

This pipeline is a set of annotation services which can be run against the
conversation repository or run as standalone REST services.


Install requirements
--------------------

The REST services require bottle.py to be installed. 

  pip install --user -r requirements.txt -U


REST services
-------------

Core modules can be integrated as REST services which provide annotations
through a simple http protocol. The backend uses the bottle.py framework and a
server of your choice (the default WSGIRefServer is quite slow, you may install
"paste" instead).

For instance, you can run the rest/test.py script which reverses character strings.

  rest/test.py

It replies to three URLs: / is a short textual help for using the service.

  curl http://localhost:1234

It also recognizes GET requests of the form /test/<text>, where <text> is the
string you want to reverse.

  curl http://localhost:1234/test/hello

Finally, for large inputs, it recognizes POST queries where the input string is
specified in a "text" form parameter.

  curl --form "text=hello" http://localhost:1234/test

A generic REST service is also available. It runs a custom command and feeds it
with inputs through stdin and collects output from stdout. The following
example echoes the inputs as output. The name parameter set the name of the
service in the url.

  rest/generic.py --port 1234 --name "cat" --command "cat"
  curl http://localhost:1234/cat/hello

The command can also be made persistent (run in the background) and fed line by
line through stdin, and its output read line by line from stdout. Note that for
this to work, the command needs to flush stdout after each input.

  rest/generic.py --port 1234 --name --command "awk '{x+=1;print x,\$0;fflush()}'" --persistent True

WARNING: the generic REST service does not enforce any kind of security.

You can check the rest/test.py implementation for making your own REST services
using the provided framework.


Repository integration
----------------------

Repository-integrated services poll the repository for new documents, process
them and push back annotation sets.  In order to get access to the repository,
you can create a tunnel with (given a proper ssh key is setup):

  util/repository.py --tunnel

The util/repository.py script has a lot more commands available. It can also be
used as a python module as full-functional repository client.

Once the tunnel is setup, you can try the test annotator which polls the
repository and adds phony annotations. That script allows to choose the host and
port of the repository.

  repo/test.py

This script gets all documents which don't have the AMU_hasTest feature, puts
generated annotations in the AMU_Test annotation set. Then, it sets the
AMU_hasTest feature to True.

After a few documents are processed, you can kill the script and run the
cleaner which deletes all annotation sets created by the previous script.

  repo/clean.py --presence_feature AMU_hasTest --annotation_set AMU_Test

The first way to integrate a novel processing module to the repository is to
use the generic annotator. It grabs all documents which don't have a given
feature, passes them as json, through stdin, to a custom command, reads the
generated annotations as json from the command's stdout, and creates a new
annotation set with the result and marks the document as processed.

Note that for the repository to accept the new annotation, it must conform to
its expectations and contain the right fields. Overwise, an error is returned.

For example, you can write a script which computes the length of the json
representation of a document, and creates a new "checksum" annotation with it.

cat script.sh
awk '{print "[{\"type\": \"checksum\", \"features\": {\"value\":"length()"}, \"start\": 0, \"end\": 0}]"}'
repo/generic.py --command ./script.sh --mark_feature "AMU_hasChecksum" --annotation "AMU_Checksum"

The command can be run once for each document, or just once in the background.
When doing so, the command is fed line by line and its output is read line by
line. For this mode to work, the command MUST flush stdout after processing an
input. See the generic REST service example for more details.

The second way to integrate a novel processing module to the repository is to
subclass the repository.AnnotationGenerator class in python. See the
repo/test.py script for an example.

class Annotator(repository.AnnotationGenerator):
    def __init__(self):
        query = '_MISSING_=MarkingFeature&_MAX_=20'
        super(Annotator, self).__init__(query)
        ... # initialize object

    def process_document(self, client, document):
        print(document['content']['id'])
        ... # generate annotation
        client.put_annotation_set(doc_id, 'AnnotatinoName', ...)
        client.put_features(doc_id, {'MarkingFeature': True})