SUMMARIZATION MODULE ==================== This module can generate synopses based on the the conversation you give it. it is based on learning features thanks to some annotation on previous synopses applied for a new conversation. This file gives you the needs and what you need to change if you want to improve or adapt the method to fit you data. HOW TO ====== See example.py NEEDS ===== - Python 2.7 - Icsiboost FILES & ANNOTATIONS =================== source/syn.annot ---------------- the annotation of the synopses are like so: topic conversation_ID Annotator text <a class="instance" variable="$SLOT_NAME" style="color:cyan" title="$SLOT_NAME" href="#"> slot_value </a> text. /!\ if you create new slots_name make sure to update the icsiboost.names /!\ conversations files ------------------- The tsv format was defined for storing Decoda annotations, each word has a few fields: <filename> <global-wordnum> <wordnum-in-sentence> <word> NULL <postag> NULL NULL <dependency-label> <governor-wordnum> <text-id> <lemma> <morphology> <speaker> 0.0 0.0 0.0 _ <mention> <features> <corefence-label> predsyn.py ---------- Generate the data you'll need to learn the features. For each phrases it gives you: - name of the conversation - word (text) - postag - lemma - named entity - parent - parent pos - dependency label - topic - length - sentence number - speaker summarizer.py ------------- Generate the final synopsis based on all the previous data. exemple: import summarizer #conv = summarizer.Word() summarizer.summarize(conversation, threshold, convID) threshold here is a fixed value to start looking results from. Based on our experiment it was optimal at 0.02 for the icsiboost system.
Name | Last commit | Last update |
---|---|---|
source | ||
README | ||
example.py | ||
mapping-by-id.txt | ||
multiword-lexicon.txt | ||
predsyn.py | ||
speaker_type.py | ||
special-tokens.txt | ||
summarizer.py | ||
topics.csv |