From: A system for de-identifying medical message board text
Feature | Example |
---|---|
MMB non-structure features | |
token | Kathy |
token lower-cased | kathy |
length | 5 |
case | isLower=True, isCapitalized=False, … |
suffix/prefix | suffix2=hy, prefix2=ka, suffix3=thy, … |
distance from beginning/end | w/in1FromEdge=True, w/in2FromEdge=True, … |
in word list | isProperName=True, isCommon=False, isUsername=False, … |
possibly in word list | editDist1ProperName=True, editDist2ProperName=True, … |
Also include features of two previous and following tokens | … |
MMB structure features | |
tf-idf over message boards | inTop10=False, inTop1%=False, … |
tf-idf over user posts | InTop10=False, inTop1%=True, ... |
border of paragraph likelihood | inTop5=True, inTop10%=True, ... |