Skip to main content

Table 2 Overview of features used by our system

From: A system for de-identifying medical message board text

Feature

Example

MMB non-structure features

token

Kathy

token lower-cased

kathy

length

5

case

isLower=True, isCapitalized=False, …

suffix/prefix

suffix2=hy, prefix2=ka, suffix3=thy, …

distance from beginning/end

w/in1FromEdge=True, w/in2FromEdge=True, …

in word list

isProperName=True, isCommon=False, isUsername=False, …

possibly in word list

editDist1ProperName=True, editDist2ProperName=True, …

Also include features of two previous and following tokens

…

MMB structure features

tf-idf over message boards

inTop10=False, inTop1%=False, …

tf-idf over user posts

InTop10=False, inTop1%=True, ...

border of paragraph likelihood

inTop5=True, inTop10%=True, ...