One of the challenges analyzing medical conditions in unstructured data is in determining whether they are reported as indications or side effects. Classifying different types of medical conditions enables a deeper understanding of disease manifestation and treatment risk, helping further research in this field. In our study, we collect medical forum posts, annotate different types of medical conditions, and classify medical conditions into indications and side effects using natural language processing and machine learning techniques.
Finding Different Types of Medical Conditions: From Data Generation to Automatic Classification
1. Finding Different Types of Medical Conditions:
From Data Generation to Automatic Classificationjj
Nathan Artz1, Stephen Doogan1, Jinho D. Choi, PhD2
1Real Life Sciences, New York, NY; 2Emory University, Atlanta, GA
rlsciences.com
info@rlsciences.com
quantitative.emory.edu
choi@mathcs.emory.edu
INTRODUCTION RESULTSAPPROACH IMPLICATIONS
• A key challenge to real world outcomes research lies in
analyzing medical conditions in unstructured data. Central to
this is determining whether conditions are reported as
symptoms of disease, treatment indications or side effects.
• In analyzing texts from patient reporting systems (e.g., FAERS,
social media), researchers often use medical dictionaries such
as MedDRA to produce ranked frequency counts of medical
conditions1; however, this approach fails to differentiate
treatment indications and side effect conditions, which is a
critical part of assessing treatment outcomes.
• Automated classification of medical conditions can provide a
deeper understanding of treatment outcomes and promote
efficient real world data investigations across a variety of patient
level data.
• In our study, we collected 19,313 spontaneous patient reports
about antibiotic treatment experiences from medical forums. We
annotated 760 sentences containing different types of medical
conditions, and classified 5,179 unique conditions into
indications and side effects across the set of reports using
natural language processing and machine learning techniques.
• Each of the 760 annotated sentences were given a binary label
indicating whether the condition played the role of an “indication”
or “side effect” of the given treatment. Specific treatment and
condition mentions in text were replaced by generic text labels
(i.e. _TREATMENT, _CONDITION) to prevent overfitting to the
antibiotic drug class.
• We used an SVM classifier with 5-fold cross validation and
averaged the outcomes of the folds to determine the F1 scores
(see Figure 2).
• We manually reviewed some of the most impactful features of
the SVM to see which are most important when differentiating
the two classes of conditions.
• For conditions playing the role of indication, we see interesting
features such as a Bigram(treatment, for) and
syntactic_parent(condition, for) as in “on steroids for the
inflammation”. For side effects, we observe features such as
SyntacticParent(condition, make) or SyntacticParent(condition,
give) as in “treatment made me sick” or “the drug gave me an
allergic reaction” which heavily influence the decision boundary
of the SVM.
• We believe this is a feasible approach to differentiating the roles
of medical conditions. This approach is suggested for use in
spontaneous reporting systems where many of the conditions
can be assumed to play side effect or indication roles.
• Our study can enable researchers, regulators, health insurers
and pharmaceutical companies to better monitor treatment risk
by filtering out instances of indications and automatically identify
side effects in spontaneous reports.
• Our approach can provide significant time savings in
reviewing spontaneous reports for the purpose of side effect
identification, enabling untapped data to be leveraged for
real world treatment outcomes research.
• Our follow on work includes applying these models across
doctor’s notes in electronic health records (EHRs), case
record forms within pharmaceutical companies and signal
detection across social media and published literature.
Can a machine differentiate between side effects and indications in spontaneous patient reports?
EXPERIMENT FEATURES F1 SCORE
1 78.23
84.26
83.30
2
3 Bag of Lemmas + Syntactic Features +
Syntactic Roles
Bag of Lemmas
Bag of Lemmas + Syntactic Features
Figure 2. F1 scores for three experiments using different feature sets
REFERENCES
1. Gurulingappa H, Toldo L, Rajput AM, Kors JA, Taweel A, Tayrouz Y. Automatic detection of adverse events to predict drug label
changes using text and data mining techniques. Phar. Drug Saf. 2013;22(11):1189-94.
2. Choi JD, McCallum A. Transition-based Dependency Parsing with Selectional Branching. ACL’13. 2013 1052-1062.
Figure 1. An overview of the classification approach
CONDITION
DEPENDENCY GRAPHS
Indication I am taking cipro for my infection
I got another infection after I was put on ciproSide Effect
“” “”
TEXT
taking taking
RAW
TEXT
STATISTICAL
MODEL
CLASSIFICATION
OUTPUT
taketake havehave
II IIciprocipro infectioninfectioninfectioninfection ciprocipro
cause timetheme themeagent agent
• Various combinations of syntactic and semantic features were
extracted from dependency graphs generated by an NLP toolkit,
ClearNLP2, and were used for generating a statistical model
trained by support vector machines (SVM). This statistical model is
used for classifying medical conditions in raw text. Figure 1 shows
the overview of our approach.
1
1
1 1
2
2
2