Sk ghi (wip) 22052014

Evaluating Methods for the
Identification of Cancer in
Free-Text Pathology
Reports Using alternative
Machine Learning and Data
Preprocessing Approaches
Suranga Nath Kasthurirathne

Our problem
• Cancer case reporting to public health
registries are often:
– Delayed
– Incomplete

Our emphasis
• Use pathology reports
• Automate it (It actually works !)
Our solution
• Speed
• Accuracy
• Applicability to other surveillance activities
• Computationally efficient

Issues
• Lots of data
• Lots of FREE-TEXT data
• Not enough time
• Not enough resources

Clarifications
When I say “We”:
• “We” in terms of decision making and
consultation usually means Dr. Grannis
• “We” in terms of implementation and code
mongering usually means Suranga

Solution/s
What improvements are we trying out?
• Alternative data input formats
• Candidate decision models
• Decision model combinations
• HOW to look for Vs. WHAT to look for

Manual review
• Functions as our source of truth
– What ?
– Why ?
Manually reviewed 1495 reports
Identified 371 (24.8%) positive cancer cases

Machine learning process
• Identification of keywords
– What ARE keywords ?
Metastasis, tumor, malignant, neoplasm, stage,
carcinoma and ca
• Identification of negation context
• Use of alternate data input formats

What were the different data input
formats used ?
• Raw data input
• Four state data input
What and Why ?

Training / Testing
• What ?
• Why cross validation ?
• Alternative decision models
– So many options !
– Classification vs. Clustering analysis

To preserve my sanity, and because
we’re not stupid…
• We used Weka (Waikato Environment for
Knowledge Analysis)
– is a collection of machine learning algorithms
for data mining tasks
– is Open Source !

Decision models used
• Logistic regression
• Naïve Bayes
• Support vector machine
• K-nearest neighbor
• Random forest
• JT48 J48 decision tree
(Thanks Jamie !!!)

Results
• How do we measure our results ?
– Precision
• What % of positive predictions were correct?
– Recall
• What % of positive cases were caught?
– Accuracy
• What % of predictions were correct?
Precision Vs. Recall. The fine balance

Results contd.…
• RF and NB showed statistically significant
lower values for precision
• SVM exhibited statistically significant
lower results for recall
• SVM and NB produced statistically
significant lower results for accuracy

Overall performance by
preprocessed input type
• Raw count is significantly better
than four state

Overall performance by decision
model
• Ensemble approach is significantly
better to individual algorithms

Keywords ? sure, I have
a list…
Better identification of keywords
Shaun

Results
• The funder is happy… we think
• We wrote an abstract !
• Feature selection approaches for keyword
identification as an independent study
rotation

Our thanks to…
• Dr. Shaun Grannis (RI)
• Dr. Brian Dixon (RI)
• Dr. Judy Wawira (IUPUI)
• Eric Durbin (UKC)

Sk ghi (wip) 22052014

More Related Content

What's hot

Viewers also liked

Similar to Sk ghi (wip) 22052014

More from Suranga Nath Kasthurirathne

Recently uploaded

Sk ghi (wip) 22052014

Editor's Notes