What improvements are we trying out?
• Alternative data input formats
• Candidate decision models
• Decision model combinations
• HOW to look for Vs. WHAT to look for
• Functions as our source of truth
– What ?
– Why ?
Manually reviewed 1495 reports
Identified 371 (24.8%) positive cancer cases
Machine learning process
• Identification of keywords
– What ARE keywords ?
Metastasis, tumor, malignant, neoplasm, stage,
carcinoma and ca
• Identification of negation context
• Use of alternate data input formats
What were the different data input
formats used ?
• Raw data input
• Four state data input
What and Why ?
Training / Testing
• What ?
• Why cross validation ?
• Alternative decision models
– So many options !
– Classification vs. Clustering analysis
To preserve my sanity, and because
we’re not stupid…
• We used Weka (Waikato Environment for
– is a collection of machine learning algorithms
for data mining tasks
– is Open Source !
Decision models used
• Logistic regression
• Naïve Bayes
• Support vector machine
• K-nearest neighbor
• Random forest
• JT48 J48 decision tree
(Thanks Jamie !!!)
• How do we measure our results ?
• What % of positive predictions were correct?
• What % of positive cases were caught?
• What % of predictions were correct?
Precision Vs. Recall. The fine balance
• RF and NB showed statistically significant
lower values for precision
• SVM exhibited statistically significant
lower results for recall
• SVM and NB produced statistically
significant lower results for accuracy
Overall performance by
preprocessed input type
• Raw count is significantly better
than four state
Overall performance by decision
• Ensemble approach is significantly
better to individual algorithms