Evaluating Methods for the
Identification of Cancer in
Free-Text Pathology
Reports Using alternative
Machine Learning and Data
Preprocessing Approaches
Suranga Nath Kasthurirathne
What does that even mean ?
Our problem
• Cancer case reporting to public health
registries are often:
– Delayed
– Incomplete
Our emphasis
• Use pathology reports
• Automate it (It actually works !)
Our solution
• Speed
• Accuracy
• Applicability to other surveillance activities
• Computationally efficient
Issues
• Lots of data
• Lots of FREE-TEXT data
• Not enough time
• Not enough resources
Clarifications
When I say ā€œWeā€:
• ā€œWeā€ in terms of decision making and
consultation usually means Dr. Grannis
• ā€œWeā€ in terms of implementation and code
mongering usually means Suranga
Our basic approach
Solution/s
What improvements are we trying out?
• Alternative data input formats
• Candidate decision models
• Decision model combinations
• HOW to look for Vs. WHAT to look for
Manual review
• Functions as our source of truth
– What ?
– Why ?
Manually reviewed 1495 reports
Identified 371 (24.8%) positive cancer cases
Machine learning process
• Identification of keywords
– What ARE keywords ?
Metastasis, tumor, malignant, neoplasm, stage,
carcinoma and ca
• Identification of negation context
• Use of alternate data input formats
What were the different data input
formats used ?
• Raw data input
• Four state data input
What and Why ?
• Raw
• Four state
So basically
Training / Testing
• What ?
• Why cross validation ?
• Alternative decision models
– So many options !
– Classification vs. Clustering analysis
To preserve my sanity, and because
we’re not stupid…
• We used Weka (Waikato Environment for
Knowledge Analysis)
– is a collection of machine learning algorithms
for data mining tasks
– is Open Source !
Decision models used
• Logistic regression
• NaĆÆve Bayes
• Support vector machine
• K-nearest neighbor
• Random forest
• JT48 J48 decision tree
(Thanks Jamie !!!)
Results
• How do we measure our results ?
– Precision
• What % of positive predictions were correct?
– Recall
• What % of positive cases were caught?
– Accuracy
• What % of predictions were correct?
Precision Vs. Recall. The fine balance
Results contd.…
• RF and NB showed statistically significant
lower values for precision
• SVM exhibited statistically significant
lower results for recall
• SVM and NB produced statistically
significant lower results for accuracy
Overall performance by
preprocessed input type
• Raw count is significantly better
than four state
Overall performance by decision
model
• Ensemble approach is significantly
better to individual algorithms
Improvements
Keywords ? sure, I have
a list…
Better identification of keywords
Shaun
Problems with Negex…
Results
• The funder is happy… we think
• We wrote an abstract !
• Feature selection approaches for keyword
identification as an independent study
rotation
Our thanks to…
• Dr. Shaun Grannis (RI)
• Dr. Brian Dixon (RI)
• Dr. Judy Wawira (IUPUI)
• Eric Durbin (UKC)
Questions ?

Sk ghi (wip) 22052014

  • 1.
    Evaluating Methods forthe Identification of Cancer in Free-Text Pathology Reports Using alternative Machine Learning and Data Preprocessing Approaches Suranga Nath Kasthurirathne
  • 2.
    What does thateven mean ?
  • 3.
    Our problem • Cancercase reporting to public health registries are often: – Delayed – Incomplete
  • 4.
    Our emphasis • Usepathology reports • Automate it (It actually works !) Our solution • Speed • Accuracy • Applicability to other surveillance activities • Computationally efficient
  • 5.
    Issues • Lots ofdata • Lots of FREE-TEXT data • Not enough time • Not enough resources
  • 6.
    Clarifications When I sayā€œWeā€: • ā€œWeā€ in terms of decision making and consultation usually means Dr. Grannis • ā€œWeā€ in terms of implementation and code mongering usually means Suranga
  • 7.
  • 8.
    Solution/s What improvements arewe trying out? • Alternative data input formats • Candidate decision models • Decision model combinations • HOW to look for Vs. WHAT to look for
  • 9.
    Manual review • Functionsas our source of truth – What ? – Why ? Manually reviewed 1495 reports Identified 371 (24.8%) positive cancer cases
  • 10.
    Machine learning process •Identification of keywords – What ARE keywords ? Metastasis, tumor, malignant, neoplasm, stage, carcinoma and ca • Identification of negation context • Use of alternate data input formats
  • 11.
    What were thedifferent data input formats used ? • Raw data input • Four state data input What and Why ?
  • 12.
  • 13.
  • 14.
    Training / Testing •What ? • Why cross validation ? • Alternative decision models – So many options ! – Classification vs. Clustering analysis
  • 15.
    To preserve mysanity, and because we’re not stupid… • We used Weka (Waikato Environment for Knowledge Analysis) – is a collection of machine learning algorithms for data mining tasks – is Open Source !
  • 16.
    Decision models used •Logistic regression • NaĆÆve Bayes • Support vector machine • K-nearest neighbor • Random forest • JT48 J48 decision tree (Thanks Jamie !!!)
  • 18.
    Results • How dowe measure our results ? – Precision • What % of positive predictions were correct? – Recall • What % of positive cases were caught? – Accuracy • What % of predictions were correct? Precision Vs. Recall. The fine balance
  • 19.
    Results contd.… • RFand NB showed statistically significant lower values for precision • SVM exhibited statistically significant lower results for recall • SVM and NB produced statistically significant lower results for accuracy
  • 20.
    Overall performance by preprocessedinput type • Raw count is significantly better than four state
  • 21.
    Overall performance bydecision model • Ensemble approach is significantly better to individual algorithms
  • 22.
  • 23.
    Keywords ? sure,I have a list… Better identification of keywords Shaun
  • 24.
  • 25.
    Results • The funderis happy… we think • We wrote an abstract ! • Feature selection approaches for keyword identification as an independent study rotation
  • 26.
    Our thanks to… •Dr. Shaun Grannis (RI) • Dr. Brian Dixon (RI) • Dr. Judy Wawira (IUPUI) • Eric Durbin (UKC)
  • 27.

Editor's Notes

  • #3Ā Explain title
  • #4Ā Describe actual problem, speak of registries, physician resources etc.
  • #5Ā What are we focusing on / I.e. what are we concerned about ?
  • #7Ā Who actually did what
  • #8Ā General flowchart showing what happens
  • #9Ā How we’re trying to solve things