Evaluating Methods for the
Identification of Cancer in
Free-Text Pathology
Reports Using alternative
Machine Learning and ...
What does that even mean ?
Our problem
• Cancer case reporting to public health
registries are often:
– Delayed
– Incomplete
Our emphasis
• Use pathology reports
• Automate it (It actually works !)
Our solution
• Speed
• Accuracy
• Applicability t...
Issues
• Lots of data
• Lots of FREE-TEXT data
• Not enough time
• Not enough resources
Clarifications
When I say “We”:
• “We” in terms of decision making and
consultation usually means Dr. Grannis
• “We” in te...
Our basic approach
Solution/s
What improvements are we trying out?
• Alternative data input formats
• Candidate decision models
• Decision mo...
Manual review
• Functions as our source of truth
– What ?
– Why ?
Manually reviewed 1495 reports
Identified 371 (24.8%) po...
Machine learning process
• Identification of keywords
– What ARE keywords ?
Metastasis, tumor, malignant, neoplasm, stage,...
What were the different data input
formats used ?
• Raw data input
• Four state data input
What and Why ?
• Raw
• Four state
So basically
Training / Testing
• What ?
• Why cross validation ?
• Alternative decision models
– So many options !
– Classification vs...
To preserve my sanity, and because
we’re not stupid…
• We used Weka (Waikato Environment for
Knowledge Analysis)
– is a co...
Decision models used
• Logistic regression
• Naïve Bayes
• Support vector machine
• K-nearest neighbor
• Random forest
• J...
Results
• How do we measure our results ?
– Precision
• What % of positive predictions were correct?
– Recall
• What % of ...
Results contd.…
• RF and NB showed statistically significant
lower values for precision
• SVM exhibited statistically sign...
Overall performance by
preprocessed input type
• Raw count is significantly better
than four state
Overall performance by decision
model
• Ensemble approach is significantly
better to individual algorithms
Improvements
Keywords ? sure, I have
a list…
Better identification of keywords
Shaun
Problems with Negex…
Results
• The funder is happy… we think
• We wrote an abstract !
• Feature selection approaches for keyword
identification...
Our thanks to…
• Dr. Shaun Grannis (RI)
• Dr. Brian Dixon (RI)
• Dr. Judy Wawira (IUPUI)
• Eric Durbin (UKC)
Questions ?
Sk ghi (wip) 22052014
Upcoming SlideShare
Loading in...5
×

Sk ghi (wip) 22052014

173

Published on

Presentation on Evaluating Methods for the Identification of Cancer in  Free-Text Pathology Reports Using alternative Machine Learning and Data Preprocessing Approaches

Published in: Science, Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
173
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Explain title
  • Describe actual problem, speak of registries, physician resources etc.
  • What are we focusing on / I.e. what are we concerned about ?
  • Who actually did what
  • General flowchart showing what happens
  • How we’re trying to solve things
  • Sk ghi (wip) 22052014

    1. 1. Evaluating Methods for the Identification of Cancer in Free-Text Pathology Reports Using alternative Machine Learning and Data Preprocessing Approaches Suranga Nath Kasthurirathne
    2. 2. What does that even mean ?
    3. 3. Our problem • Cancer case reporting to public health registries are often: – Delayed – Incomplete
    4. 4. Our emphasis • Use pathology reports • Automate it (It actually works !) Our solution • Speed • Accuracy • Applicability to other surveillance activities • Computationally efficient
    5. 5. Issues • Lots of data • Lots of FREE-TEXT data • Not enough time • Not enough resources
    6. 6. Clarifications When I say “We”: • “We” in terms of decision making and consultation usually means Dr. Grannis • “We” in terms of implementation and code mongering usually means Suranga
    7. 7. Our basic approach
    8. 8. Solution/s What improvements are we trying out? • Alternative data input formats • Candidate decision models • Decision model combinations • HOW to look for Vs. WHAT to look for
    9. 9. Manual review • Functions as our source of truth – What ? – Why ? Manually reviewed 1495 reports Identified 371 (24.8%) positive cancer cases
    10. 10. Machine learning process • Identification of keywords – What ARE keywords ? Metastasis, tumor, malignant, neoplasm, stage, carcinoma and ca • Identification of negation context • Use of alternate data input formats
    11. 11. What were the different data input formats used ? • Raw data input • Four state data input What and Why ?
    12. 12. • Raw • Four state
    13. 13. So basically
    14. 14. Training / Testing • What ? • Why cross validation ? • Alternative decision models – So many options ! – Classification vs. Clustering analysis
    15. 15. To preserve my sanity, and because we’re not stupid… • We used Weka (Waikato Environment for Knowledge Analysis) – is a collection of machine learning algorithms for data mining tasks – is Open Source !
    16. 16. Decision models used • Logistic regression • Naïve Bayes • Support vector machine • K-nearest neighbor • Random forest • JT48 J48 decision tree (Thanks Jamie !!!)
    17. 17. Results • How do we measure our results ? – Precision • What % of positive predictions were correct? – Recall • What % of positive cases were caught? – Accuracy • What % of predictions were correct? Precision Vs. Recall. The fine balance
    18. 18. Results contd.… • RF and NB showed statistically significant lower values for precision • SVM exhibited statistically significant lower results for recall • SVM and NB produced statistically significant lower results for accuracy
    19. 19. Overall performance by preprocessed input type • Raw count is significantly better than four state
    20. 20. Overall performance by decision model • Ensemble approach is significantly better to individual algorithms
    21. 21. Improvements
    22. 22. Keywords ? sure, I have a list… Better identification of keywords Shaun
    23. 23. Problems with Negex…
    24. 24. Results • The funder is happy… we think • We wrote an abstract ! • Feature selection approaches for keyword identification as an independent study rotation
    25. 25. Our thanks to… • Dr. Shaun Grannis (RI) • Dr. Brian Dixon (RI) • Dr. Judy Wawira (IUPUI) • Eric Durbin (UKC)
    26. 26. Questions ?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×