SlideShare a Scribd company logo
1 of 38
Classification of noisy free-text prostate
cancer pathology reports using natural
language processing (NLP)
Presenter: Anjani K. Dhrangadhariya
Joint work with Sebastian Otálora, Manfredo Atzori and Henning Müller
AIDP2021 @ICPR2020 – Jan 10, 2021
1
MedGIFT group, University of Applied Sciences Western Switzerland (HES-SO)
Project supported by European Union
Horizon 2020 grant agreement 825292
Pathology Reports
Images are taken from the web for educational purposes. All rights reserved with the respective owners.
2
(Un/semi)-structured reporting
3
Structured reporting
4
> Completeness, consistency and clarity
> Conformance to standards - Interoperability
> Accurate
> Comparison over health management timelines
> Intervention benefits analysis
> Better patient management, treatment decisions
> (Semi-) automated decision support
Swillens, J. E. M., et al. "Identification of barriers and facilitators in nationwide implementation of standardized structured reporting in pathology: a mixed method study."
Snoek, Annefleure, et al. "The impact of standardized structured reporting of pathology reports for breast cancer in the Netherlands."
Automation
Unavailability of structured reports
• Manual information extraction (IE)
• Time and resource consuming
Automation methods
• Create structured reports
• Organize reports and their respective digital pathology images
into structured proprietary database
5
Automation methods
• Natural language Processing (NLP) methods: word and
document embeddings, RNNs, transformers
• Extensively used for electronic health records (EHR) analysis
and IE
• Applicability have not fully penetrated clinical pathology!!
• Yala et al. classified breast cancer pathology reports into 20
classes using n-grams reaching 97% accuracy
• Qiu et al. used a CNN to automatically extract ICD-O-3
topographic codes from a corpus of breast and lung cancer
pathology reports with micro-F1 of 81%
6
Motivation
 Classify
• Very-noisy, publicly-available
• prostate pathology reports into high-grade & low-grade
using NLP methods
 High confidence
 Inspect the text representation and classifier for reliability
7
X Private datasets
X Do not investigate the reliability of machine learning
approaches beyond performance metrics
Methods: Corpus
• Prostate adenocarcinoma clinical pathology reports from
The Cancer Genome Atlas (TCGA) PanCancer dataset
• 494 reports (404 non-empty)
• Manually annotated into high-grade and low-grade using
diagnostic information contained within them
8
High-grade Low-grade
Gleason score > 7 <= 7
Number of reports 171 233
Methods: Corpus characteristics
9
1. PDF format
2. Noisy
3. Variable structure
4. Variable length
5. Class imbalance
Methods: Preprocessing
PDF Text
10
http://jocr.sourceforge.net/
Optical character
recognition tool
Methods: Preprocessing
1. special character trails
2. ASCII null characters
3. NLTK stop-words (SW)
4. Corpus-specific SW
5. Punctuations
Denoising and
Preprocessing
Noise
11NLTK = Natural Language Toolkit
Methods: Text representation
12
Count vectors
• Represent text in form of word
counts
• Tf-idf (Term frequency – Inverse
document frequency)
• Count-based, weighted
• Weights each term in the
document wrt. corpus
• meaningful words
• Filler, stopwords
Methods: Text representation
13
• Semantic and contextual
information lost!
• High dimensionality, sparcity
Methods: Text representation
14
Semantic vectors
• Distributed representation of
paragraphs or documents
• Paragraph vectors
• Unsupervised
Paragraph vectors
1. Distributed memory model of
paragraph vectors (PV-DM)
2. Distributed bag of words model of
paragraph vectors (PV-DBOW)
Methods: Text representation
15
• Distributed memory DM • Distributed bag of words
DBOW
Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
16
Methods
Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
17
Methods
Augmented
Training set
Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
18
Methods
Augmented
Training set
back-translation
Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
19
Methods
Augmented
Training set
vectorization
Vectors
1. Tf-Idf
2. PV-DM
3. PV-DBOW
Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
20
Methods
Augmented
Training set
vectorization
Model training
& evaluation
Classifiers
1. LR
2. SVM
3. KNN
Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
21
Methods
Augmented
Training set
vectorization
Model training
& evaluation
Best performing model
High
grade
Low
grade
Results – best model
0.5
0.6
0.7
0.8
0.9
1
tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
22
1.0
0.5
Results – best model
0.5
0.6
0.7
0.8
0.9
1
tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
23
Results – best model
0.5
0.6
0.7
0.8
0.9
1
tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
24
Results – best model
0.5
0.6
0.7
0.8
0.9
1
tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
25
Results – best model
0.5
0.6
0.7
0.8
0.9
1
tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
26
Results – best model
0.5
0.6
0.7
0.8
0.9
1
tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
27
Results – best model
0.5
0.6
0.7
0.8
0.9
1
tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
28
Results – best model
0.5
0.6
0.7
0.8
0.9
1
tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
29
Results – best model
30
Results – best model
31
LIME Interpretability analysis
Strong cues for the high-grade
adenocarcinoma
• Gleason 4+5=9
• Gleason 4
• Gleason 5
33
LIME Interpretability analysis
Very strong cues for the low-
grade adenocarcinoma
• Gleason grade 3+4
• Gleason grade 3+3
• Histologic grade g3
• Primary Gleason grade 3
• Secondary Gleason grade 4
34
LIME Interpretability analysis
Irrelevant cues for the high-
grade adenocarcinoma
• Right?
• Left?
• Prostatic?
• 1.3.3.5?
35
LIME Interpretability analysis
Strong cues for the low-grade
adenocarcinoma
• Gleason score 3+4=7
• Histologic grade g3-4
36
NCI Tumor grade fact-sheet:
Histologic grade g3-4 denotes
high-grade cancer
Conclusion
 The binary classification approach was tested on High-grade &
Low-grade prostate adenocarcinoma
 Semantic representation performed better than count-based
representation (23% better ROC AUC score)
 Reliability of paragraph vector representation - LIME
 Future work: Extracting
 tumor staging terms
 clinical measurements
 prostrate tissue anatomy information
37
Resources
Data, code and interpretability analysis
 Github: https://github.com/anjani-dhrangadhariya/pathology-report-classification.git
 TCGA dataset: http://www.cbioportal.org/study/clinicalData?id=prad_tcga_pan_can_atlas_2018
38
39
Thank you for your attention
Anjani Dhrangadhariya
anjani.dhrangadhariya@hevs.ch
https://www.linkedin.com/in/anjani-dhrangadhariya/
More information
http://medgift.hevs.ch/wordpress/
https://www.examode.eu/
Project supported by European Union
Horizon 2020 grant agreement 825292

More Related Content

Similar to Classification of noisy free-text prostate cancer pathology reports using natural language processing (NLP) - Anjani K. Dhrangadhariya (HES-SO Valais-Wallis) - AIDP2021 workshop colocated at ICPR2021

Peter Nagy, Columbia Agilent Symposium, Jan, 27 2012
Peter Nagy, Columbia Agilent Symposium, Jan, 27 2012Peter Nagy, Columbia Agilent Symposium, Jan, 27 2012
Peter Nagy, Columbia Agilent Symposium, Jan, 27 2012sequencing_columbia
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingStephen Turner
 
The OncoScan(TM) platform for analysis of copy number and somatic mutations i...
The OncoScan(TM) platform for analysis of copy number and somatic mutations i...The OncoScan(TM) platform for analysis of copy number and somatic mutations i...
The OncoScan(TM) platform for analysis of copy number and somatic mutations i...Lawrence Greenfield
 
Analyzing Performance of the Twist Exome with CNV Backbone at Various Probe D...
Analyzing Performance of the Twist Exome with CNV Backbone at Various Probe D...Analyzing Performance of the Twist Exome with CNV Backbone at Various Probe D...
Analyzing Performance of the Twist Exome with CNV Backbone at Various Probe D...Golden Helix
 
Lung-Colon-Lung-Fusion-Panel- Flyer.pdf
Lung-Colon-Lung-Fusion-Panel- Flyer.pdfLung-Colon-Lung-Fusion-Panel- Flyer.pdf
Lung-Colon-Lung-Fusion-Panel- Flyer.pdfMOUDDENYOUSSEF
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Gunnar Rätsch
 
Global Gene Expression Profiles from Bladder Tumor FFPE Samples
Global Gene Expression Profiles from Bladder Tumor FFPE SamplesGlobal Gene Expression Profiles from Bladder Tumor FFPE Samples
Global Gene Expression Profiles from Bladder Tumor FFPE SamplesThermo Fisher Scientific
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
Forensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics GroupForensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics Groupnist-spin
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
High Through-Put DNA Methylation Analysis of Lung Cancer: Plasma cfDNA for Bi...
High Through-Put DNA Methylation Analysis of Lung Cancer: Plasma cfDNA for Bi...High Through-Put DNA Methylation Analysis of Lung Cancer: Plasma cfDNA for Bi...
High Through-Put DNA Methylation Analysis of Lung Cancer: Plasma cfDNA for Bi...Kate Barlow
 
5 Tips for Successful qRT-PCR Results Infographic
5 Tips for Successful qRT-PCR Results Infographic5 Tips for Successful qRT-PCR Results Infographic
5 Tips for Successful qRT-PCR Results InfographicQIAGEN
 
Addressing the growing demand for CNV and UPD detection
Addressing the growing demand for CNV and UPD detection Addressing the growing demand for CNV and UPD detection
Addressing the growing demand for CNV and UPD detection Oxford Gene Technology
 
Nanion Usergroup Meeting Sept 2011
Nanion Usergroup Meeting Sept 2011Nanion Usergroup Meeting Sept 2011
Nanion Usergroup Meeting Sept 2011Clemens Möller
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Thermo Fisher Scientific
 
Development and verification of an Ion AmpliSeq TP53 Panel
Development and verification of an Ion AmpliSeq TP53 PanelDevelopment and verification of an Ion AmpliSeq TP53 Panel
Development and verification of an Ion AmpliSeq TP53 PanelThermo Fisher Scientific
 
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...Servio Fernando Lima Reina
 
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...Anjani Dhrangadhariya
 

Similar to Classification of noisy free-text prostate cancer pathology reports using natural language processing (NLP) - Anjani K. Dhrangadhariya (HES-SO Valais-Wallis) - AIDP2021 workshop colocated at ICPR2021 (20)

Peter Nagy, Columbia Agilent Symposium, Jan, 27 2012
Peter Nagy, Columbia Agilent Symposium, Jan, 27 2012Peter Nagy, Columbia Agilent Symposium, Jan, 27 2012
Peter Nagy, Columbia Agilent Symposium, Jan, 27 2012
 
14 00-20171207 rance-piv_c
14 00-20171207 rance-piv_c14 00-20171207 rance-piv_c
14 00-20171207 rance-piv_c
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
 
The OncoScan(TM) platform for analysis of copy number and somatic mutations i...
The OncoScan(TM) platform for analysis of copy number and somatic mutations i...The OncoScan(TM) platform for analysis of copy number and somatic mutations i...
The OncoScan(TM) platform for analysis of copy number and somatic mutations i...
 
Analyzing Performance of the Twist Exome with CNV Backbone at Various Probe D...
Analyzing Performance of the Twist Exome with CNV Backbone at Various Probe D...Analyzing Performance of the Twist Exome with CNV Backbone at Various Probe D...
Analyzing Performance of the Twist Exome with CNV Backbone at Various Probe D...
 
Lung-Colon-Lung-Fusion-Panel- Flyer.pdf
Lung-Colon-Lung-Fusion-Panel- Flyer.pdfLung-Colon-Lung-Fusion-Panel- Flyer.pdf
Lung-Colon-Lung-Fusion-Panel- Flyer.pdf
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
Global Gene Expression Profiles from Bladder Tumor FFPE Samples
Global Gene Expression Profiles from Bladder Tumor FFPE SamplesGlobal Gene Expression Profiles from Bladder Tumor FFPE Samples
Global Gene Expression Profiles from Bladder Tumor FFPE Samples
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Forensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics GroupForensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics Group
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
High Through-Put DNA Methylation Analysis of Lung Cancer: Plasma cfDNA for Bi...
High Through-Put DNA Methylation Analysis of Lung Cancer: Plasma cfDNA for Bi...High Through-Put DNA Methylation Analysis of Lung Cancer: Plasma cfDNA for Bi...
High Through-Put DNA Methylation Analysis of Lung Cancer: Plasma cfDNA for Bi...
 
5 Tips for Successful qRT-PCR Results Infographic
5 Tips for Successful qRT-PCR Results Infographic5 Tips for Successful qRT-PCR Results Infographic
5 Tips for Successful qRT-PCR Results Infographic
 
Addressing the growing demand for CNV and UPD detection
Addressing the growing demand for CNV and UPD detection Addressing the growing demand for CNV and UPD detection
Addressing the growing demand for CNV and UPD detection
 
Nanion Usergroup Meeting Sept 2011
Nanion Usergroup Meeting Sept 2011Nanion Usergroup Meeting Sept 2011
Nanion Usergroup Meeting Sept 2011
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
 
Development and verification of an Ion AmpliSeq TP53 Panel
Development and verification of an Ion AmpliSeq TP53 PanelDevelopment and verification of an Ion AmpliSeq TP53 Panel
Development and verification of an Ion AmpliSeq TP53 Panel
 
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
 
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
 

More from Institute of Information Systems (HES-SO)

Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...Institute of Information Systems (HES-SO)
 
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...Institute of Information Systems (HES-SO)
 
Studying Public Medical Images from Open Access Literature and Social Network...
Studying Public Medical Images from Open Access Literature and Social Network...Studying Public Medical Images from Open Access Literature and Social Network...
Studying Public Medical Images from Open Access Literature and Social Network...Institute of Information Systems (HES-SO)
 
Risques opérationnels et le système de contrôle interne : les limites d’un te...
Risques opérationnels et le système de contrôle interne : les limites d’un te...Risques opérationnels et le système de contrôle interne : les limites d’un te...
Risques opérationnels et le système de contrôle interne : les limites d’un te...Institute of Information Systems (HES-SO)
 
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...Institute of Information Systems (HES-SO)
 
Le système de contrôle interne : Présentation générale, enjeux et méthodes
Le système de contrôle interne : Présentation générale, enjeux et méthodesLe système de contrôle interne : Présentation générale, enjeux et méthodes
Le système de contrôle interne : Présentation générale, enjeux et méthodesInstitute of Information Systems (HES-SO)
 
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...Institute of Information Systems (HES-SO)
 
NOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
NOSE: une approche Smart-City pour les zones périphériques et extra-urbainesNOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
NOSE: une approche Smart-City pour les zones périphériques et extra-urbainesInstitute of Information Systems (HES-SO)
 

More from Institute of Information Systems (HES-SO) (20)

MIE20232.pptx
MIE20232.pptxMIE20232.pptx
MIE20232.pptx
 
Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...Machine learning assisted citation screening for Systematic Reviews - Anjani ...
Machine learning assisted citation screening for Systematic Reviews - Anjani ...
 
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...Exploiting biomedical literature to mine out a large multimodal dataset of ra...
Exploiting biomedical literature to mine out a large multimodal dataset of ra...
 
L'IoT dans les usines. Quels avantages ?
L'IoT dans les usines. Quels avantages ?L'IoT dans les usines. Quels avantages ?
L'IoT dans les usines. Quels avantages ?
 
Studying Public Medical Images from Open Access Literature and Social Network...
Studying Public Medical Images from Open Access Literature and Social Network...Studying Public Medical Images from Open Access Literature and Social Network...
Studying Public Medical Images from Open Access Literature and Social Network...
 
Risques opérationnels et le système de contrôle interne : les limites d’un te...
Risques opérationnels et le système de contrôle interne : les limites d’un te...Risques opérationnels et le système de contrôle interne : les limites d’un te...
Risques opérationnels et le système de contrôle interne : les limites d’un te...
 
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
Le contrôle interne dans les administrations publiques tient-il toutes ses pr...
 
Le système de contrôle interne : Présentation générale, enjeux et méthodes
Le système de contrôle interne : Présentation générale, enjeux et méthodesLe système de contrôle interne : Présentation générale, enjeux et méthodes
Le système de contrôle interne : Présentation générale, enjeux et méthodes
 
Crowdsourcing-based Mobile Application for Wheelchair Accessibility
Crowdsourcing-based Mobile Application for Wheelchair AccessibilityCrowdsourcing-based Mobile Application for Wheelchair Accessibility
Crowdsourcing-based Mobile Application for Wheelchair Accessibility
 
Quelle(s) valeur(s) pour le leadership stratégique ?
Quelle(s) valeur(s) pour le leadership stratégique ?Quelle(s) valeur(s) pour le leadership stratégique ?
Quelle(s) valeur(s) pour le leadership stratégique ?
 
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
A 3-D Riesz-Covariance Texture Model for the Prediction of Nodule Recurrence ...
 
Challenges in medical imaging and the VISCERAL model
Challenges in medical imaging and the VISCERAL modelChallenges in medical imaging and the VISCERAL model
Challenges in medical imaging and the VISCERAL model
 
NOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
NOSE: une approche Smart-City pour les zones périphériques et extra-urbainesNOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
NOSE: une approche Smart-City pour les zones périphériques et extra-urbaines
 
Medical image analysis and big data evaluation infrastructures
Medical image analysis and big data evaluation infrastructuresMedical image analysis and big data evaluation infrastructures
Medical image analysis and big data evaluation infrastructures
 
Medical image analysis, retrieval and evaluation infrastructures
Medical image analysis, retrieval and evaluation infrastructuresMedical image analysis, retrieval and evaluation infrastructures
Medical image analysis, retrieval and evaluation infrastructures
 
How to detect soft falls on devices
How to detect soft falls on devicesHow to detect soft falls on devices
How to detect soft falls on devices
 
FUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSIS
FUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSISFUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSIS
FUNDAMENTALS OF TEXTURE PROCESSING FOR BIOMEDICAL IMAGE ANALYSIS
 
MOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLS
MOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLSMOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLS
MOBILE COLLECTION AND DISSEMINATION OF SENIORS’ SKILLS
 
Enhanced Students Laboratory The GET project
Enhanced Students Laboratory The GET projectEnhanced Students Laboratory The GET project
Enhanced Students Laboratory The GET project
 
Solar production prediction based on non linear meteo source adaptation
Solar production prediction based on non linear meteo source adaptationSolar production prediction based on non linear meteo source adaptation
Solar production prediction based on non linear meteo source adaptation
 

Recently uploaded

jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjaytendertech
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...mikehavy0
 
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxClient Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxStephen266013
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSSnehalVinod
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Payal Garg #K09
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontangsiskavia95
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样wsppdmt
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxJocylDuran
 

Recently uploaded (20)

jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxClient Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptx
 

Classification of noisy free-text prostate cancer pathology reports using natural language processing (NLP) - Anjani K. Dhrangadhariya (HES-SO Valais-Wallis) - AIDP2021 workshop colocated at ICPR2021

  • 1. Classification of noisy free-text prostate cancer pathology reports using natural language processing (NLP) Presenter: Anjani K. Dhrangadhariya Joint work with Sebastian Otálora, Manfredo Atzori and Henning Müller AIDP2021 @ICPR2020 – Jan 10, 2021 1 MedGIFT group, University of Applied Sciences Western Switzerland (HES-SO) Project supported by European Union Horizon 2020 grant agreement 825292
  • 2. Pathology Reports Images are taken from the web for educational purposes. All rights reserved with the respective owners. 2
  • 4. Structured reporting 4 > Completeness, consistency and clarity > Conformance to standards - Interoperability > Accurate > Comparison over health management timelines > Intervention benefits analysis > Better patient management, treatment decisions > (Semi-) automated decision support Swillens, J. E. M., et al. "Identification of barriers and facilitators in nationwide implementation of standardized structured reporting in pathology: a mixed method study." Snoek, Annefleure, et al. "The impact of standardized structured reporting of pathology reports for breast cancer in the Netherlands."
  • 5. Automation Unavailability of structured reports • Manual information extraction (IE) • Time and resource consuming Automation methods • Create structured reports • Organize reports and their respective digital pathology images into structured proprietary database 5
  • 6. Automation methods • Natural language Processing (NLP) methods: word and document embeddings, RNNs, transformers • Extensively used for electronic health records (EHR) analysis and IE • Applicability have not fully penetrated clinical pathology!! • Yala et al. classified breast cancer pathology reports into 20 classes using n-grams reaching 97% accuracy • Qiu et al. used a CNN to automatically extract ICD-O-3 topographic codes from a corpus of breast and lung cancer pathology reports with micro-F1 of 81% 6
  • 7. Motivation  Classify • Very-noisy, publicly-available • prostate pathology reports into high-grade & low-grade using NLP methods  High confidence  Inspect the text representation and classifier for reliability 7 X Private datasets X Do not investigate the reliability of machine learning approaches beyond performance metrics
  • 8. Methods: Corpus • Prostate adenocarcinoma clinical pathology reports from The Cancer Genome Atlas (TCGA) PanCancer dataset • 494 reports (404 non-empty) • Manually annotated into high-grade and low-grade using diagnostic information contained within them 8 High-grade Low-grade Gleason score > 7 <= 7 Number of reports 171 233
  • 9. Methods: Corpus characteristics 9 1. PDF format 2. Noisy 3. Variable structure 4. Variable length 5. Class imbalance
  • 11. Methods: Preprocessing 1. special character trails 2. ASCII null characters 3. NLTK stop-words (SW) 4. Corpus-specific SW 5. Punctuations Denoising and Preprocessing Noise 11NLTK = Natural Language Toolkit
  • 12. Methods: Text representation 12 Count vectors • Represent text in form of word counts • Tf-idf (Term frequency – Inverse document frequency) • Count-based, weighted • Weights each term in the document wrt. corpus • meaningful words • Filler, stopwords
  • 13. Methods: Text representation 13 • Semantic and contextual information lost! • High dimensionality, sparcity
  • 14. Methods: Text representation 14 Semantic vectors • Distributed representation of paragraphs or documents • Paragraph vectors • Unsupervised Paragraph vectors 1. Distributed memory model of paragraph vectors (PV-DM) 2. Distributed bag of words model of paragraph vectors (PV-DBOW)
  • 15. Methods: Text representation 15 • Distributed memory DM • Distributed bag of words DBOW
  • 16. Test set Training set Reports Test set Clean reports Preprocessing 16 Methods
  • 17. Test set Training set Reports Test set Clean reports Preprocessing 17 Methods Augmented Training set
  • 18. Test set Training set Reports Test set Clean reports Preprocessing 18 Methods Augmented Training set back-translation
  • 19. Test set Training set Reports Test set Clean reports Preprocessing 19 Methods Augmented Training set vectorization Vectors 1. Tf-Idf 2. PV-DM 3. PV-DBOW
  • 20. Test set Training set Reports Test set Clean reports Preprocessing 20 Methods Augmented Training set vectorization Model training & evaluation Classifiers 1. LR 2. SVM 3. KNN
  • 21. Test set Training set Reports Test set Clean reports Preprocessing 21 Methods Augmented Training set vectorization Model training & evaluation Best performing model High grade Low grade
  • 22. Results – best model 0.5 0.6 0.7 0.8 0.9 1 tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR (denoised oversampled) pvdbow LR (no denoising) pvdbow LR (no oversampling) P R F1 ROC AUC 22 1.0 0.5
  • 23. Results – best model 0.5 0.6 0.7 0.8 0.9 1 tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR (denoised oversampled) pvdbow LR (no denoising) pvdbow LR (no oversampling) P R F1 ROC AUC 23
  • 24. Results – best model 0.5 0.6 0.7 0.8 0.9 1 tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR (denoised oversampled) pvdbow LR (no denoising) pvdbow LR (no oversampling) P R F1 ROC AUC 24
  • 25. Results – best model 0.5 0.6 0.7 0.8 0.9 1 tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR (denoised oversampled) pvdbow LR (no denoising) pvdbow LR (no oversampling) P R F1 ROC AUC 25
  • 26. Results – best model 0.5 0.6 0.7 0.8 0.9 1 tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR (denoised oversampled) pvdbow LR (no denoising) pvdbow LR (no oversampling) P R F1 ROC AUC 26
  • 27. Results – best model 0.5 0.6 0.7 0.8 0.9 1 tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR (denoised oversampled) pvdbow LR (no denoising) pvdbow LR (no oversampling) P R F1 ROC AUC 27
  • 28. Results – best model 0.5 0.6 0.7 0.8 0.9 1 tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR (denoised oversampled) pvdbow LR (no denoising) pvdbow LR (no oversampling) P R F1 ROC AUC 28
  • 29. Results – best model 0.5 0.6 0.7 0.8 0.9 1 tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR (denoised oversampled) pvdbow LR (no denoising) pvdbow LR (no oversampling) P R F1 ROC AUC 29
  • 30. Results – best model 30
  • 31. Results – best model 31
  • 32. LIME Interpretability analysis Strong cues for the high-grade adenocarcinoma • Gleason 4+5=9 • Gleason 4 • Gleason 5 33
  • 33. LIME Interpretability analysis Very strong cues for the low- grade adenocarcinoma • Gleason grade 3+4 • Gleason grade 3+3 • Histologic grade g3 • Primary Gleason grade 3 • Secondary Gleason grade 4 34
  • 34. LIME Interpretability analysis Irrelevant cues for the high- grade adenocarcinoma • Right? • Left? • Prostatic? • 1.3.3.5? 35
  • 35. LIME Interpretability analysis Strong cues for the low-grade adenocarcinoma • Gleason score 3+4=7 • Histologic grade g3-4 36 NCI Tumor grade fact-sheet: Histologic grade g3-4 denotes high-grade cancer
  • 36. Conclusion  The binary classification approach was tested on High-grade & Low-grade prostate adenocarcinoma  Semantic representation performed better than count-based representation (23% better ROC AUC score)  Reliability of paragraph vector representation - LIME  Future work: Extracting  tumor staging terms  clinical measurements  prostrate tissue anatomy information 37
  • 37. Resources Data, code and interpretability analysis  Github: https://github.com/anjani-dhrangadhariya/pathology-report-classification.git  TCGA dataset: http://www.cbioportal.org/study/clinicalData?id=prad_tcga_pan_can_atlas_2018 38
  • 38. 39 Thank you for your attention Anjani Dhrangadhariya anjani.dhrangadhariya@hevs.ch https://www.linkedin.com/in/anjani-dhrangadhariya/ More information http://medgift.hevs.ch/wordpress/ https://www.examode.eu/ Project supported by European Union Horizon 2020 grant agreement 825292

Editor's Notes

  1. Good afternoon, I will be presenting the work on classification of very noisy prostate cancer pathology reports using NLP which was jointly done with Sebastian Otalora, Manfredo Atzori, and Henning Mueller. An example pathology report, as seen in the title slide, is
  2. the primary form of communication between pathologists and clinicians A pathologist microscopically examines a biological specimen for the presence of cancer-related morphology. After careful examination, the findings are summarized into a pathology report which is sent back to the referring clinician. This clinician based on the diagnostic and supporting information plans out a treatment course for a patient.
  3. A pathology report contains vital diagnostic pathology information in form of cancer cell grade, histology grade or TNM stage, anatomy, and tumor site information along with other histologic measurements and descriptions. For decades, all this information in pathology reports has been summarized in an unstructured or at best semi-structured free-text form.
  4. And structured reporting though difficult to initially adapt to is gaining prominence in clinical practice because of the very obvious reasons… Enforces completeness, consistency and clarity by implementation of standardized formats for reporting. Such reports reinforce that anyone who writes the them conforms to the international established standards which in turn enables interoperability This improves accuracy for reporting as measurement units in such reports are standardized too. This in turn enables accurate comparison over health management timelines Which makes it easier to conduct systematic evidence-based intervention benefit analysis Which eventually leads to better patient management, better treatment decision and also testing the feasibility of semiautomated decision support.
  5. But because structured reports are not available, practitioners' resort to manually generating them from the unstructured ones which is unarguably resource consuming. Instead of manual methods, automation could be used to construct structured reports from the unstructured ones and organize these structured reports into proprietary database.
  6. NLP methods have improved automated general text understanding and these methods have been extensively used for electronic health record analysis. However, their applicability have not fully penetrated clinical pathology domain. For example, method x was used for classifying 90,000 pathology reports eventually achieving a high classification confidence. Method y was also used to classify a medium-size corpus into 20 classes and reached an F1 of 0.89.
  7. To demonstrate our approach, we use the publicly available prostate cancer pathology reports from TCGA PanCancer dataset 404 non-empty reports were manually classified into high-grade and low-grade using Gleason score information available from the dataset A report was classified as high-grade if its Gleason diagnostic grade was above 7 and low-grade otherwise. With this method, about 171 reports were classified as high-grade and 233 as low-grade
  8. Before any text could be used for NLP tasks, it needs to be thoroughly preprocessed. The PDF reports were converted into machine-readable text using a freely available JAVA based optical character recognition tool
  9. This conversion adds much noise to the already noisy unstructured reports which now require denoising. Special-character trails were automatically removed using heuristics and ASCII null characters were manually removed. Next predefined stop words were automatically removed using NLTK stop words corpus and some corpus specific stop words were also removed. Unnecessary punctuations were also removed
  10. Before any classifier could be trained, the natural language corpus requires to be represented in numerical machine understandable format. We used count vectors to represent text in form of word counts specifically the tf-idf which is even better because also takes into account weights of each terms in the document. Tf-idf up weighs meaningful words and downweighs filler stop words. Tf-idf is basically tf frequency (which is the number of occurrences of a term i in document j ) multiplied by inverse document frequency which is defined by the number of times a term appears in a document
  11. To consider semantic information from a document and reduce vector dimensionality, we used two variation of the paragraph vectors which themselves are a variation of word vectors. These are ”distributed memory” model of paragraph vectors and “distributed bag-of-words” model of paragraph vectors.
  12. A distributed memory model uses document identifier and surrounding terms to predict target term and are trained in unsupervised fashion. A distributed bag of word uses document identifier as input to predict randomly sampled words from the document. These are also trained in an unsupervised manner. These representations retain semantic information of the encoded document
  13. Getting back to the approach Once the reports are preprocessed and denoised, the corpus of clean reports is next separated into training and test sets.
  14. To tackle the class imbalance problem text-augmentation using back-translation trick was used to over-sample the minority class.
  15. To tackle the class imbalance problem text-augmentation using back-translation trick was used to over-sample the minority class. In backtranslation process, a document in source language is first translated to a target language of choice and then is back translated to the source language. This brings a little variation in the augmented text reports. We used German as the target language for back-translation because English and German languages share the Saxon roots.
  16. Then the previously mentioned vectors were extracted from the denoised over-sampled training corpus.
  17. These vectorized documents were used to train three machine learning classifier for binary classification of the reports into high- vs. low-grades.
  18. The best performing model-vector combination was then used to classify samples from the test set into high- vs. low-grade
  19. So it could be seen that denoising and oversampling brought much performance improvement to the best performing model-vector combination
  20. And over-all the semantic PV-DBOW vectors consistently and considerably outperformed tf-idf which are count-based vectors. That was precisely by 23% for ROC-AUC score…
  21. We inspected the best performing model-vector combination using LIME to check if it actually learnt pathology relevant features. From one of the inspected high-grade sample it could be seen that the model picks strong diagnostic clues from the report.
  22. For the next inspected low-grade sample, the model identifies very strong clues for the report to be low-grade like the Gleason score information and histologic grade
  23. But the model also turns out to falsely predict high-grade report as low-grade by picking out on very irrelevant clues