Classification of prostate cancer pathology reports using natural language processing

Classification of noisy free-text prostate
cancer pathology reports using natural
language processing (NLP)
Presenter: Anjani K. Dhrangadhariya
Joint work with Sebastian Otálora, Manfredo Atzori and Henning Müller
AIDP2021 @ICPR2020 – Jan 10, 2021
1
MedGIFT group, University of Applied Sciences Western Switzerland (HES-SO)
Project supported by European Union
Horizon 2020 grant agreement 825292

Pathology Reports
Images are taken from the web for educational purposes. All rights reserved with the respective owners.
2

(Un/semi)-structured reporting
3

Structured reporting
4
> Completeness, consistency and clarity
> Conformance to standards - Interoperability
> Accurate
> Comparison over health management timelines
> Intervention benefits analysis
> Better patient management, treatment decisions
> (Semi-) automated decision support
Swillens, J. E. M., et al. "Identification of barriers and facilitators in nationwide implementation of standardized structured reporting in pathology: a mixed method study."
Snoek, Annefleure, et al. "The impact of standardized structured reporting of pathology reports for breast cancer in the Netherlands."

Automation
Unavailability of structured reports
• Manual information extraction (IE)
• Time and resource consuming
Automation methods
• Create structured reports
• Organize reports and their respective digital pathology images
into structured proprietary database
5

Automation methods
• Natural language Processing (NLP) methods: word and
document embeddings, RNNs, transformers
• Extensively used for electronic health records (EHR) analysis
and IE
• Applicability have not fully penetrated clinical pathology!!
• Yala et al. classified breast cancer pathology reports into 20
classes using n-grams reaching 97% accuracy
• Qiu et al. used a CNN to automatically extract ICD-O-3
topographic codes from a corpus of breast and lung cancer
pathology reports with micro-F1 of 81%
6

Motivation
 Classify
• Very-noisy, publicly-available
• prostate pathology reports into high-grade & low-grade
using NLP methods
 High confidence
 Inspect the text representation and classifier for reliability
7
X Private datasets
X Do not investigate the reliability of machine learning
approaches beyond performance metrics

Methods: Corpus
• Prostate adenocarcinoma clinical pathology reports from
The Cancer Genome Atlas (TCGA) PanCancer dataset (or
corpus)
• 494 reports (404 non-empty)
• Manually annotated into high-grade and low-grade using
diagnostic information contained within them
8
High-grade Low-grade
Gleason score > 7 <= 7
Number of reports 171 233

Methods: Corpus characteristics
9
1. PDF format
2. Noisy
3. Variable structure
4. Variable length
5. Class imbalance

Methods: Preprocessing
PDF Text
10
http://jocr.sourceforge.net/
Optical character
recognition tool

Methods: Preprocessing
1. Special-character trails
2. ASCII null characters
3. NLTK stop-words (SW)
4. Corpus-specific SW
5. Punctuations
Denoising and
Preprocessing
Noise
11NLTK = Natural Language Toolkit

Methods: Text representation
12
Count vectors
• Represent text in form of word
counts
• Tf-idf (Term frequency – Inverse
document frequency)
• Count-based, weighted
• Weights each term in the
document wrt. corpus
• meaningful words
• Filler, stopwords

13
• Semantic and contextual
information lost!
• High dimensionality, sparsity

14
Semantic vectors
• Distributed representation of
paragraphs or documents
• Paragraph vectors
• Unsupervised
Paragraph vectors
1. Distributed memory model of
paragraph vectors (PV-DM)
2. Distributed bag of words model of
paragraph vectors (PV-DBOW)

15
• Distributed memory DM • Distributed bag of words
DBOW

Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
16
Methods

Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
17
Methods
Augmented
Training set

Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
18
Methods
Augmented
Training set
back-translation

Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
19
Methods
Augmented
Training set
vectorization
Vectors
1. Tf-Idf
2. PV-DM
3. PV-DBOW

Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
20
Methods
Augmented
Training set
vectorization
Model training
& evaluation
Classifiers
1. LR
2. SVM
3. KNN

Test set
Training set
Reports
Test set
Clean
reports
Preprocessing
21
Methods
Augmented
Training set
vectorization
Model training
& evaluation
Best performing model
High
grade
Low
grade

Results – best model
0.5
0.6
0.7
0.8
0.9
1
tfidf LR tfidf SVM tfidf KNN pvdbow SVM pvdbow KNN pvdbow LR
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
22
1.0
0.5

0.5
0.6
0.7
0.8
0.9
1
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
23

0.5
0.6
0.7
0.8
0.9
1
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
24

0.5
0.6
0.7
0.8
0.9
1
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
25

0.5
0.6
0.7
0.8
0.9
1
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
26

0.5
0.6
0.7
0.8
0.9
1
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
27

0.5
0.6
0.7
0.8
0.9
1
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
28

0.5
0.6
0.7
0.8
0.9
1
(denoised
oversampled)
pvdbow LR
(no denoising)
pvdbow LR
(no
oversampling)
P R F1 ROC AUC
29

LIME Interpretability analysis
Strong cues for the high-grade
adenocarcinoma
• Gleason 4+5=9
• Gleason 4
• Gleason 5
33

Very strong cues for the low-
grade adenocarcinoma
• Gleason grade 3+4
• Gleason grade 3+3
• Histologic grade g3
• Primary Gleason grade 3
• Secondary Gleason grade 4
34

Irrelevant cues for the high-
grade adenocarcinoma
• Right?
• Left?
• Prostatic?
• 1.3.3.5?
35

Strong cues for the low-grade
adenocarcinoma
• Gleason score 3+4=7
• Histologic grade g3-4
36
NCI Tumor grade fact-sheet:
Histologic grade g3-4 denotes
high-grade cancer

Conclusion
 The binary classification approach was tested on High-grade &
Low-grade prostate adenocarcinoma
 Semantic representation performed better than count-based
representation (23% better ROC AUC score)
 Reliability of paragraph vector representation - LIME
 Future work: Extracting
 tumor staging terms
 clinical measurements
 prostrate tissue anatomy information
37

Resources
Data, code and interpretability analysis
 Github: https://github.com/anjani-dhrangadhariya/pathology-report-classification.git
 TCGA dataset: http://www.cbioportal.org/study/clinicalData?id=prad_tcga_pan_can_atlas_2018
38

39
Thank you for your attention
Anjani Dhrangadhariya
anjani.dhrangadhariya@hevs.ch
https://www.linkedin.com/in/anjani-dhrangadhariya/
More information
http://medgift.hevs.ch/wordpress/
https://www.examode.eu/
Project supported by European Union
Horizon 2020 grant agreement 825292

Classification of prostate cancer pathology reports using natural language processing

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Classification of prostate cancer pathology reports using natural language processing

Similar to Classification of prostate cancer pathology reports using natural language processing (20)

Recently uploaded

Recently uploaded (20)

Classification of prostate cancer pathology reports using natural language processing

Editor's Notes