PPTX, PDF98 views

Classification of prostate cancer pathology reports using natural language processing

This document summarizes research on classifying prostate cancer pathology reports into high-grade and low-grade categories using natural language processing. The best performing model was a logistic regression model trained on paragraph vector representations of reports, achieving an ROC AUC score of 0.91. An analysis of the model's interpretations found that it strongly associated terms like "Gleason 4+5=9" with high-grade cancer and "Gleason grade 3+3" with low-grade cancer. Future work will aim to extract additional clinical information like tumor staging from the reports.

Science◦

More Related Content

PPTX

End-to-end Fine-grained Neural Entity Recognition of Patients, Interventions,...

byAnjani Dhrangadhariya

PPTX

Exploiting biomedical literature to mine out a large multimodal dataset of ra...

byAnjani Dhrangadhariya

PDF

Detection of erythemato-squamous diseases using AR-CatfishBPSO-KSVM

bysipij

PDF

Semantic representation of neuroimaging observation

byEmna AMDOUNI, Ph.D.

PDF

Integrating Medical Robots for Brain Surgical Applications

byDR.P.S.JAGADEESH KUMAR

PDF

Comparing prediction accuracy for machine learning and

byAlexander Decker

PPTX

2015 bioinformatics wim_vancriekinge

byProf. Wim Van Criekinge

PDF

Using and combining the different tools for predicting the pathogenicity of s...

byVall d'Hebron Institute of Research (VHIR)

End-to-end Fine-grained Neural Entity Recognition of Patients, Interventions,...

byAnjani Dhrangadhariya

Exploiting biomedical literature to mine out a large multimodal dataset of ra...

byAnjani Dhrangadhariya

Detection of erythemato-squamous diseases using AR-CatfishBPSO-KSVM

bysipij

Semantic representation of neuroimaging observation

byEmna AMDOUNI, Ph.D.

Integrating Medical Robots for Brain Surgical Applications

byDR.P.S.JAGADEESH KUMAR

Comparing prediction accuracy for machine learning and

byAlexander Decker

2015 bioinformatics wim_vancriekinge

byProf. Wim Van Criekinge

Using and combining the different tools for predicting the pathogenicity of s...

byVall d'Hebron Institute of Research (VHIR)

What's hot

PDF

International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...

byCSCJournals

PPTX

2016 bioinformatics i_wim_vancriekinge_vupload

byProf. Wim Van Criekinge

PPTX

Knowing Your NGS Downstream: Functional Predictions

byGolden Helix Inc

PPTX

EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...

byServio Fernando Lima Reina

PPTX

Detecting malaria using a deep convolutional neural network

byYusuf Brima

PDF

SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...

byIJECEIAES

PDF

ISMB2014読み会イントロ + Deep learning of the tissue-regulated splicing code

byKengo Sato

PDF

Optical biopsy

byUnesco Telemedicine

PPT

Optical biopsy with confocal endoscopy diagnosed by pathologist

byUnesco Telemedicine

PPT

Nerve injuries update

byVaikunthan Rajaratnam

PDF

IRJET - Identification of Malarial Parasites using Deep Learning

byIRJET Journal

PDF

MELANOMA CELL DETECTION IN LYMPH NODES HISTOPATHOLOGICAL IMAGES USING DEEP LE...

bysipij

PDF

A novel approach to generate face biometric template using binary discriminat...

bysipij

PDF

IRJET- Classification of Sickle Cell Disease using Feedforward Neural Network...

byIRJET Journal

PDF

MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)

byIJCSEA Journal

PPTX

Ml in genomics

byBrianSchilder

PPTX

Sundaram et al. 2018 Presentation

byBrianSchilder

PDF

Blind trials of computer-assisted structure elucidation software

byUS Environmental Protection Agency (EPA), Office of Chemical Safety and Pollution Prevention

International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...

byCSCJournals

2016 bioinformatics i_wim_vancriekinge_vupload

byProf. Wim Van Criekinge

Knowing Your NGS Downstream: Functional Predictions

byGolden Helix Inc

EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...

byServio Fernando Lima Reina

Detecting malaria using a deep convolutional neural network

byYusuf Brima

SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...

byIJECEIAES

ISMB2014読み会イントロ + Deep learning of the tissue-regulated splicing code

byKengo Sato

Optical biopsy

byUnesco Telemedicine

Optical biopsy with confocal endoscopy diagnosed by pathologist

byUnesco Telemedicine

Nerve injuries update

byVaikunthan Rajaratnam

IRJET - Identification of Malarial Parasites using Deep Learning

byIRJET Journal

MELANOMA CELL DETECTION IN LYMPH NODES HISTOPATHOLOGICAL IMAGES USING DEEP LE...

bysipij

A novel approach to generate face biometric template using binary discriminat...

bysipij

IRJET- Classification of Sickle Cell Disease using Feedforward Neural Network...

byIRJET Journal

MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)

byIJCSEA Journal

Ml in genomics

byBrianSchilder

Sundaram et al. 2018 Presentation

byBrianSchilder

Blind trials of computer-assisted structure elucidation software

byUS Environmental Protection Agency (EPA), Office of Chemical Safety and Pollution Prevention

Similar to Classification of prostate cancer pathology reports using natural language processing

PPTX

Updated proposal powerpoint.pptx

byAriyoAgbajeGbeminiyi

PDF

KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final

byDeborah Woodcock

PPTX

project-ppt-on-breast-cancer-prediction-using-ml

by1kn20cs050

PDF

Automated and Explainable Deep Learning for Clinical Language Understanding a...

byDatabricks

PPTX

Artificial Intelligence in pathology

byNeha Singh

PDF

Prediction & Survival Rate Prostate Cancer Patient using Artificial Neural Ne...

byrahulmonikasharma

PPTX

Cancer Detection and Diagnosis Using Machine Learning

byiamaftab01

PPTX

A Practical Use of Artificial Intelligence in the Fight Against Cancer by Bri...

byData Con LA

PPTX

AI AND ML PROPOSAL PRESENTATION MUST.pptx

bywamalajohn

PPTX

AI and whole slide imaging biomarkers

byJoel Saltz

PDF

Effective Classification of Clinical Reports: Natural Language Processing-Bas...

byEfsun Kayi

PDF

The Role of Pathology AI in Translational Cancer Research and Education

byJoel Saltz

PDF

[ASGO 2019] Artificial Intelligence in Medicine

byYoon Sup Choi

PDF

Deep learning in medicine: An introduction and applications to next-generatio...

byAllen Day, PhD

PDF

Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...

byKaela Johnson

PPTX

Dekker trog - learning outcome prediction models from cancer data - 2017

byAndre Dekker

PPTX

u-Breast Cancer Detection Using Deep Learning.pptx

bypayalajaysagar

PPTX

Radiomics and Deep Learning for Lung Cancer Screening

byWookjin Choi

DOCX

ROBOTICS ESSAYS ANSWERS BY KANTE- IRVIN MAKUWAZA.docx

bymakuwazairvin

PDF

Pathomics, Clinical Studies, and Cancer Surveillance

byJoel Saltz

Updated proposal powerpoint.pptx

byAriyoAgbajeGbeminiyi

KCI_NLP_OHSUResearchWeek2016-NLPatOHSU-final

byDeborah Woodcock

project-ppt-on-breast-cancer-prediction-using-ml

by1kn20cs050

Automated and Explainable Deep Learning for Clinical Language Understanding a...

byDatabricks

Artificial Intelligence in pathology

byNeha Singh

Prediction & Survival Rate Prostate Cancer Patient using Artificial Neural Ne...

byrahulmonikasharma

Cancer Detection and Diagnosis Using Machine Learning

byiamaftab01

A Practical Use of Artificial Intelligence in the Fight Against Cancer by Bri...

byData Con LA

AI AND ML PROPOSAL PRESENTATION MUST.pptx

bywamalajohn

AI and whole slide imaging biomarkers

byJoel Saltz

Effective Classification of Clinical Reports: Natural Language Processing-Bas...

byEfsun Kayi

The Role of Pathology AI in Translational Cancer Research and Education

byJoel Saltz

[ASGO 2019] Artificial Intelligence in Medicine

byYoon Sup Choi

Deep learning in medicine: An introduction and applications to next-generatio...

byAllen Day, PhD

Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...

byKaela Johnson

Dekker trog - learning outcome prediction models from cancer data - 2017

byAndre Dekker

u-Breast Cancer Detection Using Deep Learning.pptx

bypayalajaysagar

Radiomics and Deep Learning for Lung Cancer Screening

byWookjin Choi

ROBOTICS ESSAYS ANSWERS BY KANTE- IRVIN MAKUWAZA.docx

bymakuwazairvin

Pathomics, Clinical Studies, and Cancer Surveillance

byJoel Saltz

Recently uploaded

PPTX

Mount and repair a fishing net for fishi

byelsavant25

PPTX

OECD 204 (Acute Dermal Toxicity) .pptx

byManjari36

PPTX

HUBUNGAN EKOLOGI DENGAN KONSERVASI .pptx

byaprilliaroslina4

PDF

이영욱 교수님의 물리학회 발표자료 전채_20251120.pdf

bysciencepeople

PPTX

VIRULENCE FACTOR OF BACTERIA: STRUCTURAL ELEMENT, ENZYMES, TOXINS...

byBIOSPHERE OF KNOWLEDGE

PPTX

Grade 9 - Science Fourth Quarter Physics

byModezaPonje

PPTX

Schrodinger wave equation and Physical Significance

bykoreap

PDF

Surface Chemistry ( पृष्ठ रसायन ) - Notes PDF - Irfanullah Mehar - JJ Sir Che...

byWorld of Wisdom

PPTX

Photogrammetry_Implant_Dentistry_Summary.pptx

byJeldiKusuma

PDF

Detection of an NH3 Absorption Band at 2.2μm on Europa

bySérgio Sacani

PPT

Thermodynamics Chapter-2 by Cengel and Boles

byrasoolurfa

PDF

Control and Coordination - Short Notes (Prashant Kirad) (1).pdf

byAnu kumari

PPTX

Muscles_of_mastication_Anatomy of neck.

bybushra9akhtar786

PDF

The Crab Nebula Revisited Using HST/WFC3

bySérgio Sacani

PDF

Virtualization 1: Virtual Machine, Hypervisor (Virtual Machine Monitor)

bymrethein98

PDF

Bio - Molecules ( जैव अणु ) - PDF Notes - Irfanullah Mehar - JJ Sir Chemistry...

byWorld of Wisdom

PDF

Possible identification of the Luna 9 Moon landing site using a novel machine...

bySérgio Sacani

PPT

GSE s8p3 distance time graphs for 8th Grade Physical Science

byhfludd

PDF

Factors affecting Plant Health – Edaphic factors and Biotic Factors

byNistarini College, Purulia (W.B) India

PDF

Rapid diagnosis and contamination of seafood and aquaculture products

byGirija90

Mount and repair a fishing net for fishi

byelsavant25

OECD 204 (Acute Dermal Toxicity) .pptx

byManjari36

HUBUNGAN EKOLOGI DENGAN KONSERVASI .pptx

byaprilliaroslina4

이영욱 교수님의 물리학회 발표자료 전채_20251120.pdf

bysciencepeople

VIRULENCE FACTOR OF BACTERIA: STRUCTURAL ELEMENT, ENZYMES, TOXINS...

byBIOSPHERE OF KNOWLEDGE

Grade 9 - Science Fourth Quarter Physics

byModezaPonje

Schrodinger wave equation and Physical Significance

bykoreap

Surface Chemistry ( पृष्ठ रसायन ) - Notes PDF - Irfanullah Mehar - JJ Sir Che...

byWorld of Wisdom

Photogrammetry_Implant_Dentistry_Summary.pptx

byJeldiKusuma

Detection of an NH3 Absorption Band at 2.2μm on Europa

bySérgio Sacani

Thermodynamics Chapter-2 by Cengel and Boles

byrasoolurfa

Control and Coordination - Short Notes (Prashant Kirad) (1).pdf

byAnu kumari

Muscles_of_mastication_Anatomy of neck.

bybushra9akhtar786

The Crab Nebula Revisited Using HST/WFC3

bySérgio Sacani

Virtualization 1: Virtual Machine, Hypervisor (Virtual Machine Monitor)

bymrethein98

Bio - Molecules ( जैव अणु ) - PDF Notes - Irfanullah Mehar - JJ Sir Chemistry...

byWorld of Wisdom

Possible identification of the Luna 9 Moon landing site using a novel machine...

bySérgio Sacani

GSE s8p3 distance time graphs for 8th Grade Physical Science

byhfludd

Factors affecting Plant Health – Edaphic factors and Biotic Factors

byNistarini College, Purulia (W.B) India

Rapid diagnosis and contamination of seafood and aquaculture products

byGirija90

Editor's Notes

#3 the primary form of communication between pathologists and clinicians A pathologist microscopically examines a biological specimen for the presence of cancer-related morphology. After careful examination, the findings are summarized into a pathology report which is sent back to the referring clinician. This clinician based on the diagnostic and supporting information plans out a treatment course for a patient.
#4 These reports contains vital diagnostic pathology information in form of cancer cell grade, histology grade or TNM stage, anatomy, and tumor site information along with other histologic measurements and descriptions. For decades, all this information in pathology reports has been summarized in an unstructured or at best semi-structured free-text form.
#5 And structured reporting though difficult to initially adapt to is gaining prominence in clinical practice because of the very obvious reasons… Enforces completeness, consistency and clarity by implementation of standardized formats for reporting. Such reports reinforce that anyone who writes the them conforms to the international established standards which in turn enables interoperability This improves accuracy for reporting as measurement units in such reports are standardized too. This in turn enables accurate comparison over health management timelines enabling systematic evidence-based intervention benefit analysis Which eventually leads to better patient management, better treatment decision and also testing the feasibility of semiautomated decision support.
#6 But because structured reports are not available, practitioners' resort to manually generating them from the unstructured ones which is unarguably resource consuming. Instead of manual methods, automation could be used to construct structured reports from the unstructured ones and organize these structured reports into proprietary database.
#7 NLP methods have improved automated general text understanding and these methods have been extensively used for electronic health record analysis. However, their applicability have not fully penetrated clinical pathology domain. For example, method x was used for classifying 90,000 pathology reports eventually achieving a high classification confidence. Method y was also used to classify a medium-size corpus into 20 classes and reached an F1 of 0.89.
#9 To demonstrate our approach, we use the publicly available prostate cancer pathology reports from TCGA PanCancer dataset 404 non-empty reports were manually classified into high-grade and low-grade using Gleason score information available from the dataset A report was classified as high-grade if its Gleason diagnostic grade was above 7 and low-grade otherwise. With this method, about 171 reports were classified as high-grade and 233 as low-grade
#11 Before any text could be used for NLP tasks, it needs to be thoroughly preprocessed. The PDF reports were converted into machine-readable text using a freely available JAVA based optical character recognition tool
#12 This conversion adds much noise to the already noisy unstructured reports which now require denoising. Special-character trails were automatically removed using heuristics and ASCII null characters were manually removed. Next predefined stop words were automatically removed using NLTK stop words corpus and some corpus specific stop words were also removed. Unnecessary punctuations were also removed
#13 Before any classifier could be trained, the natural language corpus requires to be represented in numerical machine understandable format. We used count vectors to represent text in form of word counts specifically the tf-idf which is even better because also takes into account weights of each terms in the document. Tf-idf up weighs meaningful words and downweighs filler stop words. Tf-idf is basically term frequency (which is the number of occurrences of a term i in document j ) multiplied by inverse document frequency which is defined by the number of times a term appears in the entire corpus. This way frequently used words like articles, verbs and exclamations get down-weighted.
#15 To consider semantic information from a document, reduce vector dimensionality and sparsity, we used two variation of the paragraph vectors which themselves are a variation of word vectors. These are ”distributed memory” model of paragraph vectors and “distributed bag-of-words” model of paragraph vectors.
#16 A distributed memory model uses document identifier and surrounding terms to predict target term. The training task is unsupervised and uses standard encoder-decoder model but adds a document memory vector. A distributed bag of word model uses document identifier as input to predict randomly sampled words from the document. These are also trained in an unsupervised manner. These representations retain semantic information of the encoded document
#17 Getting back to the approach Once the reports are preprocessed and denoised, the corpus of clean reports is next separated into training and test sets.
#18 To tackle the class imbalance problem text-augmentation using back-translation trick was used to over-sample the minority class.
#19 To tackle the class imbalance problem text-augmentation using back-translation trick was used to over-sample the minority class. In backtranslation process, a document in source language is first translated to a target language of choice and then is back translated to the source language. This brings a little variation in the augmented text reports. We used German as the target language for back-translation because English and German languages share the Saxon roots.
#20 Then the previously mentioned vectors were extracted from the denoised over-sampled training corpus.
#21 These vectorized documents were used to train three machine learning classifier for binary classification of the reports into high- vs. low-grades.
#22 The best performing model-vector combination was then used to classify samples from the test set into high- vs. low-grade
#23 Tf-idf vector representation with logistic regression gave a mediocre performance not even crossing the threshold of 70% for any of the metrics.
#31 So it could be seen that denoising and oversampling brought much performance improvement to the best performing model-vector combination
#32 And over-all the semantic PV-DBOW vectors consistently and considerably outperformed tf-idf which are count-based vectors. That was precisely by 23% for ROC-AUC score…
#34 We inspected the best performing model-vector combination using LIME to check if it actually learnt pathology relevant features. From one of the inspected high-grade sample it could be seen that the model picks strong diagnostic clues from the report.
#35 For the next inspected low-grade sample, the model identifies very strong clues for the report to be low-grade like the Gleason score information and histologic grade
#36 But the model also turns out to falsely predict high-grade report as low-grade by picking out on very irrelevant clues