This document summarizes a study that mined biomedical literature to create a large multimodal dataset of rare cancer studies. The researchers harvested over 15,000 images and corresponding journal articles related to rare cancers from public literature databases. They used both visual and textual classification approaches to identify images of humans, neoplastic tissues, and rare cancers. The textual approach using TF-IDF and SVMs outperformed visual CNN classifiers for all tasks. This created the first dataset aimed at automatically extracting rare cancer images to help address challenges in researching these less prevalent cancers.
Exploiting biomedical literature to mine out a large multimodal dataset of rare cancers
1. Exploiting biomedical literature to mine out a large
multimodal dataset of rare cancer studies
Anjani K. Dhrangadhariya et al.
MedGIFT group
University of Applied Sciences Western Switzerland (HES-SO)
Project supported by European Union
Horizon 2020 grant agreement 825292
SPIE Medical Imaging 2020, 16.02.2020
2. Motivation
> Rare cancers = 15 out of 100,000 / year
> Account for 25% cancer-related deaths
> Lower prevalence = fewer patients
> Less tumor samples for research
> Lack of robust clinical models
Puca, Loredana, et al. "Patient derived organoids to model rare prostate cancer phenotypes." Nature communications 9.1 (2018): 1-10.
2
3. Data resource
• Challenges
1) Private datasets
2) Limited size
3) Single center / scanner
4) Small variability
5) Some contain only images / only text
6) No or small subsets of manual annotations
7) Difficult to compare results
3
6. Medical Subject Headings (MeSH)
• Hierarchically organized
Controlled Vocabulary
• Cataloguing biomedical
information
• 16 thematic categories
• A = Anatomy
• B = Organism…
• Each term has a unique
MeSH Identifier
MeSH
term
MeSH
code
Lipscomb, Carolyn E. "Medical subject headings (MeSH)." Bulletin of the Medical Library Association 88.3 (2000): 265.
6
7. MeSH as annotation
• Manually annotated by National library of
Medicine (NLM) staff
• For e.g., All the studies about
benign cancer are indexed
under MeSH annotation “Neoplasm”
• Groundtruth annotation
• Not all PMC / PMCOA have annotations
7
8. Visual classification
• ImageCLEF medical image annotation
challenge (since 2013)
• Small subset of annotated PMC-OA >
train CNNs
• Classify into 31 modalities - PET, light
microscopy, CT, etc.
• State of the art: Superficial modality
classification
8
Deep Multimodal Classification of Image Types in Biomedical Journal Figures”, Andrearczyk and Müller, CLEF 2018
2000 Annotated PMC-OA
90% accuracy
13. Title +
Abstract
Title +
Abstract
Textual approach
Title +
Abstract
Model
training &
evaluation
Best
performing
model
13
MeSH_0
MeSH_1
Title +
Abstract
MeSH_0
MeSH_1
Title +
Abstract
No MeSH
14. 14
Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
15. - 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
Getting “human” images
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Tf-idf,
2. Word vectors,
3. paragraph vector
Not human
20%
80%
Training set
Test set
human
Not human
Title +
Abstract
Title +
Abstract
=
= ⇔ B01.050.150.900.649.313.988.400.112.400.400 ∉ {MeSH}
⇔ B01.050.150.900.649.313.988.400.112.400.400 ∈ {MeSH} & other B01 codes ∉ {MeSH}
15
16. Getting “human” images
Title +
Abstract
Title +
Abstract
human
not human
Best performing
Model, hyper-params and
vectors
SVM, tf-idf bigrams
No MeSH
Title +
Abstract
DLMI
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
human
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Tf-idf,
2. Word vectors,
3. paragraph vector
not human
20%
80%
Training set
Test set
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
16
17. 17
Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
18. 18
Getting “neoplastic” images
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
=
= ⇔ C04 ∉ {MeSH}
⇔ C04 ∈ {MeSH}
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Tf-idf,
2. Word vectors,
3. paragraph vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
19. Getting “non-neoplastic” images
Title +
Abstract
Title +
Abstract
Title +
Abstract
{MeSH}
DLMI
Model training and evaluation
1. Logistic regression
2. Support Vector Machine
3. K-nearest neighbor
1. Tf-idf,
2. Word vectors,
3. paragraph vector
20%
80%
Training set
Test set
human
neoplastic
not neoplastic
Title +
Abstract
Title +
Abstract
Best performing
Model, hyper-params and
vectors
SVM, tf-idf bigrams
No MeSH
Title +
Abstract
DLMI
human
neoplastic
not neoplastic
- 0.5467
0.1111
0.5789
- 0.3789
- 0.4999
0.6687
- 0.1167
0.9976
19
20. 20
Pipeline
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all images
1
2
3
4
5
Title +
Abstract
MeSH MeSH
vs
21. Getting “rare cancer” images
• No MeSH terms for “rare” cancer class
• Set of {rare cancer} terms by National Center for Advancing
Translational Sciences (NCATS)
https://rarediseases.info.nih.gov/diseases/diseases-by-category/1
21
Title +
Abstract
Title +
Abstract
DLMI
humanNo MeSH
{MeSH}
DLMI
neoplastic
human
neoplastic
Title +
Abstract
rare
cancer
Title +
Abstract
rare
cancer
= ⇔
Title +
Abstract ∩ {rare cancer} ≠
Ø Title +
Abstract
non-rare
cancer
22. Visual: “rare cancer”
22
rare cancer
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
non-rare cancer
23. Visual: “rare cancer”
23
No label
Model training and evaluation
• VGG19
• ImageNet weights
• With and without image
augmentation
rare cancer
non-rare cancer
rare cancer non-rare cancer
24. Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Tf-idf trigrams 0.89 0.90 0.90
24
25. Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Tf-idf trigrams 0.89 0.90 0.90
“neoplastic” vs. “non-neoplastic” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.68 0.65 0.64
Textual SVM Tf-idf bigrams 0.99 0.99 0.99
25
26. Results
“human” vs. “non-human” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.69 0.71 0.68
Textual SVM Tf-idf trigrams 0.89 0.90 0.90
“neoplastic” vs. “non-neoplastic” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.68 0.65 0.64
Textual SVM Tf-idf bigrams 0.99 0.99 0.99
“rare cancer” vs. “non-rare cancer” classification
Data type Classifier Feature Precision Recall F1-score
Visual VGG19 With data augmentation 0.62 0.77 0.69
26
27. Discussion: Textual vs. Visual
27
Textual approach
Outperformed visual approach
for all tasks
Tf-idf n-grams with SVM
performed the excellent for
both tasks.
Visual approach
Correctly classify some
“human” test instances with
recall of 0.71
Worse performance for
“neoplastic” identification
“rare cancer” classification had
a recall of 0.77
28. Conclusion
• First study targeting automatic rare cancer
image extraction
• Used approach relies on visual deep
learning and textual NLP
• 15,028 light microscopy (DLMI), human,
rare cancer images + corresponding journal
articles
Getting DLMI images
Getting “human” images
Getting “neoplastic” images
Getting “rare cancer” images
PMC-OA all data
28
1
2
3
4
5
29. Thank you for your attention
29
More information:
http://medgift.hevs.ch
Contact:
anjani.dhrangadhariya@hevs.ch
Follow us:
https://twitter.com/MedGIFT_group
Editor's Notes
2
3
4
How are these biomedical publications stored in Medline represented in PubMed?
A PubMed record consists of Title and Abstract followed by Publication images as shown in thumbnails.
And a list of Medical Subject Headings or MeSH annotations that are like keywords or annotations describing something about the publication.
All these text, images and MeSH terms are stringed together by the unique PubMed Identifier or PMID.
You can also notice a PMCID or unique pubmed central identifier that links to the full-text of the publication.
All these components, the images, text and the MeSH terms have thus 1 to 1 association with each other.
6
PubMed records are manually annotated with MeSH terms by staff at NLM.
What is the significance of attaching MeSH terms to a PubMed record?
MeSH annotation enforces uniformity and consistency across the terminology in a way that all articles about benign cancer are indexed under MeSH term “Neoplasm”, all the articles or studies involving patients are annotated under MeSH term “Humans”
So MeSH terms could be considered as gold standard annotations or groundtruth annotations for a publication.
Not all publications in PubMed have these manually attached MeSH terms.
Have this PMC-OA images been used elsewhere for image analysis?
Yes, an annotated subset of PMC-OA has already been used in ImageCLEF medical image annotation challenge which is a public challenge that has been taking place since 2013.
This small annotated subset of 2000 images was used to train CNNs for image classification into 31 image modality classes…
Including PET, CT images, light microscopy images, et cetera.
This classification approach achieved an overall 90% accuracy for modality classification.
However, this approach only goes till superficial modality classification task.
What about going beyond this generic modality classification into more specialized image sets?
So what we did for navigating towards rare cancer sets was this:
Take all the PMC-OA images and classify them using ImageCLEF setup into 31 modality types.
Retain all the images classified as DLMI or diagnostic light microscopy images.
We focus only upon DLMI images because they are fundamental to rare cancer diagnostics.
All the retained DLMI images are linked to their respective title, abstract and MeSH annotations if available.
With this multimodal annotated dataset in hand, we propose an approach for sequential curation of article abstracts and images using MeSH terms to eventually mine-out a large multimodal set of rare cancer images and full-texts.
This involves three subsequent binary classification tasks where we first filter “human” from “non-human” set, followed by separating “neoplastic” from “non-neoplastic” set and finally separating “rare cancer“ from the “non-rare cancer“.
It has to be noticed that at each binary classification step we compare visual vs. textual approach separately and use MeSH terms as the groundtruth labels for the datasets.
For the visual classification tasks, images with two different MeSH classes were used to and evaluate VGG19 model using pretrained trained ImageNet weights and fine-tuned with and without image augmentation
Data augmentation: image mirroring and cropping.
Why do we use VGG?
This fined-tuned models were then used to classify unlabeled images into their respective classes.
13
Lets get back to the pipeline for further curating the previously retrieved DLMI dataset.
«human» records were first filtered out from «non-human records» in following way.
15
Best performing model setup was used to classify the un-annotated DLMI records into “human” and “non-human”.
Then «neoplastic» or tumor-related records were separated from «non-neoplastic» records in similar manner.
18
Best performing model setup was used to classify the un-annotated records into “neoplasm” and “non-neoplasm”.
This was about the annotated text dataset. Similarly, the annotated image dataset classified using VGG19 setup.
Finally, we chaff out rare cancer dataset from the non-rare cancer dataset.
Unfortunately, there are no MeSH terms pertaining to “rare cancer”, so we used a pre-defined set of rare cancer terms available from NCATS.
All the records recognized as “neoplasm” were retained and filtered out as “rare cancer” only if rare cancer term from NCATS set was present in the title and the abstract.
After getting «rare cancer» and the «non-rare cancer» labels for images from the previous text classification, we used them to train and evaluate a VGG19 model for this binary classification task.
After getting «rare cancer» and the «non-rare cancer» labels for images from the previous text classification, we used them to train and evaluate a VGG19 model for this binary classification task.
For the «human» classification task, textual approach performed far better than visual approach.
However, a recall of 0.71 hints that the visual classification model does learn something about retaining human images.
For the neoplasm classification task too, textual performed better than visual.
Visual approach did not have good results for this task.
For the final task, a recall of 0.77 does hint that VGG19 model did learn something by better retaining the «rare cancer» images, but it has much room for improvement.