Studying Public Medical Images from the Open Access
Literature and Social Networks for Model Training and
Knowledge Extraction
Vincent Andrearczyk
HES-SO, Switzerland
MMM 2020, 08.01.2020
Henning Müller, Vincent Andrearczyk, Oscar Jimenez, Anjani Dhrangadhariya,
Roger Schaer, and Manfredo Atzori
Motivation
• Deep learning has been a driving force for
improving many applications of image analysis
• Complex networks require large amounts of
training data
- Data diversity is important for generalizability
• Most medical data sets have strong class
imbalances (rare diseases)
- Rare diseases require data from multiple centers
making the organization complex
• Many resources that include images have become
available in the past few years
- PubMed Central, TCIA, social networks, etc.
Objectives of this article
• Summarize existing approaches that harvest
public data
– Focusing on PubMed Central and social networks
• Highlight advantages and difficulties in exploiting
the data
– (+) Very diverse data
– (+) Rare cases are oversampled
– (-) Much pre-treatment and filtering is required
• Develop next steps required to fully use the data
PubMed Central
• Repository with the biomedical open access
literature, including images as files, etc.
– 3-4 images per article,
PubMed Central
• Repository with the biomedical open access
literature, including images as files, etc.
– 3-4 images per article,
– increasing # articles
Methodology for finding articles
• Analysis of tasks of ImageCLEF and work done
on these tasks using data from ImageCLEF
– Over the past 12 years
– Steps of filtering out data taken from this
• Use of Google scholar to add references
– Terms “medical image classification”, “publicly
accessible resources”, “medical literature”,
“machine learning” were combined
• Dynamically growing data sets were favored
• Journal papers were referenced over
conference publication
Image retrieval
• Allows to search for images with text
– Or semantic terms such as UMLS or MeSH
• Content-based image retrieval
Demner-Fushman, et al. (2012), Journal of Computing Science and Engineering
Structuring the visual content
• Define types of images to make the literature
images classifiable
– Extremely large variety in most categories
– Many sub-categories are possible
– Categories with clinical relevance
are most important
– Allows removing noise
– Compound figures
are separately treated
[ImageCLEF 2013]
Challenges in the data
• Look-alikes
– Much strange content that needs to be removed
Challenges in the data
• Look-alikes
– Much strange content that needs to be removed
• Compound figures can not easily be classified,
as they may contain aspects of several classes
– Cutting them into subfigures makes content
accessible
Meta data available for PMC
• Text of the figure caption
– Relatively specific but often short
– Hard for compound figures that contain many parts
• Full text of the article
– Non specific for individual figures
– Location of the figure is available
• Article title and author-generated key words
• Global MeSH terms (Manually attached)
– Cover species and organs
• Not all is available for all articles (incomplete)
Tasks to make figures accessible
• Removing very small images & strange aspect
ratios
• Classify figures into figure types
– Using image data and also text
– Remove non-relevant images, e.g. flowcharts
• Detect and cut compound figures into their parts
– Classify these into figure types again
• Filter human and animal tissue
• Filter specific organs of interest
• Find diseases or grading/staging
– Ground truth classes for machine learning
Advantages of literature images
• Rare images are generally used for articles and
case descriptions
– Mostly extreme cases to share the knowledge
on them
– Creates critical mass for rare diseases
• Images are from many laboratories and thus
contain many image variations
– Increase generalizability of learned models
• Exponentially increasing content
Problems with filtered images
• Many images might be missed by automatic
filtering
• Ground truth is not always solid
• Images might not have clinical quality
– Grey level resolution
– No information on level/window setting
– Cropped images, arrows in images, other overlays
• Size of the images is often small for publications
• Scale of images is not known (can be detected)
Otalora et al. (2018) MICCAI 2018
An example of Twitter images
• Images and information posted by pathologists on
Twitter
• Create dataset of histopathology images
• Train machine learning algorithms
– identify stains (H&E, IHC ...)
– discriminate between different tissues
– predict malignant tumors
• Limitations:
– good results (AUROC 0.9) only for simple tasks: H&E
vs rest
Schaumberg et al. (2018), BioRxiv
Next steps
• Quickly increasing content offers many possibilities
– Automatic pipelines need to contain update
mechanisms based on latest imaging equipment
– Community efforts for data curation
• Distribute the class labels with confidence scores
via PMC
• Evaluate impact on machine learning tasks of
adding such diverse sources
Next steps
• We have been working on it!
– Mined out 32,486 light microscopy human rare
cancer images Dhrangadhariya et al. (2020) SPIE2020
– Automatic generalizable filtering pipeline
In preparation: Jimenez et al. (2020) Journal of the American Medical Informatics Association
– Benefits in deep learning clinical tasks … to come
Conclusions
• Images from public resources are complementary to
clinical images for machine learning
– Rare cases, much diversity
– Very large amount of data
• How can we obtain high quality annotations with
limited effort (for example via active learning)
Contact
• More information can be found at
– http://medgift.hevs.ch/
– http://publications.hevs.ch
• Contact:
– vincent.andrearczyk@hevs.ch
– henning.mueller@hevs.ch

Studying Public Medical Images from Open Access Literature and Social Networks for Model Training and Knowledge Extraction

  • 1.
    Studying Public MedicalImages from the Open Access Literature and Social Networks for Model Training and Knowledge Extraction Vincent Andrearczyk HES-SO, Switzerland MMM 2020, 08.01.2020 Henning Müller, Vincent Andrearczyk, Oscar Jimenez, Anjani Dhrangadhariya, Roger Schaer, and Manfredo Atzori
  • 2.
    Motivation • Deep learninghas been a driving force for improving many applications of image analysis • Complex networks require large amounts of training data - Data diversity is important for generalizability • Most medical data sets have strong class imbalances (rare diseases) - Rare diseases require data from multiple centers making the organization complex • Many resources that include images have become available in the past few years - PubMed Central, TCIA, social networks, etc.
  • 3.
    Objectives of thisarticle • Summarize existing approaches that harvest public data – Focusing on PubMed Central and social networks • Highlight advantages and difficulties in exploiting the data – (+) Very diverse data – (+) Rare cases are oversampled – (-) Much pre-treatment and filtering is required • Develop next steps required to fully use the data
  • 4.
    PubMed Central • Repositorywith the biomedical open access literature, including images as files, etc. – 3-4 images per article,
  • 5.
    PubMed Central • Repositorywith the biomedical open access literature, including images as files, etc. – 3-4 images per article, – increasing # articles
  • 6.
    Methodology for findingarticles • Analysis of tasks of ImageCLEF and work done on these tasks using data from ImageCLEF – Over the past 12 years – Steps of filtering out data taken from this • Use of Google scholar to add references – Terms “medical image classification”, “publicly accessible resources”, “medical literature”, “machine learning” were combined • Dynamically growing data sets were favored • Journal papers were referenced over conference publication
  • 7.
    Image retrieval • Allowsto search for images with text – Or semantic terms such as UMLS or MeSH • Content-based image retrieval Demner-Fushman, et al. (2012), Journal of Computing Science and Engineering
  • 8.
    Structuring the visualcontent • Define types of images to make the literature images classifiable – Extremely large variety in most categories – Many sub-categories are possible – Categories with clinical relevance are most important – Allows removing noise – Compound figures are separately treated [ImageCLEF 2013]
  • 9.
    Challenges in thedata • Look-alikes – Much strange content that needs to be removed
  • 10.
    Challenges in thedata • Look-alikes – Much strange content that needs to be removed • Compound figures can not easily be classified, as they may contain aspects of several classes – Cutting them into subfigures makes content accessible
  • 11.
    Meta data availablefor PMC • Text of the figure caption – Relatively specific but often short – Hard for compound figures that contain many parts • Full text of the article – Non specific for individual figures – Location of the figure is available • Article title and author-generated key words • Global MeSH terms (Manually attached) – Cover species and organs • Not all is available for all articles (incomplete)
  • 12.
    Tasks to makefigures accessible • Removing very small images & strange aspect ratios • Classify figures into figure types – Using image data and also text – Remove non-relevant images, e.g. flowcharts • Detect and cut compound figures into their parts – Classify these into figure types again • Filter human and animal tissue • Filter specific organs of interest • Find diseases or grading/staging – Ground truth classes for machine learning
  • 13.
    Advantages of literatureimages • Rare images are generally used for articles and case descriptions – Mostly extreme cases to share the knowledge on them – Creates critical mass for rare diseases • Images are from many laboratories and thus contain many image variations – Increase generalizability of learned models • Exponentially increasing content
  • 14.
    Problems with filteredimages • Many images might be missed by automatic filtering • Ground truth is not always solid • Images might not have clinical quality – Grey level resolution – No information on level/window setting – Cropped images, arrows in images, other overlays • Size of the images is often small for publications • Scale of images is not known (can be detected) Otalora et al. (2018) MICCAI 2018
  • 15.
    An example ofTwitter images • Images and information posted by pathologists on Twitter • Create dataset of histopathology images • Train machine learning algorithms – identify stains (H&E, IHC ...) – discriminate between different tissues – predict malignant tumors • Limitations: – good results (AUROC 0.9) only for simple tasks: H&E vs rest Schaumberg et al. (2018), BioRxiv
  • 16.
    Next steps • Quicklyincreasing content offers many possibilities – Automatic pipelines need to contain update mechanisms based on latest imaging equipment – Community efforts for data curation • Distribute the class labels with confidence scores via PMC • Evaluate impact on machine learning tasks of adding such diverse sources
  • 17.
    Next steps • Wehave been working on it! – Mined out 32,486 light microscopy human rare cancer images Dhrangadhariya et al. (2020) SPIE2020 – Automatic generalizable filtering pipeline In preparation: Jimenez et al. (2020) Journal of the American Medical Informatics Association – Benefits in deep learning clinical tasks … to come
  • 18.
    Conclusions • Images frompublic resources are complementary to clinical images for machine learning – Rare cases, much diversity – Very large amount of data • How can we obtain high quality annotations with limited effort (for example via active learning)
  • 19.
    Contact • More informationcan be found at – http://medgift.hevs.ch/ – http://publications.hevs.ch • Contact: – vincent.andrearczyk@hevs.ch – henning.mueller@hevs.ch