With the advent of convolutional neural networks in recent years, machine learning is becoming increasingly accessible to researchers from non-computer science background. Furthermore, whole-slide imaging (WSI) pathology is gaining acceptance in both research and clinical practice, allowing for more biomedical researchers to apply modern machine learning tools to their data. However, preparing data for machine learning remains a crucial step which requires hands-on expertise. In this presentation we show how to use python open-source tools to load and transform digital slides and their annotation. As an example we will use a set of annotated kidney pathology WSI. We demonstrate how to load slides and annotation and how to save images suitable for a machine learning experiment.
4. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Educational Goals
[know and reason about] categories of machine learning and
computer vision tasks applied to digital pathology
[be able to] choose which format to use
depending on your task
[be able to] prepare digital slides for classification,
segmentation, and object detection using open-source tools in
Python
11. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
fibroadenoma
atypical lobular hyperplasia
calcifications
DCIS
LCIS
invasive breast cancer
Using NLP to mine pathology labels
Right breast, core needle
biopsy:
1. Focal atypical lobular
hyperplasia.
2. Hyalinized fibroadenoma
with associated
microcalcifications.
Dx: Set of labels:
Tools:
• tokenizers
• medical ontologies (e.g. UMSL)
• approaches: bag-of-words, n-grams, sequential
• classifiers: FastText, GBM, SVM, Logistic Regression
12. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Using NLP to mine pathology labels
Right breast, core needle
biopsy:
1. Focal atypical lobular
hyperplasia.
2. Hyalinized fibroadenoma
with associated
microcalcifications.
Dx:
fibroadenoma
atypical lobular hyperplasia
calcifications
DCIS
LCIS
invasive breast cancer
Set of labels:
13. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Computer Vision Tasks in Digital Pathology
Classification
assign
a categorical label to
each image
acute
rejection
normal
Regression
EGFR: 70
assign
a numeric value
to each image
One label per-slide
Whole-slide images don't fit
into regular 2018 AD GPU
memory, thus image needs
to be fed in small patches
Signal predictive of the
target is often concentrated
in small areas
Requires weak supervision
techniques to guess from
which patch the signal is
coming
14. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Computer Vision Tasks in Digital Pathology
Segmentation,
Object detection
detect structural
elements
Potentially multiple
contours per slide
Classification
assign
a categorical label to
each image
acute
rejection
normal
Regression
EGFR: 70
assign
a numeric value
to each image
One label per-slide
22. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Why do we need to preprocess slides?
Whole-slide images don't fit into GPU memory (as of 2018AD)
Slide images have to be chunked into smaller pieces
Image annotations (contours) have to be sliced in same way
Whole-slide imaging is very sparse
(tissue occupies only 5 – 10% of the slide for needle biopsy)
24. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Step 1: Know what your ML model needs
Which file format? (png, jpeg, tiff etc)
What dimensions? (399x399, 256x256, variable dimensions)
What file/folder structure:
image folder per each class (classification)
paired images and masks (segmentation)
MS-COCO format (object detection)
35. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Step 5: Read an arbitrary patch with ROIs
Read ROIs from XML
>>> rreader = RoiReader(xml_filename)
>>> points = sample_points(contour,1000)
>>> xc, yc = points[0]
>>> fig, ax, region, rois =
rreader.plot_patch(xc, yc, 1024,
subsample=4)
Read and visualize a patch
36. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Step 5: Read an arbitrary patch with ROIs
>>> region = rreader.read_patch(
xc, yc, 1024, scale=4)
Read an image patch
Read matching ROIs for the patch
>>> patch_rois =
rreader.get_patch_rois(
xc, yc, 1024, scale=4,
cocorle=True, translate=True)
source patch
size=10242 pix
down-sample the patch by
a factor of 4
(resulting size is 2562 pix)
translate coordinates
so that upper left
corner is (x=0, y=0)
produce an
MS-COCO
RLE encoding
37. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Step 5: Read an arbitrary patch with ROIs
Read matching ROIs for the patch
>>> patch_rois =
rreader.get_patch_rois(
xc, yc, 1024, scale=4,
cocorle=True, translate=True)
Convert ROIs to MS-COCO formatted dictionary or JSON
>>> patch_rois.to_dict()
>>> patch_rois.to_json()
44. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Formatting slides into MS-COCO format
Why MS-COCO format?
A standard dataset for object
detection tasks with its
associated standard format
A number of open-source tools
accept MS-COCO format as
input
{'annotations':[...],
'images' :[...],
'type' :[...],
'categories' :[...],
'info' :[...]}
49. DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Conclusion and Further Considerations
Check if you have labels already
If yes – find an automated way to extract them
if not -- create your own
Know what format you need for your downstream application
Work on communication between your clinical and computational collaborators.
Know what matters and what is possible (sometimes you wouldn't dream of it!)
Take breaks and drink water