Preparing Pathology WSI data for Machine Learning Experiments

DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicerDIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Preparing Digital Slides for
Machine Learning Experiments
Preparing Digital Slides for
Machine Learning Experiments
Dima Lituiev, PhD
University of California, San Francisco

DIGITALPATHOLOGYASSOCIATION.ORG #PATHVISIONgithub.com/DSLituiev/slideslicer
Acknowledgments
 Bakar Computational Health
Sciences Institute, UCSF
 Dexter Hadley
 Sung Jik Cha
 UC Berkeley
 Ryan Chen
 UCSF Pathology
 Zoltan Laszik
 Dejan Dobi
 Aaron Chin
 Eliah Shamir
 Yunn-Yi Chen
 UCSF Radiation Oncology
 Catherine Park
 Vasant Kearney
 Stathis Gennatas

This talk is right for you if…
You'd like to learn how to apply deep learning to
your pathology data
You have digitized slides (or use public datasets)
You have some experience in coding
(or have colleagues who do it for you)

Educational Goals
 [know and reason about] categories of machine learning and
computer vision tasks applied to digital pathology
 [be able to] choose which format to use
depending on your task
 [be able to] prepare digital slides for classification,
segmentation, and object detection using open-source tools in
Python

Please see the Github repository
for summary
github.com/DSLituiev/slideslicer

Motivation & Intro

Why?
Improve pathology diagnostics
Recognize, outline, or count morphological
structures and pathological changes in digital
slides
Train machine learning algorithms

Machine Learning for Digital Pathology
Learning from
pathology notes:
Natural Language Processing
Learning from slides:
Computer Vision
https://www.pinterest.ch/pin/504684701962620102/www.wikipedia.org

fibroadenoma
atypical lobular hyperplasia
calcifications
DCIS
LCIS
invasive breast cancer
Using NLP to mine pathology labels
Right breast, core needle
biopsy:
1. Focal atypical lobular
hyperplasia.
2. Hyalinized fibroadenoma
with associated
microcalcifications.
Dx: Set of labels:

fibroadenoma
calcifications
DCIS
LCIS
biopsy:
hyperplasia.
with associated
Dx: Set of labels:
Tools:
• tokenizers
• medical ontologies (e.g. UMSL)
• approaches: bag-of-words, n-grams, sequential
• classifiers: FastText, GBM, SVM, Logistic Regression

biopsy:
hyperplasia.
with associated
Dx:
fibroadenoma
calcifications
DCIS
LCIS
Set of labels:

Computer Vision Tasks in Digital Pathology
Classification
assign
a categorical label to
each image
acute
rejection
normal
Regression
EGFR: 70
assign
a numeric value
to each image
One label per-slide
 Whole-slide images don't fit
into regular 2018 AD GPU
memory, thus image needs
to be fed in small patches
 Signal predictive of the
target is often concentrated
in small areas
 Requires weak supervision
techniques to guess from
which patch the signal is
coming

Segmentation,
Object detection
detect structural
elements
Potentially multiple
contours per slide
Classification
assign
each image
acute
rejection
normal
Regression
EGFR: 70
assign
a numeric value
to each image
One label per-slide

Semantic
Segmentation
Object
Detection and
Localization
Image / Patch
Classification
glomerulus
tubuli
assign
each image
provide bounding
boxes and labels of
contained objects
provide pixel-level
labels

Manual slide annotation

Manual Annotation (SVS format)
Screenshot: Aperio

Manual Annotation
Screenshot: Aperio

Annotations are stored as an XML file
Screenshot:
annotation XML file

Image preprocessing

Why do we need to preprocess slides?
 Whole-slide images don't fit into GPU memory (as of 2018AD)
 Slide images have to be chunked into smaller pieces
 Image annotations (contours) have to be sliced in same way
 Whole-slide imaging is very sparse
(tissue occupies only 5 – 10% of the slide for needle biopsy)

Technical Tasks
loc: 7 800; 11 485
size: 1024 x 1024
loc: 0; 0
size:256 x 256
 Tissue vs background?
 How to sample it
efficiently?
 How to handle ROIs?

Step 1: Know what your ML model needs
 Which file format? (png, jpeg, tiff etc)
 What dimensions? (399x399, 256x256, variable dimensions)
 What file/folder structure:
 image folder per each class (classification)
 paired images and masks (segmentation)
 MS-COCO format (object detection)

Step 2: choose tools
XPath -- working with XML
shapely -- intersecting contours
opencv -- general purpose classical CV
PIL -- light-weight Python CV toolbox
 -- reading slides

Data Preparation with slideslicer
 Reading annotated digital pathology slides
 Automated annotation of tissue vs background
 Splitting ~300Mb slides into smaller patches
suitable for training machine learning algorithms
 Extras:
slide de-identification
dataset splitting (train, test, val)
resizing/subsampling

 Read SVS with OpenSlide
 Read annotation
 Save annotations as a json file
Step 3: Reading slides and annotations
XPath
>>> fnsvs = "some_pathology_slide.svs"
>>> slide = openslide.OpenSlide(fnsvs)
>>> rreader = RoiReader(fnsvs)
>>> rreader.save('my_annotation.json')

 Inspect annotations as a pandas table:
>>> rreader.df
id name area length
1 infl 1729228.5 8163.4
2 open glom 406998.5 2475.8

 Inspect annotations as a pandas table:
>>> rreader.df
id name area length
1 infl 1729228.5 8163.4
2 open glom 406998.5 2475.8
>>> rreader.plot(labels=False)
>>> plt.legend(loc='center left',
bbox_to_anchor=(1, 0.5))
 Visualize ROIs:

Step 4: Segment tissue vs background
>>> rreader = RoiReader(fnsvs,
threshold_tissue=True,
save=True)

Step 4: Segment tissue vs background
>>> rreader = RoiReader(fnsvs,
threshold_tissue=True,
save=True)
NB: tissue pieces
with no annotations
in them are discarded
by default

Step 5: Sample patches
 Challenge: most of the slide is blank
(no tissue)
 Need to select only points that contain tissue
(& maybe very few blank patches)
 Naïve sampling produces many empty patches and is costly

Optimized point sampling
for needle biopsy
Finding the tightest bounding box for efficient sampling
with opencv and shapely packages

Optimized point sampling
>>> sample_points(contour,
n_points=1000,
# spacing=512,
mode='uniform_random')
# mode='grid')
Timing: O(n)
70 μs/point
700 ms/10,000 points

Step 5: Read an arbitrary patch with ROIs
Read ROIs from XML
>>> rreader = RoiReader(xml_filename)
>>> points = sample_points(contour,1000)
>>> xc, yc = points[0]
>>> fig, ax, region, rois =
rreader.plot_patch(xc, yc, 1024,
subsample=4)
Read and visualize a patch

>>> region = rreader.read_patch(
xc, yc, 1024, scale=4)
Read an image patch
Read matching ROIs for the patch
>>> patch_rois =
rreader.get_patch_rois(
xc, yc, 1024, scale=4,
cocorle=True, translate=True)
source patch
size=10242 pix
down-sample the patch by
a factor of 4
(resulting size is 2562 pix)
translate coordinates
so that upper left
corner is (x=0, y=0)
produce an
MS-COCO
RLE encoding

 Read matching ROIs for the patch
>>> patch_rois =
rreader.get_patch_rois(
xc, yc, 1024, scale=4,
cocorle=True, translate=True)
 Convert ROIs to MS-COCO formatted dictionary or JSON
>>> patch_rois.to_dict()
>>> patch_rois.to_json()

Tiling and Subsampling
Slice a ~40x40K image into ingestible bites
Subsample image patches and ROIs
$ python3 sample_from_slide.py
--target-side 1024
--data-root "$OUTPUT_DIR"
"$XML"
$ DATADIR="/data/data_1024/all"
$ FACTOR=2
$ python3 subsample.py $DATADIR $FACTOR

Region of Interest (ROI) formats
Contour Vertices
One-hot mask
Integer mask
Run-length encoding (RLE) mask

ROI formats
Channels of a One-Hot Binary Mask
Original Vertices Integer Mask

Run-length encoding (RLE)
+10
+60
+30
10, 60, 30+ …
RLE
Count # of pixels between ROI boundaries in a flattened image

Run-length encoding (RLE)
+10
+60
+30
10, 60, 50, 60, 20, …
+60
RLE
+20
+20
ASCII byte encoded RLE:
'lbe5<b?5K3O2M2N4...'

Region of Interest (ROI) formats
 Contour Vertices
compact, slow to convert to mask
 One-hot mask
easy to ingest, hard to visualize
 Integer mask
easy to ingest, easy to visualize
 RLE mask
compact, fast to convert

Formatting slides into MS-COCO format
 Why MS-COCO format?
 A standard dataset for object
detection tasks with its
associated standard format
 A number of open-source tools
accept MS-COCO format as
input
{'annotations':[...],
'images' :[...],
'type' :[...],
'categories' :[...],
'info' :[...]}

Formatting slides into MS-COCO format
$ XML="~/Documents/some_dcis_slide.xml"
$ COCODIR="~/Documents/coco_patches_dcis/"
$ python3 sample_patches_lowres_coco.py
--rle # include RLE mask
--out-root "$COCODIR" # output root folder
--target-side 512 # output size (pixels)
--magnlevel 2 # magnification 4^n
$XML # slide annotation path

Structure of MS-COCO dataset: JSON file
In [2]: coco['images'][0]
{'file_name': '0977c-x28-
y1020.png',
'height': 512,
'width': 512,
'id': 15,
'location-x': 28326,
'location-y': 10204,
'slide_name': '0977c.svs',
'set': 'train'}
In [3]: coco['annotations'][0]
{'area': 1661.5,
'bbox': [363.0, 75.0, 44.0, 50.0],
'category_id': 1,
'category_name': 'glom',
'counts': 'lbe5<b?5K3O2M2N4...',
'size': [512, 512],
'id’: 124,
'image_id': 15,
'iscrowd': 0,
'segmentation': [[383,75,383,75,...]],
'set': 'train',
'slide_name': '0977c.svs'}
In [1]: coco.keys()
['annotations',
'images',
'type',
'categories',
'info']
Images and annotations are linked by:
images.id <-> annotations.image_id

Structure of MS-COCO dataset: JSON file
In [2]: coco['images'][0]
{'file_name': '0977c-x28-
y1020.png',
'height': 512,
'width': 512,
'id': 15,
'location-x': 28326,
'location-y': 10204,
'slide_name': '0977c.svs',
'set': 'train'}
In [3]: coco['annotations'][0]
{'area': 1661.5,
'bbox': [363.0, 75.0, 44.0, 50.0],
'category_id': 1,
'category_name': 'glom',
'counts': 'lbe5<b?5K3O2M2N4...',
'size': [512, 512],
'id’: 124,
'image_id': 15,
'iscrowd': 0,
'segmentation': [[383,75,383,75,...]],
'set': 'train',
'slide_name': '0977c.svs'}
In [1]: coco.keys()
['annotations',
'images',
'type',
'categories',
'info']
slideslicer's custom fields to track
location of a patch within the slide

Converting between formats
 Contour -> Mask
 Mask -> Contour
 Mask <-> MS-COCO RLE
>>> mask = convert_contour2mask(contour)
>>> contour = convert_mask2contour(mask)
>>> from pycocotools.mask import encode, decode
>>> coco_rle = encode(mask)
>>> mask = encode(coco_rle)

Conclusion and Further Considerations
 Check if you have labels already
 If yes – find an automated way to extract them
 if not -- create your own
 Know what format you need for your downstream application
 Work on communication between your clinical and computational collaborators.
 Know what matters and what is possible (sometimes you wouldn't dream of it!)
 Take breaks and drink water

Download and install
https://github.com/DSLituiev/slideslicer
Check also tools for training keras models:
https://github.com/DSLituiev/kerastrainutils
Image augmentation toolset:
https://github.com/aleju/imgaug
Connect
DSLituiev
@DimaLituiev

Thank you!

Preparing Pathology WSI data for Machine Learning Experiments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Preparing Pathology WSI data for Machine Learning Experiments

Similar to Preparing Pathology WSI data for Machine Learning Experiments (20)

Recently uploaded

Recently uploaded (20)

Preparing Pathology WSI data for Machine Learning Experiments

Editor's Notes