5. Breast Cancer Tumor Proliferation
Challenge
● Images of Tumors: Can be analyzed and given a
score for medical assessment.
● Tumor Score: Difficult to determine and takes a
trained eye.
● Currently assessed by Pathologists (M.D./D.O.).
● Dataset contains 500 images of breast cancer tissue,
each at more than 15GB.
6. Context
•Breast cancer is a leading cause of cancerous death in women.
•Survival rates increase as early detection increases, incentivizing
quicker detection.
•Tumor cell proliferation is a strong indicator of a patient’s
prognosis.
•Currently, pathologists classify tumors based on proliferation by
counting the dividing cell nuclei in hematoxylin & eosin stained
slides by hand with a microscope.
•Suffers due to underlying subjectivity.
17. Reference Paper:
“Automated Grading of Gliomas
using Deep Learning in Digital
Pathology Images”
Daniel L. Rubin, MD, MS Lab
Department of Radiology &
Department of Medicine
(Biomedical Informatics Research),
Stanford University
18. “Automated Grading of Gliomas using Deep
Learning in Digital Pathology Images”
1. Cut a “whole-slide” image into square “tiles” at 20x magnification.
2. Filter the “tiles” to remove any without tissue.
3. Cut the remaining “tiles” into smaller “samples”.
4. Assign a tumor score label to each sample based on the tumor score of the
“whole-slide” image.
5. Repeat 1-4 for all “whole-slide” images.
6. Train a convolutional neural network with the resulting dataset of labeled
“samples”.
7. Good results!
21. Our Approach:
● Utilize Apache Spark to cut and filter all 500
labeled, extremely high-resolution tumor slide
images into 4.7 million smaller square
samples.
● Utilize Apache SystemML on top of Spark to
train a convolutional neural network on the
labeled samples.
23. What is Apache Spark?
● Apache Spark is a fast and general engine for large-scale data
processing.
● Combines ML, SQL, streaming, and other complex analytics.
● Extends Scala idioms, as well as R/Python DataFrame idioms to
cluster computing.
● APIs for Scala, Java, Python, R.
● Simple to use!
● Much more information
at https://spark.apache.org/.
24.
25. What is Apache SystemML?
● Apache SystemML is a machine learning system for running
distributed linear algebra on top of Apache Spark.
● Exposes high-level R-like & Python-like languages focused on
linear algebra.
● APIs for Python, Scala, Java.
● Much more information
at http://systemml.apache.org/.