Anahita Bhiwandiwalla – Intel Corporation
Karthik Vadla – Intel Corporation
A Predictive Analytics
Workflow on DICOM
Images using Apache
Spark
Who we are?
Agenda
• Use-case overview
• Challenges
• What is our data?
• Deriving Insights
• Machine Learning
• Tools
• Spark-tk
• Demo
• spark-tensorflow-connector
• Performance
Use-case overview
• Vast amount of medical image data available
• Meta-data accompanying the image
– Patient information
– Status of condition
– Method of image capture
– Time to events(if part of a study)
Create distributed technology needed to efficiently
store, scan and analyze data in healthcare.
Challenges
• Getting data for experimentation is hard!
• Combine data from multiple public sources for simulation
• Multiple tools – image processing libraries, visualization
tools, distributed engines, ML libraries
• Compatibility across different tools
How do we make Machine Learning easy to use?
Data: DICOM
- Digital Imaging and
COMmunications in Medicine
- International standard format to
store, exchange, and transmit
medical images
- Standards for imaging modalities -
radiography, ultrasonography,
computed tomography (CT),
magnetic resonance imaging (MRI),
and radiation therapy.
- Protocols for image exchange, image
compression, 3-D visualization,
image presentation etc.
Data: Meta-data
• Image related: storage, date, relevant parts of
the image etc.
• Patient related: age, weight, height, DOB,
medical history
• Device related: manufacturer information,
operator, method of capture
Deriving insights
• Can we develop smart filtering techniques based on
image meta-data?
• Can we conclude and predict certain conditions based
on patient meta-data?
• Can we classify image capture techniques and suggest
improvements based on device meta-data?
• Can we combine image & patient meta-data to conclude
and predict?
• Can analytics help automate/improve image capture
techniques?
Machine Learning
• Patient data
– Train classifier model on patient data to predict presence
of a condition
• Logistic Regression, Random Forest, Naïve Bayes
– Train regressor model on patient data to predict probability
of a condition
• Linear Regression, Random Forest
– Train survival analysis model on patient data to predict
time to event, compare risk of populations
• Accelerate Failure Time Model, Cox Proportional Hazards Model
Machine Learning
• Image data
– Compute the Eigen decomposition of the image
– Compute the Covariance Matrix of the image
– Compute the Principal Component Analysis of the
image pixel data
Machine Learning
• Device data:
– Predict trends in image capture techniques
• Classification/Regression techniques
– Predict device time to failure
• Cox Proportional Hazards Model, Accelerated Failure Time
– Recommendations to device capture methods
• Collaborative filtering
Tools
• pydicom
– Python package for working with DICOM files such as medical images, reports, and radiotherapy
– Single threaded
• dcm4che
– Collection of open source applications and utilities for the healthcare enterprise.
– Developed in the Java programming language for performance and portability, supporting
deployment on JDK 1.6 and up
• matplotlib
– Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
and interactive environments across platforms.
– Can be used in Python scripts, Python and IPython shell, jupyter notebook, web application servers
• Machine Learning libraries
– Classifiers, regressors, failure analysis analysis
– Eigen decomposition
Spark-tk Introduction
• Open source library enhancing the Spark experience
• APIs for feature engineering, ETL, graph construction,
machine learning, image processing, scoring
• Abstraction level familiar to data scientists; removes
complexity of parallel processing
• Lower level Spark APIs seamlessly exposed
• Easy to build apps plugging in other tools/libraries
https://github.com/trustedanalytics/spark-tk
http://trustedanalytics.github.io/spark-tk/
Spark-tk components
• Frames
– Scalable data frame representation
– More intuitive than low level HDFS file and Spark
RDD/DataFrame/DataSet formats; schema inference
– APIs to manipulate the data frames for feature engineering and
exploration, such as joins and aggregations
– Run user-defined transformations and filters using distributed
processing
– Input to our models
– Easy to convert to/from Spark-tk Frames - Spark representations
Spark-tk components
• Graphs
– Scalable graph representation based on Frames for
vertices and edges
– In house distributed algorithms for graph analytics
using GraphX
– Supports importing/exporting to OrientDB’s scalable
graph databases – visualize, real-time graph querying
– Store massive graphs
Spark-tk components
• Machine Learning & Streaming
– Time series analysis
– Recommender systems
– Topic Modeling
– Clustering
– Classification/Regression
– Image processing
– Scikit Models
– Streaming via Scoring engine
How Spark-tk helps
• Brings in support for image analytics to Spark
– Support for ingesting and processing DICOM images in a
distributed environment
– Queries, filters, and analytics on image collections
– Integrated & distributed dcm4che3 via. Spark
– Tested against pydicom
– Used matplotlib for visualization
• Machine learning for image and health records
– Created distributed Cox Survival Proportional Hazards Model to
compare sets of populations
– Recommend algorithm and hyper-parameters
Live Demo
spark-tensorflow-connector
• Developed library for loading and storing
TensorFlow records with Apache Spark
• Implements data import from the standard
TensorFlow record format (TFRecords) into
Spark SQL DataFrames, and data export from
DataFrames to TensorFlow records
https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-
connector
Performance
0
5
10
15
20
25
30
0.25 0.5 1 2 4 8 16 32
Data(GB) Vs. Time(Min)
Processing time increases with data(Constant #Partitions=1000)
Performance
24.5
25
25.5
26
26.5
27
27.5
28
28.5
29
29.5
300 450 675 1012 2278 3147 5125 7688
#Partitions Vs. Time(Min)
Increasing partitions reduced the time and then increased time(overhead)
Performance
0
10
20
30
40
50
60
70
80
90
100
7 11 24 35 56
#Executors Vs. Time(Min)
Increasing number of executors reduces the time significantly
Performance
0
5
10
15
20
25
30
35
1 2 3 4 5
#Executor Cores Vs. Time(Min)
Increasing executor cores did not significantly change time
Thank You.
https://github.com/trustedanalytics/spark-tk
http://trustedanalytics.github.io/spark-tk/

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahita Bhiwandiwalla and Karthik Vadla

  • 1.
    Anahita Bhiwandiwalla –Intel Corporation Karthik Vadla – Intel Corporation A Predictive Analytics Workflow on DICOM Images using Apache Spark
  • 2.
  • 3.
    Agenda • Use-case overview •Challenges • What is our data? • Deriving Insights • Machine Learning • Tools • Spark-tk • Demo • spark-tensorflow-connector • Performance
  • 4.
    Use-case overview • Vastamount of medical image data available • Meta-data accompanying the image – Patient information – Status of condition – Method of image capture – Time to events(if part of a study) Create distributed technology needed to efficiently store, scan and analyze data in healthcare.
  • 5.
    Challenges • Getting datafor experimentation is hard! • Combine data from multiple public sources for simulation • Multiple tools – image processing libraries, visualization tools, distributed engines, ML libraries • Compatibility across different tools How do we make Machine Learning easy to use?
  • 6.
    Data: DICOM - DigitalImaging and COMmunications in Medicine - International standard format to store, exchange, and transmit medical images - Standards for imaging modalities - radiography, ultrasonography, computed tomography (CT), magnetic resonance imaging (MRI), and radiation therapy. - Protocols for image exchange, image compression, 3-D visualization, image presentation etc.
  • 7.
    Data: Meta-data • Imagerelated: storage, date, relevant parts of the image etc. • Patient related: age, weight, height, DOB, medical history • Device related: manufacturer information, operator, method of capture
  • 8.
    Deriving insights • Canwe develop smart filtering techniques based on image meta-data? • Can we conclude and predict certain conditions based on patient meta-data? • Can we classify image capture techniques and suggest improvements based on device meta-data? • Can we combine image & patient meta-data to conclude and predict? • Can analytics help automate/improve image capture techniques?
  • 9.
    Machine Learning • Patientdata – Train classifier model on patient data to predict presence of a condition • Logistic Regression, Random Forest, Naïve Bayes – Train regressor model on patient data to predict probability of a condition • Linear Regression, Random Forest – Train survival analysis model on patient data to predict time to event, compare risk of populations • Accelerate Failure Time Model, Cox Proportional Hazards Model
  • 10.
    Machine Learning • Imagedata – Compute the Eigen decomposition of the image – Compute the Covariance Matrix of the image – Compute the Principal Component Analysis of the image pixel data
  • 11.
    Machine Learning • Devicedata: – Predict trends in image capture techniques • Classification/Regression techniques – Predict device time to failure • Cox Proportional Hazards Model, Accelerated Failure Time – Recommendations to device capture methods • Collaborative filtering
  • 12.
    Tools • pydicom – Pythonpackage for working with DICOM files such as medical images, reports, and radiotherapy – Single threaded • dcm4che – Collection of open source applications and utilities for the healthcare enterprise. – Developed in the Java programming language for performance and portability, supporting deployment on JDK 1.6 and up • matplotlib – Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. – Can be used in Python scripts, Python and IPython shell, jupyter notebook, web application servers • Machine Learning libraries – Classifiers, regressors, failure analysis analysis – Eigen decomposition
  • 13.
    Spark-tk Introduction • Opensource library enhancing the Spark experience • APIs for feature engineering, ETL, graph construction, machine learning, image processing, scoring • Abstraction level familiar to data scientists; removes complexity of parallel processing • Lower level Spark APIs seamlessly exposed • Easy to build apps plugging in other tools/libraries https://github.com/trustedanalytics/spark-tk http://trustedanalytics.github.io/spark-tk/
  • 14.
    Spark-tk components • Frames –Scalable data frame representation – More intuitive than low level HDFS file and Spark RDD/DataFrame/DataSet formats; schema inference – APIs to manipulate the data frames for feature engineering and exploration, such as joins and aggregations – Run user-defined transformations and filters using distributed processing – Input to our models – Easy to convert to/from Spark-tk Frames - Spark representations
  • 15.
    Spark-tk components • Graphs –Scalable graph representation based on Frames for vertices and edges – In house distributed algorithms for graph analytics using GraphX – Supports importing/exporting to OrientDB’s scalable graph databases – visualize, real-time graph querying – Store massive graphs
  • 16.
    Spark-tk components • MachineLearning & Streaming – Time series analysis – Recommender systems – Topic Modeling – Clustering – Classification/Regression – Image processing – Scikit Models – Streaming via Scoring engine
  • 17.
    How Spark-tk helps •Brings in support for image analytics to Spark – Support for ingesting and processing DICOM images in a distributed environment – Queries, filters, and analytics on image collections – Integrated & distributed dcm4che3 via. Spark – Tested against pydicom – Used matplotlib for visualization • Machine learning for image and health records – Created distributed Cox Survival Proportional Hazards Model to compare sets of populations – Recommend algorithm and hyper-parameters
  • 18.
  • 19.
    spark-tensorflow-connector • Developed libraryfor loading and storing TensorFlow records with Apache Spark • Implements data import from the standard TensorFlow record format (TFRecords) into Spark SQL DataFrames, and data export from DataFrames to TensorFlow records https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow- connector
  • 20.
    Performance 0 5 10 15 20 25 30 0.25 0.5 12 4 8 16 32 Data(GB) Vs. Time(Min) Processing time increases with data(Constant #Partitions=1000)
  • 21.
    Performance 24.5 25 25.5 26 26.5 27 27.5 28 28.5 29 29.5 300 450 6751012 2278 3147 5125 7688 #Partitions Vs. Time(Min) Increasing partitions reduced the time and then increased time(overhead)
  • 22.
    Performance 0 10 20 30 40 50 60 70 80 90 100 7 11 2435 56 #Executors Vs. Time(Min) Increasing number of executors reduces the time significantly
  • 23.
    Performance 0 5 10 15 20 25 30 35 1 2 34 5 #Executor Cores Vs. Time(Min) Increasing executor cores did not significantly change time
  • 24.