Data Science in Healthcare -The University Malaya Medical Centre Breast Cancer Ecosystem
1. D A T A S C I E N C E I N H E A L T H C A R E
T H E U N I V E R S I T Y M A L A Y A M E D I C A L
C E N T R E
B R E A S T C A N C E R E C O S Y S T E M
A S S O C . P R O F D R S A R I N D E R K . D H I L L O N
D ATA S C I E N C E & B I O I N F O R M AT I C S L A B
U N I V E R S I T Y O F M A L AYA , K U A L A L U M P U R , M A L AY S I A
S A R I N D E R @ U M . E D U . M Y
H T T P : / / S A R I N D E R K A U R . C O M
H T T P S : / / U M E X P E R T . U M . E D U . M Y / S A R I N D E R
3. Data Science in Healthcare
Transforming healthcare Data is used effectively Personalised medicine Predicting factors for
improved decision
making
Historical data use in
analytics
Artificial enabled
systems
Electronic medical
records/Electronic
health records
4. F O C U S O F TA L K
U N I V E R S I T Y M A L AYA M E D I C A L
C E N T R E ( U M M C )
D ATA S C I E N C E H E A LT H C A R E
E C O S Y S T E M
5. UMMC
Data Science Healthcare Ecosystem
Machine Learning for
Prediction
MyBCC ( Malaysian Breast Cancer
Cohort Study)
Deep Learning for
Classification
Text Mining in Radiology
UMMC Breast cancer clinical data
&
Electronic Medical Records
6. DEVELOPMENT OF ELECTRONIC
MEDICAL RECORD FOR
CLINICAL & RESEARCH PURPOSES:
B R E A S T C A N C E R C L I N I C A L D ATA M A N A G E M E N T
U S I N G P O I N T- O F - C A R E A N D M U LT I D I S C I P L I N A R Y
D ATA C O L L E C T I O N
FIRST PROJECT
7. Bernama (2018, November 8). Health Minister: EMR system to be implemented at 145 hospitals within the next three years. The Star.
Brar, K. (2018, January 31). Aiding healthcare through data analytics. The Star.
“The new technology would facilitate medical practitioners,
including doctors and nurses to identify and share patients' medical
consultation information, as well as prescription of medicines
through an integrated system.”
“With Big Data, healthcare organisations have the ability of
information exchange, leading to a 360° view of their
patients, so doctors can give a more complete diagnosis.”
In the News
16. • Enhanced EMR provides point-of-care data in
clinics, wards and MDT meetings
• Ability to produce reports on breast cancer
survival in Malaysia for research, individual
hospital performance for policymakers to track
outcomes and provide direction in cancer
control.
• A time effective approach while producing new
knowledge through data mining and analysis –
Research & Development
Alternate Clinical Data Management Using Point-of-care
and Multidisciplinary Data Collection
Nor, N. A. M., Taib, N. A., Saad, M., Zaini, H. S., Ahmad, Z., Ahmad, Y., & Dhillon, S. K. (2018). Development of electronic medical records for clinical and
research purposes : the breast cancer module using an implementation framework in a middle income country- Malaysia. BMC Bioinformatics, 19(Suppl 13), 1–
16. http://doi.org/10.1186/s12859-018-2406-9) (ISI-Indexed)
19. SECOND PROJECT
Application of machine learning to predict the important clinical prognostic
factors affecting survival rate of breast cancer
8066 samples from
University Malaya Medical
Centre (UMMC)
(1993-2018)
23 clinical factors
1 target variable (Life
status)
27. RESULTS – SL – Decision tree
Decision tree for all data; Shows that patients with curable cancer, ≤ 1 positive lymph nodes (PLN) and ≤ 2 total axillary
lymph nodes removed (TLN) had 50% survival probability, while patients with pre-cancer, ≤ 1 PLN and ≤ 2 TLN had
90% survival probability. Patients with metastatic cancer, > 6 PLN and > 6 TLN had only 25% survival probability
29. VALIDATION USING AJCC MANUAL 5th edition
Variables AJCC Machine learning
Classificat
ion of
tumor
size
< 2.0cm
1.0 < 4.0cm
>4.0cm
<2.5cm
2.5 < 4.8cm
4.8 < 11.0cm
>11.0cm
Number
of
positive
lymph
nodes
≤ 3
3 < 6
> 6
< 3
3 < 9
> 9
30. SECOND PROJECT
Application of machine learning to predict the important clinical prognostic
factors affecting survival rate of breast cancer
8066 samples from
University Malaya Medical
Centre (UMMC)
(1993-2018)
23 clinical factors
1 target variable (Life
status)
31. METHODOLOGY – Unsupervised learning (UL)
• Using Gap Statistics
method, factoextra
library in R
• Labels of variables
suggest the value
of K (number of
cluster)
• Run K-means and
record distance, V
• V minimizes when
K = n
• Visualize through
scree plot
1. Determine
optimal number
of cluster
• Compare
hierarchical, k-means
and PAM
(Partitioning around
Medoids)) using
clValid R library
• Measured using
connectivity, Dunn
and Silhouette
2. Determine
most suitable
clustering
method
• Variables
subdivided into
groups by cutting at
a desired similarity
level
• Function dist() in
hclust library was
used to calculate a
dissimilarity matrix
as an input
• Dendrogram was
visualise using
fviz_dend() function
in factoextra library
3. Perform
hierarchical
clustering
• Clinicians validate
the patterns of
variables in each
cluster
• Results to be
compared with real-
time survival analysis
4. Validate the
pattern of
clusters
32. RESULTS – UL
Optimal number of cluster = 6
Suitable clustering method=
hierarchical clustering
METHOD CONNECTIV
I-TY
DUNN SILHOUETT
E
hierarchica
l
16.45 0.72 0.38
kmeans 16.34 0.68 0.38
pam 20.70 0.34 0.16
33. RESULTS – UL – Hierarchical clustering
V1: Age
V22: Total lymph nodes
V21: Hormonal therapy
V12: ER status
V13: PR status
V18: Method of axillary lymph node
dissection
V2: Marital status
V7: Classification of breast cancer
V16: Surgery status
V19: Radiotherapy
V20: Chemotherapy
V3: Menopausal status
V8: Laterality
V6: Method of diagnosis
V15: Primary treatment type
V10: Grade of differentiation
V14: cerb2 status
V17: Type of surgery
V5: Race
V4: Presence of family history
V9: Cancer stage
V11: Tumor size
V23: Positive lymph nodes
40. OUR STUDY
Deep Learning
Transfer Learning
Feature
Extraction
VGG16 Fine
Tuning
ResNet50 Fine
Tuning
VGG-FC
VGG-SVM
VGG-DT
VGG-RF
VGG-AB
ResNet-FC
ResNet-SVM
ResNet-DT
ResNet-RF
ResNet-AB
• We use VGG16 and ResNet50 pre-trained on Breast US image dataset
• We used Convolutional layers of VGG16 and ResNet50 for feature extraction
• 12 Deep Learning models that we used in our study
42. References Dataset Deep learning Models Performance
[29]
4254 benign
GoogLeNet
Accuracy: 91.23%
3154 malignant Sensitivity: 84.29%
Specificity: 96.07%
[30]
135 benign
Boltzmann
Accuracy: 93.4%
92 malignant Sensitivity: 88.6%
Specificity: 97.1%
[31]
100 benign
Deep Polynomial
network+SVM
Accuracy: 92.40%
100 malignant Sensitivity: 92.67%
Specificity: 91.36%
[32]
275 benign
Stacked denoising
Autoencoder
Accuracy: 82.4%
245 malignant Sensitivity: 78.7%
Specificity: 85.7%
Current Study
249 benign
Attention VGG16 +
ensembled loss
Accuracy: 93%
190 malignant Sensitivity: 96%
Specificity: 90%
THE STATE OF THE ART OF DEEP LEARNING MODELS IN BREAST ULTRASOUND LESION CLASSIFICATION.
43. FOURTH PROJECT OBJECTIVE
To develop a
fully automated AI-enabled database platform
with interactive visualisations
for researchers and clinicians
using The Malaysian Breast Cancer Survivorship Cohort (MyBCC) Study
Baseline
6 Months
1 Year
3 Years
5 Years
5 Timelines
603
Variables
909
Samples
May 2019
47. FOURTH PROJECT– Automated machine learning
Algorithm 1. Cartesian product to select lifestyle and clinical factors from different tables
Select 13 lifestyle factors, life status and survival years from table, mybcc and 4 clinical factors from table, clinical
where the mybcc.RN = clinical.RN (select samepatient ID from both tables)
r = 𝝈mybcc.RN = clinical.RN ((πRN,l1,l2,l3…,l13,lifestatus,survivalyears (mybcc)) × (πRN,c1,c2,c3.c4 (clinical)))
Definition:
r = relational database
𝝈 = selection
Π = projection
× = Cartesian product
mybcc = data table, which contains lifestyle factors
clinical = data table, which contains clinical factors
RN = patient ID/ primary key in both tables
lifestatus = life status of the patients (Alive/Dead)
survivalyears = Overall survival years of the patients
l1,l2,l3…l13 = 13 lifestyle factors
c1,c2,c3,c4 = 4 clinical factors
48. FOURTH PROJECT– Automated machine learning
Algorithm 2. Python-HTML integration for automated machine learning
a = query1 + ((pm1,pm2,…pmn)+ ps + ph)
Definition:
a = automated analysis
query1 = 𝝈mybcc.RN = clinical.RN
((πRN,l1,l2,l3…,l14,lifestatus,survivalyears (mybcc)) ×
(πRN,c1,c2,c3.c4 (clinical) (Refer to Algorithm 1)
pm1,pm2,…pmn = Pyhton modules
ps = Python script to run each analysis
ph = Python-HTML connection via cgitb
49. FOURTH PROJECT– Automated machine learning
Algorithm 3. Model of the automated visualisation from database
v = query2 + (c1,c2,…,cn)
Definition:
v = visualisation
query2 = query1 + ((pm1,pm2,…pmn) + ps + ph)
(Refer to Algorithm 2)
c1,c2,…,cn = different types of charts
56. FIFTH PROJECT TEXT MINING
.in+0=========== REPORT TEXT ==========.br.br.br.br.br.br.brMRI THORACOLUMBAR +C of
13-Oct-2015:.br.brIndication.brMetastatic breast carcinoma. Currently complaint of weakness of the lower limb
bilaterally. TRO cord involvement/ compression..br.brSequences.brCoronal T1W, T2W.brSagittal T1W,T2W,
CISS3D.brAxial CISS3D, T2W.brPost gad- sag, axial.br.brFindings.brCorrelation made with previous CT dated
8.6.2015..br.brNormal spinal alignment..brMultilevel enhancing high signal intensity on T1/T2W/STIR is seen in
the thoracic and lumbar vertebral bodies, sacrum, both ilium and sternum..brReduced T12 vertebral body
height..brThey are expansion of the vertebral body at T3, T6, T9, T10 and T12 levels causing indentation of the
spinal cord worst at the T12 level with AP diameter of the spinal canal measures 0.8cm. .brThe rest of vertebral
body heights are preserved..brThe intervertebral disc returns normal signal..brThe spinal cord ends at L1. No
intramedullary lesion seen or high signals seen in spinal cord at T2WI. .br.brImpression.brBreast carcinoma
with .br1. Multiple vertebral body metastasis with multilevel spinal cord indentation..br2. Pathological
compression fracture and spinal canal stenosis at T12 level .br.br.brDrs Vithya / Nur Aishah / DR
NAZRI.br.br.br.br.br.br.brReport Written By: Dr MOHAMMAD NAZRI BIN MD. SHAH? 13-OCT-
2015 05:05 PM.brReport Approved By : Dr MOHAMMAD NAZRI BIN MD. SHAH? 13-OCT-2015 05:05
PM.br.br.br========== REPORT TEXT END =========.br.in-0
58. mmo
Towards an
AI Enabled
Digital
Platform for
Modern
Cancer
Healthcare
A Fully
Automated EMR
for BCM
(iPesakit BCM)
AI Enabled
Breast Cancer
Research EMR
AI Powered
Sentiment
Analysis
Research
Databases
1
2
3 Public
Opinio
n
Mobile Application
for E-
Communication
between doctors,
nurses and patients
4
Monitoring & Predictions
59. Summary
• Data Science Is Reshaping Healthcare
• Techniques Such As Machine Learning, Deep Learning, Text Mining Are The Core of
Data Science
• UMMC Is One of The Primary Hospital In The Country To Embark On Data Science
Projects
• The Data Science Pipeline Produced And Tested In These Projects Will Be Used In
Other Asian Hospitals
THANK YOU FOR LISTENING
60. Professor. Dr. Nur Aishah Binti Mohd Taib, UM,
Dr. Nurul Aqilah Mohd Nor, Bioinformatics, UM,
Dr. Tania Islam, MyBCC Project Manager, UM,
Professor Pietro Lio, Adviser, University of Cambridge, UK,
Tan Wee Ming, Programmer, student, Bioinformatics, UM,
Dr. Elham Yousef Kalafi, Researcher, Bioinformatics, UM
Mogana Darshini Ganggayah, student, Bioinformatics, UM
Editor's Notes
There are 4 studies to be discussed.
Newspaper clips from NST and the Star in 2018
Acknowledges in importance of good structured clinical big data management.
Highlighting on improving the quality service in healthcare.
However…
The EMR innovation matches the UMMC direction and priorities towards the hospital’s mission, especially in bridging the gap between clinical practice and research through and efficient data system.
Involves multi diciplines;
Clinicians and nurses as front line staff who will deliver the innovation
System structure design by bioinformatics and module content by clinicians , Implementation by the IT experts
Data capture methods had been manual and done retrospectively by tracing notes of patients’ clinical and treatment characteristics.
This method is expensive with high probability of missing values and inaccuracies. Reducing manual work by automated data capture systems increase workflow efficiency as well as better research outcomes. In a typical clinical set up, these primary data are used for surgical audits in measuring the hospital performance, while the secondary use data will be used in epidemiological analysis in breast cancer outcome research.
Aggregating data from different sources in healthcare and research is important 17. The effectiveness and data quality of records can be improved through the enhancement of the research database features. Elements needed for a successful clinical research database include engagement of clinicians, utility for research and the ability to integrate with the legacy systems 18 .
Advances in the scientific field, particularly in the medical domain has led to an increase of clinical data production which offers enhancement opportunities for clinical research sector. More initiatives have to be expanded through interconnection of computational strategies such as electronic medical record (EMR) to interact with research platforms. EMR is primarily designed to meet clinical practice needs for patient care. As the usage of EMR expands, there are more opportunities in extending the system through data interoperability to facilitate breast cancer clinical research activities.Our objective is to extend EMR and develop a breast cancer research knowledgebase system for easy data access, secondary data use for data mining, and interoperability between multiple clinical departments. The rationale of establishing the proposed EMR system is to provide convenient data access to users in a typical clinical set up. In this study, we introduce a breast cancer research clinical data management system integrated with EMR in the clinical environment of diagnosed breast cancer patients in UMMC.
We adopted the Quality Implementation Framework (QIF) because it synthesizes existing models and research support to provide a conceptual overview of the critical steps that comprise quality implementation
There are five groups of crucial members in this EMR implementation;
project manager and critical stakeholders include
hospital management and governance team,
physician champions
system design and development, as well as the
evaluation and quality assurance teams.The project manager is the lead person in facilitating these implementation steps, connects different implementation phases and coordinate the planning, design, development, and testing phases between team members. The hospital management and governance team from the Patient Information Department provide feedback on governance and policy matters pertaining to data sharing, privacy and confidentiality. Physician champions have credibility with clinical staffs and hospital administration, to promote value of the innovation through stakeholders engagements. They are also the main point of reference from the clinical perspective, also as content experts and EMR functionalities so the digital workflow matches closely to the actual clinical workflow. The system designers (bioinformaticians) act as a liaison between physician champions and system development team (IT staffs) in connecting ideas and suitable concepts. Bioinformaticians design the digital system workflow, templates and structure through gathering EMR requirements from physician champions and put in technical form for system developers to take into development phase. IT staffs are responsible in building, customizing and deploying the breast cancer module, as well as providing maintenance service of the system to be conducted by the evaluation and quality assurance team (breast care nurse). Nurses conduct on-site system testing and performance review and coordinates training for users within the practice and system use.
MyBCC : Malaysian Breast Cancer Survivorship Cohort
Breast Q : patients experience with the service, quality measurement
Gone Live since February 2016 VIDEO
We have been working closely with the clinical specialists as well as IT experts in coming out with the best solution that would benefit both clinicians and researchers.
Progressed with Onco department ; in e-prescribing chemotherapy, which links to the Pharm dept
Survey results based on system evaluation test survey conducted at Department of Surgery on 18/7/2017
- Clinicians spend less time interacting with patients
Lessons learnt :
-Personal Data Protection Act (PDPA)
Challenges :
-System testing
- User training
What do we propose?
The Ninth Malaysia Plan (9MP) which is a national budget focus for 2009 highlighted on strengthening the Health Information System, to improve the point-of-care service and information access however till date, the success rate has been low. There is an absolute urgency in developing a reliable, integrated and interoperable Health Information Management using an implementation framework.
Enhanced EMR provides point-of-care data for clinical visits which allows data tracking over time, useful in identifying patients who are due for preventive visits and screenings, cancer-follow up, monitoring recurrence and death.
Ability to produce reports on BC survival in Malaysia to be used by stakeholders (clinicians, researchers, and the government) is essential for research, individual hospital performance for policymakers to track outcomes and provide direction in cancer control.
A time effective approach while producing new knowledge through data mining and analysis – Research & Development
The first one is implementation of machine learning in prediction of important factors influencing survival rate of breast cancer. We used 6 different machine learning algorithms to evaluate the dataset which contains 8066 patient’s records with 23 predictors.
There are 5 steps in the methodology (Model evaluation, Random forest further modelling, variable selection, decision tree and survival curves for validation
In calibration analysis, all the algorithms show well calibrated measure for the provided dataset. The support vector machine classifier produced a sigmoid curve due to the margin property of hinge loss as it focused on hard samples closer to decision boundaries (the support vectors). The dataset for the prediction of breast cancer survival (‘all data’) seemed sufficiently reliable to proceed with the other steps, mainly because the calibration measures were closer to the diagonal or identity.
Both tumour size and positive lymph nodes are the determinants of the stage of breast cancer, according to AJCC and as expected, these were predicted as important variables in variable selection process in this study. The tumour size separation in this study were (<2.5 cm, 2.5 < 4.8 cm, 4.8 < 11 cm, and >11 cm), while the AJCC manual categorised the TS as less than or equals to 2 cm (T1) for Stage I, 2 – 4 cm (T2) for Stage II, and more than 4 cm (T3) for Stage III. The positive lymph nodes separation generated from DT analysis of this study were (<3, 3<9, and 9<18), whereas in the AJCC staging, positive lymph nodes of less than or equals to 3 (N1) fell under Stage II breast cancer, PLN between 3 and 6 (N2) was categorised as Stage IIIA, and PLN exceeding 6 (N3) was under Stage IIIB.
MyBCC study was started in 2012 to conduct a cohort study on different lifestyle factors influencing survival rate of breast cancer patients from multi-ethnic origin in Malaysia. There are 5 different timelines which are baseline (diagnosis), 6 months, 1 year, 3 years and 5 years. There are 603 variables/ questions and 800 patients recruited until 2018.
Different type of visualizations for other variables are in progress.
Radiologist used to report their findings from mammogram or ultrasound in free-text, moreover different radiologists have different ways of narrating the report. Thus, a structured radiology reporting system is needed in medical field.