SlideShare a Scribd company logo
1 of 44
Download to read offline
Knowledge graphs based extension of patients’ files to predict
hospitalization
Raphaël Gazzotti
Supervisors: Catherine Faron Zucker & Fabien Gandon
PhD Thesis Defense - April 2020
CURRENT STATUS
2
Problem of hospitalization:
● Rate of 191 hospitalizations
per 1000 inhabitants (12.7 million patients).
● On average a full hospitalization lasts:
○ 6.4 days in medicine, surgery, obstetrics.
○ 36.4 days in follow-up care and
rehabilitation.
○ 44.5 days in home hospitalization.
○ 57.0 days in psychiatry.
90%
Perceived as a failure by the physician.
Increase in the number of patients in a state of comorbidity (multiplicity of chronic diseases).
Feeling of abandonment of physicians
General practitioners problem:
Difficulty to plan actions against comorbidity:
● Lack of recommendations.
● Multiplication of treatments and side
effects.
● Risk of applying recommendations for
isolated diseases.
[Lacroix-Hugues et al. 2012]
Balanced Dataset
PRIMEGE [SINCE 2012] - DATASET
Gender Consultation’s reason ICPC2 ... History Observations
H vaccin-antitétanique A44 ... Appendicite EN CP - Bon état général -
auscult pulm libre; bdc rég sans
souffle - tympans ok-
3
TOWARDS A DECISION SUPPORT TOOL FOR PHYSICIANS
4
Preventive system that orders the risk factors involved in patient hospitalization:
1. Predict hospitalization => learn the risk factors.
2. Sort the risk factors found in patient’s file (personalised medicine).
3. Provide a simulation with a decision support tool called Health Predict.
General practitioner:
● Preserve and/or improve his patient’s health and autonomy.
● Predict as rapidly an easily as possible the probability of hospitalization of his patient.
● Avoid hospitalization to his patient as far as possible.
Patient:
● Preserve and/or improve one’s health and autonomy.
● Avoid one’s hospitalization as far as possible.
Computer scientist:
● Use of methods that do not jeopardize data confidentiality.
● Make sense of all the variables contained in medical records.
THESIS OVERVIEW
5
I. Predicting hospitalization on a basic representation of
electronic medical records
II. Predicting hospitalization based on electronic medical
records representation enrichment
A. How to extract knowledge relevant for the prediction
of the occurrence of an event?
B. Can ontological augmentations of the features
improve the prediction of the occurrence of an event?
III. Decision support application
IV. Conclusion
THESIS OVERVIEW
6
I. Predicting hospitalization on a basic representation of
electronic medical records
II. Predicting hospitalization based on electronic medical
records representation enrichment
A. How to extract knowledge relevant for the prediction
of the occurrence of an event?
B. Can ontological augmentations of the features
improve the prediction of the occurrence of an event?
III. Decision support application
IV. Conclusion
VECTOR REPRESENTATION OF TEXT DATA IN EMRs
We opted for a Bag-of-words model:
● Main information extracted without
requiring a large corpus.
● Attributes are not transformed.
● Integration of heterogeneous data
with concatenation.
7
Doc. 1 History of headaches and facial weakness.
Doc. 2 Severe structural defects of the right petrous temporal bone.
Doc. 3 Patient presented to the Emergency Room.
(a) Sample documents
Word 1 History Word 11 right
Word 2 of Word 12 petrous
Word 3 headaches Word 13 temporal
Word 4 and Word 14 bone
Word 5 facial Word 15 Patient
Word 6 weakness Word 16 presented
Word 7 Severe Word 17 to
Word 8 structural Word 18 Emergency
Word 9 defects Word 19 Room
Word 10 the
(b) Vocabulary used in the document collection
1 0 0
1 1 0
1 0 0
1 0 0
1 0 0
1 0 0
0 1 0
0 1 0
0 1 0
0 1 0
0 1 1
0 1 0
0 1 0
0 1 0
0 0 1
0 0 1
0 0 1
0 0 1
0 0 1
(c) Word-document occurrence matrix
A patient's medical record can be spread over a long period of time (several years):
● Use of non-sequential algorithms.
● Involves to aggregate all their consultations.
We have opted for the patients:
● “Hospitalized” => All consultations occurring before hospitalization / return from hospitalization.
● “Not Hospitalized” => All their consultations.
● Hospitalization patterns detected with a regex.
HANDLING TEMPORAL DIMENSION IN THE REPRESENTATION OF EMRs (1)
Hospitalization
8
HANDLING TEMPORAL DIMENSION IN THE REPRESENTATION OF EMRs (2)
Sequential representation defined as:
● ti
= (xi
,yi
)
It handles both (xi
):
● Permanent Data
○ Family History
○ Personal History
○ Patient Information
● Time-series Data
○ consultation specific
9
TP: True Positives
FP: False Positives
FN: False Negatives
Validation of results by nested cross-validation:
● K at 10 for the outer loop.
● L at 2 for the inner loop, hyperparameters by
random search (7 iterations).
Metric to evaluate vector representations:
● The F-measure Ftp,fp
(Forman and Scholz (2010))
suited to cross-validation context.
PREDICTING HOSPITALIZATION FROM EMRs: EXPERIMENTAL PROTOCOL
10
11
PREDICTING HOSPITALIZATION FROM EMRs: RESULTS
SVC RF Log CRFs
0.819 0.831 0.850 0.834
Machine learning algorithms and hyperparameters optimized:
● SVC, C-Support vector classifier, libSVM [Chang and Lin 2011]:
○ The penalty parameter C, the kernel used and the kernel coefficient.
● RF, Random forest [Breiman 2001]:
○ Number of trees in the forest, the maximum depth in the tree, the
minimum number of samples required to split an internal node, the
minimum number of samples required to be at a leaf node and the
maximum number of lead nodes.
● LR, Logistic regression [Mcullagh and Nelder 1989]:
○ The regularization coefficient C, the penalty used by the algorithm.
● CRFs, Conditional random fields [Sutton et al. 2012]:
○ The regularization coefficients c1 and c2 used by the solver L-BFGS.
No added value in terms of performance with a sequential representation.
THESIS OVERVIEW
12
I. Predicting hospitalization on a basic representation of
electronic medical records
II. Predicting hospitalization based on electronic medical
records representation enrichment
A. How to extract knowledge relevant for the prediction
of the occurrence of an event?
B. Can ontological augmentations of the features
improve the prediction of the occurrence of an event?
III. Decision support application
IV. Conclusion
THESIS OVERVIEW
13
I. Predicting hospitalization on a basic representation of
electronic medical records
II. Predicting hospitalization based on electronic medical
records representation enrichment
A. How to extract knowledge relevant for the prediction
of the occurrence of an event?
B. Can ontological augmentations of the features
improve the prediction of the occurrence of an event?
III. Decision support application
IV. Conclusion
TEXT REPRESENTATION & INJECTION OF DOMAIN KNOWLEDGE
BOW Concept Vector
Resulting Vector
14
Bag-of-words (BOW) representation:
● Main information extracted without
requiring a large corpus.
● Attributes are not transformed.
● Integration of heterogeneous data with
concatenation.
Injection of domain knowledge by
concatenation:
● BOW obtained from text fields.
● Concept vector from ontologies.
1 2
2: Cardiovascular disease
1: Organ failure
...
Data
Cardiovascular
disease
Organ failure ...
Concept
Vector
Open knowledge bases used:
● The French chapter of DBpedia with the property dcterms:subject.
● Wikidata with knowledge about drugs:
○ Properties: ‘subject has role’ (wdt:P2868), ‘medical condition treated’ (wdt:P2175)
and ‘significant drug interaction’ (wdt:P769).
○ Couple property-concept (e.g., meprobamate treats (wdt:P2175) headache).
Vocabularies specific to the health sector, OWL-SKOS representations of :
● ATC (Anatomical, Therapeutic and Chemical):
○ e. g., ‘meprednisone’ (code H02AB15) has ‘Glucocorticoids, Systemic’ (code H02AB) as super class…
○ Considered as super classes concepts linked by properties rdfs:subClassOf and atc:member_of.
● ICPC2 (International Classification of Primary Care):
○ e. g., ‘Symptom and complaints’ (H05) has ‘Ear’ (H) as super class (rdfs:subClassOf) .
● NDF-RT (National Drug File - Reference Terminology):
○ Properties: ‘may_prevent’, ‘CI_with’, ‘may_treat’.
ONTOLOGIES USED TO ENRICH THE REPRESENTATION OF EMRs
15
NAMED ENTITY RECOGNITION IN EMRs AND LINKING WITH ONTOLOGIES
16
Use of free text and structured data.
Not all drug-related properties necessarily
exist in Wikidata:
● RxNorm code
● cui code
● ATC code
NAMED ENTITY LINKING WITH DBPEDIA
Enrichment workflow used with DBpedia (limited set of concepts):
17
Speciality Concept
Oncology Neoplasm stubs, Oncology, Radiation therapy
Cardiovascular Cardiovascular disease, Cardiac arrhythmia
Neuropathy Neurovascular disease
Immunopathy Malignant hemopathy, Autoimmune disease
Endocrinopathy Medical condition related to obesity
Genopathy Genetic diseases and disorders
Intervention Surgical removal procedures, Organ failure
Emergencies Medical emergencies, Cardiac emergencies
(+s, +s*)
● +s: Consider the text fields about: the
patient's personal history, family history,
allergies, environmental factors, past health
problems, current health problems, reasons
for consultations, diagnoses, medications,
care procedures, reasons for prescribing
medications, physician observations,
symptoms and diagnosis.
● +s*: Consider the text fields about: patient's
personal history, allergies, environmental
factors, current health problems, reasons for
consultations, diagnoses, medications, care
procedure, reasons for prescribing
medications and physician observations.
EXAMPLE OF EXTRACTED DBPEDIA CONCEPTS
18
Patient 1 Patient 2
French prédom à gche - insuf vnse ou insuf cardiaque -
pas signe de phlébite - - ne veut pas mettre de
bas de contention et ne veut pas aumenter le
lasilix... -
procédure FIV - - transfert embryon samedi
dernière - a fait hyperstimulation ovarienne;
rupture de kyste - - asthénie, - - dleur abdo,
doulleur à la palpation ++ - - voit gynéco la
semaine prochaine pr controle betahcg, echo
-
English
(Translation)
predom[inates] on the l[e]ft, venous or cardiac
insuf[ficiency], no evidence of phlebitis, does not
want to wear compression stockings and does not
want to increase the lasix
In vitro fertilization procedure, embryo
transfer last Saturday, did ovarian
hyperstimulation, cyst rupture, asthenia,
abdominal [pain], [pain] on palpation ++, will
see a gyneco[logist] next week [for] a beta
HCG, echo check-up
Concepts Cardiovascular disease,
Organ failure
Neoplasm stubs
PREFIX dbpedia-fr: <http://fr.dbpedia.org/resource/> PREFIX category-fr:<http://fr.dbpedia.org/resource/Catégorie:>
insuffisance cardiaque dbpedia-fr:Insuffisance_cardiaque category:fr:Défaillance_d'organe
AUTOMATIC EXTRACTION WITH DBPEDIA
19
Use of features extracted from text of the medical reports.
Generation of vector representations based on annotations
from feature selection.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX yago: <http://dbpedia.org/class/yago/>
PREFIX cat: <http://fr.dbpedia.org/resource/Catégorie:>
SELECT ?skos_subject WHERE {
SERVICE <http://fr.dbpedia.org/sparql> {
# Constraint on the medical domain
VALUES ?concept_constraint {
cat:Maladie # disease
cat:Santé # health
cat:Génétique_médicale # medical genetics
cat:Médecine # medicine
cat:Urgence # urgency
cat:Traitement # treatment
cat:Anatomie # anatomy
cat:Addiction # addiction
cat:Bactérie # bacteria
} 20
SPARQL QUERY FOR CONCEPT EXTRACTION IN DBPEDIA
<link_dbpedia_spotlight>
dbpedia-owl:wikiPageRedirects{0,1} ?page.
?page dcterms:subject ?page_subject.
?page_subject skos:broader{0,10} ?concept_constraint.
?page_subject skos:prefLabel ?skos_subject.
?page owl:sameAs ?page_en.
# Filter used to select the corresponding resource in the
English Chapter of DBpedia
FILTER(STRSTARTS(STR(?page_en),
"http://dbpedia.org/resource/")) }
SERVICE <http://dbpedia.org/sparql> {
?page_en a ?type
}
# Selection on specific types on the DBpedia EN chapter
FILTER(?type in (dbo:Disease, dbo:Bacteria,
yago:WikicatViruses,
yago:WikicatRetroviruses,
yago:WikicatSurgicalProcedures,
yago:WikicatSurgicalRemovalProcedures))
}
VALIDATION OF THE CONCEPTS SELECTED TO ANNOTATE EMRs
Annotation of the 285 concepts by 3 experts:
● 2 General practitioners
● 1 Biologist
Inter-rater agreement (relevance of concepts):
● 0.51 / 0.21 between the GPs.
● Without terminological conflicts 0.66 / 0.52
between the GPs (243 Concepts).
● Average of 198 concepts annotated as
useful in studying hospitalization risk.
21
Scores are not significant.
Harder than it looks.
ON THE CORRELATION BETWEEN HUMAN AND MACHINE ANNOTATORS (1)
Correlation between annotated sets of 285
concepts from both human and machine
annotators.
The correlation metric ranges from 0 to 2:
● 0: Perfect correlation
● 1: No correlation
● 2: Perfect negative correlation
22
ū is the mean of elements of u.
Cosine distance uses the same formula
without the mean.
Maximum variation:
● Maximum of 0.681 between humans
● Maximum of 0.485 between machines
Machines’ specificities:
● U1
similar to M5
score of 0.12
ON THE CORRELATION BETWEEN HUMAN AND MACHINE ANNOTATORS (2)
23
Maximum variation:
● Maximum of 0.681 between humans (A1
/A2
)
● Maximum of 0.485 between machines (M10
/U1
)
Machines’ specificities:
● U1
similar to M5
score of 0.12
Closer annotation with the machines.
Caption:
: > 0.5
: 0.25<x<0.5
: < 0.25
THESIS OVERVIEW
24
I. Predicting hospitalization on a basic representation of
electronic medical records
II. Predicting hospitalization based on electronic medical
records representation enrichment
A. How to extract knowledge relevant for the prediction
of the occurrence of an event?
B. Can ontological augmentations of the features
improve the prediction of the occurrence of an event?
III. Decision support application
IV. Conclusion
TP: True Positives
FP: False Positives
FN: False Negatives
Validation of results by nested cross-validation:
● Re-shuffle of the folds.
● K at 10 for the outer loop.
● L at 3 for the inner loop, hyperparameters by
random search (150 iterations).
Metric to evaluate vector representations:
● The F-measure Ftp,fp
(Forman and Scholz (2010))
suited to cross-validation context.
PREDICTING HOSPITALIZATION FROM EMRs: EXPERIMENTAL PROTOCOL
25
● baseline: BOW generated from medical records.
● +t: ICPC2 concepts.
● +s: DBpedia on all text fields.
● +s*: DBpedia focussed on patient’s own record.
● +c: ATC on different hierarchical depth levels.
● +wa: Wikidata, property ‘subject has role’.
● +wi: Wikidata, property ‘significant drug interaction’.
● +wm: Wikidata, property ‘medical condition treated’.
● +d: enrichment with NFD-RT:
○ prevent
=> ‘may_prevent’ property
○ treat
=> ‘may_treat’ property
○ CI
=> ‘CI_with’ property.
INDIVIDUAL FEATURES RESULTS - Ftp,fp
metric
Features set SVC RF Log
baseline 0.8270 0.8533 0.8491
+t 0.8239 0.8522 0.8545
+s 0.8221 0.8522 0.8485
+s* 0.8339 0.8449 0.8514
+c1
0.8235 0.8433 0.8453
+c1
-2
0.8254 0.8480 0.8510
+c2
0.8348 0.8522 0.8505
+wa 0.8223 0.8468 0.8545
+wi 0.8149 0.8484 0.8501
+wm 0.8221 0.8453 0.8458
+dprevent
0.8254 0.8506 0.8479
+dtreat
0.8338 0.8472 0.8481
+dCI
0.8281 0.8498 0.8460 26
Favg
OF INDIVIDUAL FEATURES UNDER Log
27
COMBINED FEATURES RESULTS & CONVERGENCE CURVE
28
Features set SVC RF Log
+t+s+c2
+wa+wi 0.8258 0.8486 0.8547
+t+s*+c2
+wa+wi 0.8239 0.8494 0.8543
+t+c2
+wa+wi 0.8140 0.8531 0.8571
Better result with combined features.
Ftp,fp
Ftp,fp
Favg
on Log
union
intersection
Introducing automatic selection in conceptual vector representation
29
. . .
K-Folds with feature selection or
annotated concepts
Possible alternative concept vector representation of EMRs by union or intersection of
selected concepts from each fold:
30
ANNOTATIONS RESULTS ON DBPEDIA - Ftp,fp
metric
Generalisation of vector concepts on +sm:
● Intersection: 0.8662 (14 concepts)
● Union: 0.8714 (51 concepts)
Features set SVC RF Log
baseline 0.8270 0.8533 0.8491
+s 0.8221 0.8522 0.8485
+s* 0.8339 0.8449 0.8514
+s*T 0.8214 0.8492 0.8388
+s*∩ 0.8262 0.8521 0.8432
+s*∪ 0.8270 0.8467 0.8445
+s*m 0.8363 0.8547 0.8642
+sm 0.8384 0.8541 0.8689
● baseline: BOW generated from medical records.
● +s: On all text fields (selection of 14 concepts).
● +s*: Focussed on patient’s own record (selection of 14 concepts).
● +s*T: all concepts automatically extracted, focus on patient’s own
record.
● +s*∩: intersection of concepts estimated relevant by experts,
focus on patient’s own record.
● +s*∪: concepts unanimously estimated relevant by experts, focus
on patient’s own record.
● +s*m: Feature selection algorithm (Lasso) step on concepts, focus
on patient’s own record.
● +sm: Feature selection algorithm (Lasso) step on concepts, focus
on all text fields.
Best results achieved with machine annotations on all text fields.
Favg
OF ANNOTATED FEATURES FROM DBPEDIA UNDER Log
31
Correction of dependent Student’s t test (Nadeau and Bengio (2003)):
● Comparison of the different statistical tests (Demšar (2006))
● Features sets compared: baseline vs. other on LR.
● Alpha at 0.05.
● Critical values: 1.83 (F1) / 1.86 (AUC).
● Couple t-value/p-value.
STATISTICAL COMPARISON OF THE RESULTS - T TEST
32
Features set
t-value/
p-value
(on F1)
t-value/
p-value
(on AUC)
+wa -1.06/0.32 0.11/0.92
+t+s+c2
+wa+wi -0.47/0.65 0.02/0.98
+t+s*+c2
+wa+wi -0.52/0.62 0.10/0.92
+t+c2
+wa+wi -0.69/0.51 -0.16/0.87
+sm -1.57/0.151 -0.77/0.46
+sm∩ -1.62/0.139 -0.58/0.58
+sm∪ -2.23/0.05 -0.81/0.44
Given two sets A and B of length n:
● xj
= Aj
- Bj
● n2 is the number of testing folds, here 1.
● n1 is the number of training folds, here 9.
● σ² is the sample standard deviation on x.
T test on +sm∪ rejects the null hypothesis.
To validate it on the other features sets we should relaunch X times the experiments.
33
OVERALL RESULTS & DISCUSSION
Concepts involved in the hospitalization prediction (among the 51 selected concepts of +sm):
Source Concept Concept (translated)
Generic knowledge Terme médical Medical terminology
Patient’s mental state
Antidépresseur, Dépression
(psychiatrie), Psychopathologie,
Sémiologie psychiatrique, Trouble de
l’humeur
Antidepressant, Major depressive disorder,
Psychopathology, Psychiatric assessment,
Mood disorder
Infectious disease
Infection ORL, Infection urinaire,
Infection virale, Virologie médicale
ENT infection, Urinary tract infection, Viral
Infection, Clinical virology
Cardiovascular system
Dépistage et diagnostic du système
cardio-vasculaire, Maladie
cardio-vasculaire, Physiologie du
système cardio-vasculaire, Signe
clinique du système-cardiovasculaire,
Trouble du rythme cardiaque
Screening and diagnosis of the
cardiovascular system, Cardiovascular
disease, Physiology of the cardiovascular
system, Clinical sign of the cardiovascular
system, Cardiac arrhythmia
Family history Diabète, Terme médical Diabetes, Medical terminology
Use of technical terminology in a situation involving a complex medical case.
Semi-supervised step allows to retrieve the most relevant concepts.
THESIS OVERVIEW
34
I. Predicting hospitalization on a basic representation of
electronic medical records
II. Predicting hospitalization based on electronic medical
records representation enrichment
A. How to extract knowledge relevant for the prediction
of the occurrence of an event?
B. Can ontological augmentations of the features
improve the prediction of the occurrence of an event?
III. Decision support application
IV. Conclusion
35
THE PHYSICIAN’S NEEDS
General practitioner:
● Preserve and/or improve her patient’s health and autonomy.
● Predict as rapidly an easily as possible the probability of
hospitalization of her patient.
● Avoid hospitalization to her patient as far as possible.
Example scenario:
Dr. Nathalie predicting the hospitalization of her
patient Patrick (57 y.o.).
36
USE OF HEALTH PREDICT BY THE PHYSICIAN (1)
(Once connected to Health Predict
through a plugin directly integrated
into one’s consultation software)
To decide if patient Patrick needs to
be hospitalized, Dr. Nathalie:
1. checks the 5 Patrick’s risk
factors that she can act upon in
order to reduce the
hospitalization probability.
2.
3. selects the two first risk
factors (those with the most
impact) for a treatment. In A,
she observes the decrease in
the risk of hospitalization and in
B she sees that the decrease in
risk is 17%.
1
2
1
2
A
A
B
B
37
USE OF HEALTH PREDICT BY THE PHYSICIAN (2)
To decide if patient Patrick needs to
be hospitalized, Dr. Nathalie:
1. checks the other Patrick’s risk
factors in order to avoid side
effects or contraindicated
treatments.
2.
3. identifies that some
antidepressant can be
contraindicated with the
patient’s condition (atrial
fibrillation -risk factor identified
in -, pacemaker, paroxysmal
tachycardia).
3
3
4
4
1
38
THE PATIENT’S NEEDS
Patient:
● Preserve and/or improve one’s health and autonomy.
● Avoid one’s hospitalization as far as possible.
Example scenario:
Patient Patrick (57 y.o.) negotiating the hospitalization
preventing treatment with Dr. Nathalie.
39
USE OF HEALTH PREDICT BY THE PATIENT
(The physician Dr. Nathalie shows up
her monitor with the Health Predict
plugin to her patient Patrick. She
wants her patient to stop smoking,
drinking. She also plans to deal with
her patient’s depression)
● Patrick doesn't feel ready to
quit smoking and drinking at
the same time.
● However, he may make an
effort to stop smoking and can
be attended by a professional
(depression and drinking).
● He notices that these actions
would make him -28% less
likely of being hospitalized.
A
A
B
B
THESIS OVERVIEW
40
I. Predicting hospitalization on a basic representation of
electronic medical records
II. Predicting hospitalization based on electronic medical
records representation enrichment
A. How to extract knowledge relevant for the prediction
of the occurrence of an event?
B. Can ontological augmentations of the features
improve the prediction of the occurrence of an event?
III. Decision support application
IV. Conclusion
CONCLUSION
41
● We have shown that the injection of domain knowledge succeeds in
improving the prediction of hospitalization. The +sm∪ approach for which
the t-test rejects the null hypothesis.
● We proposed a semi-supervised process for selecting relevant concepts
from knowledge graphs for a given task.
● Definition of physicians’ needs and construction of an interface adapted to
some of their needs that exploits the fruit of our research in AI. The
usability tests of our interface still needs to be carried out.
LIMITS & FUTURE WORK
Vector representation:
● Coupling of semantic relationship and text data.
Additional medical domain knowledge:
● Integration of other annotators on Wikidata and on
domain specific knowledge graphs.
Address the weaknesses of the approach using DBpedia (free text):
● Better management of abbreviations / spelling mistakes.
● Use of different level of hierarchy.
● Detect the negation and the context of a medical concept.
● Identify complex medical expressions.
Selection of concepts on combined features. 42
AREA FOR FUTURE RESEARCH
Data related:
● Inclusion of biological values.
Personalized treatment:
● Suggestions of the most suitable treatment for the patient.
Consideration of other health issues:
● Study the risks associated with rehospitalization,
cardiovascular diseases, mental illness...
● Hospitalization risk related to Covid-19.
43
Thank you for your attention.
● Gazzotti, R., Faron-Zucker, C., Gandon, F., Lacroix-Hugues, V. and Darmon, D., 2019, January. Évaluation des améliorations de
prédiction d'hospitalisation par l'ajout de connaissances métier aux dossiers médicaux. In EGC 2019-Conférence Extraction et
Gestion des connaissances 2019.
● Gazzotti, R., Faron-Zucker, C., Gandon, F., Lacroix-Hugues, V. and Darmon, D., 2019, June. Injecting Domain Knowledge in
Electronic Medical Records to Improve Hospitalization Prediction. In European Semantic Web Conference (pp. 116-130). Springer,
Cham.
● Gazzotti, R., Noual, E., Faron-Zucker, C., Gandon, F., Giboin, A., Lacroix-Hugues, V. and Darmon, D., 2019, July. Interface d'Aide à la
Décision pour Prédire l'Hospitalisation de Patients et planifier les Actions Préventives pour Prévenir cet Événement.
● Gazzotti, R., Faron-Zucker, C., Gandon, F., Lacroix-Hugues, V. and Darmon, D., 2020, March. Injection of automatically selected
DBpedia subjects in electronic medical records to boost hospitalization prediction. In Proceedings of the 35th Annual ACM
Symposium on Applied Computing (SAC ’20).

More Related Content

Similar to PhD Defense - Knowledge graphs based extension of patients’ files to predict hospitalization

SHN 288408 D3_3 rev2 Annex 10_Rastall_SHN_D3_3_final
SHN 288408 D3_3 rev2 Annex 10_Rastall_SHN_D3_3_finalSHN 288408 D3_3 rev2 Annex 10_Rastall_SHN_D3_3_final
SHN 288408 D3_3 rev2 Annex 10_Rastall_SHN_D3_3_final
Darren Wooldridge
 
Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017
Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017
Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017
MLconf
 

Similar to PhD Defense - Knowledge graphs based extension of patients’ files to predict hospitalization (20)

Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
 
How Carle Health Effectively Integrated Augmented Intelligence
How Carle Health Effectively Integrated Augmented IntelligenceHow Carle Health Effectively Integrated Augmented Intelligence
How Carle Health Effectively Integrated Augmented Intelligence
 
Portugal-patientsummaries
Portugal-patientsummariesPortugal-patientsummaries
Portugal-patientsummaries
 
Digitization in the emergency Department: the role of patient summaries
Digitization in the emergency Department: the role of patient summariesDigitization in the emergency Department: the role of patient summaries
Digitization in the emergency Department: the role of patient summaries
 
Medinfo2017 Trillium II Workshop
Medinfo2017 Trillium II WorkshopMedinfo2017 Trillium II Workshop
Medinfo2017 Trillium II Workshop
 
assignment 4
assignment 4assignment 4
assignment 4
 
SHN 288408 D3_3 rev2 Annex 10_Rastall_SHN_D3_3_final
SHN 288408 D3_3 rev2 Annex 10_Rastall_SHN_D3_3_finalSHN 288408 D3_3 rev2 Annex 10_Rastall_SHN_D3_3_final
SHN 288408 D3_3 rev2 Annex 10_Rastall_SHN_D3_3_final
 
CSPH Talk
CSPH TalkCSPH Talk
CSPH Talk
 
Csph talk
Csph talkCsph talk
Csph talk
 
Learning to speak medicine
Learning to speak medicineLearning to speak medicine
Learning to speak medicine
 
ML to cure the world
ML to cure the worldML to cure the world
ML to cure the world
 
Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017
Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017
Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017
 
Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...
Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...
Detect COVID-19 with Deep Learning- A survey on Deep Learning for Pulmonary M...
 
Transforming Health Care In Africa
Transforming Health Care In Africa Transforming Health Care In Africa
Transforming Health Care In Africa
 
Towards online universal quality healthcare through AI
Towards online universal quality healthcare through AITowards online universal quality healthcare through AI
Towards online universal quality healthcare through AI
 
Fighting Neurodegenerative Diseases
Fighting Neurodegenerative DiseasesFighting Neurodegenerative Diseases
Fighting Neurodegenerative Diseases
 
Connecting eh rdataquad12
Connecting eh rdataquad12Connecting eh rdataquad12
Connecting eh rdataquad12
 
5.0 - Abstract 1083 Zaleski SHS - Publish
5.0 - Abstract 1083 Zaleski SHS - Publish5.0 - Abstract 1083 Zaleski SHS - Publish
5.0 - Abstract 1083 Zaleski SHS - Publish
 
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
 
D1 Clinical Process 2223 (1).pptx
D1 Clinical Process 2223 (1).pptxD1 Clinical Process 2223 (1).pptx
D1 Clinical Process 2223 (1).pptx
 

Recently uploaded

College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
perfect solution
 

Recently uploaded (20)

Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
 
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
 
Call Girls Tirupati Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Tirupati Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 8250077686 Top Class Call Girl Service Available
 
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
Call Girls Agra Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Agra Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Agra Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Agra Just Call 8250077686 Top Class Call Girl Service Available
 
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Varanasi Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Varanasi Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Varanasi Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Varanasi Just Call 8250077686 Top Class Call Girl Service Available
 
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
 
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In AhmedabadO963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
 
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
 
Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
 
Top Rated Bangalore Call Girls Richmond Circle ⟟ 9332606886 ⟟ Call Me For Ge...
Top Rated Bangalore Call Girls Richmond Circle ⟟  9332606886 ⟟ Call Me For Ge...Top Rated Bangalore Call Girls Richmond Circle ⟟  9332606886 ⟟ Call Me For Ge...
Top Rated Bangalore Call Girls Richmond Circle ⟟ 9332606886 ⟟ Call Me For Ge...
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
 
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
 
Call Girls Guntur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Guntur  Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Guntur  Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Guntur Just Call 8250077686 Top Class Call Girl Service Available
 

PhD Defense - Knowledge graphs based extension of patients’ files to predict hospitalization

  • 1. Knowledge graphs based extension of patients’ files to predict hospitalization Raphaël Gazzotti Supervisors: Catherine Faron Zucker & Fabien Gandon PhD Thesis Defense - April 2020
  • 2. CURRENT STATUS 2 Problem of hospitalization: ● Rate of 191 hospitalizations per 1000 inhabitants (12.7 million patients). ● On average a full hospitalization lasts: ○ 6.4 days in medicine, surgery, obstetrics. ○ 36.4 days in follow-up care and rehabilitation. ○ 44.5 days in home hospitalization. ○ 57.0 days in psychiatry. 90% Perceived as a failure by the physician. Increase in the number of patients in a state of comorbidity (multiplicity of chronic diseases). Feeling of abandonment of physicians General practitioners problem: Difficulty to plan actions against comorbidity: ● Lack of recommendations. ● Multiplication of treatments and side effects. ● Risk of applying recommendations for isolated diseases.
  • 3. [Lacroix-Hugues et al. 2012] Balanced Dataset PRIMEGE [SINCE 2012] - DATASET Gender Consultation’s reason ICPC2 ... History Observations H vaccin-antitétanique A44 ... Appendicite EN CP - Bon état général - auscult pulm libre; bdc rég sans souffle - tympans ok- 3
  • 4. TOWARDS A DECISION SUPPORT TOOL FOR PHYSICIANS 4 Preventive system that orders the risk factors involved in patient hospitalization: 1. Predict hospitalization => learn the risk factors. 2. Sort the risk factors found in patient’s file (personalised medicine). 3. Provide a simulation with a decision support tool called Health Predict. General practitioner: ● Preserve and/or improve his patient’s health and autonomy. ● Predict as rapidly an easily as possible the probability of hospitalization of his patient. ● Avoid hospitalization to his patient as far as possible. Patient: ● Preserve and/or improve one’s health and autonomy. ● Avoid one’s hospitalization as far as possible. Computer scientist: ● Use of methods that do not jeopardize data confidentiality. ● Make sense of all the variables contained in medical records.
  • 5. THESIS OVERVIEW 5 I. Predicting hospitalization on a basic representation of electronic medical records II. Predicting hospitalization based on electronic medical records representation enrichment A. How to extract knowledge relevant for the prediction of the occurrence of an event? B. Can ontological augmentations of the features improve the prediction of the occurrence of an event? III. Decision support application IV. Conclusion
  • 6. THESIS OVERVIEW 6 I. Predicting hospitalization on a basic representation of electronic medical records II. Predicting hospitalization based on electronic medical records representation enrichment A. How to extract knowledge relevant for the prediction of the occurrence of an event? B. Can ontological augmentations of the features improve the prediction of the occurrence of an event? III. Decision support application IV. Conclusion
  • 7. VECTOR REPRESENTATION OF TEXT DATA IN EMRs We opted for a Bag-of-words model: ● Main information extracted without requiring a large corpus. ● Attributes are not transformed. ● Integration of heterogeneous data with concatenation. 7 Doc. 1 History of headaches and facial weakness. Doc. 2 Severe structural defects of the right petrous temporal bone. Doc. 3 Patient presented to the Emergency Room. (a) Sample documents Word 1 History Word 11 right Word 2 of Word 12 petrous Word 3 headaches Word 13 temporal Word 4 and Word 14 bone Word 5 facial Word 15 Patient Word 6 weakness Word 16 presented Word 7 Severe Word 17 to Word 8 structural Word 18 Emergency Word 9 defects Word 19 Room Word 10 the (b) Vocabulary used in the document collection 1 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 (c) Word-document occurrence matrix
  • 8. A patient's medical record can be spread over a long period of time (several years): ● Use of non-sequential algorithms. ● Involves to aggregate all their consultations. We have opted for the patients: ● “Hospitalized” => All consultations occurring before hospitalization / return from hospitalization. ● “Not Hospitalized” => All their consultations. ● Hospitalization patterns detected with a regex. HANDLING TEMPORAL DIMENSION IN THE REPRESENTATION OF EMRs (1) Hospitalization 8
  • 9. HANDLING TEMPORAL DIMENSION IN THE REPRESENTATION OF EMRs (2) Sequential representation defined as: ● ti = (xi ,yi ) It handles both (xi ): ● Permanent Data ○ Family History ○ Personal History ○ Patient Information ● Time-series Data ○ consultation specific 9
  • 10. TP: True Positives FP: False Positives FN: False Negatives Validation of results by nested cross-validation: ● K at 10 for the outer loop. ● L at 2 for the inner loop, hyperparameters by random search (7 iterations). Metric to evaluate vector representations: ● The F-measure Ftp,fp (Forman and Scholz (2010)) suited to cross-validation context. PREDICTING HOSPITALIZATION FROM EMRs: EXPERIMENTAL PROTOCOL 10
  • 11. 11 PREDICTING HOSPITALIZATION FROM EMRs: RESULTS SVC RF Log CRFs 0.819 0.831 0.850 0.834 Machine learning algorithms and hyperparameters optimized: ● SVC, C-Support vector classifier, libSVM [Chang and Lin 2011]: ○ The penalty parameter C, the kernel used and the kernel coefficient. ● RF, Random forest [Breiman 2001]: ○ Number of trees in the forest, the maximum depth in the tree, the minimum number of samples required to split an internal node, the minimum number of samples required to be at a leaf node and the maximum number of lead nodes. ● LR, Logistic regression [Mcullagh and Nelder 1989]: ○ The regularization coefficient C, the penalty used by the algorithm. ● CRFs, Conditional random fields [Sutton et al. 2012]: ○ The regularization coefficients c1 and c2 used by the solver L-BFGS. No added value in terms of performance with a sequential representation.
  • 12. THESIS OVERVIEW 12 I. Predicting hospitalization on a basic representation of electronic medical records II. Predicting hospitalization based on electronic medical records representation enrichment A. How to extract knowledge relevant for the prediction of the occurrence of an event? B. Can ontological augmentations of the features improve the prediction of the occurrence of an event? III. Decision support application IV. Conclusion
  • 13. THESIS OVERVIEW 13 I. Predicting hospitalization on a basic representation of electronic medical records II. Predicting hospitalization based on electronic medical records representation enrichment A. How to extract knowledge relevant for the prediction of the occurrence of an event? B. Can ontological augmentations of the features improve the prediction of the occurrence of an event? III. Decision support application IV. Conclusion
  • 14. TEXT REPRESENTATION & INJECTION OF DOMAIN KNOWLEDGE BOW Concept Vector Resulting Vector 14 Bag-of-words (BOW) representation: ● Main information extracted without requiring a large corpus. ● Attributes are not transformed. ● Integration of heterogeneous data with concatenation. Injection of domain knowledge by concatenation: ● BOW obtained from text fields. ● Concept vector from ontologies. 1 2 2: Cardiovascular disease 1: Organ failure ... Data Cardiovascular disease Organ failure ... Concept Vector
  • 15. Open knowledge bases used: ● The French chapter of DBpedia with the property dcterms:subject. ● Wikidata with knowledge about drugs: ○ Properties: ‘subject has role’ (wdt:P2868), ‘medical condition treated’ (wdt:P2175) and ‘significant drug interaction’ (wdt:P769). ○ Couple property-concept (e.g., meprobamate treats (wdt:P2175) headache). Vocabularies specific to the health sector, OWL-SKOS representations of : ● ATC (Anatomical, Therapeutic and Chemical): ○ e. g., ‘meprednisone’ (code H02AB15) has ‘Glucocorticoids, Systemic’ (code H02AB) as super class… ○ Considered as super classes concepts linked by properties rdfs:subClassOf and atc:member_of. ● ICPC2 (International Classification of Primary Care): ○ e. g., ‘Symptom and complaints’ (H05) has ‘Ear’ (H) as super class (rdfs:subClassOf) . ● NDF-RT (National Drug File - Reference Terminology): ○ Properties: ‘may_prevent’, ‘CI_with’, ‘may_treat’. ONTOLOGIES USED TO ENRICH THE REPRESENTATION OF EMRs 15
  • 16. NAMED ENTITY RECOGNITION IN EMRs AND LINKING WITH ONTOLOGIES 16 Use of free text and structured data. Not all drug-related properties necessarily exist in Wikidata: ● RxNorm code ● cui code ● ATC code
  • 17. NAMED ENTITY LINKING WITH DBPEDIA Enrichment workflow used with DBpedia (limited set of concepts): 17 Speciality Concept Oncology Neoplasm stubs, Oncology, Radiation therapy Cardiovascular Cardiovascular disease, Cardiac arrhythmia Neuropathy Neurovascular disease Immunopathy Malignant hemopathy, Autoimmune disease Endocrinopathy Medical condition related to obesity Genopathy Genetic diseases and disorders Intervention Surgical removal procedures, Organ failure Emergencies Medical emergencies, Cardiac emergencies (+s, +s*) ● +s: Consider the text fields about: the patient's personal history, family history, allergies, environmental factors, past health problems, current health problems, reasons for consultations, diagnoses, medications, care procedures, reasons for prescribing medications, physician observations, symptoms and diagnosis. ● +s*: Consider the text fields about: patient's personal history, allergies, environmental factors, current health problems, reasons for consultations, diagnoses, medications, care procedure, reasons for prescribing medications and physician observations.
  • 18. EXAMPLE OF EXTRACTED DBPEDIA CONCEPTS 18 Patient 1 Patient 2 French prédom à gche - insuf vnse ou insuf cardiaque - pas signe de phlébite - - ne veut pas mettre de bas de contention et ne veut pas aumenter le lasilix... - procédure FIV - - transfert embryon samedi dernière - a fait hyperstimulation ovarienne; rupture de kyste - - asthénie, - - dleur abdo, doulleur à la palpation ++ - - voit gynéco la semaine prochaine pr controle betahcg, echo - English (Translation) predom[inates] on the l[e]ft, venous or cardiac insuf[ficiency], no evidence of phlebitis, does not want to wear compression stockings and does not want to increase the lasix In vitro fertilization procedure, embryo transfer last Saturday, did ovarian hyperstimulation, cyst rupture, asthenia, abdominal [pain], [pain] on palpation ++, will see a gyneco[logist] next week [for] a beta HCG, echo check-up Concepts Cardiovascular disease, Organ failure Neoplasm stubs PREFIX dbpedia-fr: <http://fr.dbpedia.org/resource/> PREFIX category-fr:<http://fr.dbpedia.org/resource/Catégorie:> insuffisance cardiaque dbpedia-fr:Insuffisance_cardiaque category:fr:Défaillance_d'organe
  • 19. AUTOMATIC EXTRACTION WITH DBPEDIA 19 Use of features extracted from text of the medical reports. Generation of vector representations based on annotations from feature selection.
  • 20. PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX dbpedia-owl: <http://dbpedia.org/ontology/> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX yago: <http://dbpedia.org/class/yago/> PREFIX cat: <http://fr.dbpedia.org/resource/Catégorie:> SELECT ?skos_subject WHERE { SERVICE <http://fr.dbpedia.org/sparql> { # Constraint on the medical domain VALUES ?concept_constraint { cat:Maladie # disease cat:Santé # health cat:Génétique_médicale # medical genetics cat:Médecine # medicine cat:Urgence # urgency cat:Traitement # treatment cat:Anatomie # anatomy cat:Addiction # addiction cat:Bactérie # bacteria } 20 SPARQL QUERY FOR CONCEPT EXTRACTION IN DBPEDIA <link_dbpedia_spotlight> dbpedia-owl:wikiPageRedirects{0,1} ?page. ?page dcterms:subject ?page_subject. ?page_subject skos:broader{0,10} ?concept_constraint. ?page_subject skos:prefLabel ?skos_subject. ?page owl:sameAs ?page_en. # Filter used to select the corresponding resource in the English Chapter of DBpedia FILTER(STRSTARTS(STR(?page_en), "http://dbpedia.org/resource/")) } SERVICE <http://dbpedia.org/sparql> { ?page_en a ?type } # Selection on specific types on the DBpedia EN chapter FILTER(?type in (dbo:Disease, dbo:Bacteria, yago:WikicatViruses, yago:WikicatRetroviruses, yago:WikicatSurgicalProcedures, yago:WikicatSurgicalRemovalProcedures)) }
  • 21. VALIDATION OF THE CONCEPTS SELECTED TO ANNOTATE EMRs Annotation of the 285 concepts by 3 experts: ● 2 General practitioners ● 1 Biologist Inter-rater agreement (relevance of concepts): ● 0.51 / 0.21 between the GPs. ● Without terminological conflicts 0.66 / 0.52 between the GPs (243 Concepts). ● Average of 198 concepts annotated as useful in studying hospitalization risk. 21 Scores are not significant. Harder than it looks.
  • 22. ON THE CORRELATION BETWEEN HUMAN AND MACHINE ANNOTATORS (1) Correlation between annotated sets of 285 concepts from both human and machine annotators. The correlation metric ranges from 0 to 2: ● 0: Perfect correlation ● 1: No correlation ● 2: Perfect negative correlation 22 ū is the mean of elements of u. Cosine distance uses the same formula without the mean. Maximum variation: ● Maximum of 0.681 between humans ● Maximum of 0.485 between machines Machines’ specificities: ● U1 similar to M5 score of 0.12
  • 23. ON THE CORRELATION BETWEEN HUMAN AND MACHINE ANNOTATORS (2) 23 Maximum variation: ● Maximum of 0.681 between humans (A1 /A2 ) ● Maximum of 0.485 between machines (M10 /U1 ) Machines’ specificities: ● U1 similar to M5 score of 0.12 Closer annotation with the machines. Caption: : > 0.5 : 0.25<x<0.5 : < 0.25
  • 24. THESIS OVERVIEW 24 I. Predicting hospitalization on a basic representation of electronic medical records II. Predicting hospitalization based on electronic medical records representation enrichment A. How to extract knowledge relevant for the prediction of the occurrence of an event? B. Can ontological augmentations of the features improve the prediction of the occurrence of an event? III. Decision support application IV. Conclusion
  • 25. TP: True Positives FP: False Positives FN: False Negatives Validation of results by nested cross-validation: ● Re-shuffle of the folds. ● K at 10 for the outer loop. ● L at 3 for the inner loop, hyperparameters by random search (150 iterations). Metric to evaluate vector representations: ● The F-measure Ftp,fp (Forman and Scholz (2010)) suited to cross-validation context. PREDICTING HOSPITALIZATION FROM EMRs: EXPERIMENTAL PROTOCOL 25
  • 26. ● baseline: BOW generated from medical records. ● +t: ICPC2 concepts. ● +s: DBpedia on all text fields. ● +s*: DBpedia focussed on patient’s own record. ● +c: ATC on different hierarchical depth levels. ● +wa: Wikidata, property ‘subject has role’. ● +wi: Wikidata, property ‘significant drug interaction’. ● +wm: Wikidata, property ‘medical condition treated’. ● +d: enrichment with NFD-RT: ○ prevent => ‘may_prevent’ property ○ treat => ‘may_treat’ property ○ CI => ‘CI_with’ property. INDIVIDUAL FEATURES RESULTS - Ftp,fp metric Features set SVC RF Log baseline 0.8270 0.8533 0.8491 +t 0.8239 0.8522 0.8545 +s 0.8221 0.8522 0.8485 +s* 0.8339 0.8449 0.8514 +c1 0.8235 0.8433 0.8453 +c1 -2 0.8254 0.8480 0.8510 +c2 0.8348 0.8522 0.8505 +wa 0.8223 0.8468 0.8545 +wi 0.8149 0.8484 0.8501 +wm 0.8221 0.8453 0.8458 +dprevent 0.8254 0.8506 0.8479 +dtreat 0.8338 0.8472 0.8481 +dCI 0.8281 0.8498 0.8460 26
  • 28. COMBINED FEATURES RESULTS & CONVERGENCE CURVE 28 Features set SVC RF Log +t+s+c2 +wa+wi 0.8258 0.8486 0.8547 +t+s*+c2 +wa+wi 0.8239 0.8494 0.8543 +t+c2 +wa+wi 0.8140 0.8531 0.8571 Better result with combined features. Ftp,fp Ftp,fp Favg on Log
  • 29. union intersection Introducing automatic selection in conceptual vector representation 29 . . . K-Folds with feature selection or annotated concepts Possible alternative concept vector representation of EMRs by union or intersection of selected concepts from each fold:
  • 30. 30 ANNOTATIONS RESULTS ON DBPEDIA - Ftp,fp metric Generalisation of vector concepts on +sm: ● Intersection: 0.8662 (14 concepts) ● Union: 0.8714 (51 concepts) Features set SVC RF Log baseline 0.8270 0.8533 0.8491 +s 0.8221 0.8522 0.8485 +s* 0.8339 0.8449 0.8514 +s*T 0.8214 0.8492 0.8388 +s*∩ 0.8262 0.8521 0.8432 +s*∪ 0.8270 0.8467 0.8445 +s*m 0.8363 0.8547 0.8642 +sm 0.8384 0.8541 0.8689 ● baseline: BOW generated from medical records. ● +s: On all text fields (selection of 14 concepts). ● +s*: Focussed on patient’s own record (selection of 14 concepts). ● +s*T: all concepts automatically extracted, focus on patient’s own record. ● +s*∩: intersection of concepts estimated relevant by experts, focus on patient’s own record. ● +s*∪: concepts unanimously estimated relevant by experts, focus on patient’s own record. ● +s*m: Feature selection algorithm (Lasso) step on concepts, focus on patient’s own record. ● +sm: Feature selection algorithm (Lasso) step on concepts, focus on all text fields. Best results achieved with machine annotations on all text fields.
  • 31. Favg OF ANNOTATED FEATURES FROM DBPEDIA UNDER Log 31
  • 32. Correction of dependent Student’s t test (Nadeau and Bengio (2003)): ● Comparison of the different statistical tests (Demšar (2006)) ● Features sets compared: baseline vs. other on LR. ● Alpha at 0.05. ● Critical values: 1.83 (F1) / 1.86 (AUC). ● Couple t-value/p-value. STATISTICAL COMPARISON OF THE RESULTS - T TEST 32 Features set t-value/ p-value (on F1) t-value/ p-value (on AUC) +wa -1.06/0.32 0.11/0.92 +t+s+c2 +wa+wi -0.47/0.65 0.02/0.98 +t+s*+c2 +wa+wi -0.52/0.62 0.10/0.92 +t+c2 +wa+wi -0.69/0.51 -0.16/0.87 +sm -1.57/0.151 -0.77/0.46 +sm∩ -1.62/0.139 -0.58/0.58 +sm∪ -2.23/0.05 -0.81/0.44 Given two sets A and B of length n: ● xj = Aj - Bj ● n2 is the number of testing folds, here 1. ● n1 is the number of training folds, here 9. ● σ² is the sample standard deviation on x. T test on +sm∪ rejects the null hypothesis. To validate it on the other features sets we should relaunch X times the experiments.
  • 33. 33 OVERALL RESULTS & DISCUSSION Concepts involved in the hospitalization prediction (among the 51 selected concepts of +sm): Source Concept Concept (translated) Generic knowledge Terme médical Medical terminology Patient’s mental state Antidépresseur, Dépression (psychiatrie), Psychopathologie, Sémiologie psychiatrique, Trouble de l’humeur Antidepressant, Major depressive disorder, Psychopathology, Psychiatric assessment, Mood disorder Infectious disease Infection ORL, Infection urinaire, Infection virale, Virologie médicale ENT infection, Urinary tract infection, Viral Infection, Clinical virology Cardiovascular system Dépistage et diagnostic du système cardio-vasculaire, Maladie cardio-vasculaire, Physiologie du système cardio-vasculaire, Signe clinique du système-cardiovasculaire, Trouble du rythme cardiaque Screening and diagnosis of the cardiovascular system, Cardiovascular disease, Physiology of the cardiovascular system, Clinical sign of the cardiovascular system, Cardiac arrhythmia Family history Diabète, Terme médical Diabetes, Medical terminology Use of technical terminology in a situation involving a complex medical case. Semi-supervised step allows to retrieve the most relevant concepts.
  • 34. THESIS OVERVIEW 34 I. Predicting hospitalization on a basic representation of electronic medical records II. Predicting hospitalization based on electronic medical records representation enrichment A. How to extract knowledge relevant for the prediction of the occurrence of an event? B. Can ontological augmentations of the features improve the prediction of the occurrence of an event? III. Decision support application IV. Conclusion
  • 35. 35 THE PHYSICIAN’S NEEDS General practitioner: ● Preserve and/or improve her patient’s health and autonomy. ● Predict as rapidly an easily as possible the probability of hospitalization of her patient. ● Avoid hospitalization to her patient as far as possible. Example scenario: Dr. Nathalie predicting the hospitalization of her patient Patrick (57 y.o.).
  • 36. 36 USE OF HEALTH PREDICT BY THE PHYSICIAN (1) (Once connected to Health Predict through a plugin directly integrated into one’s consultation software) To decide if patient Patrick needs to be hospitalized, Dr. Nathalie: 1. checks the 5 Patrick’s risk factors that she can act upon in order to reduce the hospitalization probability. 2. 3. selects the two first risk factors (those with the most impact) for a treatment. In A, she observes the decrease in the risk of hospitalization and in B she sees that the decrease in risk is 17%. 1 2 1 2 A A B B
  • 37. 37 USE OF HEALTH PREDICT BY THE PHYSICIAN (2) To decide if patient Patrick needs to be hospitalized, Dr. Nathalie: 1. checks the other Patrick’s risk factors in order to avoid side effects or contraindicated treatments. 2. 3. identifies that some antidepressant can be contraindicated with the patient’s condition (atrial fibrillation -risk factor identified in -, pacemaker, paroxysmal tachycardia). 3 3 4 4 1
  • 38. 38 THE PATIENT’S NEEDS Patient: ● Preserve and/or improve one’s health and autonomy. ● Avoid one’s hospitalization as far as possible. Example scenario: Patient Patrick (57 y.o.) negotiating the hospitalization preventing treatment with Dr. Nathalie.
  • 39. 39 USE OF HEALTH PREDICT BY THE PATIENT (The physician Dr. Nathalie shows up her monitor with the Health Predict plugin to her patient Patrick. She wants her patient to stop smoking, drinking. She also plans to deal with her patient’s depression) ● Patrick doesn't feel ready to quit smoking and drinking at the same time. ● However, he may make an effort to stop smoking and can be attended by a professional (depression and drinking). ● He notices that these actions would make him -28% less likely of being hospitalized. A A B B
  • 40. THESIS OVERVIEW 40 I. Predicting hospitalization on a basic representation of electronic medical records II. Predicting hospitalization based on electronic medical records representation enrichment A. How to extract knowledge relevant for the prediction of the occurrence of an event? B. Can ontological augmentations of the features improve the prediction of the occurrence of an event? III. Decision support application IV. Conclusion
  • 41. CONCLUSION 41 ● We have shown that the injection of domain knowledge succeeds in improving the prediction of hospitalization. The +sm∪ approach for which the t-test rejects the null hypothesis. ● We proposed a semi-supervised process for selecting relevant concepts from knowledge graphs for a given task. ● Definition of physicians’ needs and construction of an interface adapted to some of their needs that exploits the fruit of our research in AI. The usability tests of our interface still needs to be carried out.
  • 42. LIMITS & FUTURE WORK Vector representation: ● Coupling of semantic relationship and text data. Additional medical domain knowledge: ● Integration of other annotators on Wikidata and on domain specific knowledge graphs. Address the weaknesses of the approach using DBpedia (free text): ● Better management of abbreviations / spelling mistakes. ● Use of different level of hierarchy. ● Detect the negation and the context of a medical concept. ● Identify complex medical expressions. Selection of concepts on combined features. 42
  • 43. AREA FOR FUTURE RESEARCH Data related: ● Inclusion of biological values. Personalized treatment: ● Suggestions of the most suitable treatment for the patient. Consideration of other health issues: ● Study the risks associated with rehospitalization, cardiovascular diseases, mental illness... ● Hospitalization risk related to Covid-19. 43
  • 44. Thank you for your attention. ● Gazzotti, R., Faron-Zucker, C., Gandon, F., Lacroix-Hugues, V. and Darmon, D., 2019, January. Évaluation des améliorations de prédiction d'hospitalisation par l'ajout de connaissances métier aux dossiers médicaux. In EGC 2019-Conférence Extraction et Gestion des connaissances 2019. ● Gazzotti, R., Faron-Zucker, C., Gandon, F., Lacroix-Hugues, V. and Darmon, D., 2019, June. Injecting Domain Knowledge in Electronic Medical Records to Improve Hospitalization Prediction. In European Semantic Web Conference (pp. 116-130). Springer, Cham. ● Gazzotti, R., Noual, E., Faron-Zucker, C., Gandon, F., Giboin, A., Lacroix-Hugues, V. and Darmon, D., 2019, July. Interface d'Aide à la Décision pour Prédire l'Hospitalisation de Patients et planifier les Actions Préventives pour Prévenir cet Événement. ● Gazzotti, R., Faron-Zucker, C., Gandon, F., Lacroix-Hugues, V. and Darmon, D., 2020, March. Injection of automatically selected DBpedia subjects in electronic medical records to boost hospitalization prediction. In Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC ’20).