ESWC2019 - Injecting domain knowledge in electronic medical records to improve hospitalization prediction

Injecting Domain Knowledge in Electronic Medical Records
to Improve Hospitalization Prediction
Gazzotti, R., Faron Zucker, C., Gandon, F.,
Lacroix-Hugues, V., & Darmon, D
ESWC2019 - 06.06.19 1

MOTIVATION
Problem of hospitalization:
● Rate of 191 hospitalizations
per 1000 inhabitants (12.7 million patients).
● On average a full hospitalization lasts:
○ 6.4 days in medicine, surgery, obstetrics.
○ 36.4 days in follow-up care and
rehabilitation.
○ 44.5 days in home hospitalization.
○ 57.0 days in psychiatry.
Perceived as a failure by the physician.
Increase in the number of patients in a state of comorbidity (multiplicity of chronic diseases).
2
90%

RESEARCH QUESTIONS
Can ontological augmentations of the features improve
the prediction of the occurrence of an event?
4
How to integrate domain knowledge into a vector
representation used by a machine learning algorithm?
Is the addition of domain knowledge improving the
prediction of a patient’s hospitalization?
Which domain knowledge combined with which machine learning
methods provide the best prediction of a patient’s hospitalization?
11
22
33

Balanced Dataset
PRIMEGE (SINCE 2012) - DATASET
Gender Consultation’s reason ICPC2 ... History Observations
H vaccin-antitétanique A44 ... Appendicite EN CP - Bon état général -
auscult pulm libre; bdc rég sans
souffle - tympans ok-
5

Can ontological augmentations of the
features improve the prediction of the
occurrence of an event?
6
1) How to integrate domain knowledge
into a vector representation used by a
machine learning algorithm?

TEXTUAL REPRESENTATION & INJECTION OF DOMAIN KNOWLEDGE
BOW Concept Vector
Resulting Vector
7
Bag-of-words (BOW) representation:
● Main information extracted without
requiring a large corpus.
● Attributes are not transformed.
● Integration of heterogeneous data with
concatenation.
Injection of domain knowledge by
concatenation:
● BOW obtained from text ﬁelds.
● Concept vector from ontologies.
1 2
2: Cardiovascular disease
1: Organ failure
...
Data
Cardiovascular
disease
Organ failure ...
Concept
Vector

Machine learning algorithms able to provide a native interpretation of their decisions:
● Support vector machine with a linear kernel (Chang and Lin (2011)).
● Logistic regression (McCullafh and Nedler (1989)).
● Random forests (Braiman (2001)).
Commonly used in risk prediction in the medical sector (Goldstein et al. (2017)):
● Logistic regression.
● Random forest.
Interpretation of an algorithm's decision:
● Reasons for hospitalizing a patient.
● Factors on which GP can intervene to prevent this event.
The limited size of the dataset does not lend to neural network approaches.
ALGORITHMS USED
8

A patient's medical record can be spread over a long period of time (several years):
● Use of non-sequential algorithms.
● Involves to aggregate all their consultations.
We have opted for the patients:
● “Hospitalized” => All consultations occurring before hospitalization / return from hospitalization.
● “Not Hospitalized” => All their consultations.
DATASET - TEMPORAL DIMENSION
Hospitalization
9

10
2) Is the addition of domain knowledge
improving the prediction of a patient's
hospitalization?

Open knowledge bases used:
● The French chapter of DBpedia with the property dcterms:subject.
● Wikidata with knowledge about drugs:
○ Properties: ‘subject has role’ (wdt:P2868), ‘medical condition treated’ (wdt:P2175) and ‘significant drug
interaction’ (wdt:P769).
○ Couple property-concept (e.g., meprobamate treats (wdt:P2175) headache).
Vocabularies specific to the health sector, OWL-SKOS representations of :
● ATC (Anatomical, Therapeutic and Chemical):
○ e. g., ‘meprednisone’ (code H02AB15) has ‘Glucocorticoids, Systemic’ (code H02AB) as super class…
○ Considered as super classes concepts linked by properties rdfs:subClassOf and atc:member_of.
● ICPC2 (International Classification of Primary Care):
○ e. g., ‘Symptom and complaints’ (H05) has ‘Ear’ (H) as super class (rdfs:subClassOf) .
● NDF-RT (National Drug File - Reference Terminology):
○ Properties: ‘may_prevent’, ‘CI_with’, ‘may_treat’.
ONTOLOGIES USED
11

@preﬁx : <http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl> .
:N0000022046 a owl:Class; rdfs:label "ATORVASTATIN";
:UMLS_CUI "C0286651";
owl:subClassOf [
rdf:type owl:Restriction; owl:onProperty :may_prevent;
owl:someValuesFrom :N0000000856 ];
owl:subClassOf [
rdf:type owl:Restriction; owl:onProperty :CI_with;
owl:someValuesFrom :N0000010195 ];
owl:subClassOf [
rdf:type owl:Restriction; owl:onProperty :may_treat;
owl:someValuesFrom :N0000001594 ].
:N0000000856 rdfs:label "Coronary Artery Disease [Disease/Finding]".
:N0000010195 rdfs:label "Pregnancy [Disease/Finding]".
:N0000001594 rdfs:label "Hyperlipoproteinemias [Disease/Finding]".
12
EXAMPLE - NDF-RT, PATIENT UNDER TAHOR

ONTOLOGIES USED - MAPPING
Enrichment workﬂow used with DBpedia:
13
Speciality Concept
Oncology Neoplasm stubs, Oncology, Radiation therapy
Cardiovascular Cardiovascular disease, Cardiac arrhythmia
Neuropathy Neurovascular disease
Immunopathy Malignant hemopathy, Autoimmune disease
Endocrinopathy Medical condition related to obesity
Genopathy Genetic diseases and disorders
Intervention Surgical removal procedures, Organ failure
Emergencies Medical emergencies, Cardiac emergencies

EXAMPLE - DBPEDIA CONCEPTS
14
Patient 1 Patient 2
French prédom à gche - insuf vnse ou insuf
cardiaque - pas signe de phlébite - - ne
veut pas mettre de bas de contention et ne
veut pas aumenter le lasilix... -
procédure FIV - - transfert embryon
samedi dernière - a fait hyperstimulation
ovarienne; rupture de kyste - - asthénie,
- - dleur abdo, doulleur à la palpation ++ -
- voit gynéco la semaine prochaine pr
controle betahcg, echo -
English
(Translation)
predom[inates] on the left, venous or
cardiac insuf[ﬁciency], no evidence of
phlebitis, does not want to wear
compression stockings and does not want
to increase the lasix
In vitro fertilization procedure, embryo
transfer last Saturday, did ovarian
hyperstimulation, cyst rupture, asthenia,
abdominal [pain], [pain] on palpation ++,
will see a gyneco[logist] next week [for] a
beta HCG, echo check-up
Concepts Cardiovascular disease,
Organ failure
Neoplasm stubs

● BOW generated from:
○ Reasons for the consultation and associated ICPC2 codes,
Diagnoses and associated ICPC2 and ICD10 codes,
Prescribed drugs and associated ATC codes with the reason for prescription,
common health problems, symptoms, observations, care procedures.
○ The medical history, family history, patient's allergies, environmental factors, past health
problems.
○ Prefix to distinguish source fields (e. dg. Distinction between family and personal background).
● Concepts vector generated from:
○ The ATC drug code for Wikidata, NDF-RT and the ATC vocabulary.
○ The ICPC2 code of diagnoses and reasons for consultation for the ICPC2 vocabulary.
○ The text fields used to generate the BOW are used with DBpedia.
INPUT ATTRIBUTES OF MACHINE LEARNING ALGORITHMS
15

16
3) Which domain knowledge combined with
which machine learning methods provide the
best prediction of a patient's hospitalization?

Validation of results by nested cross-validation:
● K at 10 for the outer loop.
● K at 3 for the inner loop, hyperparameters by random search (150 iterations).
Metric to evaluate vector representations:
● The F-measure Ftp,fp
(Forman and Scholz (2010)) suited to cross-validation context.
EXPERIMENTAL PROTOCOL
TP: True Positives
FP: False Positives
FN: False Negatives
17

● baseline: BOW generated from medical records.
● +t: ICPC2 concepts.
● +s: DBpedia on all text ﬁelds.
● +s*: DBpedia focussed on patient’s own record
● +c: ATC on different hierarchical depth levels.
● +wa: Wikidata, property ‘subject has role’.
● +wi: Wikidata, property ‘signiﬁcant drug interaction’.
● +wm: Wikidata, property ‘medical condition treated’.
● +d: enrichment with NFD-RT:
○ prevent
=> ‘may_prevent’ property
○ treat
=> ‘may_treat’ property
○ CI
=> ‘CI_with’ property.
RESULTS & DISCUSSION - Ftp,fp
metric
Features set SVC RF Log
baseline 0.8270 0.8533 0.8491
+t 0.8239 0.8522 0.8545
+s 0.8221 0.8522 0.8485
+s* 0.8339 0.8449 0.8514
+c1
0.8235 0.8433 0.8453
+c1
-2
0.8254 0.8480 0.8510
+c2
0.8348 0.8522 0.8505
+wa 0.8223 0.8468 0.8545
+wi 0.8149 0.8484 0.8501
+wm 0.8221 0.8453 0.8458
+dprevent
0.8254 0.8506 0.8479
+dtreat
0.8338 0.8472 0.8481
+dCI
0.8281 0.8498 0.8460 18

RESULTS & DISCUSSION - CONVERGENCE CURVE
19
Features set SVC RF Log
+t+s+c2
+wa+wi 0.8258 0.8486 0.8547
+t+s*+c2
+wa+wi 0.8239 0.8494 0.8543
+t+c2
+wa+wi 0.8140 0.8531 0.8571

LIMITS & FUTURE WORK
Vector representation:
● Coupling of semantic relationship and textual data.
Additional medical domain knowledge:
● Annotation / selection by general practitioners.
● Integration of the annotator of the French Bioportal (SIFR project).
Address the weaknesses of the approach using DBpedia:
● Better management of abbreviations / spelling mistakes.
● Use of more concepts.
● Detect the negation and the context of a medical concept.
● Identify complex medical expressions.
20

Thank you for your attention.
Any question?
Contact me at:
raphael.gazzotti@inria.fr

ESWC2019 - Injecting domain knowledge in electronic medical records to improve hospitalization prediction

More Related Content

Similar to ESWC2019 - Injecting domain knowledge in electronic medical records to improve hospitalization prediction

Recently uploaded

ESWC2019 - Injecting domain knowledge in electronic medical records to improve hospitalization prediction