Injecting Domain Knowledge in Electronic Medical Records
to Improve Hospitalization Prediction
Gazzotti, R., Faron Zucker, C., Gandon, F.,
Lacroix-Hugues, V., & Darmon, D
ESWC2019 - 06.06.19 1
MOTIVATION
Problem of hospitalization:
● Rate of 191 hospitalizations
per 1000 inhabitants (12.7 million patients).
● On average a full hospitalization lasts:
○ 6.4 days in medicine, surgery, obstetrics.
○ 36.4 days in follow-up care and
rehabilitation.
○ 44.5 days in home hospitalization.
○ 57.0 days in psychiatry.
Perceived as a failure by the physician.
Increase in the number of patients in a state of comorbidity (multiplicity of chronic diseases).
2
90%
MOTIVATION - SOFTWARE
3
RESEARCH QUESTIONS
Can ontological augmentations of the features improve
the prediction of the occurrence of an event?
4
How to integrate domain knowledge into a vector
representation used by a machine learning algorithm?
Is the addition of domain knowledge improving the
prediction of a patient’s hospitalization?
Which domain knowledge combined with which machine learning
methods provide the best prediction of a patient’s hospitalization?
11
22
33
Balanced Dataset
PRIMEGE (SINCE 2012) - DATASET
Gender Consultation’s reason ICPC2 ... History Observations
H vaccin-antitétanique A44 ... Appendicite EN CP - Bon état général -
auscult pulm libre; bdc rég sans
souffle - tympans ok-
5
Can ontological augmentations of the
features improve the prediction of the
occurrence of an event?
6
1) How to integrate domain knowledge
into a vector representation used by a
machine learning algorithm?
TEXTUAL REPRESENTATION & INJECTION OF DOMAIN KNOWLEDGE
BOW Concept Vector
Resulting Vector
7
Bag-of-words (BOW) representation:
● Main information extracted without
requiring a large corpus.
● Attributes are not transformed.
● Integration of heterogeneous data with
concatenation.
Injection of domain knowledge by
concatenation:
● BOW obtained from text fields.
● Concept vector from ontologies.
1 2
2: Cardiovascular disease
1: Organ failure
...
Data
Cardiovascular
disease
Organ failure ...
Concept
Vector
Machine learning algorithms able to provide a native interpretation of their decisions:
● Support vector machine with a linear kernel (Chang and Lin (2011)).
● Logistic regression (McCullafh and Nedler (1989)).
● Random forests (Braiman (2001)).
Commonly used in risk prediction in the medical sector (Goldstein et al. (2017)):
● Logistic regression.
● Random forest.
Interpretation of an algorithm's decision:
● Reasons for hospitalizing a patient.
● Factors on which GP can intervene to prevent this event.
The limited size of the dataset does not lend to neural network approaches.
ALGORITHMS USED
8
A patient's medical record can be spread over a long period of time (several years):
● Use of non-sequential algorithms.
● Involves to aggregate all their consultations.
We have opted for the patients:
● “Hospitalized” => All consultations occurring before hospitalization / return from hospitalization.
● “Not Hospitalized” => All their consultations.
DATASET - TEMPORAL DIMENSION
Hospitalization
9
10
2) Is the addition of domain knowledge
improving the prediction of a patient's
hospitalization?
Can ontological augmentations of the
features improve the prediction of the
occurrence of an event?
Open knowledge bases used:
● The French chapter of DBpedia with the property dcterms:subject.
● Wikidata with knowledge about drugs:
○ Properties: ‘subject has role’ (wdt:P2868), ‘medical condition treated’ (wdt:P2175) and ‘significant drug
interaction’ (wdt:P769).
○ Couple property-concept (e.g., meprobamate treats (wdt:P2175) headache).
Vocabularies specific to the health sector, OWL-SKOS representations of :
● ATC (Anatomical, Therapeutic and Chemical):
○ e. g., ‘meprednisone’ (code H02AB15) has ‘Glucocorticoids, Systemic’ (code H02AB) as super class…
○ Considered as super classes concepts linked by properties rdfs:subClassOf and atc:member_of.
● ICPC2 (International Classification of Primary Care):
○ e. g., ‘Symptom and complaints’ (H05) has ‘Ear’ (H) as super class (rdfs:subClassOf) .
● NDF-RT (National Drug File - Reference Terminology):
○ Properties: ‘may_prevent’, ‘CI_with’, ‘may_treat’.
ONTOLOGIES USED
11
@prefix : <http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl> .
:N0000022046 a owl:Class; rdfs:label "ATORVASTATIN";
:UMLS_CUI "C0286651";
owl:subClassOf [
rdf:type owl:Restriction; owl:onProperty :may_prevent;
owl:someValuesFrom :N0000000856 ];
owl:subClassOf [
rdf:type owl:Restriction; owl:onProperty :CI_with;
owl:someValuesFrom :N0000010195 ];
owl:subClassOf [
rdf:type owl:Restriction; owl:onProperty :may_treat;
owl:someValuesFrom :N0000001594 ].
:N0000000856 rdfs:label "Coronary Artery Disease [Disease/Finding]".
:N0000010195 rdfs:label "Pregnancy [Disease/Finding]".
:N0000001594 rdfs:label "Hyperlipoproteinemias [Disease/Finding]".
12
EXAMPLE - NDF-RT, PATIENT UNDER TAHOR
ONTOLOGIES USED - MAPPING
Enrichment workflow used with DBpedia:
13
Speciality Concept
Oncology Neoplasm stubs, Oncology, Radiation therapy
Cardiovascular Cardiovascular disease, Cardiac arrhythmia
Neuropathy Neurovascular disease
Immunopathy Malignant hemopathy, Autoimmune disease
Endocrinopathy Medical condition related to obesity
Genopathy Genetic diseases and disorders
Intervention Surgical removal procedures, Organ failure
Emergencies Medical emergencies, Cardiac emergencies
EXAMPLE - DBPEDIA CONCEPTS
14
Patient 1 Patient 2
French prédom à gche - insuf vnse ou insuf
cardiaque - pas signe de phlébite - - ne
veut pas mettre de bas de contention et ne
veut pas aumenter le lasilix... -
procédure FIV - - transfert embryon
samedi dernière - a fait hyperstimulation
ovarienne; rupture de kyste - - asthénie,
- - dleur abdo, doulleur à la palpation ++ -
- voit gynéco la semaine prochaine pr
controle betahcg, echo -
English
(Translation)
predom[inates] on the left, venous or
cardiac insuf[ficiency], no evidence of
phlebitis, does not want to wear
compression stockings and does not want
to increase the lasix
In vitro fertilization procedure, embryo
transfer last Saturday, did ovarian
hyperstimulation, cyst rupture, asthenia,
abdominal [pain], [pain] on palpation ++,
will see a gyneco[logist] next week [for] a
beta HCG, echo check-up
Concepts Cardiovascular disease,
Organ failure
Neoplasm stubs
● BOW generated from:
○ Reasons for the consultation and associated ICPC2 codes,
Diagnoses and associated ICPC2 and ICD10 codes,
Prescribed drugs and associated ATC codes with the reason for prescription,
common health problems, symptoms, observations, care procedures.
○ The medical history, family history, patient's allergies, environmental factors, past health
problems.
○ Prefix to distinguish source fields (e. dg. Distinction between family and personal background).
● Concepts vector generated from:
○ The ATC drug code for Wikidata, NDF-RT and the ATC vocabulary.
○ The ICPC2 code of diagnoses and reasons for consultation for the ICPC2 vocabulary.
○ The text fields used to generate the BOW are used with DBpedia.
INPUT ATTRIBUTES OF MACHINE LEARNING ALGORITHMS
15
16
3) Which domain knowledge combined with
which machine learning methods provide the
best prediction of a patient's hospitalization?
Can ontological augmentations of the
features improve the prediction of the
occurrence of an event?
Validation of results by nested cross-validation:
● K at 10 for the outer loop.
● K at 3 for the inner loop, hyperparameters by random search (150 iterations).
Metric to evaluate vector representations:
● The F-measure Ftp,fp
(Forman and Scholz (2010)) suited to cross-validation context.
EXPERIMENTAL PROTOCOL
TP: True Positives
FP: False Positives
FN: False Negatives
17
● baseline: BOW generated from medical records.
● +t: ICPC2 concepts.
● +s: DBpedia on all text fields.
● +s*: DBpedia focussed on patient’s own record
● +c: ATC on different hierarchical depth levels.
● +wa: Wikidata, property ‘subject has role’.
● +wi: Wikidata, property ‘significant drug interaction’.
● +wm: Wikidata, property ‘medical condition treated’.
● +d: enrichment with NFD-RT:
○ prevent
=> ‘may_prevent’ property
○ treat
=> ‘may_treat’ property
○ CI
=> ‘CI_with’ property.
RESULTS & DISCUSSION - Ftp,fp
metric
Features set SVC RF Log
baseline 0.8270 0.8533 0.8491
+t 0.8239 0.8522 0.8545
+s 0.8221 0.8522 0.8485
+s* 0.8339 0.8449 0.8514
+c1
0.8235 0.8433 0.8453
+c1
-2
0.8254 0.8480 0.8510
+c2
0.8348 0.8522 0.8505
+wa 0.8223 0.8468 0.8545
+wi 0.8149 0.8484 0.8501
+wm 0.8221 0.8453 0.8458
+dprevent
0.8254 0.8506 0.8479
+dtreat
0.8338 0.8472 0.8481
+dCI
0.8281 0.8498 0.8460 18
RESULTS & DISCUSSION - CONVERGENCE CURVE
19
Features set SVC RF Log
+t+s+c2
+wa+wi 0.8258 0.8486 0.8547
+t+s*+c2
+wa+wi 0.8239 0.8494 0.8543
+t+c2
+wa+wi 0.8140 0.8531 0.8571
LIMITS & FUTURE WORK
Vector representation:
● Coupling of semantic relationship and textual data.
Additional medical domain knowledge:
● Annotation / selection by general practitioners.
● Integration of the annotator of the French Bioportal (SIFR project).
Address the weaknesses of the approach using DBpedia:
● Better management of abbreviations / spelling mistakes.
● Use of more concepts.
● Detect the negation and the context of a medical concept.
● Identify complex medical expressions.
20
Thank you for your attention.
Any question?
Contact me at:
raphael.gazzotti@inria.fr

ESWC2019 - Injecting domain knowledge in electronic medical records to improve hospitalization prediction

  • 1.
    Injecting Domain Knowledgein Electronic Medical Records to Improve Hospitalization Prediction Gazzotti, R., Faron Zucker, C., Gandon, F., Lacroix-Hugues, V., & Darmon, D ESWC2019 - 06.06.19 1
  • 2.
    MOTIVATION Problem of hospitalization: ●Rate of 191 hospitalizations per 1000 inhabitants (12.7 million patients). ● On average a full hospitalization lasts: ○ 6.4 days in medicine, surgery, obstetrics. ○ 36.4 days in follow-up care and rehabilitation. ○ 44.5 days in home hospitalization. ○ 57.0 days in psychiatry. Perceived as a failure by the physician. Increase in the number of patients in a state of comorbidity (multiplicity of chronic diseases). 2 90%
  • 3.
  • 4.
    RESEARCH QUESTIONS Can ontologicalaugmentations of the features improve the prediction of the occurrence of an event? 4 How to integrate domain knowledge into a vector representation used by a machine learning algorithm? Is the addition of domain knowledge improving the prediction of a patient’s hospitalization? Which domain knowledge combined with which machine learning methods provide the best prediction of a patient’s hospitalization? 11 22 33
  • 5.
    Balanced Dataset PRIMEGE (SINCE2012) - DATASET Gender Consultation’s reason ICPC2 ... History Observations H vaccin-antitétanique A44 ... Appendicite EN CP - Bon état général - auscult pulm libre; bdc rég sans souffle - tympans ok- 5
  • 6.
    Can ontological augmentationsof the features improve the prediction of the occurrence of an event? 6 1) How to integrate domain knowledge into a vector representation used by a machine learning algorithm?
  • 7.
    TEXTUAL REPRESENTATION &INJECTION OF DOMAIN KNOWLEDGE BOW Concept Vector Resulting Vector 7 Bag-of-words (BOW) representation: ● Main information extracted without requiring a large corpus. ● Attributes are not transformed. ● Integration of heterogeneous data with concatenation. Injection of domain knowledge by concatenation: ● BOW obtained from text fields. ● Concept vector from ontologies. 1 2 2: Cardiovascular disease 1: Organ failure ... Data Cardiovascular disease Organ failure ... Concept Vector
  • 8.
    Machine learning algorithmsable to provide a native interpretation of their decisions: ● Support vector machine with a linear kernel (Chang and Lin (2011)). ● Logistic regression (McCullafh and Nedler (1989)). ● Random forests (Braiman (2001)). Commonly used in risk prediction in the medical sector (Goldstein et al. (2017)): ● Logistic regression. ● Random forest. Interpretation of an algorithm's decision: ● Reasons for hospitalizing a patient. ● Factors on which GP can intervene to prevent this event. The limited size of the dataset does not lend to neural network approaches. ALGORITHMS USED 8
  • 9.
    A patient's medicalrecord can be spread over a long period of time (several years): ● Use of non-sequential algorithms. ● Involves to aggregate all their consultations. We have opted for the patients: ● “Hospitalized” => All consultations occurring before hospitalization / return from hospitalization. ● “Not Hospitalized” => All their consultations. DATASET - TEMPORAL DIMENSION Hospitalization 9
  • 10.
    10 2) Is theaddition of domain knowledge improving the prediction of a patient's hospitalization? Can ontological augmentations of the features improve the prediction of the occurrence of an event?
  • 11.
    Open knowledge basesused: ● The French chapter of DBpedia with the property dcterms:subject. ● Wikidata with knowledge about drugs: ○ Properties: ‘subject has role’ (wdt:P2868), ‘medical condition treated’ (wdt:P2175) and ‘significant drug interaction’ (wdt:P769). ○ Couple property-concept (e.g., meprobamate treats (wdt:P2175) headache). Vocabularies specific to the health sector, OWL-SKOS representations of : ● ATC (Anatomical, Therapeutic and Chemical): ○ e. g., ‘meprednisone’ (code H02AB15) has ‘Glucocorticoids, Systemic’ (code H02AB) as super class… ○ Considered as super classes concepts linked by properties rdfs:subClassOf and atc:member_of. ● ICPC2 (International Classification of Primary Care): ○ e. g., ‘Symptom and complaints’ (H05) has ‘Ear’ (H) as super class (rdfs:subClassOf) . ● NDF-RT (National Drug File - Reference Terminology): ○ Properties: ‘may_prevent’, ‘CI_with’, ‘may_treat’. ONTOLOGIES USED 11
  • 12.
    @prefix : <http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl>. :N0000022046 a owl:Class; rdfs:label "ATORVASTATIN"; :UMLS_CUI "C0286651"; owl:subClassOf [ rdf:type owl:Restriction; owl:onProperty :may_prevent; owl:someValuesFrom :N0000000856 ]; owl:subClassOf [ rdf:type owl:Restriction; owl:onProperty :CI_with; owl:someValuesFrom :N0000010195 ]; owl:subClassOf [ rdf:type owl:Restriction; owl:onProperty :may_treat; owl:someValuesFrom :N0000001594 ]. :N0000000856 rdfs:label "Coronary Artery Disease [Disease/Finding]". :N0000010195 rdfs:label "Pregnancy [Disease/Finding]". :N0000001594 rdfs:label "Hyperlipoproteinemias [Disease/Finding]". 12 EXAMPLE - NDF-RT, PATIENT UNDER TAHOR
  • 13.
    ONTOLOGIES USED -MAPPING Enrichment workflow used with DBpedia: 13 Speciality Concept Oncology Neoplasm stubs, Oncology, Radiation therapy Cardiovascular Cardiovascular disease, Cardiac arrhythmia Neuropathy Neurovascular disease Immunopathy Malignant hemopathy, Autoimmune disease Endocrinopathy Medical condition related to obesity Genopathy Genetic diseases and disorders Intervention Surgical removal procedures, Organ failure Emergencies Medical emergencies, Cardiac emergencies
  • 14.
    EXAMPLE - DBPEDIACONCEPTS 14 Patient 1 Patient 2 French prédom à gche - insuf vnse ou insuf cardiaque - pas signe de phlébite - - ne veut pas mettre de bas de contention et ne veut pas aumenter le lasilix... - procédure FIV - - transfert embryon samedi dernière - a fait hyperstimulation ovarienne; rupture de kyste - - asthénie, - - dleur abdo, doulleur à la palpation ++ - - voit gynéco la semaine prochaine pr controle betahcg, echo - English (Translation) predom[inates] on the left, venous or cardiac insuf[ficiency], no evidence of phlebitis, does not want to wear compression stockings and does not want to increase the lasix In vitro fertilization procedure, embryo transfer last Saturday, did ovarian hyperstimulation, cyst rupture, asthenia, abdominal [pain], [pain] on palpation ++, will see a gyneco[logist] next week [for] a beta HCG, echo check-up Concepts Cardiovascular disease, Organ failure Neoplasm stubs
  • 15.
    ● BOW generatedfrom: ○ Reasons for the consultation and associated ICPC2 codes, Diagnoses and associated ICPC2 and ICD10 codes, Prescribed drugs and associated ATC codes with the reason for prescription, common health problems, symptoms, observations, care procedures. ○ The medical history, family history, patient's allergies, environmental factors, past health problems. ○ Prefix to distinguish source fields (e. dg. Distinction between family and personal background). ● Concepts vector generated from: ○ The ATC drug code for Wikidata, NDF-RT and the ATC vocabulary. ○ The ICPC2 code of diagnoses and reasons for consultation for the ICPC2 vocabulary. ○ The text fields used to generate the BOW are used with DBpedia. INPUT ATTRIBUTES OF MACHINE LEARNING ALGORITHMS 15
  • 16.
    16 3) Which domainknowledge combined with which machine learning methods provide the best prediction of a patient's hospitalization? Can ontological augmentations of the features improve the prediction of the occurrence of an event?
  • 17.
    Validation of resultsby nested cross-validation: ● K at 10 for the outer loop. ● K at 3 for the inner loop, hyperparameters by random search (150 iterations). Metric to evaluate vector representations: ● The F-measure Ftp,fp (Forman and Scholz (2010)) suited to cross-validation context. EXPERIMENTAL PROTOCOL TP: True Positives FP: False Positives FN: False Negatives 17
  • 18.
    ● baseline: BOWgenerated from medical records. ● +t: ICPC2 concepts. ● +s: DBpedia on all text fields. ● +s*: DBpedia focussed on patient’s own record ● +c: ATC on different hierarchical depth levels. ● +wa: Wikidata, property ‘subject has role’. ● +wi: Wikidata, property ‘significant drug interaction’. ● +wm: Wikidata, property ‘medical condition treated’. ● +d: enrichment with NFD-RT: ○ prevent => ‘may_prevent’ property ○ treat => ‘may_treat’ property ○ CI => ‘CI_with’ property. RESULTS & DISCUSSION - Ftp,fp metric Features set SVC RF Log baseline 0.8270 0.8533 0.8491 +t 0.8239 0.8522 0.8545 +s 0.8221 0.8522 0.8485 +s* 0.8339 0.8449 0.8514 +c1 0.8235 0.8433 0.8453 +c1 -2 0.8254 0.8480 0.8510 +c2 0.8348 0.8522 0.8505 +wa 0.8223 0.8468 0.8545 +wi 0.8149 0.8484 0.8501 +wm 0.8221 0.8453 0.8458 +dprevent 0.8254 0.8506 0.8479 +dtreat 0.8338 0.8472 0.8481 +dCI 0.8281 0.8498 0.8460 18
  • 19.
    RESULTS & DISCUSSION- CONVERGENCE CURVE 19 Features set SVC RF Log +t+s+c2 +wa+wi 0.8258 0.8486 0.8547 +t+s*+c2 +wa+wi 0.8239 0.8494 0.8543 +t+c2 +wa+wi 0.8140 0.8531 0.8571
  • 20.
    LIMITS & FUTUREWORK Vector representation: ● Coupling of semantic relationship and textual data. Additional medical domain knowledge: ● Annotation / selection by general practitioners. ● Integration of the annotator of the French Bioportal (SIFR project). Address the weaknesses of the approach using DBpedia: ● Better management of abbreviations / spelling mistakes. ● Use of more concepts. ● Detect the negation and the context of a medical concept. ● Identify complex medical expressions. 20
  • 21.
    Thank you foryour attention. Any question? Contact me at: raphael.gazzotti@inria.fr