Refining Health Outcomes of
Interest using Formal Concept
Analysis and Semantic Query
Olivier Curé1, Henri Maurer2, Paea Le Pendu3, Nigam Shah3
1: CNRS LIGM lab, UPEM, France
2: Edinburgh University, IK
3: BMIR lab, Stanford University, USA
Applications need to select, extract, compare and
analyze groups of patients using Electronic
Health Records (EHRs)
This require to define Health Outcomes of
Interests (HOI), e.g. myocardial infarction,
chronic obstructive pulmonary disease.
With clinical text, these definitions should capture
variations of terms and ensure good precision
and recall of the text-mining process.
Problem setting (2)
It is not practical to define precisely these
HOIs with concept identifiers, e.g. UMLS
We provide a solution that produces and
refines HOI definitions from terms provided by
Our solution aims to propose sound and
complete definitions in a best-effort way.
Improve search results by expanding queries
with the transitive closure of the subsumption
relationship of ontology concepts.
Queries can be generalized (resp. specialized)
via expansions with ancestors (resp.
Ex: expanding a query with 'neoplasm' or
'tumor' when searching for 'cancer'.
Abstract conceptual descriptions from a set of
objects described by some attributes.
Used in machine learning and knowledge
A formal context is a triple (G,M,I), resp. a set
of objects, attributes and a binary relation
between G and M.
A formal context can be represented as a
SQE: Relational database approach
We are using the ontologies stored in Stanford's
DB and its materialization of concept subsumption
(almost 14 millions entries).
FCA: objects and attributes of the formal
context are concept identifiers (UMLS concept
To improve relevance, identifying potential
concepts among discovered ones, a pruning
FCA-based approach is designed.
Formal contexts is composed of matching
concepts as objects and candidate concepts
Thus the binary relation corresponds to the
Ex: 10365: “hyperlipoproteinemia type iv” and 740154 : “disease, disorder or finding”
Standard FCA algorithms are used to define the FCA lattice.
Qualifying a discovered concept is performed
using a top-down navigation of the FCA
For each formal concept <Ai,Bi>, we compute
the transitive closure of sub concepts of Ai
(resp. Bi), denoted LAi (resp. Lbi).
If (|L ∩ L |)/ | L | ≥ Θ, with Θ a predefined
pruning threshold then Bi is potential concept
M : matching
D : Discovered
P : Potential
C : Other concept
Search on Hypercholesterolemia on 18 ontologies provides:
20 matching concepts (i.e., FCA objects)
102 discovered concepts (i.e., FCA attributes)
Generates an FCA lattice with 67 formal concepts
First formal concept satisfying a Θ=.75 pruning threshold is
at the 4th level of the lattice: only 4 concepts out of 16 LBi
are covered by LAi .
These 4 concepts have the following preferred labels:
“hypercholesterolemia”, “cholesterolosis”, “secondary
hypercholesterolemia” and “hyperlipidemia”.
We include interactions with end-user to validate our
Hence the domain expert has the final decision on
acceptance/rejection of a proposition.
Important issue: trade-off between user interactions and
precision/recall of results.
End-user can validate whenever she wants.
Interactions are performed in a web interface providing
additional information on the search (clinical text snippets,
number of patients).
i2b2 obesity NLP reference set used as an
evaluation data set
Gold standard are the results of a previous
experiment conducted at Stanford.
Evaluation in terms of specificity, sensitivity
and duration of computation (on commodity
An improvement of 2 and 3 % on resp. sensitivity and
Computation duration in terms of seconds on a
More interesting is that some of our false negatives seem to be relevant to the
Some of these false negative come from the matching and also the potential (i.e.
FCA based) approaches:
Matching example :
Sitosterolemia for hypercholesterolemia'' for hypercholesterolemia
“h/o: raised blood, familial hyperlipoproteinemia”, “fh: raised blood lipids” for
hypercholesterolemia, while the gold standard contains concepts such as
“hyperlipoproteinemia type ii”) concepts which confirms the relevance of using a
Note that among our true positive, depending on the use case, a significant
number of items have been retrieved from the potential concept set, i.e., using
our FCA statistical approach.
We have proposed a semi-automatic solution
for defining HOIs.
Approach uses SQE and FCA enriched with a
Our results are comparable to state of the art
It refines HOIs definitions efficiently with
Conduct user-driven evaluations with
clinicians and researchers.
Analyze acceptance/rejection of end-users in
Use active learning over past query
refinements to improve future queries.
Study our method's impact on mining EHRs
clinical notes and cohort building tools.