Swat4 ls fca_slides


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Swat4 ls fca_slides

  1. 1. Refining Health Outcomes of Interest using Formal Concept Analysis and Semantic Query Expansion Olivier Curé1, Henri Maurer2, Paea Le Pendu3, Nigam Shah3 1: CNRS LIGM lab, UPEM, France 2: Edinburgh University, IK 3: BMIR lab, Stanford University, USA
  2. 2. Problem setting ● ● ● Applications need to select, extract, compare and analyze groups of patients using Electronic Health Records (EHRs) This require to define Health Outcomes of Interests (HOI), e.g. myocardial infarction, chronic obstructive pulmonary disease. With clinical text, these definitions should capture variations of terms and ensure good precision and recall of the text-mining process. 2
  3. 3. Problem setting (2) ● ● ● It is not practical to define precisely these HOIs with concept identifiers, e.g. UMLS CUIs. We provide a solution that produces and refines HOI definitions from terms provided by the end-user. Our solution aims to propose sound and complete definitions in a best-effort way. 3
  4. 4. Approach overview Terminology3 DB terms Statistics Based Pruning concepts Bioportal - Knowledge Semantic Semantic Query Query Expansion Expansion Diseases Procedures Formal Concept Analysis Devices Drugs 4
  5. 5. SQE ● ● ● Improve search results by expanding queries with the transitive closure of the subsumption relationship of ontology concepts. Queries can be generalized (resp. specialized) via expansions with ancestors (resp. descendants). Ex: expanding a query with 'neoplasm' or 'tumor' when searching for 'cancer'. 5
  6. 6. FCA ● ● ● ● Abstract conceptual descriptions from a set of objects described by some attributes. Used in machine learning and knowledge management. A formal context is a triple (G,M,I), resp. a set of objects, attributes and a binary relation between G and M. A formal context can be represented as a matrix. 6
  7. 7. FCA (2) ⊤ {1,2,3,4,5,6}-{F1,F2} {3,6}-{MF2,F1,F2} {1,2,3}-{CF1,F1,F2} {1,2}-{CF1,F1,CF2,F2} {4,5,6}-{BLF1,F1,F2} {6}-{BLF1,F1,MF2,F2} {3}-{CF1,F1,MF2,F2} {4,5}-{BLF1,F1,BLF2,F2} 7 ⊥
  8. 8. Method ● SQE: Relational database approach – ● We are using the ontologies stored in Stanford's DB and its materialization of concept subsumption (almost 14 millions entries). FCA: objects and attributes of the formal context are concept identifiers (UMLS concept identifiers). 8
  9. 9. Method (3) ● ● ● To improve relevance, identifying potential concepts among discovered ones, a pruning FCA-based approach is designed. Formal contexts is composed of matching concepts as objects and candidate concepts as attributes. Thus the binary relation corresponds to the subsumption relationship. 10
  10. 10. Method (4) ● Ex: 10365: “hyperlipoproteinemia type iv” and 740154 : “disease, disorder or finding” ● Standard FCA algorithms are used to define the FCA lattice. 11
  11. 11. Method (5) ● ● ● Qualifying a discovered concept is performed using a top-down navigation of the FCA lattice. For each formal concept <Ai,Bi>, we compute the transitive closure of sub concepts of Ai (resp. Bi), denoted LAi (resp. Lbi). If (|L ∩ L |)/ | L | ≥ Θ, with Θ a predefined pruning threshold then Bi is potential concept Bi Ai Bi 12
  12. 12. Method (6) ● Concept sets: – M : matching – D : Discovered – P : Potential – C : Other concept 13
  13. 13. Example ● Search on Hypercholesterolemia on 18 ontologies provides: – – ● ● ● 20 matching concepts (i.e., FCA objects) 102 discovered concepts (i.e., FCA attributes) Generates an FCA lattice with 67 formal concepts First formal concept satisfying a Θ=.75 pruning threshold is at the 4th level of the lattice: only 4 concepts out of 16 LBi are covered by LAi . These 4 concepts have the following preferred labels: “hypercholesterolemia”, “cholesterolosis”, “secondary hypercholesterolemia” and “hyperlipidemia”. 14
  14. 14. Method (7) ● ● ● ● ● We include interactions with end-user to validate our potential discoveries. Hence the domain expert has the final decision on acceptance/rejection of a proposition. Important issue: trade-off between user interactions and precision/recall of results. End-user can validate whenever she wants. Interactions are performed in a web interface providing additional information on the search (clinical text snippets, number of patients). 15
  15. 15. Evaluation ● ● ● i2b2 obesity NLP reference set used as an evaluation data set Gold standard are the results of a previous experiment conducted at Stanford. Evaluation in terms of specificity, sensitivity and duration of computation (on commodity hardware) 16
  16. 16. Evaluation (2) ● ● An improvement of 2 and 3 % on resp. sensitivity and specificity. Computation duration in terms of seconds on a standard laptop. 17
  17. 17. Evaluation (3) ● ● ● More interesting is that some of our false negatives seem to be relevant to the search. Some of these false negative come from the matching and also the potential (i.e. FCA based) approaches: Matching example : – ● ● ● Sitosterolemia for hypercholesterolemia'' for hypercholesterolemia Potential examples: “h/o: raised blood, familial hyperlipoproteinemia”, “fh: raised blood lipids” for hypercholesterolemia, while the gold standard contains concepts such as “hyperlipoproteinemia type ii”) concepts which confirms the relevance of using a semantic approach. Note that among our true positive, depending on the use case, a significant number of items have been retrieved from the potential concept set, i.e., using our FCA statistical approach. 18
  18. 18. Conclusion ● ● ● ● We have proposed a semi-automatic solution for defining HOIs. Approach uses SQE and FCA enriched with a statistical approach. Our results are comparable to state of the art methods. It refines HOIs definitions efficiently with relevant terms/concepts/ 19
  19. 19. Future works ● ● ● ● Conduct user-driven evaluations with clinicians and researchers. Analyze acceptance/rejection of end-users in practical scenarios. Use active learning over past query refinements to improve future queries. Study our method's impact on mining EHRs clinical notes and cohort building tools. 20
  20. 20. Thanks Questions ? ocure@univ-mlv.fr 21