Validation of a Natural LanguageProcessing Protocol for Detecting     Heart Failure Signs and Symptoms in Electronic Healt...
Outline•   Background and objectives•   Datasets•   Tools & Methods•   Results•   Discussion    – Challenges    – Opportun...
Background and Objectives• Background   – Framingham criteria for HF published in 1971   – Geisinger/IBM “PredMED” project...
Framingham HF Diagnostic Criteria           MAJOR SYMPTOMS                        MINOR SYMPTOMS1. Paroxysmal Nocturnal Dy...
(Sample downstream analysis)                          Reports of Framingham HF criteria                            in the ...
Datasets• Clinical notes from longitudinal (2001-2010) EHR  encounters for   – 6,355 case patients       • Meet operationa...
Tools      • LRW1 – LanguageWare Resource Workbench                     UIMA Collection Processing Engine          – Basic...
Criteria Extraction Methods:               Dictionaries• Framingham Criteria              • Negating words  vocabulary    ...
Criteria Extraction Methods:               Grammars• Shallow English syntax            • Negated Scope   – Noun Phrases   ...
Criteria Extraction Methods:     Text Analysis Engines (TAEs)• Rules to filter candidate       • Co-occurrence  criteria c...
Encounter Labeling Methods• We can label an encounter note with labels showing the  criteria that the note mentions   – Th...
Results
Evaluation FlowMetrics:                                        Machine    Encounter                                       ...
Encounter Labeling Performance                   Machine-learning method                      Rule-based method           ...
Performance of Framingham        Diagnostic Criteria Extraction                                                           ...
Precision and Recall for Individual             Criteria
Analysis of 1492 extracted criteria:             PredMED extractions vs.            Gold Standard annotations             ...
Discussion• Challenges                           • Opportunities   – Data quality: EHR text data is       – We can apply s...
Summary• Extractors can identify affirmations and denials  of Framingham HF criteria in EHR clinical notes  with an overal...
Backup:Iterative Annotation Refinement
Iterative Annotation Refinement• What are the problems solved?  – Annotations are required for training and evaluating    ...
The Development Process:           Iterative Annotation Refinement               Initialization   Results                 ...
User interface for the annotation tool, which wasused to manage annotations during refinement.
Performance improvement during         development                                        Performance comparison          ...
Iterative methods for creating annotations, guidelines, and extractors                   Extraction       Result of using ...
Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD
Upcoming SlideShare
Loading in …5
×

Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD

985 views
735 views

Published on

Clinical Informatics

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
985
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD

  1. 1. Validation of a Natural LanguageProcessing Protocol for Detecting Heart Failure Signs and Symptoms in Electronic Health Record Text Notes Roy J. Byrd2, Steven R. Steinhubl1, Jimeng Sun2, Shahram Ebadollahi2, Zahra Daar1, Walter F. Stewart1 1Geisinger Medical Center, Center for Health Research, Danville, PA 2 IBM, T.J. Watson Research Center, Hawthorne, NY
  2. 2. Outline• Background and objectives• Datasets• Tools & Methods• Results• Discussion – Challenges – Opportunities• Summary• (Iterative annotation refinement)
  3. 3. Background and Objectives• Background – Framingham criteria for HF published in 1971 – Geisinger/IBM “PredMED” project on predictive modeling for early detection of HF, using longitudinal EHRs• Overall Project Objective Better understand the presentation of HF in the primary care setting, in order to facilitate its more rapid identification and treatment• Objective of this paper: Build and validate NLP extractors for Framingham criteria (signs and symptoms) from EHR clinical notes, so that they may be suitable for downstream diagnostic applications
  4. 4. Framingham HF Diagnostic Criteria MAJOR SYMPTOMS MINOR SYMPTOMS1. Paroxysmal Nocturnal Dyspnea 1. Bilateral Ankle Edema (PND) or Orthopnea2. Neck Vein Distension (JVD) 2. Nocturnal Cough3. Rales 3. Dyspnea on ordinary exertion4. Radiographic Cardiomegaly 4. Hepatomegaly5. Acute Pulmonary Edema 5. Pleural effusion 6. A decrease in vital capacity by 1/36. S3 Gallop of the maximal value recorded**7. Increased Central Venous Pressure 7. Tachycardia (>120 BPM) (> 16 cm H2O at RA)8. Circulation Time of 25 seconds**9. Hepatojugular Reflux (HJR) ** Not extracted, since these criteria are not documented in routine10.Weight loss 4.5kg in 5 days in clinical practice. response to treatment N Engl J Med. 1971;285:1441-1446.
  5. 5. (Sample downstream analysis) Reports of Framingham HF criteria in the year prior to diagnosisPercent with Documented Criteria 60 50 Cases (N=4,644) Controls (N=45,981) 40 62.3 65 30 20 28.6 22.9 10 17.2 17.9 17.7 7.2 5.8 5.2 1.7 1.4 0.7 1.1 0 PND Rales JVD Pulm CMegaly Ankle DOE Edema Edema
  6. 6. Datasets• Clinical notes from longitudinal (2001-2010) EHR encounters for – 6,355 case patients • Meet operational criteria for HF** – 26,052 control patients • Clinic-, gender- and age-matched to cases – The case-control distinction is exploited in downstream applications; it’s not relevant for criteria extraction.• Development dataset **Operational HF Criteria – 65 encounter notes –HF diagnosis on • Selected for density of Framingham criteria problem list, • Annotated by a clinical expert –HF diagnosis in EHR for two outpatient• Validation dataset encounters, –Two or more – 400 encounter notes (200 cases & 200 controls) medications with ICD- • Randomly selected 9 code for HF, or • Annotated by consensus of 4 trained coders –One HF diagnosis and one medication with • N = 1492 criteria ICD-9 code for HF
  7. 7. Tools • LRW1 – LanguageWare Resource Workbench UIMA Collection Processing Engine – Basic Text ProcessingEncounter – Dictionaries for Basic Processing Dictionaries and Grammars Text Analysis Engines Extracted paragraphs, sentences, for recognizing criteria for applying constraintsDocuments Criteria – Grammars etc. tokenization, candidates and annotating criteria • UIMA2 - Unstructured Information Management Architecture – Execution Pipeline, including I/O management – Text Analysis Engines • TextSTAT3 – Simple Text Analysis Tool – Concordance program, used for linguistic analysis 1http://www/alphaworks.ibm.com/tech/lrw 2http://uima.apache.org 3http://neon.niederlandistik.fu-berlin.de/en/textstat
  8. 8. Criteria Extraction Methods: Dictionaries• Framingham Criteria • Negating words vocabulary – Used to deny criteria – Words and phrases used to • no, free of, ruled out mention the 15 Framingham Criteria • Counterfactual triggers – The criteria may not have – edema, leg occurred edema, oedema; shortness of breath, SOB • if, should, as needed for – Size: ~75 “lemma forms” • Miscellaneous Classes (main entries) and – Weight loss phrases hundreds of variant forms • lose weight, diurese• Segment Header words – Time value words and phrases • day, week, month – Patient – Weight units History, Examination, Plan, • pound, kilogram Instruction – Diuretics • Bumex, Furosimide
  9. 9. Criteria Extraction Methods: Grammars• Shallow English syntax • Negated Scope – Noun Phrases – regular rate and rhythm • some moderate DOE without – Compound Noun Phrases murmurs, clicks, gallops, o r rubs • chest pain, DOE, or night cough • Counterfactual Scope – Prepositional Phrases – Patient should call if she• No full-sentential parses experiences shortness of breath – Not needed for simple HF criteria • Weight Loss – Unreliable sentence – 20 pound weight loss in a boundaries and syntax in week with diuretics clinical notes • Tachycardia – tachy at 120 (to 130) – HR: 135
  10. 10. Criteria Extraction Methods: Text Analysis Engines (TAEs)• Rules to filter candidate • Co-occurrence criteria created from constraints dictionaries and – exercise HR: 135 doesn’t grammars. affirm Tachycardia• Deny criteria mentioned • Disambiguation in negated contexts – edema is recognized as – regular rate and rhythm APEdema, if near cxr, or in without murmurs, clicks, a “Radiology” note, or in a gallops, or rubs  S3Neg “Chest X-Ray” segment• Ignore criteria in • Numeric constraints counterfactual contexts – she lost 5 pounds over a month doesn’t affirm – Patient should call if she WeightLoss experiences shortness of breath – tachy @ 115 doesn’t affirm Tachycardia
  11. 11. Encounter Labeling Methods• We can label an encounter note with labels showing the criteria that the note mentions – The labels can be used by downstream analyses to gather information such as: “This patient exhibited those symptoms on that date.”• 2 Methods: – Machine-learning • Using candidate criteria and scope annotations, as features, … • use a [CHAID decision tree] classifier to assign criteria as labels. – Rule-based • Run the full extractor pipeline, then … • Assign labels consisting of all unique criteria that survive filtering.
  12. 12. Results
  13. 13. Evaluation FlowMetrics: Machine Encounter Learning Labels Precision (Positive Predictive Value): Lexical Lexical Encounter Encounter #TruePositive / (#TruePositive &+Scope Look-up #FalsePositive) Label Documents & Scope Annotations Evaluation Recall (Sensitivity): Encounter Rules #TruePositive / (#TruePositive + #FalseNegative) Labels F-Score (the harmonic mean of Precision and Recall): (2 x Precision x Recall) / (Precision + Recall) Criteria
  14. 14. Encounter Labeling Performance Machine-learning method Rule-based method Recall Precision F-Score Recall Precision F-Score Affirmed 0.675000 0.754190 0.712401 0.738532 0.899441 0.811083 Denied 0.945556 0.905319 0.925000 0.987599 0.931915 0.958949 Overall 0.896364 0.881144 0.888689 0.938462 0.926720 0.932554Overall 99% (0.848-0.929) (0.900-0.964)Conf. Int. Conclusion: Machine-learning labeling does not significantly underperform rule-based labeling.
  15. 15. Performance of Framingham Diagnostic Criteria Extraction 99% Confidence Precision Recall F-score Interval (F-score) Overall (exact) 0.925234 0.896864 0.910828 (0.891 - 0.929) Overall (relaxed) 0.948239 0.919164 0.933475 (0.916 - 0.950) Affirmed 0.747801 0.789474 0.768072 (0.711 - 0.824) Denied 0.982857 0.928058 0.954672 (0.938 - 0.970)Note: Performance on affirmed criteria is worse, possibly because of theirgreater syntactic diversity. For example, we don’t find: PleuralEffusion: blunting of the right costrophrenic angle DOExertion: she felt like she couldn’t get enough air in
  16. 16. Precision and Recall for Individual Criteria
  17. 17. Analysis of 1492 extracted criteria: PredMED extractions vs. Gold Standard annotations e tiv ED eg KE td si E g TL g g AP DN EP g D Ne W Ne R eg Po H eg R Ne TA eg JV e g N eg PN eg AN dS AN D e PL g S3 g EN N N KE ED e e N H H N E E N EN e N N D D ol EP G G C C AL AL JR JR E D D ls O O C C C CPredMED PN AP TA PL S3 JV Fa G D H H H N R RANKED 90 6 16ANKEDNeg 230 6APED 8 5 2 1 22APEDNeg 0DOE 116 17 1 3DOENeg 3 135 2 1HEP 0 1HEPNeg 125HJR 2 1HJRNeg 9JVD 7 2JVDNeg 91NC 2NCNeg 43 2PLE 8PLENeg 1PND 1 7 2PNDNeg 69RALE 11 1RALENeg 197RC 6RCNeg 1S3G 0S3GNeg 131TACH 1 2TACHNeg 0 4WTL 0False Negative 6 8 5 2 6 5 1 4 1 3 2 2 7 35 2 1 1 10
  18. 18. Discussion• Challenges • Opportunities – Data quality: EHR text data is – We can apply similar messy. techniques to other collections • >10% (i.e., 26/237) of the of criteria. errors are caused by • NY Heart Association misspellings & bad sentence • European Society of boundaries Cardiology – Human anatomy • MedicalCriteria.com • We need a better solution – Many specific criteria than word co-occurrence extractors can be re-used in constraints other settings. – Syntactic diversity of affirmed criteria • We need deeper syntactic – For downstream applications, and semantic analysis see posters and presentations – Contradictions and from our project at this redundancy conference • An issue for downstream analysis
  19. 19. Summary• Extractors can identify affirmations and denials of Framingham HF criteria in EHR clinical notes with an overall F-Score of 0.91.• Classifiers can label EHR encounters with the Framingham critera they mention with an F- Score of 0.93.• Information about HF criteria mentioned in EHR notes appears to be useful for downstream applications that seek to achieve early detection of HF.
  20. 20. Backup:Iterative Annotation Refinement
  21. 21. Iterative Annotation Refinement• What are the problems solved? – Annotations are required for training and evaluating criteria extractors. – Human annotators without guidelines have high precision but lower recall. – Domain experts’ intuitions (about the language for expressing criteria) are initially imprecise.• What is produced? – Annotated dataset – Annotation guidelines … that are consistent – Criteria extractors
  22. 22. The Development Process: Iterative Annotation Refinement Initialization Results Iteration Update the Expert Write Annotations annotations initial and theExpert guidelines guidelines Discuss the Annotation Annotate texts Perform language Encounter Guidelines with current error of HF Texts extractors analysis criteria Build Criteria Update the initial Extractors extractors extractorsLinguist
  23. 23. User interface for the annotation tool, which wasused to manage annotations during refinement.
  24. 24. Performance improvement during development Performance comparison Final PredMED Clinical Expert 1 Ini al 0.9 Final 0.8 Precision Ini al 0.7 0.6 0.5 0.5 0.6 0.7 0.8 0.9 1 Recall
  25. 25. Iterative methods for creating annotations, guidelines, and extractors Extraction Result of using Sources of Arbiter for Objective (and target the method annotations disagreements metric) for each compared in at each iteration each iteration iterationIterative Framingham - Annotations Expert and Expert Improve extractorAnnotation HF criteria - Guidelines Extractor performance (F-Refinement - Extractor score)Annotation Clinical - Guidelines (in Expert and Consensus Improve inter-Induction conditions the form of an Linguist annotator(Chapman, et annotation agreement (F-al. J Biom Inf schema) score)2006)CDKRM Classes in the - Annotations 2 Experts Consensus Improve inter-(Coden, et al., cancer disease - Guidelines annotatorJ Biom Inf model agreement2009) (agreement %)TALLAL PHI (protected - Annotations Expert and Expert Annotate full(Carrell, et al, health - Extractor Extractor dataset (to theGHRI-IT information) expert’sposter, 2010) classes satisfaction)

×