Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD
Upcoming SlideShare
Loading in...5

Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD



Clinical Informatics

Clinical Informatics



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD Presentation Transcript

  • Validation of a Natural LanguageProcessing Protocol for Detecting Heart Failure Signs and Symptoms in Electronic Health Record Text Notes Roy J. Byrd2, Steven R. Steinhubl1, Jimeng Sun2, Shahram Ebadollahi2, Zahra Daar1, Walter F. Stewart1 1Geisinger Medical Center, Center for Health Research, Danville, PA 2 IBM, T.J. Watson Research Center, Hawthorne, NY
  • Outline• Background and objectives• Datasets• Tools & Methods• Results• Discussion – Challenges – Opportunities• Summary• (Iterative annotation refinement)
  • Background and Objectives• Background – Framingham criteria for HF published in 1971 – Geisinger/IBM “PredMED” project on predictive modeling for early detection of HF, using longitudinal EHRs• Overall Project Objective Better understand the presentation of HF in the primary care setting, in order to facilitate its more rapid identification and treatment• Objective of this paper: Build and validate NLP extractors for Framingham criteria (signs and symptoms) from EHR clinical notes, so that they may be suitable for downstream diagnostic applications
  • Framingham HF Diagnostic Criteria MAJOR SYMPTOMS MINOR SYMPTOMS1. Paroxysmal Nocturnal Dyspnea 1. Bilateral Ankle Edema (PND) or Orthopnea2. Neck Vein Distension (JVD) 2. Nocturnal Cough3. Rales 3. Dyspnea on ordinary exertion4. Radiographic Cardiomegaly 4. Hepatomegaly5. Acute Pulmonary Edema 5. Pleural effusion 6. A decrease in vital capacity by 1/36. S3 Gallop of the maximal value recorded**7. Increased Central Venous Pressure 7. Tachycardia (>120 BPM) (> 16 cm H2O at RA)8. Circulation Time of 25 seconds**9. Hepatojugular Reflux (HJR) ** Not extracted, since these criteria are not documented in routine10.Weight loss 4.5kg in 5 days in clinical practice. response to treatment N Engl J Med. 1971;285:1441-1446.
  • (Sample downstream analysis) Reports of Framingham HF criteria in the year prior to diagnosisPercent with Documented Criteria 60 50 Cases (N=4,644) Controls (N=45,981) 40 62.3 65 30 20 28.6 22.9 10 17.2 17.9 17.7 7.2 5.8 5.2 1.7 1.4 0.7 1.1 0 PND Rales JVD Pulm CMegaly Ankle DOE Edema Edema
  • Datasets• Clinical notes from longitudinal (2001-2010) EHR encounters for – 6,355 case patients • Meet operational criteria for HF** – 26,052 control patients • Clinic-, gender- and age-matched to cases – The case-control distinction is exploited in downstream applications; it’s not relevant for criteria extraction.• Development dataset **Operational HF Criteria – 65 encounter notes –HF diagnosis on • Selected for density of Framingham criteria problem list, • Annotated by a clinical expert –HF diagnosis in EHR for two outpatient• Validation dataset encounters, –Two or more – 400 encounter notes (200 cases & 200 controls) medications with ICD- • Randomly selected 9 code for HF, or • Annotated by consensus of 4 trained coders –One HF diagnosis and one medication with • N = 1492 criteria ICD-9 code for HF
  • Tools • LRW1 – LanguageWare Resource Workbench UIMA Collection Processing Engine – Basic Text ProcessingEncounter – Dictionaries for Basic Processing Dictionaries and Grammars Text Analysis Engines Extracted paragraphs, sentences, for recognizing criteria for applying constraintsDocuments Criteria – Grammars etc. tokenization, candidates and annotating criteria • UIMA2 - Unstructured Information Management Architecture – Execution Pipeline, including I/O management – Text Analysis Engines • TextSTAT3 – Simple Text Analysis Tool – Concordance program, used for linguistic analysis 1http://www/ 2 3
  • Criteria Extraction Methods: Dictionaries• Framingham Criteria • Negating words vocabulary – Used to deny criteria – Words and phrases used to • no, free of, ruled out mention the 15 Framingham Criteria • Counterfactual triggers – The criteria may not have – edema, leg occurred edema, oedema; shortness of breath, SOB • if, should, as needed for – Size: ~75 “lemma forms” • Miscellaneous Classes (main entries) and – Weight loss phrases hundreds of variant forms • lose weight, diurese• Segment Header words – Time value words and phrases • day, week, month – Patient – Weight units History, Examination, Plan, • pound, kilogram Instruction – Diuretics • Bumex, Furosimide
  • Criteria Extraction Methods: Grammars• Shallow English syntax • Negated Scope – Noun Phrases – regular rate and rhythm • some moderate DOE without – Compound Noun Phrases murmurs, clicks, gallops, o r rubs • chest pain, DOE, or night cough • Counterfactual Scope – Prepositional Phrases – Patient should call if she• No full-sentential parses experiences shortness of breath – Not needed for simple HF criteria • Weight Loss – Unreliable sentence – 20 pound weight loss in a boundaries and syntax in week with diuretics clinical notes • Tachycardia – tachy at 120 (to 130) – HR: 135
  • Criteria Extraction Methods: Text Analysis Engines (TAEs)• Rules to filter candidate • Co-occurrence criteria created from constraints dictionaries and – exercise HR: 135 doesn’t grammars. affirm Tachycardia• Deny criteria mentioned • Disambiguation in negated contexts – edema is recognized as – regular rate and rhythm APEdema, if near cxr, or in without murmurs, clicks, a “Radiology” note, or in a gallops, or rubs  S3Neg “Chest X-Ray” segment• Ignore criteria in • Numeric constraints counterfactual contexts – she lost 5 pounds over a month doesn’t affirm – Patient should call if she WeightLoss experiences shortness of breath – tachy @ 115 doesn’t affirm Tachycardia
  • Encounter Labeling Methods• We can label an encounter note with labels showing the criteria that the note mentions – The labels can be used by downstream analyses to gather information such as: “This patient exhibited those symptoms on that date.”• 2 Methods: – Machine-learning • Using candidate criteria and scope annotations, as features, … • use a [CHAID decision tree] classifier to assign criteria as labels. – Rule-based • Run the full extractor pipeline, then … • Assign labels consisting of all unique criteria that survive filtering.
  • Results
  • Evaluation FlowMetrics: Machine Encounter Learning Labels Precision (Positive Predictive Value): Lexical Lexical Encounter Encounter #TruePositive / (#TruePositive &+Scope Look-up #FalsePositive) Label Documents & Scope Annotations Evaluation Recall (Sensitivity): Encounter Rules #TruePositive / (#TruePositive + #FalseNegative) Labels F-Score (the harmonic mean of Precision and Recall): (2 x Precision x Recall) / (Precision + Recall) Criteria
  • Encounter Labeling Performance Machine-learning method Rule-based method Recall Precision F-Score Recall Precision F-Score Affirmed 0.675000 0.754190 0.712401 0.738532 0.899441 0.811083 Denied 0.945556 0.905319 0.925000 0.987599 0.931915 0.958949 Overall 0.896364 0.881144 0.888689 0.938462 0.926720 0.932554Overall 99% (0.848-0.929) (0.900-0.964)Conf. Int. Conclusion: Machine-learning labeling does not significantly underperform rule-based labeling.
  • Performance of Framingham Diagnostic Criteria Extraction 99% Confidence Precision Recall F-score Interval (F-score) Overall (exact) 0.925234 0.896864 0.910828 (0.891 - 0.929) Overall (relaxed) 0.948239 0.919164 0.933475 (0.916 - 0.950) Affirmed 0.747801 0.789474 0.768072 (0.711 - 0.824) Denied 0.982857 0.928058 0.954672 (0.938 - 0.970)Note: Performance on affirmed criteria is worse, possibly because of theirgreater syntactic diversity. For example, we don’t find: PleuralEffusion: blunting of the right costrophrenic angle DOExertion: she felt like she couldn’t get enough air in
  • Precision and Recall for Individual Criteria
  • Analysis of 1492 extracted criteria: PredMED extractions vs. Gold Standard annotations e tiv ED eg KE td si E g TL g g AP DN EP g D Ne W Ne R eg Po H eg R Ne TA eg JV e g N eg PN eg AN dS AN D e PL g S3 g EN N N KE ED e e N H H N E E N EN e N N D D ol EP G G C C AL AL JR JR E D D ls O O C C C CPredMED PN AP TA PL S3 JV Fa G D H H H N R RANKED 90 6 16ANKEDNeg 230 6APED 8 5 2 1 22APEDNeg 0DOE 116 17 1 3DOENeg 3 135 2 1HEP 0 1HEPNeg 125HJR 2 1HJRNeg 9JVD 7 2JVDNeg 91NC 2NCNeg 43 2PLE 8PLENeg 1PND 1 7 2PNDNeg 69RALE 11 1RALENeg 197RC 6RCNeg 1S3G 0S3GNeg 131TACH 1 2TACHNeg 0 4WTL 0False Negative 6 8 5 2 6 5 1 4 1 3 2 2 7 35 2 1 1 10
  • Discussion• Challenges • Opportunities – Data quality: EHR text data is – We can apply similar messy. techniques to other collections • >10% (i.e., 26/237) of the of criteria. errors are caused by • NY Heart Association misspellings & bad sentence • European Society of boundaries Cardiology – Human anatomy • • We need a better solution – Many specific criteria than word co-occurrence extractors can be re-used in constraints other settings. – Syntactic diversity of affirmed criteria • We need deeper syntactic – For downstream applications, and semantic analysis see posters and presentations – Contradictions and from our project at this redundancy conference • An issue for downstream analysis
  • Summary• Extractors can identify affirmations and denials of Framingham HF criteria in EHR clinical notes with an overall F-Score of 0.91.• Classifiers can label EHR encounters with the Framingham critera they mention with an F- Score of 0.93.• Information about HF criteria mentioned in EHR notes appears to be useful for downstream applications that seek to achieve early detection of HF.
  • Backup:Iterative Annotation Refinement
  • Iterative Annotation Refinement• What are the problems solved? – Annotations are required for training and evaluating criteria extractors. – Human annotators without guidelines have high precision but lower recall. – Domain experts’ intuitions (about the language for expressing criteria) are initially imprecise.• What is produced? – Annotated dataset – Annotation guidelines … that are consistent – Criteria extractors
  • The Development Process: Iterative Annotation Refinement Initialization Results Iteration Update the Expert Write Annotations annotations initial and theExpert guidelines guidelines Discuss the Annotation Annotate texts Perform language Encounter Guidelines with current error of HF Texts extractors analysis criteria Build Criteria Update the initial Extractors extractors extractorsLinguist
  • User interface for the annotation tool, which wasused to manage annotations during refinement.
  • Performance improvement during development Performance comparison Final PredMED Clinical Expert 1 Ini al 0.9 Final 0.8 Precision Ini al 0.7 0.6 0.5 0.5 0.6 0.7 0.8 0.9 1 Recall
  • Iterative methods for creating annotations, guidelines, and extractors Extraction Result of using Sources of Arbiter for Objective (and target the method annotations disagreements metric) for each compared in at each iteration each iteration iterationIterative Framingham - Annotations Expert and Expert Improve extractorAnnotation HF criteria - Guidelines Extractor performance (F-Refinement - Extractor score)Annotation Clinical - Guidelines (in Expert and Consensus Improve inter-Induction conditions the form of an Linguist annotator(Chapman, et annotation agreement (F-al. J Biom Inf schema) score)2006)CDKRM Classes in the - Annotations 2 Experts Consensus Improve inter-(Coden, et al., cancer disease - Guidelines annotatorJ Biom Inf model agreement2009) (agreement %)TALLAL PHI (protected - Annotations Expert and Expert Annotate full(Carrell, et al, health - Extractor Extractor dataset (to theGHRI-IT information) expert’sposter, 2010) classes satisfaction)