Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Segmentation of Clinical Texts

2,247 views

Published on

This was presented at the 2014 IEEE Conference on Big Data

For details:
http://kavita-ganesan.com/content/general-supervised-approach-segmentation-clinical-texts

Citation:
Ganesan, Kavita, and Michael Subotin. "A General Supervised Approach to Segmentation of Clinical Texts."

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Segmentation of Clinical Texts

  1. 1. Kavita Ganesan & Michael Subotin Presented at: 2014 Conference on IEEE Big Data
  2. 2. All sorts of notes types!  Admit notes ◦ documenting why patient is being admitted ◦ baseline status, etc.  Progress notes ◦ progress during course of hospitalization  Discharge notes ◦ conclusion of a hospital stay or series of treatments  Others ◦ Operative notes ◦ Procedure notes ◦ Delivery notes ◦ Emergency Department notes, etc
  3. 3. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. CHIEF COMPLAINT: Injured right little toe. HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of multiple myeloma who presents today after hitting his fifth toe of the right foot on a wood panel yesterday…… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. RESPIRATORY: No cough, shortness of breath, or wheezing. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... PAST MEDICAL HISTORY Multiple myeloma, peripheral neuropathy, hypertension.. PAST SURGICAL HISTORY:- Stem cell transplant. SOCIAL HISTORY The patient formerly smoked tobacco; however, quit within the last 10 years. FAMILY HISTORY: Hypertension. ALLERGIES: ASPIRIN. ……… Purpose of visit Patient’s current condition in narrative form Ongoing issues, issues in the past Information on allergies
  4. 4. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. CHIEF COMPLAINT: Injured right little toe. HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of multiple myeloma who presents today after hitting his fifth toe of the right foot on a wood panel yesterday…… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. RESPIRATORY: No cough, shortness of breath, or wheezing. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... PAST MEDICAL HISTORY Multiple myeloma, peripheral neuropathy, hypertension.. PAST SURGICAL HISTORY:- Stem cell transplant. SOCIAL HISTORY The patient formerly smoked tobacco; however, quit within the last 10 years. This is how most notes look: • some longer, some shorter • different set of headers, etc FAMILY HISTORY: Hypertension. ALLERGIES: ASPIRIN. ……… Purpose of visit Patient’s current condition in narrative form Ongoing issues, issues in the past Information on allergies
  5. 5. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. CHIEF COMPLAIN: Injured right little toe. CHIEF COMPLAIN: Injured right little toe. CHIEF COMPLAINT: Injured right little toe. HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of… HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of… HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... ……… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... ……… ………  Very unstructured ◦ formatting cues  inconsistent ◦ varies: across physicians, notes, hospitals  Hard to analyze specific sections ◦ E.g. analyze allergies patient population ◦ Need to segment notes to extract all allergy info.
  6. 6. ◦ Information collected vary from note types to note types  Ex. info on progress notes vs. admit note ◦ Contents & formatting can vary from hospital to hospital  Even within the same organization – E.g. Kaiser ◦ Contents & formatting vary between physicians  Different styles, speed of typing, etc.
  7. 7.  If you are looking at a single note type, from a single hospital - then maybe  Not suitable as a general segmentation approach:  Can easily break: ◦ on unseen note types and minor format variations ◦ Example:  regex based on all caps  regex based on seen headers only
  8. 8.  Several works have explored supervised methods to segmenting clinical notes [Cho et al. 2003, tepper et al. 2012, apostolva et al. 2009]  Problem: methods not general! ◦ Cho et al. 2003: One model for each type of note  20 note types  20 models!  Not practical  maintain each model ◦ Tepper et al. 2012: Model had low adaptability to unseen documents  features used, training data used, etc.
  9. 9.  General segmentation approach for clinical texts  Requirements: ◦ Single model/approach for most note types ◦ Discount extreme non-standard formatting e.g. tabular format  Segment: ◦ Header ◦ Top level sections ◦ Footer
  10. 10. PRIMARY CARE PHYSICIAN: Dr. XXXXX XXXXXXXX. CHIEF COMPLAINT: Injured right little toe. HISTORY OF PRESENT ILLNESS: This is a 63-year-old male with a past medical history of multiple myeloma who presents today after hitting his fifth toe of the right foot on a wood panel yesterday…… Review of Systems: CONSTITUTIONAL: No fever, chills, or weight loss. RESPIRATORY: No cough, shortness of breath, or wheezing. CARDIOVASCULAR: No chest pain, chest pressure, or palpitations. ............... PAST MEDICAL HISTORY Multiple myeloma, peripheral neuropathy, hypertension.. PAST SURGICAL HISTORY:- Stem cell transplant. SOCIAL HISTORY The patient formerly smoked tobacco; however, quit within the last 10 years. FAMILY HISTORY: Hypertension. ALLERGIES: ASPIRIN. ……… Header Top-level section Top-level section Top-level section Top-level section Top-level section Top-level section Top-level section
  11. 11.  Supervised approach using L1-Logistic Regression with a constraint combination approach  Idea: scan each line in a clinical document and label as: ◦ BeginHeader ◦ ContHeader ◦ BeginSection ◦ ContSection ◦ Footer  Labels are predicted with certain confidence  But, problem using line-wise predictions as is: ◦ Label sequences may not make sense ◦ E.g. There maybe a BeginHeader after a BeginSection  incorrect
  12. 12.  Post-processing: enforce sequence combination rules: ◦ First line of document: BeginHeader or BeginSection ◦ BeginHeader cannot come right after BeginHeader or ContHeader ◦ ContHeader must come after BeginHeader or ContHeader ◦ ContSection must come after BeginSection or ContSection ◦ Footer cannot come right after BeginHeader or ContHeader  Rules applied after all lines in document labeled ◦ Applied to consecutive label pairs ◦ Computed efficiently: Viterbi algorithm
  13. 13. Inpatient Outpatient • Notes from 12 different enterprises • Some large enterprises • All sorts of note types • Some noisy sectioning, some clean • 100 radiology notes • Fairly clean sections • One hospital • All sorts of note types • Fairly well sectioned • 35, 000 notes in total • 2000 randomly sampled notes (inpatient) • 100 radiology notes • Fairly clean sections
  14. 14.  Emphasis on training data  Variation in training data ◦ Use different note types for training ◦ Intuition: help model generalize well  Sample training data: ◦ Instead of using all training data from 2100 notes ◦ Generated subsets of training data with varying size and cross-validate on test sets ◦ Intuition: allows to pick the best model  Best model only used < 700 notes (out of 2100)
  15. 15.  5 test sets ◦ 4/5 test set from hospitals not in train set  true estimate of accuracy ◦ Covers both inpatient and outpatient notes ◦ Covers different note types ◦ ~12,500 test notes  Primary evaluation metric: line-wise accuracy ◦ percentage of correctly predicted line labels
  16. 16. 1st model: limited variety (hp + discharge) Train set 3-folded cross validation Unseen test accuracy Inp1HospB (300 - limited) 96.70% 67.00% Inp3HospD (300 - varied) 96.58% 88.23% 2nd model: variety (11 types - hp, ds, pn…) Model with variety: higher accuracy on unseen test set 3-folded cross-validation accuracy: high in both Important to have variety in training notes in building general segmentation model
  17. 17. Accuracy consistently > 90% across enterprises Client/Data In/Outpatient # Test Docs Accuracy 1. Inp1HospB In 300 92.58% 2. Inp2HospC In 1000 93.29% 3. Inp3HospD In 300 95.81% 4. Rad1MixedHosps Out 9000 92.45% 5. Rad2HospA Out 1902 93.67% Average 93.56% • Average accuracy: 93.56% • Covers inpatient/outpatient Single model: But, performs well across enterprises
  18. 18. Document Type Accuracy 1. History and Physical 95.70% 2. Physician Clinicals 93.10% 3. Discharge Summary 94.00% 4. Consult Note 94.60% 5. Short Stay Summary 94.60% 6. Operative Note 92.20% 7. Progress Note 87.80% 8. Cardiac Cath Report 85.40% 9. Procedure Note 83.60% • Model performs well across note types • Lowest performance: procedure notes low recall on segmenting “technique” sections Performs very well > 90% Reasonable.. > 80% Accuracy Breakdown for Inp2HospC
  19. 19. 94.00% 93.00% 92.00% 91.00% 90.00% 89.00% 88.00% 87.00% 86.00% # Notes vs. Accuracy No benefit with more notes 0 500 1000 1500 2000 Accuracy # Training Notes Avg. accurracy peaks @500 notes on all test sets No benefit with more notes No need for big data for a general model. We need good data from all that big data!
  20. 20.  Unigrams – of each line (LineUnigram)  Relative position of line in document (PosInDoc) ◦ Top, Middle, Bottom  Known Header features (KnownHeader) ◦ Find potential headers using repository of seen headers ◦ Seen headers can have canonical type E.g. Past Medical History, Previous Med History “PAST_MEDICAL_HISTORY” ◦ If potential headers found, we include features:  Canonical type  Unigram & Char n-gram of potential header  Caps/colon info – mixed case, all caps, lowercase  Length of potential header
  21. 21. Feature Set Avg. Accuracy Improvement LineUnigram 85.55% LineUnigram+PosInDoc 88.62% +3.46% LineUnigram+PosInDoc+KnownHeader 93.10% +4.81%
  22. 22.  Explored: ◦ Supervised approach to building a very general segmentation model for clinical texts  Evaluation showed: ◦ Model works well on notes across enterprises ◦ Model works across note types  Key to effectiveness: ◦ Variation in training data –all sorts of note types ◦ Training data selection strategy – sample and cross-validate ◦ Feature set – not explored in existing works
  23. 23. Contact: Kavita Ganesan ganesan.kavita@gmail.com www.kavita-ganesan.com www.text-analytics101.com

×