A Comparative Study of Supervised Learning  as Applied to  Acronym Expansion in  Clinical Reports Mahesh Joshi,   Serguei Pakhomov,  Ted Pedersen,   Christopher G. Chute University of Minnesota, Duluth Mayo College of Medicine, Rochester
Overview Acronyms are ambiguous in general, and in more specialized domains Acronyms can be disambiguated by expansion  expansions act as senses or definitions Acronym expansion can be viewed as word sense disambiguation supervised learning from annotated examples Features trump learning algorithms unigrams dominant
AMIA - Top Google Results  American Medical Informatics Association Association of Moving Image Archivists Anglican Mission in America Associcion Mutual Israelita Argentina
RN in Wikipedia Registered Nurse Royal Navy Radio National Radio Nederland Richard Nixon Registered Identification Number Renovacion Nacional
Acronym Ambiguity not just a problem for General English… 33% of Acronyms in UMLS are ambiguous Liu et. al. AMIA-2001  81% of Acronyms in MEDLINE abstracts are ambiguous, with an average of 16 expansions Liu et. al. AMIA-2002
We view AE as WSD AE  sense 1: American Eagle sense 2: Arab Emirates sense 3: acronym expansion WSD sense 1: Washington School for the Deaf sense 2: web server director sense 3: word sense disambiguation
Methodology Identify 16 ambiguous acronyms 9 from Pakhomov, et. al. AMIA-2005 7 newly annotated for this this study Manually annotate in clinical notes 7,738 total instances from Mayo Clinic database of clinical notes Use as training data for supervised learning
Acronyms (majority < 50%) AC  Acromioclavicular Antitussive with Codeine Acid Controller 10 more  APC  Argon Plasma Coagulation  Adenomatous Polyposis Coli Atrial Premature Contraction 10 more expansions LE Limited Exam Lower Extremity Initials 5 more expansions PE  Pulmonary Embolism Pressure Equalizing Patient Education 12 more expansions
Acronyms (50% < majority < 80%) CP Chest Pain Cerebral Palsy Cerebellopontine 19 more expansions HD Huntington's Disease  Hemodialysis Hospital Day 9 more expansions CF Cystic Fibrosis  Cold Formula Complement Fixation 6 more expansions MCI Mild Cognitive Impairment Methylchloroisothiazolinone Microwave Communications, Inc. 5 more expansions ID Infectious Disease Identification Idaho Identified 4 more expansions LA Long Acting Person Left Atrium 5 more expansions
Acronyms (majority > 80%) MI Myocardial Infarction Michigan Unknown 2 more expansions ACA Adenocarcinoma Anterior Cerebral Artery Anterior Communication Artery 3 more expansions GE Gastroesophageal General Exam Generose General Electric HA Headache Hearing Aid Hydroxyapatite 2 more expansions FEN Fluids, Electrolytes and Nutrition Drug Fen Phen Unknown NSR Normal Sinus Rhythm Nasoseptal Reconstruction
Experimental Objectives Compare performance of ML methods Naïve Bayesian classifier J48/C4.5 decision tree learner  Support vector machine (SMO) Compare four different feature sets POS tags from Brill-Hepple Tagger Unigrams that occur 5 or more times Flexible   window of size 5 around target Bigrams that occur 5 or more times Flexible window of size 5 around target Unigrams + Bigrams + POS tags
Feature Extraction Horizon : up to 5 content   words to left and right of target Boundaries : cross sentences, but not clinical notes Skip stop words Bigrams are pairs of contiguous content words Example (CF is target):  Unigrams: “if she is  found  to be a  carrier , then they will  follow  with  CF   carrier   testing  in her  husband .” Bigrams: “if she is found to be a carrier, then they will follow with  CF   carrier   testing  in her husband.”
Results (majority < 50%)
Results (50% < majority < 80%)
Results (majority > 80%)
Results (flexible window)
Conclusions Overall expansion accuracy at or above 90% regardless of distribution Differences in accuracy are largely due to features, not ML algorithms Addition of bigrams and POS tags helps performance, but unigrams dominant Flexible window improves upon fixed window feature selection
Future Work Expand all acronyms in a text, not just select few expand based on prior expansions utilize one sense per discourse constraint Integrate supervised methods with knowledge based approaches and clustering methods to reduce need for annotated examples
Acknowledgments We would like to thank our annotators Barbara Abbott, Debra Albrecht and Pauline Funk.  This work was supported in part by the NLM Training Grant (T15 LM07041-19) and the NIH Roadmap Multidisciplinary Clinical Research Career Development Award (K12/NICHD)-HD49078. Dr. Pedersen has been partially supported by a National Science Foundation Faculty Early CAREER Development Award (#0092784).
Software Resources NSPGate (from Duluth/Mayo) http://nspgate.sourceforge.net/ Ngram Statistics Package (from Duluth) http://ngram.sourceforge.net/ WSDGate (from Duluth/Mayo) http://wsdgate.sourceforge.net/ WEKA (from Waikato)  http://www.cs.waikato.ac.nz/ml/weka/  GATE (from Sheffield)  http://gate.ac.uk/

Amia06

  • 1.
    A Comparative Studyof Supervised Learning as Applied to Acronym Expansion in Clinical Reports Mahesh Joshi, Serguei Pakhomov, Ted Pedersen, Christopher G. Chute University of Minnesota, Duluth Mayo College of Medicine, Rochester
  • 2.
    Overview Acronyms areambiguous in general, and in more specialized domains Acronyms can be disambiguated by expansion expansions act as senses or definitions Acronym expansion can be viewed as word sense disambiguation supervised learning from annotated examples Features trump learning algorithms unigrams dominant
  • 3.
    AMIA - TopGoogle Results American Medical Informatics Association Association of Moving Image Archivists Anglican Mission in America Associcion Mutual Israelita Argentina
  • 4.
    RN in WikipediaRegistered Nurse Royal Navy Radio National Radio Nederland Richard Nixon Registered Identification Number Renovacion Nacional
  • 5.
    Acronym Ambiguity notjust a problem for General English… 33% of Acronyms in UMLS are ambiguous Liu et. al. AMIA-2001 81% of Acronyms in MEDLINE abstracts are ambiguous, with an average of 16 expansions Liu et. al. AMIA-2002
  • 6.
    We view AEas WSD AE sense 1: American Eagle sense 2: Arab Emirates sense 3: acronym expansion WSD sense 1: Washington School for the Deaf sense 2: web server director sense 3: word sense disambiguation
  • 7.
    Methodology Identify 16ambiguous acronyms 9 from Pakhomov, et. al. AMIA-2005 7 newly annotated for this this study Manually annotate in clinical notes 7,738 total instances from Mayo Clinic database of clinical notes Use as training data for supervised learning
  • 8.
    Acronyms (majority <50%) AC Acromioclavicular Antitussive with Codeine Acid Controller 10 more APC Argon Plasma Coagulation Adenomatous Polyposis Coli Atrial Premature Contraction 10 more expansions LE Limited Exam Lower Extremity Initials 5 more expansions PE Pulmonary Embolism Pressure Equalizing Patient Education 12 more expansions
  • 9.
    Acronyms (50% <majority < 80%) CP Chest Pain Cerebral Palsy Cerebellopontine 19 more expansions HD Huntington's Disease Hemodialysis Hospital Day 9 more expansions CF Cystic Fibrosis Cold Formula Complement Fixation 6 more expansions MCI Mild Cognitive Impairment Methylchloroisothiazolinone Microwave Communications, Inc. 5 more expansions ID Infectious Disease Identification Idaho Identified 4 more expansions LA Long Acting Person Left Atrium 5 more expansions
  • 10.
    Acronyms (majority >80%) MI Myocardial Infarction Michigan Unknown 2 more expansions ACA Adenocarcinoma Anterior Cerebral Artery Anterior Communication Artery 3 more expansions GE Gastroesophageal General Exam Generose General Electric HA Headache Hearing Aid Hydroxyapatite 2 more expansions FEN Fluids, Electrolytes and Nutrition Drug Fen Phen Unknown NSR Normal Sinus Rhythm Nasoseptal Reconstruction
  • 11.
    Experimental Objectives Compareperformance of ML methods Naïve Bayesian classifier J48/C4.5 decision tree learner Support vector machine (SMO) Compare four different feature sets POS tags from Brill-Hepple Tagger Unigrams that occur 5 or more times Flexible window of size 5 around target Bigrams that occur 5 or more times Flexible window of size 5 around target Unigrams + Bigrams + POS tags
  • 12.
    Feature Extraction Horizon: up to 5 content words to left and right of target Boundaries : cross sentences, but not clinical notes Skip stop words Bigrams are pairs of contiguous content words Example (CF is target): Unigrams: “if she is found to be a carrier , then they will follow with CF carrier testing in her husband .” Bigrams: “if she is found to be a carrier, then they will follow with CF carrier testing in her husband.”
  • 13.
  • 14.
    Results (50% <majority < 80%)
  • 15.
  • 16.
  • 17.
    Conclusions Overall expansionaccuracy at or above 90% regardless of distribution Differences in accuracy are largely due to features, not ML algorithms Addition of bigrams and POS tags helps performance, but unigrams dominant Flexible window improves upon fixed window feature selection
  • 18.
    Future Work Expandall acronyms in a text, not just select few expand based on prior expansions utilize one sense per discourse constraint Integrate supervised methods with knowledge based approaches and clustering methods to reduce need for annotated examples
  • 19.
    Acknowledgments We wouldlike to thank our annotators Barbara Abbott, Debra Albrecht and Pauline Funk. This work was supported in part by the NLM Training Grant (T15 LM07041-19) and the NIH Roadmap Multidisciplinary Clinical Research Career Development Award (K12/NICHD)-HD49078. Dr. Pedersen has been partially supported by a National Science Foundation Faculty Early CAREER Development Award (#0092784).
  • 20.
    Software Resources NSPGate(from Duluth/Mayo) http://nspgate.sourceforge.net/ Ngram Statistics Package (from Duluth) http://ngram.sourceforge.net/ WSDGate (from Duluth/Mayo) http://wsdgate.sourceforge.net/ WEKA (from Waikato) http://www.cs.waikato.ac.nz/ml/weka/ GATE (from Sheffield) http://gate.ac.uk/