The document describes a collaboration between several organizations to use natural language processing (NLP) and machine learning to extract structured information from unstructured free text in reports of adverse events following immunization (AEFI). The goals are to reduce noise from misspellings, tag key words with medical concepts, and classify documents. Algorithms were developed and applied to AEFI reports. Performance measures indicated success in spell checking, concept extraction and mapping terms to controlled vocabularies. The work enables representation of clinical terms to support vaccine safety monitoring.
An Overview on the Use of Data Mining and Linguistics Techniques for Building...
2006NIC-NLPPoster_V2
1. METHODS
ABSTRACT
Towards Enhancing Anthrax Vaccine Safety WithNatural Language Processing
Herman Tolentino MD1,2
, Michael Matters PhD MPH2,3
, Wikke Walop PhD4,5
, Barbara Law MD4,5
,
Wesley Tong6
, Deepak Sagaram MBBS7
, Fang Liu MS8
, Paul Fontelo MD MPH8
,
Katrin Kohl MD PhD MPH9,10
, Daniel Payne PhD MSPH1
1 Vaccine Analytic Unit, National Immunization Program, CDC, Atlanta, GA 30333
2 Public Health Informatics Fellowship Program, Office of Workforce and Career Development, CDC, Atlanta, GA 30333
3 Division for Heart Disease and Stroke Prevention, National Center for Chronic Disease Prevention and Health Promotion, CDC, Atlanta GA 30333
4 Immunization & Respiratory Infections Division, Centre for Infectious Disease Prevention & Control
Public Health Agency of Canada, Ottawa, Ontario K1A 0K9
5 Brighton Collaboration
6 Honours Biology and Pharmacology Programme, McMaster University
Hamilton, Ontario L8S 4L8
7 The University of Texas Health Science Center at Houston, TX 77030
8 Office of High Performance Computing and Communications, National Library of Medicine
National Institutes of Health, Bethesda, MD 20894
9 Immunization Safety Office, CDC, Atlanta GA 30333
Detecting vaccine adverse events is an important public health
activity that contributes to patient safety. Large amounts of
clinical information are locked up in unstructured free text
components of clinical reports. Reports of adverse events
following immunization (AEFI) from surveillance systems
contain free text that can be analyzed using natural language
processing (NLP). Current advances in computer and
information technology allow storage and processing of large
amounts of information, including free text. A collaborative
workgroup among the Brighton Collaboration (BC), Public
Health Agency of Canada (PHAC), the National Library of
Medicine (NLM) and the Vaccine Analytic Unit (VAU) was
formed to investigate the use of natural language processing
or NLP (1) to extract structured information from free text
components of AEFI reports, and (2) automate information
retrieval and classification in surveillance systems. The
outputs are applicable to processing of free text for anthrax
vaccine safety.
§ Collaboration between Brighton Collaboration (BC), Public
Health Agency of Canada (PHAC), National Library of
Medicine (NLM) and Vaccine Analytic Unit (VAU). See
Figure 1.
§ Creation of AEFI free text corpus from de-identified
adverse event reports from PHAC.
RESULTS
CONCLUSIONS
SELECTED REFERENCES
§ Development of natural language processing (NLP) and
machine learning (ML) algorithms to represent information in
AEFI reports using concepts from NLM’s Unified Medical
Language System (UMLS). Two important steps are needed:
(1) spell checking to reduce “noise” from misspelled words
and abbreviations;; and (2) concept tagging to represent key
words in free text with UMLS concepts and map them to AEFI
controlled vocabularies (MedDRA, COSTART, WHO-ART).
§ Derivation of a semantic distance metric to measure similarity
between UMLS concepts and enhance concept tagging.
§ Application of the algorithms to correct spelling errors and
extract UMLS concepts from free text reports.
§ Validation of machine learning training with test data
§ Adaptation of a clustering algorithm to classify documents
based on semantic distance concept groups.
§ NLP steps for spell checking is shown in Figure 1 while that
for concept extraction is in Figure 2.
§ Performance measures for spell checking and concept
extraction are shown in Tables 1 and 2.
§ Proportions of key words mapped to adverse event controlled
vocabularies is shown in Table 3.
§ Screen shot of processed (concept-tagged) free text showing
UMLS concept tags in Figure 2.
§ The matching of terms in free text to concepts is an
essential component of human and machine reasoning.
When applied to vaccine safety using UMLS concepts, it
enables computable and unencumbered representation of
terms and eventual extraction of structured information
such as vaccine adverse events.
§ Two important steps are needed to carry this out: (1)
reduction of “noise” from misspelled words and
abbreviations, and (2) tagging of key terms from free text
with UMLS concepts. Both require the use of natural
language processing and machine learning techniques.
§ The UMLS provides adequate coverage for mapping
clinical terms to concepts. The use of a semantic distance
metric enhances the concept extraction process.
§ Working in the context of a collaboration (1) ensures that
contextual issues are addressed and appropriate
knowledge domain expertise is leveraged and efficiently
utilized, and (2) demonstrates that the value of
collaborative problem-solving in public health knows no
boundaries.
Figure 1. Spell checker process flow showing different steps.
Disambiguation involves selection of one correction term from a list of
potential candidates obtained from lexical dictionaries.
Figure 2. Concept extraction process flow showing
snapshot of a concept-tagged AEFI report
Table 1. Performance measurements for spell checker during
training and testing
Table 2. Performance measurements for concept extraction
during training and testing
Table 3. Proportions of adverse event controlled vocabulary
mappings from free text AEFI reports for training and test data sets
This research was made possible through a grantby Oak Ridge Institute for Science Education (ORISE) to the Centers for Disease Control
and Prevention Public Health Informatics Fellowship Program (PHIFP).Specialacknowledgments to (1) The Brighton Collaboration for
making globalresearch connections possible;(2) the Public Health Agency of Canada for valuable source data inputs;(3) the National
Library of Medicine for sharing UMLS expertise; and,(4) Herman’s mentors:Dan Payne and Mike McNeil.
1. Hripcsak G. Friedman C, Alderson PO, DuMouchel W, Johnson SB,
Clayton P. Unlocking clinical data from narrative reports: a study of
natural language processing. Annals of Internal Medicine. May 1995;;
122(9)-681-688.
2. Sittig DF. Potential impact of advanced clinical information technology
on healthcare in 2015. Medinfo 2004: 11(Pt 2):1379-82.
3. The Brighton Collaboration. URL:
http://www.brightoncollaboration.org. Last accessed: January 2006.
4. Unified Medical Language System. URL:
http://www.nlm.nih.gov/research/umls/. Last accessed: January 2006.
5. Chapman WW. Natural language processing for outbreak and disease
surveillance. In Handbook of Biosurveillance, Elsevier Inc, New York,
NY (2005) (in press).