Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS
May 1, 2012 • 2012 HMORN Conference • Seattle, WashingtonUse of SAS-Based Natural Language Processingto Identify Incident and Recurrent Malignancies Justin A. Strauss, MA Research Associate III Kaiser Permanente Southern California
Co-Authors & Funding• Chun R. Chao, PhD• Marilyn L. Kwan, PhD• Syed A. Ahmed, MD• Joanne E. Schottinger, MD• Virginia P. Quinn, PhD
Acknowledgements & Funding• Mayra Martinez, Michelle McGuire, Melissa Preciado, Nirupa Ghai, and Jeff Slezak (KPSC); Lawrence Kushi (KPNC); Debra Ritzwoller (KPCO); Joan Warren (NCI); Jianyu Rao and Jiaoti Huang (UCLA)• Funding was provided by KPSC Community Benefit and the Cancer Research Network
Malignancy Identification• Malignancy identification is important for clinical and epidemiologic cancer research.• Limited quality and availability of incident and recurrent malignancy data within health plans. • Delayed availability of incident malignancy data from cancer registries. • Few registries track cancer recurrences. • Manual chart abstraction slow and expensive. • Previous research has shown electronic diagnosis codes (e.g., ICD-9) to be unreliable.
Natural Language Processing• Natural language processing (NLP) can be used to identify and extract information from electronic clinical text, including incident and recurrent malignancy data.• Increasing opportunity for NLP with adoption of electronic clinical systems in patient care delivery.• Despite its potential value in clinical and research settings, NLP usage has been relatively sparse. Contributing factors may include: • Technical complexity • Systems integration requirements • Habitual use of existing methods
SCENT Overview• A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to identify incident and recurrent malignancies using text from pathology reports.• SCENT is currently being implemented in two research studies at Kaiser Permanente Southern California (KPSC): • Intervention to improve medication adherence among breast cancer patients. • Differences in the prognosis of prostate cancer patients according to their genetic factors• Use of SAS programming minimizes implementation barriers and increases availability for multisite research.
Description of Methods• SCENT identifies non-negated clinical concepts within pathology report text.• Built using SAS Base (does not require Text Miner add-on). • Makes extensive use of SAS hash objects and regular expressions.• Includes components for preprocessing, matching, negation and uncertainty detection, extracting diagnostic information (e.g., staging and Gleason score), and classifying report malignancy status.• Flexibility to assign codes using variety of coding systems. • Validation used subset of SNOMED 3.x (~1000 concepts).
SCENT Process Diagram Clinical Concepts (Excel) Pathology Text (Research Database) [moderately-differentiated ductal adenocarcinoma with papillary]Type : Morphology, topology, or procedural Text : Raw text segment from report [features.]Code : SNOMED 3.X Line : Sequential text segment identifier [the tumor involves 0.6 cm of one core.]Class : Malignant , basaloid , benign, or N/ADescription : Concept description [moderately-differentiated ductal adenocarcinoma Regular Preprocessed with papillary features.]Code : M-85033 Expressions Text [the tumor involves 0.6 cm of one core.]Description : intraductal papillaryadenocarcinoma with invasion moderately differentiated <nlp snm=m85033 [intraductal] type=m class=3>ductal adenocarcinomawith papillary</nlp snm=m85033> features Tokenize [papillary] Examine [adenocarcinoma] Extract Data Words [with] Segments Disease [invasion] Extent Code Tumor Gleason [intraductal] Staging Score Matches Diagnostic [papillary] Certainty Clean [adenocarcinoma] [with] Tokenize [moderately] [differentiated] [invasion] [ductal] [adenocarcinoma] Words with [papillary] [features] [adenocarcinoma[ls]?] [((intra)?duct(al)?)] [papillar (y|ies)] Check Enhance [papillar (y|ies)] [((intra)?duct(al)?)] Negation [adenocarcinoma[ls]?] Loop Match free (of|from) Concept Dictionary (SAS) Concepts Tokens not? (support[a-z]*|identified) non(?!small|hodgkins)
Sample Report CodingPreprocessed Text LEFT BREAST CORE BIOPSY TWO O CLOCK. <BR> INVASIVE DUCTAL CARCINOMA NOTTINGHAM GRADE 2. <BR> NO CALCIFICATION IS IDENTIFIED. <BR> NO VASCULAR INVASION IS IDENTIFIED. <BR> HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.Coded Text <NLP SNM=T04030 TYPE=T>LEFT BREAST</NLP SNM=T04030> CORE <NLP SNM=P1140 TYPE=P>BIOPSY</NLP SNM=P1140> TWO O CLOCK. <BR> INVASIVE <NLP SNM=M85003 TYPE=M CLASS=3>DUCTAL CARCINOMA</NLP SNM=M85003> NOTTINGHAM GRADE 2. <BR> NO CALCIFICATION IS IDENTIFIED. <BR> NO VASCULAR INVASION IS IDENTIFIED. <BR> HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.
Validation Study• To validate SCENT, trained chart abstractors reviewed electronic pathology reports.• Random samples of breast (n=400) and prostate (n=400) cancer patients. • Patients diagnosed at KPSC between 2000-2007. • Reports included from six months post-diagnosis through end of 2008.• In total, 206 breast and 186 prostate cancer patients contributed 490 and 425 eligible reports, respectively.• SCENT classifications were compared with those of abstractors.
Classification Concordance Abstractor Classifications Cancer Other Benign Suspicious Recurrence Primary CancerSCENT Classifications % N % N % N % N KappaBreast Cancer (Total) (436) (32) (18) (4) Benign 99.8 435 - - - - 25.0 1 0.96 Cancer Recurrence - - 100.0 32 - - - - Other Primary Cancer 0.2 1 - - 100.0 18 50.0 2 Suspicious - - - - - - 25.0 1Prostate Cancer (Total) (356) (29) (36) (4) Benign 99.4 354 - - 5.6 2 - - 0.95 Cancer Recurrence - - 96.6 28 2.8 1 - - Other Primary Cancer 0.6 2 3.4 1 91.7 33 - - Suspicious - - - - - - 100.0 4Note: incident contralateral breast malignancies were considered to be recurrences.
Conclusions• Favorable results suggest SCENT can identify and extract information about primary and recurrent malignancies from pathology reports. • Rapid cancer case identification. • Improved measurement accuracy of common study endpoint.• SCENT has the potential to expedite chart reviews by narrowing the search and highlighting relevant concepts.• Generalized utility for extracting standardized disease scores and other clinical information.• SCENT is proof of concept for SAS-based NLP that can be easily shared between institutions to support research.
Limitations & Next Steps• SCENT has a number of limitations, including: • Unable to disambiguate and contextualize identified clinical concepts without part-of-speech (POS) tagging. • More susceptible to changes in text structure and increased linguistic variability than statistical NLP approaches. • General purpose NLP (e.g., cTAKES) likely to perform better outside of pathology.• Next steps include: • Release SCENT source code and requisite support files. • Optimize current functionality and assess feasibility of adding methods (e.g., POS tagging, n-grams, statistical classifiers). • Attempt to identify non-pathologically diagnosed malignancies using radiology reports and clinical progress notes. • Quantify cost savings associated with SCENT-assisted chart reviews.