Documentations of Advanced Heath Care Directives Where Are They TAI_SEALE
Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS
1. May 1, 2012 • 2012 HMORN Conference • Seattle, Washington
Use of SAS-Based Natural Language Processing
to Identify Incident and Recurrent Malignancies
Justin A. Strauss, MA
Research Associate III
Kaiser Permanente Southern California
2. Co-Authors & Funding
• Chun R. Chao, PhD
• Marilyn L. Kwan, PhD
• Syed A. Ahmed, MD
• Joanne E. Schottinger, MD
• Virginia P. Quinn, PhD
3. Acknowledgements & Funding
• Mayra Martinez, Michelle McGuire, Melissa
Preciado, Nirupa Ghai, and Jeff Slezak
(KPSC); Lawrence Kushi (KPNC); Debra
Ritzwoller (KPCO); Joan Warren (NCI);
Jianyu Rao and Jiaoti Huang (UCLA)
• Funding was provided by KPSC
Community Benefit and the Cancer
Research Network
4. Malignancy Identification
• Malignancy identification is important for clinical
and epidemiologic cancer research.
• Limited quality and availability of incident and
recurrent malignancy data within health plans.
• Delayed availability of incident malignancy data from
cancer registries.
• Few registries track cancer recurrences.
• Manual chart abstraction slow and expensive.
• Previous research has shown electronic diagnosis codes
(e.g., ICD-9) to be unreliable.
5. Natural Language Processing
• Natural language processing (NLP) can be used to identify
and extract information from electronic clinical text,
including incident and recurrent malignancy data.
• Increasing opportunity for NLP with adoption of electronic
clinical systems in patient care delivery.
• Despite its potential value in clinical and research settings,
NLP usage has been relatively sparse. Contributing factors
may include:
• Technical complexity
• Systems integration requirements
• Habitual use of existing methods
6. SCENT Overview
• A SAS-based coding, extraction, and nomenclature tool
(SCENT) was developed to identify incident and recurrent
malignancies using text from pathology reports.
• SCENT is currently being implemented in two research
studies at Kaiser Permanente Southern California (KPSC):
• Intervention to improve medication adherence among breast
cancer patients.
• Differences in the prognosis of prostate cancer patients
according to their genetic factors
• Use of SAS programming minimizes implementation
barriers and increases availability for multisite research.
7. Description of Methods
• SCENT identifies non-negated clinical concepts within
pathology report text.
• Built using SAS Base (does not require Text Miner add-on).
• Makes extensive use of SAS hash objects and regular expressions.
• Includes components for preprocessing, matching, negation
and uncertainty detection, extracting diagnostic information
(e.g., staging and Gleason score), and classifying report
malignancy status.
• Flexibility to assign codes using variety of coding systems.
• Validation used subset of SNOMED 3.x (~1000 concepts).
8. SCENT Process Diagram
Clinical Concepts (Excel) Pathology Text (Research Database) [moderately-differentiated ductal adenocarcinoma
with papillary]
Type : Morphology, topology, or procedural Text : Raw text segment from report [features.]
Code : SNOMED 3.X Line : Sequential text segment identifier [the tumor involves 0.6 cm of one core.]
Class : Malignant , basaloid , benign, or N/A
Description : Concept description
[moderately-differentiated ductal adenocarcinoma
Regular Preprocessed with papillary features.]
Code : M-85033 Expressions Text [the tumor involves 0.6 cm of one core.]
Description : intraductal papillary
adenocarcinoma with invasion
moderately differentiated <nlp snm=m85033
[intraductal] type=m class=3>ductal adenocarcinomawith
papillary</nlp snm=m85033> features
Tokenize [papillary] Examine
[adenocarcinoma] Extract Data
Words [with] Segments Disease
[invasion] Extent
Code
Tumor Gleason
[intraductal] Staging Score
Matches Diagnostic
[papillary] Certainty
Clean [adenocarcinoma]
[with] Tokenize [moderately] [differentiated]
[invasion] [ductal] [adenocarcinoma]
Words with [papillary] [features]
[adenocarcinoma[ls]?]
[((intra)?duct(al)?)] [papillar (y|ies)]
Check
Enhance [papillar (y|ies)]
[((intra)?duct(al)?)] Negation
[adenocarcinoma[ls]?]
Loop Match free (of|from)
Concept Dictionary (SAS) Concepts Tokens not? (support[a-z]*|identified)
non(?!small|hodgkins)
9. Sample Report Coding
Preprocessed Text
LEFT BREAST CORE BIOPSY TWO O CLOCK.
<BR>
INVASIVE DUCTAL CARCINOMA NOTTINGHAM GRADE 2.
<BR>
NO CALCIFICATION IS IDENTIFIED.
<BR>
NO VASCULAR INVASION IS IDENTIFIED.
<BR>
HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.
Coded Text
<NLP SNM=T04030 TYPE=T>LEFT BREAST</NLP SNM=T04030> CORE <NLP SNM=P1140 TYPE=P>BIOPSY</NLP
SNM=P1140> TWO O CLOCK.
<BR>
INVASIVE <NLP SNM=M85003 TYPE=M CLASS=3>DUCTAL CARCINOMA</NLP SNM=M85003> NOTTINGHAM
GRADE 2.
<BR>
NO CALCIFICATION IS IDENTIFIED.
<BR>
NO VASCULAR INVASION IS IDENTIFIED.
<BR>
HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.
10. Validation Study
• To validate SCENT, trained chart abstractors reviewed
electronic pathology reports.
• Random samples of breast (n=400) and prostate
(n=400) cancer patients.
• Patients diagnosed at KPSC between 2000-2007.
• Reports included from six months post-diagnosis
through end of 2008.
• In total, 206 breast and 186 prostate cancer patients
contributed 490 and 425 eligible reports, respectively.
• SCENT classifications were compared with those of
abstractors.
11. Classification Concordance
Abstractor Classifications
Cancer Other
Benign Suspicious
Recurrence Primary Cancer
SCENT Classifications % N % N % N % N Kappa
Breast Cancer (Total) (436) (32) (18) (4)
Benign 99.8 435 - - - - 25.0 1 0.96
Cancer Recurrence - - 100.0 32 - - - -
Other Primary Cancer 0.2 1 - - 100.0 18 50.0 2
Suspicious - - - - - - 25.0 1
Prostate Cancer (Total) (356) (29) (36) (4)
Benign 99.4 354 - - 5.6 2 - - 0.95
Cancer Recurrence - - 96.6 28 2.8 1 - -
Other Primary Cancer 0.6 2 3.4 1 91.7 33 - -
Suspicious - - - - - - 100.0 4
Note: incident contralateral breast malignancies were considered to be recurrences.
12. SCENT Performance Metrics
Sensitivity* Specificity* PPV* NPV*
Breast Cancer 1.00 (0.93-1.00) 0.99 (0.98-1.00) 0.94 (0.85-0.98) 1.00 (0.99-1.00)
Prostate Cancer 0.97 (0.89-0.99) 0.99 (0.98-1.00) 0.97 (0.89-0.99) 0.99 (0.98-1.00)
* Shown with Wilson's 95% confidence interval.
13. Conclusions
• Favorable results suggest SCENT can identify and extract
information about primary and recurrent malignancies from
pathology reports.
• Rapid cancer case identification.
• Improved measurement accuracy of common study endpoint.
• SCENT has the potential to expedite chart reviews by
narrowing the search and highlighting relevant concepts.
• Generalized utility for extracting standardized disease
scores and other clinical information.
• SCENT is proof of concept for SAS-based NLP that can be
easily shared between institutions to support research.
14. Limitations & Next Steps
• SCENT has a number of limitations, including:
• Unable to disambiguate and contextualize identified clinical concepts
without part-of-speech (POS) tagging.
• More susceptible to changes in text structure and increased linguistic
variability than statistical NLP approaches.
• General purpose NLP (e.g., cTAKES) likely to perform better outside of pathology.
• Next steps include:
• Release SCENT source code and requisite support files.
• Optimize current functionality and assess feasibility of adding methods
(e.g., POS tagging, n-grams, statistical classifiers).
• Attempt to identify non-pathologically diagnosed malignancies using
radiology reports and clinical progress notes.
• Quantify cost savings associated with SCENT-assisted chart reviews.