Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

May 1, 2012 • 2012 HMORN Conference • Seattle, Washington

Use of SAS-Based Natural Language Processing
to Identify Incident and Recurrent Malignancies
Justin A. Strauss, MA
Research Associate III
Kaiser Permanente Southern California

Co-Authors & Funding
• Chun R. Chao, PhD
• Marilyn L. Kwan, PhD
• Syed A. Ahmed, MD
• Joanne E. Schottinger, MD
• Virginia P. Quinn, PhD

Acknowledgements & Funding
• Mayra Martinez, Michelle McGuire, Melissa
Preciado, Nirupa Ghai, and Jeff Slezak
(KPSC); Lawrence Kushi (KPNC); Debra
Ritzwoller (KPCO); Joan Warren (NCI);
Jianyu Rao and Jiaoti Huang (UCLA)
• Funding was provided by KPSC
Community Benefit and the Cancer
Research Network

Malignancy Identification
• Malignancy identification is important for clinical
and epidemiologic cancer research.
• Limited quality and availability of incident and
recurrent malignancy data within health plans.
• Delayed availability of incident malignancy data from
cancer registries.
• Few registries track cancer recurrences.
• Manual chart abstraction slow and expensive.
• Previous research has shown electronic diagnosis codes
(e.g., ICD-9) to be unreliable.

Natural Language Processing
• Natural language processing (NLP) can be used to identify
and extract information from electronic clinical text,
including incident and recurrent malignancy data.
• Increasing opportunity for NLP with adoption of electronic
clinical systems in patient care delivery.
• Despite its potential value in clinical and research settings,
NLP usage has been relatively sparse. Contributing factors
may include:
• Technical complexity
• Systems integration requirements
• Habitual use of existing methods

SCENT Overview
• A SAS-based coding, extraction, and nomenclature tool
(SCENT) was developed to identify incident and recurrent
malignancies using text from pathology reports.

• SCENT is currently being implemented in two research
studies at Kaiser Permanente Southern California (KPSC):
• Intervention to improve medication adherence among breast
cancer patients.

• Differences in the prognosis of prostate cancer patients
according to their genetic factors

• Use of SAS programming minimizes implementation
barriers and increases availability for multisite research.

Description of Methods
• SCENT identifies non-negated clinical concepts within
pathology report text.

• Built using SAS Base (does not require Text Miner add-on).
• Makes extensive use of SAS hash objects and regular expressions.

• Includes components for preprocessing, matching, negation
and uncertainty detection, extracting diagnostic information
(e.g., staging and Gleason score), and classifying report
malignancy status.

• Flexibility to assign codes using variety of coding systems.
• Validation used subset of SNOMED 3.x (~1000 concepts).

SCENT Process Diagram
Clinical Concepts (Excel) Pathology Text (Research Database) [moderately-differentiated ductal adenocarcinoma
with papillary]
Type : Morphology, topology, or procedural Text : Raw text segment from report [features.]
Code : SNOMED 3.X Line : Sequential text segment identifier [the tumor involves 0.6 cm of one core.]
Class : Malignant , basaloid , benign, or N/A
Description : Concept description
[moderately-differentiated ductal adenocarcinoma
Regular Preprocessed with papillary features.]
Code : M-85033 Expressions Text [the tumor involves 0.6 cm of one core.]
Description : intraductal papillary
adenocarcinoma with invasion

moderately differentiated <nlp snm=m85033
[intraductal] type=m class=3>ductal adenocarcinomawith
papillary</nlp snm=m85033> features
Tokenize [papillary] Examine
[adenocarcinoma] Extract Data
Words [with] Segments Disease
[invasion] Extent
Code
Tumor Gleason
[intraductal] Staging Score
Matches Diagnostic
[papillary] Certainty
Clean [adenocarcinoma]
[with] Tokenize [moderately] [differentiated]
[invasion] [ductal] [adenocarcinoma]
Words with [papillary] [features]

[adenocarcinoma[ls]?]
[((intra)?duct(al)?)] [papillar (y|ies)]
Check
Enhance [papillar (y|ies)]
[((intra)?duct(al)?)] Negation
[adenocarcinoma[ls]?]

Loop Match free (of|from)
Concept Dictionary (SAS) Concepts Tokens not? (support[a-z]*|identified)
non(?!small|hodgkins)

Sample Report Coding
Preprocessed Text
LEFT BREAST CORE BIOPSY TWO O CLOCK.
 
INVASIVE DUCTAL CARCINOMA NOTTINGHAM GRADE 2.
 
NO CALCIFICATION IS IDENTIFIED.
 
NO VASCULAR INVASION IS IDENTIFIED.
 
HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.

Coded Text
<NLP SNM=T04030 TYPE=T>LEFT BREAST</NLP SNM=T04030> CORE <NLP SNM=P1140 TYPE=P>BIOPSY</NLP
SNM=P1140> TWO O CLOCK.
 
INVASIVE <NLP SNM=M85003 TYPE=M CLASS=3>DUCTAL CARCINOMA</NLP SNM=M85003> NOTTINGHAM
GRADE 2.
 
NO CALCIFICATION IS IDENTIFIED.
 
NO VASCULAR INVASION IS IDENTIFIED.
 
HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.

Validation Study
• To validate SCENT, trained chart abstractors reviewed
electronic pathology reports.
• Random samples of breast (n=400) and prostate
(n=400) cancer patients.
• Patients diagnosed at KPSC between 2000-2007.
• Reports included from six months post-diagnosis
through end of 2008.
• In total, 206 breast and 186 prostate cancer patients
contributed 490 and 425 eligible reports, respectively.
• SCENT classifications were compared with those of
abstractors.

Classification Concordance
Abstractor Classifications
Cancer Other
Benign Suspicious
Recurrence Primary Cancer

SCENT Classifications % N % N % N % N Kappa

Breast Cancer (Total) (436) (32) (18) (4)
Benign 99.8 435 - - - - 25.0 1 0.96
Cancer Recurrence - - 100.0 32 - - - -
Other Primary Cancer 0.2 1 - - 100.0 18 50.0 2
Suspicious - - - - - - 25.0 1
Prostate Cancer (Total) (356) (29) (36) (4)
Benign 99.4 354 - - 5.6 2 - - 0.95
Cancer Recurrence - - 96.6 28 2.8 1 - -
Other Primary Cancer 0.6 2 3.4 1 91.7 33 - -
Suspicious - - - - - - 100.0 4

Note: incident contralateral breast malignancies were considered to be recurrences.

SCENT Performance Metrics
Sensitivity* Specificity* PPV* NPV*

Breast Cancer 1.00 (0.93-1.00) 0.99 (0.98-1.00) 0.94 (0.85-0.98) 1.00 (0.99-1.00)

Prostate Cancer 0.97 (0.89-0.99) 0.99 (0.98-1.00) 0.97 (0.89-0.99) 0.99 (0.98-1.00)

* Shown with Wilson's 95% confidence interval.

Conclusions
• Favorable results suggest SCENT can identify and extract
information about primary and recurrent malignancies from
pathology reports.
• Rapid cancer case identification.
• Improved measurement accuracy of common study endpoint.

• SCENT has the potential to expedite chart reviews by
narrowing the search and highlighting relevant concepts.
• Generalized utility for extracting standardized disease
scores and other clinical information.
• SCENT is proof of concept for SAS-based NLP that can be
easily shared between institutions to support research.

Limitations & Next Steps
• SCENT has a number of limitations, including:
• Unable to disambiguate and contextualize identified clinical concepts
without part-of-speech (POS) tagging.
• More susceptible to changes in text structure and increased linguistic
variability than statistical NLP approaches.
• General purpose NLP (e.g., cTAKES) likely to perform better outside of pathology.

• Next steps include:
• Release SCENT source code and requisite support files.
• Optimize current functionality and assess feasibility of adding methods
(e.g., POS tagging, n-grams, statistical classifiers).
• Attempt to identify non-pathologically diagnosed malignancies using
radiology reports and clinical progress notes.
• Quantify cost savings associated with SCENT-assisted chart reviews.

Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Recommended

Recommended

More Related Content

Similar to Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Similar to Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS (15)

More from HMO Research Network

More from HMO Research Network (20)

Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS