This portfolio describes my data analysis skill using text mining in R to analyse text datasets consisting of numerous medical publications. I need to identify certain keywords from the abstracts from each publication that lead to clinical or non-clinical publications
2. The medical literature is enormous. Pubmed, a database of medical publications maintained by the U.S. National
Library of Medicine, has indexed over 23 million medical publications. Further, the rate of medical publication has
increased over time, and now there are nearly 1 million new publications in the field each year, or more than one
per minute.
The large size and fast-changing nature of the medical literature has increased the need for reviews, which search
databases like Pubmed for papers on a particular topic and then report results from the papers found. While such
reviews are often performed manually, with multiple people reviewing each search result, this is tedious and time
consuming. In this problem, we will see how text analytics can be used to automate the process of information
retrieval.
The dataset consists of 1861 rows and 3 columns. The first and second column variables are title and abstract
respectively while the third column variable indicates whether the paper is a clinical trial testing a drug therapy for
cancer (variable trial). This trial label was obtained by two people reviewing each search result and accessing the
actual paper if necessary, as part of a literature review of clinical trials testing drug therapies for advanced and
metastatic breast cancer.
INTRODUCTION
3. Example of Clinical Research Paper
Title: Neoadjuvant vinorelbine-capecitabine versus docetaxel-doxorubicin-cyclophosphamide in early nonresponsive breast
cancer: phase III randomized GeparTrio trial.
Abstract: BACKGROUND: Among breast cancer patients, nonresponse to initial neoadjuvant chemotherapy is associated with
unfavorable outcome. We compared the response of nonresponding patients who continued the same treatment with that of
patients who switched to a well-tolerated non-cross-resistant regimen. METHODS: Previously untreated breast cancer
patients received two 3-week cycles of docetaxel at 75 mg/m(2), doxorubicin at 50 mg/m(2), and cyclophosphamide at 500
mg/m(2) per day (TAC). Patients whose tumors did not decrease in size by at least 50% were randomly assigned to four
additional cycles of TAC or to four cycles of vinorelbine at 25 mg/m(2) and capecitabine at 2000 mg/m(2) (NX). The outcome
was sonographic response, defined as a reduction in the product of the two largest perpendicular diameters by at least 50%. A
difference of 10% or less in the sonographic response qualified as noninferiority of the NX treatment. Pathological complete
response was defined as no invasive or in situ residual tumor masses in the breast and lymph nodes. Toxic effects were
assessed. All statistical tests were two-sided. RESULTS: Of 2090 patients enrolled in the GeparTrio study, 622 (29.8%) who did
not respond to two initial cycles of TAC were randomly assigned to an additional four cycles of TAC (n = 321) or to four cycles
of NX (n = 301). Sonographic response rate was 50.5% for the TAC arm and 51.2% for the NX arm. The difference of 0.7% (95%
confidence interval = -7.1% to 8.5%) demonstrated noninferiority of NX (P = .008). Similar numbers of patients in both arms
received breast-conserving surgery (184 [57.3%] in the TAC arm vs 180 [59.8%] in the NX arm) and had a pathological
complete response (5.3% vs 6.0%). Fewer patients in the NX arm than in the TAC arm had hematologic toxic effects, mucositis,
infections, and nail changes, but more had hand-foot syndrome and sensory neuropathy. CONCLUSION: Pathological complete
responses to both regimens were marginal. Among patients who did not respond to the initial neoadjuvant TAC treatment,
similar efficacy but better tolerability was observed by switching to NX than continuing with TAC.
4. Example of Non-Clinical Research Paper
Title: Long-term endometrial effects in postmenopausal women with early breast cancer participating in the Intergroup
Exemestane Study (IES)--a randomised controlled trial of exemestane versus continued tamoxifen after 2-3 years tamoxifen.
Abstract: BACKGROUND: The antiestrogen tamoxifen may have partial estrogen-like effects on the postmenopausal uterus.
Aromatase inhibitors (AIs) are increasingly used after initial tamoxifen in the adjuvant treatment of postmenopausal early
breast cancer due to their mechanism of action: a potential benefit being a reduction of uterine abnormalities caused by
tamoxifen.PATIENTS AND METHODS: Sonographic uterine effects of the steroidal AI exemestane were studied in 219 women
participating in the Intergroup Exemestane Study: a large trial in postmenopausal women with estrogen receptor-positive (or
unknown) early breast cancer, disease free after 2-3 years of tamoxifen, randomly assigned to continue tamoxifen or switch to
exemestane to complete 5 years adjuvant treatment. The primary end point was the proportion of patients with abnormal (>
or =5 mm) endometrial thickness (ET) on transvaginal ultrasound 24 months after randomisation.RESULTS: The analysis
included 183 patients. Two years after randomisation, the proportion of patients with abnormal ET was significantly lower in
the exemestane compared with tamoxifen arm (36% versus 62%, respectively; P = 0.004). This difference emerged within 6
months of switching treatment (43.5% versus 65.2%, respectively; P = 0.01) and disappeared within 12 months of treatment
completion (30.8% versus 34.7%, respectively; P = 0.67).CONCLUSION: Switching from tamoxifen to exemestane significantly
reverses endometrial thickening associated with continued tamoxifen.
5. OBJECTIVES:
What are some unique keywords for categorizing clinical and non-
clinical research paper?
METHODOLOGIES:
Data Import
Bag of
Corpus and
Cleaning
Bi-Word
Creation
Separating into
Clinical and Non-
Clinical Words
Converting
into Data
frame
Data
Visualiz
ation
6. Step 1: Data Import
setwd("C:/Data Science/Datasets")
data <- read.csv("clinical_trial.csv", stringsAsFactors = FALSE)
data$trial <- as.factor(data$trial)
Step 2: Bag of Corpus and Cleaning Corpus
clinical_abstract <- paste(subset(data, trial == 1)$abstract, collapse = " ")
nonclinical_abstract <- paste(subset(data, trial == 0)$abstract, collapse = " ")
all_abstract <- c(clinical_abstract, nonclinical_abstract)
all_abstract_corpus <- VCorpus(VectorSource(all_abstract))
clean_corpus <- function(corpus) {
corpus <- tm_map(corpus, content_transformer(stripWhitespace))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removeNumbers))
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "purpose", "objective",
"objectives", "aim", "aims", "unlabelled", "introduction", "context",
"goals of work"))
return(corpus)
}
all_abstract_corp <- clean_corpus(all_abstract_corpus)
Step 4: Clinical and Non-clinical Separation
clinical_words <- subset(all_abstract_matrix, all_abstract_matrix[,1] > 0 &
all_abstract_matrix[,2] == 0)
nonclinical_words <- subset(all_abstract_matrix, all_abstract_matrix[,1] == 0 &
all_abstract_matrix[,2] > 0)
Step 5: Converting into Dataframe for
Clinical and Non-Clinical Words
clinical_words <- subset(all_abstract_matrix, all_abstract_matrix[,1] > 0 &
all_abstract_matrix[,2] == 0)
nonclinical_words <- subset(all_abstract_matrix, all_abstract_matrix[,1] == 0 &
all_abstract_matrix[,2] > 0)
Step 3: Bi-Word Creation
tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))}
all_abstract_matrix <- as.matrix(TermDocumentMatrix(all_abstract_corp, control = list(tokenize
= tokenizer)))
Step 6: Data Visualization (using ggplot2)
Consisting of top 10 highly frequent bi-words for each clinical and non-clinical research papers
9. SUMMARY
Clinical research papers are mostly dominated by measurement unit
words such as progession months, mg-day, pcr rate, mgm qw, ttp-
months and toxicities
Non-Clinical research papers are mostly dominated by general medical
terminology (instead of measurement unit) words such as: breast
carcinomas, zoledronic acid, symptom distress, response
chemoteraphy, risk factors, bone turnover, cancer survivors