SlideShare a Scribd company logo
1 of 26
Download to read offline
© 2013 ExcelR Solutions. All Rights Reserved
Text Mining
&
Clustering
© 2013 ExcelR Solutions. All Rights Reserved
Text Mining - Importance
• Avenues of textual unstructured data
− Call transcripts
− Email to customer service
− Social media outreach
− Speech transcripts
− Field agents, salespeople
− Interviews & surveys
Structured
20%
Unstructured
80%
© 2013 ExcelR Solutions. All Rights Reserved
Bag-of-Words
All the world’s a stage, and all the men and women merely players:
They have their exits and their entrances;
And one man in his time plays many parts…”
ENGLISH
Professor!!!
Statistician
World Stage Men Women Play Exit Entrance time
1 1 2 1 2 1 1 1
© 2013 ExcelR Solutions. All Rights Reserved
Terminology & Pre-processing
• Each row is called as a ‘Document’ & even an empty row is considered as a document
• Collection of all these documents is called as ‘Corpus’
• Quirks of languages
− Terms with typos (e.g., ‘musc’)
− Terms in lowercase, proper case & uppercase (e.g., usb, Usb, USB)
− Punctuations & special symbols (‘%’, ‘!’, ‘&’, etc.)
− Filler words, connectors, pronouns (‘all’, ‘for’, ‘of’, ‘my’, ‘to’, etc.)
• Stemming – process of considering only stem words (e.g., jumping, jumped; stem-word
here is ‘jump’)
Let me show you “Amazon customer reviews”
© 2013 ExcelR Solutions. All Rights Reserved
DTM & TDM
• Let us understand 100 document corpus of Xbox
TF - Regular term counts
TFIDF - Discounts the TF by document frequency
DTM weighing
© 2013 ExcelR Solutions. All Rights Reserved
Corpus-Level Word Cloud
© 2013 ExcelR Solutions. All Rights Reserved
Positive Word Cloud
© 2013 ExcelR Solutions. All Rights Reserved
Negative Word Cloud
© 2013 ExcelR Solutions. All Rights Reserved
Clinical Trials Project
© 2013 ExcelR Solutions. All Rights Reserved
Clinical Trials – Text Mining
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5 Stage 6
• Stage 1: Animals
• Stage 2: Humans - very few with that specific disease
• Stage 3: Humans - who have other diseases
• Stage 4: Humans - larger audience
• Stage 5: US FDA
• Stage 6: Adverse events
Stages
Stage 1 Stage 2 Stage 3
Stage 4 Stage 5 Stage 6
© 2013 ExcelR Solutions. All Rights Reserved
Clinical Trials – Project in brief
Business Objective: Increase the success rate of the clinical trials
Project Brief Description:
 Phase 1: Collected the data from open source forums such as “https://clinicaltrials.gov/”
 Phase 2: Data Cleansing on XML files by extracting relevant fields from the clinical trials
 Phase 3: Segregated the data into Structured & Unstructured data
 Phase 4: Performed Word Cloud & Sentiment Analysis on unstructured data to identify the
reasons for termination of clinical trials
Techniques used:
Term Frequency (TF), Term Frequency Inverse Document Frequency (TFIDF), Positive &
Negative Word cloud, Dendrogram, Semantic Network, k-Means clustering
© 2013 ExcelR Solutions. All Rights Reserved
• Key words standing out of the rest are Accrual, Enrollment, Slow, Safety, Efficacy, Sponsor,
Lack, Low etc.
• These words should be seen in the context to gain business value
• When we see this word cloud in conjunction with dendrogram, we notice that slow accrual,
slow enrollment, poor efficacy, sponsor funding seem to be the broad themes for
termination of clinical trials
Unigram Word Cloud & Dendrogram
© 2013 ExcelR Solutions. All Rights Reserved
Semantic network, Bigram
• Semantic network shows that the relationship between the words & the key themes
mentioned in previous slide are becoming relevant
• One key thing is safety concerns. At the first sight it sounds as if safety concerns were reason
for termination, but when we see it in context, more termination reasons say that there are
“No Safety Concerns”
• Bi-gram is used to see 2 words to extract business value & the key themes mentioned earlier
are more evident here
Bi-gram Word Cloud & Semantic Network
© 2013 ExcelR Solutions. All Rights Reserved
• Scree-plot or elbow plot shows that there is a clear bend at 2 clusters, hence we are
considering that there are 2 clusters (categories) that the data can be segregated into
Note: Analysis is done considering slight bend at 2nd cluster and considering steep bend at
4th cluster, however, it did not provide any meaningful insights
K-Means Clustering Scree Plot
© 2013 ExcelR Solutions. All Rights Reserved
• Word cloud is clearly highlighting that this cluster is speaking majorly about Accrual:
Term referring to the number of patients in a study or clinical trial
• Even the dendrogram clearly shows Accrual, Enrollment, Slow as a major cluster
Word Cloud & Dendrogram - First Cluster
© 2013 ExcelR Solutions. All Rights Reserved
• Few key highlights from this word cloud are early & premature termination
• Dendrogram mentions majority of things related to premature closure of clinical trials
Word Cloud & Dendrogram - Second Cluster
© 2013 ExcelR Solutions. All Rights Reserved
Web & Social Media Extraction
© 2013 ExcelR Solutions. All Rights Reserved
NLP - Agenda
01
02
03
04
05
LDA in Text Mining
Topic extraction
using LDA
Structured information
extraction
Sentiment extraction
in a narrative
Lexicons & Emotion
Mining
© 2013 ExcelR Solutions. All Rights Reserved
NLP
Data collection/
Information Retrieval
Feature extraction
Lexical analysis/
Entity analysis
Cleaning/
Normalization
Extraction of insight
© 2013 ExcelR Solutions. All Rights Reserved
Latent Dirichlet Allocation (LDA)
It assumes that each
document is a
mixture of a small
number of topics
Each document may
be viewed as
a mixture of various
topics
Each word’s presence
is attributable to one
of the document’s
topics
LDA can be viewed as
a Bayesian model,
where each item is
modeled as a result of
a mixture of
underlying set of
topics
LDA is a
generative model
(a model for
randomly
generating
observable data
values)
Observations
are words
collected into
documents
© 2013 ExcelR Solutions. All Rights Reserved
Latent Dirichlet allocation (LDA) Vs. Clustering
• Unsupervised learning algorithms
• Mixture model where a document can be
assigned to one or more topics
• Each topic is a culmination of multiple
documents
• Unsupervised learning algorithms
• Specify an optimal ‘k’ that allows us to extract
topics or segments from the data
• Does a raw partition of the data
• Resultant clusters are disjoint from each other
LDA Clustering (K-means)
• A popular example using term usage
• A man sees a boy with a telescope
• Who has the telescope?
• In this example a term’s usage leads to
confusion owing to its placement
• In the same way a sentence from a corpus
could infer a different meaning in conjunction
with another sentence
© 2013 ExcelR Solutions. All Rights Reserved
• Many sources of data contain large amount of artifacts that lend a lot of information
• Text data can be subjected to methods that can help mine structured information
• This is information retrieval using previously generated labeled data
Structured data extraction
Raw
Text
1
Parser2
Names
Entities3
© 2013 ExcelR Solutions. All Rights Reserved
• Lexicons serve as dictionaries for extracting sentiment from raw unlabeled data
• These are useful in estimating semantic orientation (polarity)
• They are applied to polarity prediction tasks and serve as a bag of words that
help assign a score/label to terms in text
Three Lexicons used in this session are:
– Bing (Developed by Professor Bing Liu)
– AFINN (Informatics and Mathematical Modelling, Technical University of Denmark)
– NRC (Dr. Saif M. Mohammad)
Lexicons
© 2013 ExcelR Solutions. All Rights Reserved
This methodology allows the extraction of the most negative to positive sentiment bearing
documents from text:
• negative <- s_v[which.min(afinn_s_v)]
• > negative
• [1] "I fully agree with you. This is the worst card ever and they are running a late charge fee scam here. This is what they do - on my first
statement, I made full payment. On the second statement, they charged me a late payment fee and interest. I wrote back to them to tell them
they made mistake but never heard back from them. I made full payment on the second statement which was around $26 and when the
third statement came, I was charged with another late payment fee and compounded interests. This time, I got on the phone and spoke to
some guy in India who starts off each statement with "So how do you want to make payment sir" After something like hearing that damn
statement for the 10th time, I got so pissed with trying to get an explanation on the late fees, I was transferred to a supervisor who then said
that my first payment got rejected. I then asked why was it rejected (for which they had no clue and said that I should check with my bank)
but got no where but what I was really pissed about was that the unpaid amount due in the first statement was never reflected on the second
statement like other typical credit cards from REPUTABLE banks. Had this been done, I would have known and paid in full with the
second payment! In the end, they said that they would waive the interests which was only a few dollars but gad to charge me the late fee of
$35 per month ($70 in total). Can you imagine if they run this "sweat shop" practice from India and suckered in 10,000 people with this
practice? That would have been $700,000 into their pockets without breaking a SWEAT! Worst still, they filed my 2 months delinquencies
with the credit unions and I lost 50-70 credit score points!! Bastards! ”
• > positive <- s_v[which.max(afinn_s_v)]
• > positive
• [1] ""Well, you got me beat. I have 11 currently. However, I do have several Chase cards (CSP, Marriott Rewards and the Freedom) and
to be honest I rarely use my Amazon.com Rewards due to better reward options elsewhere; Barclay's SallieMae is the best card for
Amazon.com purchases. 5% back on up to 750 in purchases each month. It also has 5% on gas/groceries on up to 250 on purchases, in
each category per month. I really do not like the small cap of 250 on groceries and use Amex BCE for that purpose, but its still an extra
benefit to have."
Emotion Mining
© 2013 ExcelR Solutions. All Rights Reserved
 This method allows for generating narrative time for the text and
generates how the positive and negative emotional valence has
been in the corpus
 The nrc framework allows to mine text that match with eight
emotions in the writing:
1. Trust
2. Anticipation
3. Joy
4. Sadness
5. Fear
6. Anger
7. Surprise
8. Disgust
 The same can be implemented for the different sources from
which content is generated for a client, and the outcomes can be
compared and evaluated for understating which source provides
what form of understanding
Arcs & Emotion
© 2013 ExcelR Solutions. All Rights Reserved
THANK YOU

More Related Content

What's hot

Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...Sri Ambati
 
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...Sri Ambati
 
NLP applied to French legal decisions
NLP applied to French legal decisionsNLP applied to French legal decisions
NLP applied to French legal decisionsMichael BENESTY
 
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet IJECEIAES
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!
 
Intelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural networkIntelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural networkijfcstjournal
 

What's hot (6)

Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
 
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
 
NLP applied to French legal decisions
NLP applied to French legal decisionsNLP applied to French legal decisions
NLP applied to French legal decisions
 
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
 
Intelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural networkIntelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural network
 

Similar to Text Mining and Clustering Analysis of Clinical Trial Data

People Analytics_Introduction
People Analytics_IntroductionPeople Analytics_Introduction
People Analytics_IntroductionEdith Soghomonyan
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsCarla Marini
 
Seminar(Pattern Recognition)
Seminar(Pattern Recognition)Seminar(Pattern Recognition)
Seminar(Pattern Recognition)anurodhsinha
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 
eSource Stakeholders Group 18mar2016
eSource Stakeholders Group  18mar2016eSource Stakeholders Group  18mar2016
eSource Stakeholders Group 18mar2016Michael Ibara
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - ExperimentsGaurav Marwaha
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine LearningSharjeel Imtiaz
 
Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data AnalysisSaad Chahine
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
Possible Essay Questions On Romeo And Juliet
Possible Essay Questions On Romeo And JulietPossible Essay Questions On Romeo And Juliet
Possible Essay Questions On Romeo And JulietJamie Jackson
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityIkbal Ahmed
 
Understanding Users Through Ethnography and Modeling - STC Summit 2010
Understanding Users Through Ethnography and Modeling - STC Summit 2010Understanding Users Through Ethnography and Modeling - STC Summit 2010
Understanding Users Through Ethnography and Modeling - STC Summit 2010Jim Jarrett
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSubrata Saharia
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...Jon Stevens-Hall
 

Similar to Text Mining and Clustering Analysis of Clinical Trial Data (19)

People Analytics_Introduction
People Analytics_IntroductionPeople Analytics_Introduction
People Analytics_Introduction
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
 
Seminar(Pattern Recognition)
Seminar(Pattern Recognition)Seminar(Pattern Recognition)
Seminar(Pattern Recognition)
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
eSource Stakeholders Group 18mar2016
eSource Stakeholders Group  18mar2016eSource Stakeholders Group  18mar2016
eSource Stakeholders Group 18mar2016
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - Experiments
 
Chapter8.coding
Chapter8.codingChapter8.coding
Chapter8.coding
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine Learning
 
Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data Analysis
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Possible Essay Questions On Romeo And Juliet
Possible Essay Questions On Romeo And JulietPossible Essay Questions On Romeo And Juliet
Possible Essay Questions On Romeo And Juliet
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
 
Understanding Users Through Ethnography and Modeling - STC Summit 2010
Understanding Users Through Ethnography and Modeling - STC Summit 2010Understanding Users Through Ethnography and Modeling - STC Summit 2010
Understanding Users Through Ethnography and Modeling - STC Summit 2010
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
 

More from devipatnala1

data science course in delhi
data science course in delhidata science course in delhi
data science course in delhidevipatnala1
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhidevipatnala1
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangaloredevipatnala1
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangaloredevipatnala1
 
best data science certification
best data science certificationbest data science certification
best data science certificationdevipatnala1
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabaddevipatnala1
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangaloredevipatnala1
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangaloredevipatnala1
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhidevipatnala1
 
data science courses in banglore
data science courses in bangloredata science courses in banglore
data science courses in bangloredevipatnala1
 
data science course
data science coursedata science course
data science coursedevipatnala1
 
data science course
data science coursedata science course
data science coursedevipatnala1
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangaloredevipatnala1
 
data science course
data science coursedata science course
data science coursedevipatnala1
 
data science online course
data science online coursedata science online course
data science online coursedevipatnala1
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabaddevipatnala1
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabaddevipatnala1
 
data science training in hyderabad
data science training in hyderabaddata science training in hyderabad
data science training in hyderabaddevipatnala1
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangaloredevipatnala1
 
best data science certification
best data science certificationbest data science certification
best data science certificationdevipatnala1
 

More from devipatnala1 (20)

data science course in delhi
data science course in delhidata science course in delhi
data science course in delhi
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhi
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
best data science certification
best data science certificationbest data science certification
best data science certification
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabad
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangalore
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangalore
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhi
 
data science courses in banglore
data science courses in bangloredata science courses in banglore
data science courses in banglore
 
data science course
data science coursedata science course
data science course
 
data science course
data science coursedata science course
data science course
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangalore
 
data science course
data science coursedata science course
data science course
 
data science online course
data science online coursedata science online course
data science online course
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabad
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabad
 
data science training in hyderabad
data science training in hyderabaddata science training in hyderabad
data science training in hyderabad
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
best data science certification
best data science certificationbest data science certification
best data science certification
 

Recently uploaded

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 

Recently uploaded (20)

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 

Text Mining and Clustering Analysis of Clinical Trial Data

  • 1. © 2013 ExcelR Solutions. All Rights Reserved Text Mining & Clustering
  • 2. © 2013 ExcelR Solutions. All Rights Reserved Text Mining - Importance • Avenues of textual unstructured data − Call transcripts − Email to customer service − Social media outreach − Speech transcripts − Field agents, salespeople − Interviews & surveys Structured 20% Unstructured 80%
  • 3. © 2013 ExcelR Solutions. All Rights Reserved Bag-of-Words All the world’s a stage, and all the men and women merely players: They have their exits and their entrances; And one man in his time plays many parts…” ENGLISH Professor!!! Statistician World Stage Men Women Play Exit Entrance time 1 1 2 1 2 1 1 1
  • 4. © 2013 ExcelR Solutions. All Rights Reserved Terminology & Pre-processing • Each row is called as a ‘Document’ & even an empty row is considered as a document • Collection of all these documents is called as ‘Corpus’ • Quirks of languages − Terms with typos (e.g., ‘musc’) − Terms in lowercase, proper case & uppercase (e.g., usb, Usb, USB) − Punctuations & special symbols (‘%’, ‘!’, ‘&’, etc.) − Filler words, connectors, pronouns (‘all’, ‘for’, ‘of’, ‘my’, ‘to’, etc.) • Stemming – process of considering only stem words (e.g., jumping, jumped; stem-word here is ‘jump’) Let me show you “Amazon customer reviews”
  • 5. © 2013 ExcelR Solutions. All Rights Reserved DTM & TDM • Let us understand 100 document corpus of Xbox TF - Regular term counts TFIDF - Discounts the TF by document frequency DTM weighing
  • 6. © 2013 ExcelR Solutions. All Rights Reserved Corpus-Level Word Cloud
  • 7. © 2013 ExcelR Solutions. All Rights Reserved Positive Word Cloud
  • 8. © 2013 ExcelR Solutions. All Rights Reserved Negative Word Cloud
  • 9. © 2013 ExcelR Solutions. All Rights Reserved Clinical Trials Project
  • 10. © 2013 ExcelR Solutions. All Rights Reserved Clinical Trials – Text Mining Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 • Stage 1: Animals • Stage 2: Humans - very few with that specific disease • Stage 3: Humans - who have other diseases • Stage 4: Humans - larger audience • Stage 5: US FDA • Stage 6: Adverse events Stages Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6
  • 11. © 2013 ExcelR Solutions. All Rights Reserved Clinical Trials – Project in brief Business Objective: Increase the success rate of the clinical trials Project Brief Description:  Phase 1: Collected the data from open source forums such as “https://clinicaltrials.gov/”  Phase 2: Data Cleansing on XML files by extracting relevant fields from the clinical trials  Phase 3: Segregated the data into Structured & Unstructured data  Phase 4: Performed Word Cloud & Sentiment Analysis on unstructured data to identify the reasons for termination of clinical trials Techniques used: Term Frequency (TF), Term Frequency Inverse Document Frequency (TFIDF), Positive & Negative Word cloud, Dendrogram, Semantic Network, k-Means clustering
  • 12. © 2013 ExcelR Solutions. All Rights Reserved • Key words standing out of the rest are Accrual, Enrollment, Slow, Safety, Efficacy, Sponsor, Lack, Low etc. • These words should be seen in the context to gain business value • When we see this word cloud in conjunction with dendrogram, we notice that slow accrual, slow enrollment, poor efficacy, sponsor funding seem to be the broad themes for termination of clinical trials Unigram Word Cloud & Dendrogram
  • 13. © 2013 ExcelR Solutions. All Rights Reserved Semantic network, Bigram • Semantic network shows that the relationship between the words & the key themes mentioned in previous slide are becoming relevant • One key thing is safety concerns. At the first sight it sounds as if safety concerns were reason for termination, but when we see it in context, more termination reasons say that there are “No Safety Concerns” • Bi-gram is used to see 2 words to extract business value & the key themes mentioned earlier are more evident here Bi-gram Word Cloud & Semantic Network
  • 14. © 2013 ExcelR Solutions. All Rights Reserved • Scree-plot or elbow plot shows that there is a clear bend at 2 clusters, hence we are considering that there are 2 clusters (categories) that the data can be segregated into Note: Analysis is done considering slight bend at 2nd cluster and considering steep bend at 4th cluster, however, it did not provide any meaningful insights K-Means Clustering Scree Plot
  • 15. © 2013 ExcelR Solutions. All Rights Reserved • Word cloud is clearly highlighting that this cluster is speaking majorly about Accrual: Term referring to the number of patients in a study or clinical trial • Even the dendrogram clearly shows Accrual, Enrollment, Slow as a major cluster Word Cloud & Dendrogram - First Cluster
  • 16. © 2013 ExcelR Solutions. All Rights Reserved • Few key highlights from this word cloud are early & premature termination • Dendrogram mentions majority of things related to premature closure of clinical trials Word Cloud & Dendrogram - Second Cluster
  • 17. © 2013 ExcelR Solutions. All Rights Reserved Web & Social Media Extraction
  • 18. © 2013 ExcelR Solutions. All Rights Reserved NLP - Agenda 01 02 03 04 05 LDA in Text Mining Topic extraction using LDA Structured information extraction Sentiment extraction in a narrative Lexicons & Emotion Mining
  • 19. © 2013 ExcelR Solutions. All Rights Reserved NLP Data collection/ Information Retrieval Feature extraction Lexical analysis/ Entity analysis Cleaning/ Normalization Extraction of insight
  • 20. © 2013 ExcelR Solutions. All Rights Reserved Latent Dirichlet Allocation (LDA) It assumes that each document is a mixture of a small number of topics Each document may be viewed as a mixture of various topics Each word’s presence is attributable to one of the document’s topics LDA can be viewed as a Bayesian model, where each item is modeled as a result of a mixture of underlying set of topics LDA is a generative model (a model for randomly generating observable data values) Observations are words collected into documents
  • 21. © 2013 ExcelR Solutions. All Rights Reserved Latent Dirichlet allocation (LDA) Vs. Clustering • Unsupervised learning algorithms • Mixture model where a document can be assigned to one or more topics • Each topic is a culmination of multiple documents • Unsupervised learning algorithms • Specify an optimal ‘k’ that allows us to extract topics or segments from the data • Does a raw partition of the data • Resultant clusters are disjoint from each other LDA Clustering (K-means) • A popular example using term usage • A man sees a boy with a telescope • Who has the telescope? • In this example a term’s usage leads to confusion owing to its placement • In the same way a sentence from a corpus could infer a different meaning in conjunction with another sentence
  • 22. © 2013 ExcelR Solutions. All Rights Reserved • Many sources of data contain large amount of artifacts that lend a lot of information • Text data can be subjected to methods that can help mine structured information • This is information retrieval using previously generated labeled data Structured data extraction Raw Text 1 Parser2 Names Entities3
  • 23. © 2013 ExcelR Solutions. All Rights Reserved • Lexicons serve as dictionaries for extracting sentiment from raw unlabeled data • These are useful in estimating semantic orientation (polarity) • They are applied to polarity prediction tasks and serve as a bag of words that help assign a score/label to terms in text Three Lexicons used in this session are: – Bing (Developed by Professor Bing Liu) – AFINN (Informatics and Mathematical Modelling, Technical University of Denmark) – NRC (Dr. Saif M. Mohammad) Lexicons
  • 24. © 2013 ExcelR Solutions. All Rights Reserved This methodology allows the extraction of the most negative to positive sentiment bearing documents from text: • negative <- s_v[which.min(afinn_s_v)] • > negative • [1] "I fully agree with you. This is the worst card ever and they are running a late charge fee scam here. This is what they do - on my first statement, I made full payment. On the second statement, they charged me a late payment fee and interest. I wrote back to them to tell them they made mistake but never heard back from them. I made full payment on the second statement which was around $26 and when the third statement came, I was charged with another late payment fee and compounded interests. This time, I got on the phone and spoke to some guy in India who starts off each statement with "So how do you want to make payment sir" After something like hearing that damn statement for the 10th time, I got so pissed with trying to get an explanation on the late fees, I was transferred to a supervisor who then said that my first payment got rejected. I then asked why was it rejected (for which they had no clue and said that I should check with my bank) but got no where but what I was really pissed about was that the unpaid amount due in the first statement was never reflected on the second statement like other typical credit cards from REPUTABLE banks. Had this been done, I would have known and paid in full with the second payment! In the end, they said that they would waive the interests which was only a few dollars but gad to charge me the late fee of $35 per month ($70 in total). Can you imagine if they run this "sweat shop" practice from India and suckered in 10,000 people with this practice? That would have been $700,000 into their pockets without breaking a SWEAT! Worst still, they filed my 2 months delinquencies with the credit unions and I lost 50-70 credit score points!! Bastards! ” • > positive <- s_v[which.max(afinn_s_v)] • > positive • [1] ""Well, you got me beat. I have 11 currently. However, I do have several Chase cards (CSP, Marriott Rewards and the Freedom) and to be honest I rarely use my Amazon.com Rewards due to better reward options elsewhere; Barclay's SallieMae is the best card for Amazon.com purchases. 5% back on up to 750 in purchases each month. It also has 5% on gas/groceries on up to 250 on purchases, in each category per month. I really do not like the small cap of 250 on groceries and use Amex BCE for that purpose, but its still an extra benefit to have." Emotion Mining
  • 25. © 2013 ExcelR Solutions. All Rights Reserved  This method allows for generating narrative time for the text and generates how the positive and negative emotional valence has been in the corpus  The nrc framework allows to mine text that match with eight emotions in the writing: 1. Trust 2. Anticipation 3. Joy 4. Sadness 5. Fear 6. Anger 7. Surprise 8. Disgust  The same can be implemented for the different sources from which content is generated for a client, and the outcomes can be compared and evaluated for understating which source provides what form of understanding Arcs & Emotion
  • 26. © 2013 ExcelR Solutions. All Rights Reserved THANK YOU