SlideShare a Scribd company logo
1 of 26
Download to read offline
© 2013 ExcelR Solutions. All Rights Reserved
Text Mining
&
Clustering
© 2013 ExcelR Solutions. All Rights Reserved
Text Mining - Importance
• Avenues of textual unstructured data
− Call transcripts
− Email to customer service
− Social media outreach
− Speech transcripts
− Field agents, salespeople
− Interviews & surveys
Structured
20%
Unstructured
80%
© 2013 ExcelR Solutions. All Rights Reserved
Bag-of-Words
All the world’s a stage, and all the men and women merely players:
They have their exits and their entrances;
And one man in his time plays many parts…”
ENGLISH
Professor!!!
Statistician
World Stage Men Women Play Exit Entrance time
1 1 2 1 2 1 1 1
© 2013 ExcelR Solutions. All Rights Reserved
Terminology & Pre-processing
• Each row is called as a ‘Document’ & even an empty row is considered as a document
• Collection of all these documents is called as ‘Corpus’
• Quirks of languages
− Terms with typos (e.g., ‘musc’)
− Terms in lowercase, proper case & uppercase (e.g., usb, Usb, USB)
− Punctuations & special symbols (‘%’, ‘!’, ‘&’, etc.)
− Filler words, connectors, pronouns (‘all’, ‘for’, ‘of’, ‘my’, ‘to’, etc.)
• Stemming – process of considering only stem words (e.g., jumping, jumped; stem-word
here is ‘jump’)
Let me show you “Amazon customer reviews”
© 2013 ExcelR Solutions. All Rights Reserved
DTM & TDM
• Let us understand 100 document corpus of Xbox
TF - Regular term counts
TFIDF - Discounts the TF by document frequency
DTM weighing
© 2013 ExcelR Solutions. All Rights Reserved
Corpus-Level Word Cloud
© 2013 ExcelR Solutions. All Rights Reserved
Positive Word Cloud
© 2013 ExcelR Solutions. All Rights Reserved
Negative Word Cloud
© 2013 ExcelR Solutions. All Rights Reserved
Clinical Trials Project
© 2013 ExcelR Solutions. All Rights Reserved
Clinical Trials – Text Mining
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5 Stage 6
• Stage 1: Animals
• Stage 2: Humans - very few with that specific disease
• Stage 3: Humans - who have other diseases
• Stage 4: Humans - larger audience
• Stage 5: US FDA
• Stage 6: Adverse events
Stages
Stage 1 Stage 2 Stage 3
Stage 4 Stage 5 Stage 6
© 2013 ExcelR Solutions. All Rights Reserved
Clinical Trials – Project in brief
Business Objective: Increase the success rate of the clinical trials
Project Brief Description:
 Phase 1: Collected the data from open source forums such as “https://clinicaltrials.gov/”
 Phase 2: Data Cleansing on XML files by extracting relevant fields from the clinical trials
 Phase 3: Segregated the data into Structured & Unstructured data
 Phase 4: Performed Word Cloud & Sentiment Analysis on unstructured data to identify the
reasons for termination of clinical trials
Techniques used:
Term Frequency (TF), Term Frequency Inverse Document Frequency (TFIDF), Positive &
Negative Word cloud, Dendrogram, Semantic Network, k-Means clustering
© 2013 ExcelR Solutions. All Rights Reserved
• Key words standing out of the rest are Accrual, Enrollment, Slow, Safety, Efficacy, Sponsor,
Lack, Low etc.
• These words should be seen in the context to gain business value
• When we see this word cloud in conjunction with dendrogram, we notice that slow accrual,
slow enrollment, poor efficacy, sponsor funding seem to be the broad themes for
termination of clinical trials
Unigram Word Cloud & Dendrogram
© 2013 ExcelR Solutions. All Rights Reserved
Semantic network, Bigram
• Semantic network shows that the relationship between the words & the key themes
mentioned in previous slide are becoming relevant
• One key thing is safety concerns. At the first sight it sounds as if safety concerns were reason
for termination, but when we see it in context, more termination reasons say that there are
“No Safety Concerns”
• Bi-gram is used to see 2 words to extract business value & the key themes mentioned earlier
are more evident here
Bi-gram Word Cloud & Semantic Network
© 2013 ExcelR Solutions. All Rights Reserved
• Scree-plot or elbow plot shows that there is a clear bend at 2 clusters, hence we are
considering that there are 2 clusters (categories) that the data can be segregated into
Note: Analysis is done considering slight bend at 2nd cluster and considering steep bend at
4th cluster, however, it did not provide any meaningful insights
K-Means Clustering Scree Plot
© 2013 ExcelR Solutions. All Rights Reserved
• Word cloud is clearly highlighting that this cluster is speaking majorly about Accrual:
Term referring to the number of patients in a study or clinical trial
• Even the dendrogram clearly shows Accrual, Enrollment, Slow as a major cluster
Word Cloud & Dendrogram - First Cluster
© 2013 ExcelR Solutions. All Rights Reserved
• Few key highlights from this word cloud are early & premature termination
• Dendrogram mentions majority of things related to premature closure of clinical trials
Word Cloud & Dendrogram - Second Cluster
© 2013 ExcelR Solutions. All Rights Reserved
Web & Social Media Extraction
© 2013 ExcelR Solutions. All Rights Reserved
NLP - Agenda
01
02
03
04
05
LDA in Text Mining
Topic extraction
using LDA
Structured information
extraction
Sentiment extraction
in a narrative
Lexicons & Emotion
Mining
© 2013 ExcelR Solutions. All Rights Reserved
NLP
Data collection/
Information Retrieval
Feature extraction
Lexical analysis/
Entity analysis
Cleaning/
Normalization
Extraction of insight
© 2013 ExcelR Solutions. All Rights Reserved
Latent Dirichlet Allocation (LDA)
It assumes that each
document is a
mixture of a small
number of topics
Each document may
be viewed as
a mixture of various
topics
Each word’s presence
is attributable to one
of the document’s
topics
LDA can be viewed as
a Bayesian model,
where each item is
modeled as a result of
a mixture of
underlying set of
topics
LDA is a
generative model
(a model for
randomly
generating
observable data
values)
Observations
are words
collected into
documents
© 2013 ExcelR Solutions. All Rights Reserved
Latent Dirichlet allocation (LDA) Vs. Clustering
• Unsupervised learning algorithms
• Mixture model where a document can be
assigned to one or more topics
• Each topic is a culmination of multiple
documents
• Unsupervised learning algorithms
• Specify an optimal ‘k’ that allows us to extract
topics or segments from the data
• Does a raw partition of the data
• Resultant clusters are disjoint from each other
LDA Clustering (K-means)
• A popular example using term usage
• A man sees a boy with a telescope
• Who has the telescope?
• In this example a term’s usage leads to
confusion owing to its placement
• In the same way a sentence from a corpus
could infer a different meaning in conjunction
with another sentence
© 2013 ExcelR Solutions. All Rights Reserved
• Many sources of data contain large amount of artifacts that lend a lot of information
• Text data can be subjected to methods that can help mine structured information
• This is information retrieval using previously generated labeled data
Structured data extraction
Raw
Text
1
Parser2
Names
Entities3
© 2013 ExcelR Solutions. All Rights Reserved
• Lexicons serve as dictionaries for extracting sentiment from raw unlabeled data
• These are useful in estimating semantic orientation (polarity)
• They are applied to polarity prediction tasks and serve as a bag of words that
help assign a score/label to terms in text
Three Lexicons used in this session are:
– Bing (Developed by Professor Bing Liu)
– AFINN (Informatics and Mathematical Modelling, Technical University of Denmark)
– NRC (Dr. Saif M. Mohammad)
Lexicons
© 2013 ExcelR Solutions. All Rights Reserved
This methodology allows the extraction of the most negative to positive sentiment bearing
documents from text:
• negative <- s_v[which.min(afinn_s_v)]
• > negative
• [1] "I fully agree with you. This is the worst card ever and they are running a late charge fee scam here. This is what they do - on my first
statement, I made full payment. On the second statement, they charged me a late payment fee and interest. I wrote back to them to tell them
they made mistake but never heard back from them. I made full payment on the second statement which was around $26 and when the
third statement came, I was charged with another late payment fee and compounded interests. This time, I got on the phone and spoke to
some guy in India who starts off each statement with "So how do you want to make payment sir" After something like hearing that damn
statement for the 10th time, I got so pissed with trying to get an explanation on the late fees, I was transferred to a supervisor who then said
that my first payment got rejected. I then asked why was it rejected (for which they had no clue and said that I should check with my bank)
but got no where but what I was really pissed about was that the unpaid amount due in the first statement was never reflected on the second
statement like other typical credit cards from REPUTABLE banks. Had this been done, I would have known and paid in full with the
second payment! In the end, they said that they would waive the interests which was only a few dollars but gad to charge me the late fee of
$35 per month ($70 in total). Can you imagine if they run this "sweat shop" practice from India and suckered in 10,000 people with this
practice? That would have been $700,000 into their pockets without breaking a SWEAT! Worst still, they filed my 2 months delinquencies
with the credit unions and I lost 50-70 credit score points!! Bastards! ”
• > positive <- s_v[which.max(afinn_s_v)]
• > positive
• [1] ""Well, you got me beat. I have 11 currently. However, I do have several Chase cards (CSP, Marriott Rewards and the Freedom) and
to be honest I rarely use my Amazon.com Rewards due to better reward options elsewhere; Barclay's SallieMae is the best card for
Amazon.com purchases. 5% back on up to 750 in purchases each month. It also has 5% on gas/groceries on up to 250 on purchases, in
each category per month. I really do not like the small cap of 250 on groceries and use Amex BCE for that purpose, but its still an extra
benefit to have."
Emotion Mining
© 2013 ExcelR Solutions. All Rights Reserved
 This method allows for generating narrative time for the text and
generates how the positive and negative emotional valence has
been in the corpus
 The nrc framework allows to mine text that match with eight
emotions in the writing:
1. Trust
2. Anticipation
3. Joy
4. Sadness
5. Fear
6. Anger
7. Surprise
8. Disgust
 The same can be implemented for the different sources from
which content is generated for a client, and the outcomes can be
compared and evaluated for understating which source provides
what form of understanding
Arcs & Emotion
© 2013 ExcelR Solutions. All Rights Reserved
THANK YOU

More Related Content

What's hot

Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...Sri Ambati
 
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...Sri Ambati
 
NLP applied to French legal decisions
NLP applied to French legal decisionsNLP applied to French legal decisions
NLP applied to French legal decisionsMichael BENESTY
 
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet IJECEIAES
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!
 
Intelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural networkIntelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural networkijfcstjournal
 

What's hot (6)

Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
Bharath Sudharsan, ArmadaHealth - NLP in Aid of Critical Health Decisions - H...
 
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
Keynote by Charles Elkan, Goldman Sachs - Machine Learning in Finance - The P...
 
NLP applied to French legal decisions
NLP applied to French legal decisionsNLP applied to French legal decisions
NLP applied to French legal decisions
 
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
 
Intelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural networkIntelligent information extraction based on artificial neural network
Intelligent information extraction based on artificial neural network
 

Similar to data science certification

People Analytics_Introduction
People Analytics_IntroductionPeople Analytics_Introduction
People Analytics_IntroductionEdith Soghomonyan
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsCarla Marini
 
Seminar(Pattern Recognition)
Seminar(Pattern Recognition)Seminar(Pattern Recognition)
Seminar(Pattern Recognition)anurodhsinha
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 
eSource Stakeholders Group 18mar2016
eSource Stakeholders Group  18mar2016eSource Stakeholders Group  18mar2016
eSource Stakeholders Group 18mar2016Michael Ibara
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - ExperimentsGaurav Marwaha
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine LearningSharjeel Imtiaz
 
Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data AnalysisSaad Chahine
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
Possible Essay Questions On Romeo And Juliet
Possible Essay Questions On Romeo And JulietPossible Essay Questions On Romeo And Juliet
Possible Essay Questions On Romeo And JulietJamie Jackson
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityIkbal Ahmed
 
Understanding Users Through Ethnography and Modeling - STC Summit 2010
Understanding Users Through Ethnography and Modeling - STC Summit 2010Understanding Users Through Ethnography and Modeling - STC Summit 2010
Understanding Users Through Ethnography and Modeling - STC Summit 2010Jim Jarrett
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSubrata Saharia
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...Jon Stevens-Hall
 

Similar to data science certification (19)

People Analytics_Introduction
People Analytics_IntroductionPeople Analytics_Introduction
People Analytics_Introduction
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
 
Seminar(Pattern Recognition)
Seminar(Pattern Recognition)Seminar(Pattern Recognition)
Seminar(Pattern Recognition)
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
eSource Stakeholders Group 18mar2016
eSource Stakeholders Group  18mar2016eSource Stakeholders Group  18mar2016
eSource Stakeholders Group 18mar2016
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - Experiments
 
Chapter8.coding
Chapter8.codingChapter8.coding
Chapter8.coding
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine Learning
 
Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data Analysis
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Possible Essay Questions On Romeo And Juliet
Possible Essay Questions On Romeo And JulietPossible Essay Questions On Romeo And Juliet
Possible Essay Questions On Romeo And Juliet
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
 
Understanding Users Through Ethnography and Modeling - STC Summit 2010
Understanding Users Through Ethnography and Modeling - STC Summit 2010Understanding Users Through Ethnography and Modeling - STC Summit 2010
Understanding Users Through Ethnography and Modeling - STC Summit 2010
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...DevOps Enterprise Summit 2019 - How Swarming Enables EnterpriseSupport to wo...
DevOps Enterprise Summit 2019 - How Swarming Enables Enterprise Support to wo...
 

More from Data Analytics Courses in Pune

Best data science training, best data science training institute in hyderabad.
 Best data science training, best data science training institute in hyderabad. Best data science training, best data science training institute in hyderabad.
Best data science training, best data science training institute in hyderabad.Data Analytics Courses in Pune
 
Best data science training, best data science training institute in hyderabad.
 Best data science training, best data science training institute in hyderabad. Best data science training, best data science training institute in hyderabad.
Best data science training, best data science training institute in hyderabad.Data Analytics Courses in Pune
 

More from Data Analytics Courses in Pune (18)

Digital marketing 2018
Digital marketing 2018Digital marketing 2018
Digital marketing 2018
 
Digital marketing 2018
Digital marketing 2018Digital marketing 2018
Digital marketing 2018
 
Digital Marketing training in Pune
Digital Marketing training in PuneDigital Marketing training in Pune
Digital Marketing training in Pune
 
Data science course in pune
Data science course in puneData science course in pune
Data science course in pune
 
Digital Marketing Training In Pune
Digital Marketing Training In Pune Digital Marketing Training In Pune
Digital Marketing Training In Pune
 
Digital Marketing Training In Pune - ExcelR
Digital Marketing Training In Pune - ExcelRDigital Marketing Training In Pune - ExcelR
Digital Marketing Training In Pune - ExcelR
 
Data science certification
Data science certificationData science certification
Data science certification
 
Data Science Course In Pune
Data Science Course In PuneData Science Course In Pune
Data Science Course In Pune
 
Data science course in Pune
Data science course in PuneData science course in Pune
Data science course in Pune
 
Data science certification in pune
Data science certification in puneData science certification in pune
Data science certification in pune
 
Data science certification in pune
Data science certification in puneData science certification in pune
Data science certification in pune
 
Data Science Course In Pune
Data Science Course In PuneData Science Course In Pune
Data Science Course In Pune
 
Data Science Course
Data Science CourseData Science Course
Data Science Course
 
Data Science Course
Data Science CourseData Science Course
Data Science Course
 
Best data science training, best data science training institute in hyderabad.
 Best data science training, best data science training institute in hyderabad. Best data science training, best data science training institute in hyderabad.
Best data science training, best data science training institute in hyderabad.
 
Best data science training, best data science training institute in hyderabad.
 Best data science training, best data science training institute in hyderabad. Best data science training, best data science training institute in hyderabad.
Best data science training, best data science training institute in hyderabad.
 
Machine learning course in Coimbatore
Machine learning course in CoimbatoreMachine learning course in Coimbatore
Machine learning course in Coimbatore
 
Data science course in pune
Data science course in puneData science course in pune
Data science course in pune
 

Recently uploaded

Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africaictsugar
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCRashishs7044
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncrdollysharma2066
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607dollysharma2066
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Servicecallgirls2057
 
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxContemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxMarkAnthonyAurellano
 
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607dollysharma2066
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...lizamodels9
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailAriel592675
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creationsnakalysalcedo61
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesKeppelCorporation
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Timedelhimodelshub1
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMintel Group
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menzaictsugar
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst SummitHolger Mueller
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCRashishs7044
 
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadIslamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadAyesha Khan
 

Recently uploaded (20)

Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africa
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
 
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptxContemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
Contemporary Economic Issues Facing the Filipino Entrepreneur (1).pptx
 
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detail
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creations
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation Slides
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Time
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 Edition
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst Summit
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR
 
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadIslamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
 

data science certification

  • 1. © 2013 ExcelR Solutions. All Rights Reserved Text Mining & Clustering
  • 2. © 2013 ExcelR Solutions. All Rights Reserved Text Mining - Importance • Avenues of textual unstructured data − Call transcripts − Email to customer service − Social media outreach − Speech transcripts − Field agents, salespeople − Interviews & surveys Structured 20% Unstructured 80%
  • 3. © 2013 ExcelR Solutions. All Rights Reserved Bag-of-Words All the world’s a stage, and all the men and women merely players: They have their exits and their entrances; And one man in his time plays many parts…” ENGLISH Professor!!! Statistician World Stage Men Women Play Exit Entrance time 1 1 2 1 2 1 1 1
  • 4. © 2013 ExcelR Solutions. All Rights Reserved Terminology & Pre-processing • Each row is called as a ‘Document’ & even an empty row is considered as a document • Collection of all these documents is called as ‘Corpus’ • Quirks of languages − Terms with typos (e.g., ‘musc’) − Terms in lowercase, proper case & uppercase (e.g., usb, Usb, USB) − Punctuations & special symbols (‘%’, ‘!’, ‘&’, etc.) − Filler words, connectors, pronouns (‘all’, ‘for’, ‘of’, ‘my’, ‘to’, etc.) • Stemming – process of considering only stem words (e.g., jumping, jumped; stem-word here is ‘jump’) Let me show you “Amazon customer reviews”
  • 5. © 2013 ExcelR Solutions. All Rights Reserved DTM & TDM • Let us understand 100 document corpus of Xbox TF - Regular term counts TFIDF - Discounts the TF by document frequency DTM weighing
  • 6. © 2013 ExcelR Solutions. All Rights Reserved Corpus-Level Word Cloud
  • 7. © 2013 ExcelR Solutions. All Rights Reserved Positive Word Cloud
  • 8. © 2013 ExcelR Solutions. All Rights Reserved Negative Word Cloud
  • 9. © 2013 ExcelR Solutions. All Rights Reserved Clinical Trials Project
  • 10. © 2013 ExcelR Solutions. All Rights Reserved Clinical Trials – Text Mining Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 • Stage 1: Animals • Stage 2: Humans - very few with that specific disease • Stage 3: Humans - who have other diseases • Stage 4: Humans - larger audience • Stage 5: US FDA • Stage 6: Adverse events Stages Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6
  • 11. © 2013 ExcelR Solutions. All Rights Reserved Clinical Trials – Project in brief Business Objective: Increase the success rate of the clinical trials Project Brief Description:  Phase 1: Collected the data from open source forums such as “https://clinicaltrials.gov/”  Phase 2: Data Cleansing on XML files by extracting relevant fields from the clinical trials  Phase 3: Segregated the data into Structured & Unstructured data  Phase 4: Performed Word Cloud & Sentiment Analysis on unstructured data to identify the reasons for termination of clinical trials Techniques used: Term Frequency (TF), Term Frequency Inverse Document Frequency (TFIDF), Positive & Negative Word cloud, Dendrogram, Semantic Network, k-Means clustering
  • 12. © 2013 ExcelR Solutions. All Rights Reserved • Key words standing out of the rest are Accrual, Enrollment, Slow, Safety, Efficacy, Sponsor, Lack, Low etc. • These words should be seen in the context to gain business value • When we see this word cloud in conjunction with dendrogram, we notice that slow accrual, slow enrollment, poor efficacy, sponsor funding seem to be the broad themes for termination of clinical trials Unigram Word Cloud & Dendrogram
  • 13. © 2013 ExcelR Solutions. All Rights Reserved Semantic network, Bigram • Semantic network shows that the relationship between the words & the key themes mentioned in previous slide are becoming relevant • One key thing is safety concerns. At the first sight it sounds as if safety concerns were reason for termination, but when we see it in context, more termination reasons say that there are “No Safety Concerns” • Bi-gram is used to see 2 words to extract business value & the key themes mentioned earlier are more evident here Bi-gram Word Cloud & Semantic Network
  • 14. © 2013 ExcelR Solutions. All Rights Reserved • Scree-plot or elbow plot shows that there is a clear bend at 2 clusters, hence we are considering that there are 2 clusters (categories) that the data can be segregated into Note: Analysis is done considering slight bend at 2nd cluster and considering steep bend at 4th cluster, however, it did not provide any meaningful insights K-Means Clustering Scree Plot
  • 15. © 2013 ExcelR Solutions. All Rights Reserved • Word cloud is clearly highlighting that this cluster is speaking majorly about Accrual: Term referring to the number of patients in a study or clinical trial • Even the dendrogram clearly shows Accrual, Enrollment, Slow as a major cluster Word Cloud & Dendrogram - First Cluster
  • 16. © 2013 ExcelR Solutions. All Rights Reserved • Few key highlights from this word cloud are early & premature termination • Dendrogram mentions majority of things related to premature closure of clinical trials Word Cloud & Dendrogram - Second Cluster
  • 17. © 2013 ExcelR Solutions. All Rights Reserved Web & Social Media Extraction
  • 18. © 2013 ExcelR Solutions. All Rights Reserved NLP - Agenda 01 02 03 04 05 LDA in Text Mining Topic extraction using LDA Structured information extraction Sentiment extraction in a narrative Lexicons & Emotion Mining
  • 19. © 2013 ExcelR Solutions. All Rights Reserved NLP Data collection/ Information Retrieval Feature extraction Lexical analysis/ Entity analysis Cleaning/ Normalization Extraction of insight
  • 20. © 2013 ExcelR Solutions. All Rights Reserved Latent Dirichlet Allocation (LDA) It assumes that each document is a mixture of a small number of topics Each document may be viewed as a mixture of various topics Each word’s presence is attributable to one of the document’s topics LDA can be viewed as a Bayesian model, where each item is modeled as a result of a mixture of underlying set of topics LDA is a generative model (a model for randomly generating observable data values) Observations are words collected into documents
  • 21. © 2013 ExcelR Solutions. All Rights Reserved Latent Dirichlet allocation (LDA) Vs. Clustering • Unsupervised learning algorithms • Mixture model where a document can be assigned to one or more topics • Each topic is a culmination of multiple documents • Unsupervised learning algorithms • Specify an optimal ‘k’ that allows us to extract topics or segments from the data • Does a raw partition of the data • Resultant clusters are disjoint from each other LDA Clustering (K-means) • A popular example using term usage • A man sees a boy with a telescope • Who has the telescope? • In this example a term’s usage leads to confusion owing to its placement • In the same way a sentence from a corpus could infer a different meaning in conjunction with another sentence
  • 22. © 2013 ExcelR Solutions. All Rights Reserved • Many sources of data contain large amount of artifacts that lend a lot of information • Text data can be subjected to methods that can help mine structured information • This is information retrieval using previously generated labeled data Structured data extraction Raw Text 1 Parser2 Names Entities3
  • 23. © 2013 ExcelR Solutions. All Rights Reserved • Lexicons serve as dictionaries for extracting sentiment from raw unlabeled data • These are useful in estimating semantic orientation (polarity) • They are applied to polarity prediction tasks and serve as a bag of words that help assign a score/label to terms in text Three Lexicons used in this session are: – Bing (Developed by Professor Bing Liu) – AFINN (Informatics and Mathematical Modelling, Technical University of Denmark) – NRC (Dr. Saif M. Mohammad) Lexicons
  • 24. © 2013 ExcelR Solutions. All Rights Reserved This methodology allows the extraction of the most negative to positive sentiment bearing documents from text: • negative <- s_v[which.min(afinn_s_v)] • > negative • [1] "I fully agree with you. This is the worst card ever and they are running a late charge fee scam here. This is what they do - on my first statement, I made full payment. On the second statement, they charged me a late payment fee and interest. I wrote back to them to tell them they made mistake but never heard back from them. I made full payment on the second statement which was around $26 and when the third statement came, I was charged with another late payment fee and compounded interests. This time, I got on the phone and spoke to some guy in India who starts off each statement with "So how do you want to make payment sir" After something like hearing that damn statement for the 10th time, I got so pissed with trying to get an explanation on the late fees, I was transferred to a supervisor who then said that my first payment got rejected. I then asked why was it rejected (for which they had no clue and said that I should check with my bank) but got no where but what I was really pissed about was that the unpaid amount due in the first statement was never reflected on the second statement like other typical credit cards from REPUTABLE banks. Had this been done, I would have known and paid in full with the second payment! In the end, they said that they would waive the interests which was only a few dollars but gad to charge me the late fee of $35 per month ($70 in total). Can you imagine if they run this "sweat shop" practice from India and suckered in 10,000 people with this practice? That would have been $700,000 into their pockets without breaking a SWEAT! Worst still, they filed my 2 months delinquencies with the credit unions and I lost 50-70 credit score points!! Bastards! ” • > positive <- s_v[which.max(afinn_s_v)] • > positive • [1] ""Well, you got me beat. I have 11 currently. However, I do have several Chase cards (CSP, Marriott Rewards and the Freedom) and to be honest I rarely use my Amazon.com Rewards due to better reward options elsewhere; Barclay's SallieMae is the best card for Amazon.com purchases. 5% back on up to 750 in purchases each month. It also has 5% on gas/groceries on up to 250 on purchases, in each category per month. I really do not like the small cap of 250 on groceries and use Amex BCE for that purpose, but its still an extra benefit to have." Emotion Mining
  • 25. © 2013 ExcelR Solutions. All Rights Reserved  This method allows for generating narrative time for the text and generates how the positive and negative emotional valence has been in the corpus  The nrc framework allows to mine text that match with eight emotions in the writing: 1. Trust 2. Anticipation 3. Joy 4. Sadness 5. Fear 6. Anger 7. Surprise 8. Disgust  The same can be implemented for the different sources from which content is generated for a client, and the outcomes can be compared and evaluated for understating which source provides what form of understanding Arcs & Emotion
  • 26. © 2013 ExcelR Solutions. All Rights Reserved THANK YOU