More Related Content Similar to Performing at your best turning words into numbers and numbers into data driven insights with Minitab, Python and Text Mining (20) More from Minitab, LLC (20) Performing at your best turning words into numbers and numbers into data driven insights with Minitab, Python and Text Mining2. © 2020 Minitab, LLC.
Mikhail has been prototyping new machine
learning algorithms and modeling automation
for 20 years, and he has been a major
contributor to developing technological
improvements to the most important
algorithms in Machine Learning: CART®️
Decision Trees, MARS®️ Non-linear
Regression, TreeNet®️ gradient boosting, and
Random Forests®️. He holds master’s
degrees both in rocket science from Kharkov
State Polytechnic University in Ukraine and
statistical computing from the University of
Central Florida.
Meet the Presenter:
Mikhail Golovnya
Minitab Senior Advisory Data Scientist
3. © 2020 Minitab, LLC.
The Challenge of Text Mining
► Data sets often have a character variable that
contains a possibly long text (user feedback,
comments, etc.)
► Such a variable will usually have as many distinct
values as there are records in the dataset – thus, it
cannot be used directly for modeling
► Core objective of Text Mining:
Find ways to extract numeric measures from a
text variable that can be used in quantitative
modeling
3
Wine Review
Excellent wine!
HIGHLY
Recomended
LOVE IT;
AWESOME!
Too bitter, fordettable
Love this wine
Had better wine
before
4. © 2020 Minitab, LLC.
Simple Text Statistics
► The following simple numeric summaries of the raw text itself can be extracted and used in quantitative
analysis as derived numeric variables
▪ Total count of words
▪ Total count of characters
▪ Average word length (in characters)
▪ Count of stop-words (commonly occurring words)
▪ Count of numeric words (series of digits)
▪ Count of words written in all upper-case
4
5. © 2020 Minitab, LLC.
Simple Stats
5
Wine Review
Excellent wine! HIGHLY Recomended
LOVE IT; AWESOME.
Too bitter, forgettable
Love this wine
Had beter wine before
7. © 2020 Minitab, LLC.
Text Cleaning Steps
► Raw text stats summarize the original text in its raw form
► The following steps (cleaning up) are normally employed to prepare a raw text variable for further
analysis
▪ Converting all characters to lower case only
▪ Removing all punctuation
▪ Removing all stop-words
▪ Correct spelling errors
▪ Removing infrequent words
► More advanced analyses (semantic extraction, etc.) might omit some of the above steps
7
8. © 2020 Minitab, LLC.
Cleaning Up Process
8
Wine Review
Excellent wine! Highly Recomended
Love it; awesome.
Too bitter, forgettable
Love this wine
Had beter wine before
Wine Review
excellent wine highly recommended
love awesome
bitter forgettable
love wine
better wine
9. © 2020 Minitab, LLC.
Summary Statistics
► The following summary statistics can now be computed and visualized for a
“beautified” text variable
▪ Total word count for each word that “survived the beautification process”
▪ Inverse Document Frequency (IDF) for each word
𝐼𝐷𝐹 = log
𝑁
𝐷𝐹
here N – number of observations
DF – number of documents where a given word occurs
A word present in all observations has IDF=0
A word present in only one observation has the largest
possible IDF
▪ Bar chart of the most frequently occurring words and their IDFs
▪ Word-cloud image of the most frequently occurring words
9
13. © 2020 Minitab, LLC.
Extracting Sentiment Values
► Sentiment value is a number that summarizes writer’s overall
attitude based on the linguistic analysis of the text
▪ Positive sentiment reflects positive attitude
▪ Negative sentiment reflects negative attitude
13
14. © 2020 Minitab, LLC.
Creating a Bag of Words
► For each word create a new variable that reports how many times the word
occurs in the text field
► To avoid explosion of new variables, the user might want to exclude
infrequent words
14
16. © 2020 Minitab, LLC.
Summary
► Reporting stage (text_summary.py)
▪ Word frequencies and IDFs
▪ Bar charts and word cloud
► Extracting stage (text_convert.py)
▪ Created original raw text statistics variables
▪ Cleaning up stage
▪ Created sentiment value variable
▪ Created bag of words variables
▪ Created singular vector variables
► We have solved the original text mining challenge:
all these numeric variables summarize the original text variable and can be
used in predictive modeling algorithms along with the rest of the predictors!
16
17. © 2020 Minitab, LLC.
Reporting Stage
► LET K1 = "reviews.csv“ – input data set
► LET K2 = "Review“ – text variable
► LET K3 = 1 – word count limit
► PYSC "text_summary.py“ – reporting script
17
18. © 2020 Minitab, LLC.
Extracting Stage
► LET K1 = "reviews.csv“ – input data set
► LET K2 = "Review“ – text variable
► LET K3 = 1 – word count limit
► LET K5 = 5 – number of singular vectors
► LET K6 = "reviews_bow.csv“ – bag of words dataset
► LET K7 = "reviews_svd.csv“ – singular vector dataset
► LET K8 = "reviews_lds.csv“ – word loadings
► PYSC "text_convert.py“ – extracting script
18
19. © 2020 Minitab, LLC.
Our Approach: More Than Business Analytics… Solutions Analytics
Software
Services
Training
Learn first-hand by attending public
trainings or customized trainings
according to your requirements.
Statistical
Consulting
Personalized help with statistical
challenges from collecting the right data
to interpreting analysis more.
Support
Assistance with installation,
implementation, version updates
and license management.
Master statistics and
Minitab anywhere
with online training
Machine learning and
predictive analytics
software
Start, track, manage
and execute
improvement projects
with real-time
dashboards
Powerful statistical
software everyone
can use
Data Analysis Predictive Modeling Visual Business Tools Project Oversight
Visual tools to
process and product
excellence
Online Training
Solutions analytics is our integrated approach to providing software and services that enable organizations to
make better decisions that drive business excellence.
20. © 2020 Minitab, LLC.
Upcoming Webinar Wednesdays
Continue learning and working efficiently with our free webinar series:
• A TEDx Coach’s Secrets To Developing Innovative Leaders
and Ensuring They Thrive at Your Organization – July 15
info.minitab.com/resources/webinars/webinar-wednesdays
Minitab Training is now virtual!
Learn more at minitab.com/training