SlideShare a Scribd company logo
1 of 21
Download to read offline
© 2020 Minitab, LLC.
© 2020 Minitab, LLC.
Mikhail has been prototyping new machine
learning algorithms and modeling automation
for 20 years, and he has been a major
contributor to developing technological
improvements to the most important
algorithms in Machine Learning: CART®️
Decision Trees, MARS®️ Non-linear
Regression, TreeNet®️ gradient boosting, and
Random Forests®️. He holds master’s
degrees both in rocket science from Kharkov
State Polytechnic University in Ukraine and
statistical computing from the University of
Central Florida.
Meet the Presenter:
Mikhail Golovnya
Minitab Senior Advisory Data Scientist
© 2020 Minitab, LLC.
The Challenge of Text Mining
► Data sets often have a character variable that
contains a possibly long text (user feedback,
comments, etc.)
► Such a variable will usually have as many distinct
values as there are records in the dataset – thus, it
cannot be used directly for modeling
► Core objective of Text Mining:
Find ways to extract numeric measures from a
text variable that can be used in quantitative
modeling
3
Wine Review
Excellent wine!
HIGHLY
Recomended
LOVE IT;
AWESOME!
Too bitter, fordettable
Love this wine
Had better wine
before
© 2020 Minitab, LLC.
Simple Text Statistics
► The following simple numeric summaries of the raw text itself can be extracted and used in quantitative
analysis as derived numeric variables
▪ Total count of words
▪ Total count of characters
▪ Average word length (in characters)
▪ Count of stop-words (commonly occurring words)
▪ Count of numeric words (series of digits)
▪ Count of words written in all upper-case
4
© 2020 Minitab, LLC.
Simple Stats
5
Wine Review
Excellent wine! HIGHLY Recomended
LOVE IT; AWESOME.
Too bitter, forgettable
Love this wine
Had beter wine before
© 2020 Minitab, LLC.
6
© 2020 Minitab, LLC.
Text Cleaning Steps
► Raw text stats summarize the original text in its raw form
► The following steps (cleaning up) are normally employed to prepare a raw text variable for further
analysis
▪ Converting all characters to lower case only
▪ Removing all punctuation
▪ Removing all stop-words
▪ Correct spelling errors
▪ Removing infrequent words
► More advanced analyses (semantic extraction, etc.) might omit some of the above steps
7
© 2020 Minitab, LLC.
Cleaning Up Process
8
Wine Review
Excellent wine! Highly Recomended
Love it; awesome.
Too bitter, forgettable
Love this wine
Had beter wine before
Wine Review
excellent wine highly recommended
love awesome
bitter forgettable
love wine
better wine
© 2020 Minitab, LLC.
Summary Statistics
► The following summary statistics can now be computed and visualized for a
“beautified” text variable
▪ Total word count for each word that “survived the beautification process”
▪ Inverse Document Frequency (IDF) for each word
𝐼𝐷𝐹 = log
𝑁
𝐷𝐹
here N – number of observations
DF – number of documents where a given word occurs
A word present in all observations has IDF=0
A word present in only one observation has the largest
possible IDF
▪ Bar chart of the most frequently occurring words and their IDFs
▪ Word-cloud image of the most frequently occurring words
9
© 2020 Minitab, LLC.
Summary Statistics
10
© 2020 Minitab, LLC.
Word Counts
11
© 2020 Minitab, LLC.
Word IDFs
12
© 2020 Minitab, LLC.
Extracting Sentiment Values
► Sentiment value is a number that summarizes writer’s overall
attitude based on the linguistic analysis of the text
▪ Positive sentiment reflects positive attitude
▪ Negative sentiment reflects negative attitude
13
© 2020 Minitab, LLC.
Creating a Bag of Words
► For each word create a new variable that reports how many times the word
occurs in the text field
► To avoid explosion of new variables, the user might want to exclude
infrequent words
14
© 2020 Minitab, LLC.
Extracting Singular Vectors
15
© 2020 Minitab, LLC.
Summary
► Reporting stage (text_summary.py)
▪ Word frequencies and IDFs
▪ Bar charts and word cloud
► Extracting stage (text_convert.py)
▪ Created original raw text statistics variables
▪ Cleaning up stage
▪ Created sentiment value variable
▪ Created bag of words variables
▪ Created singular vector variables
► We have solved the original text mining challenge:
all these numeric variables summarize the original text variable and can be
used in predictive modeling algorithms along with the rest of the predictors!
16
© 2020 Minitab, LLC.
Reporting Stage
► LET K1 = "reviews.csv“ – input data set
► LET K2 = "Review“ – text variable
► LET K3 = 1 – word count limit
► PYSC "text_summary.py“ – reporting script
17
© 2020 Minitab, LLC.
Extracting Stage
► LET K1 = "reviews.csv“ – input data set
► LET K2 = "Review“ – text variable
► LET K3 = 1 – word count limit
► LET K5 = 5 – number of singular vectors
► LET K6 = "reviews_bow.csv“ – bag of words dataset
► LET K7 = "reviews_svd.csv“ – singular vector dataset
► LET K8 = "reviews_lds.csv“ – word loadings
► PYSC "text_convert.py“ – extracting script
18
© 2020 Minitab, LLC.
Our Approach: More Than Business Analytics… Solutions Analytics
Software
Services
Training
Learn first-hand by attending public
trainings or customized trainings
according to your requirements.
Statistical
Consulting
Personalized help with statistical
challenges from collecting the right data
to interpreting analysis more.
Support
Assistance with installation,
implementation, version updates
and license management.
Master statistics and
Minitab anywhere
with online training
Machine learning and
predictive analytics
software
Start, track, manage
and execute
improvement projects
with real-time
dashboards
Powerful statistical
software everyone
can use
Data Analysis Predictive Modeling Visual Business Tools Project Oversight
Visual tools to
process and product
excellence
Online Training
Solutions analytics is our integrated approach to providing software and services that enable organizations to
make better decisions that drive business excellence.
© 2020 Minitab, LLC.
Upcoming Webinar Wednesdays
Continue learning and working efficiently with our free webinar series:
• A TEDx Coach’s Secrets To Developing Innovative Leaders
and Ensuring They Thrive at Your Organization – July 15
info.minitab.com/resources/webinars/webinar-wednesdays
Minitab Training is now virtual!
Learn more at minitab.com/training
© 2020 Minitab, LLC.
Thank You!
From all of us at

More Related Content

What's hot

What's hot (20)

R studio
R studio R studio
R studio
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
 
Process Change: Communication & Training Tips
Process Change:  Communication & Training TipsProcess Change:  Communication & Training Tips
Process Change: Communication & Training Tips
 
LIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsLIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic Patterns
 
PFMEA-Training.pptx
PFMEA-Training.pptxPFMEA-Training.pptx
PFMEA-Training.pptx
 
Process fmea work_instructions
Process fmea work_instructionsProcess fmea work_instructions
Process fmea work_instructions
 
Workshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate LevelWorkshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate Level
 
static ABAP code analyzers
static ABAP code analyzersstatic ABAP code analyzers
static ABAP code analyzers
 
Reporting Phi Coefficient test in APA
Reporting Phi Coefficient test in APAReporting Phi Coefficient test in APA
Reporting Phi Coefficient test in APA
 
Time series forecasting
Time series forecastingTime series forecasting
Time series forecasting
 
USP <665> draft standard : A rational risk-based approach to characterization...
USP <665> draft standard : A rational risk-based approach to characterization...USP <665> draft standard : A rational risk-based approach to characterization...
USP <665> draft standard : A rational risk-based approach to characterization...
 
Exploring Best Practises in Design of Experiments: A Data Driven Approach to ...
Exploring Best Practises in Design of Experiments: A Data Driven Approach to ...Exploring Best Practises in Design of Experiments: A Data Driven Approach to ...
Exploring Best Practises in Design of Experiments: A Data Driven Approach to ...
 
CFA Fit Statistics
CFA Fit StatisticsCFA Fit Statistics
CFA Fit Statistics
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Quantitative Data Analysis using R
Quantitative Data Analysis using RQuantitative Data Analysis using R
Quantitative Data Analysis using R
 
Visualizations that make an impact - see what s new in minitab statistical s...
Visualizations that make an impact  - see what s new in minitab statistical s...Visualizations that make an impact  - see what s new in minitab statistical s...
Visualizations that make an impact - see what s new in minitab statistical s...
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
 
USP <665> draft standard : A rational risk-based approach to characterization...
USP <665> draft standard : A rational risk-based approach to characterization...USP <665> draft standard : A rational risk-based approach to characterization...
USP <665> draft standard : A rational risk-based approach to characterization...
 
ESS and HASS: Concerns with the Practices and Standards
ESS and HASS:  Concerns with the Practices and StandardsESS and HASS:  Concerns with the Practices and Standards
ESS and HASS: Concerns with the Practices and Standards
 
Time series Analysis
Time series AnalysisTime series Analysis
Time series Analysis
 

Similar to Performing at your best turning words into numbers and numbers into data driven insights with Minitab, Python and Text Mining

Discover the Hidden Gems in Webtrends Analytics
Discover the Hidden Gems in Webtrends AnalyticsDiscover the Hidden Gems in Webtrends Analytics
Discover the Hidden Gems in Webtrends Analytics
Webtrends
 
Discover the Hidden Gems in Webtrends Analytics
Discover the Hidden Gems in Webtrends AnalyticsDiscover the Hidden Gems in Webtrends Analytics
Discover the Hidden Gems in Webtrends Analytics
Webtrends
 
Innovation World 2008
Innovation World 2008Innovation World 2008
Innovation World 2008
Roman Stanek
 

Similar to Performing at your best turning words into numbers and numbers into data driven insights with Minitab, Python and Text Mining (20)

Boost Your Data Expertise with the Latest Release of Minitab Statistical Soft...
Boost Your Data Expertise with the Latest Release of Minitab Statistical Soft...Boost Your Data Expertise with the Latest Release of Minitab Statistical Soft...
Boost Your Data Expertise with the Latest Release of Minitab Statistical Soft...
 
Tips & Tricks for CART (Classification and Regression Trees) in Minitab Stati...
Tips & Tricks for CART (Classification and Regression Trees) in Minitab Stati...Tips & Tricks for CART (Classification and Regression Trees) in Minitab Stati...
Tips & Tricks for CART (Classification and Regression Trees) in Minitab Stati...
 
Boost Your Data Expertise - What's New in Minitab 19.2020.1
Boost Your Data Expertise -  What's New in Minitab 19.2020.1Boost Your Data Expertise -  What's New in Minitab 19.2020.1
Boost Your Data Expertise - What's New in Minitab 19.2020.1
 
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
Maximize Efficiency with Minitab Workspace and Minitab Statistical Software -...
 
Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 
Machinelearning: The next step in manufacturing performance
Machinelearning: The next step in manufacturing performance Machinelearning: The next step in manufacturing performance
Machinelearning: The next step in manufacturing performance
 
Watson Analytics for HSE - Copy
Watson Analytics for HSE - CopyWatson Analytics for HSE - Copy
Watson Analytics for HSE - Copy
 
Meet-Minitab-Connect-Oct-28-2020-Webinar-Slides.pdf
Meet-Minitab-Connect-Oct-28-2020-Webinar-Slides.pdfMeet-Minitab-Connect-Oct-28-2020-Webinar-Slides.pdf
Meet-Minitab-Connect-Oct-28-2020-Webinar-Slides.pdf
 
Discover the Hidden Gems in Webtrends Analytics
Discover the Hidden Gems in Webtrends AnalyticsDiscover the Hidden Gems in Webtrends Analytics
Discover the Hidden Gems in Webtrends Analytics
 
Discover the Hidden Gems in Webtrends Analytics
Discover the Hidden Gems in Webtrends AnalyticsDiscover the Hidden Gems in Webtrends Analytics
Discover the Hidden Gems in Webtrends Analytics
 
Innovation World 2008
Innovation World 2008Innovation World 2008
Innovation World 2008
 
Preparing for AI - Measurefest
Preparing for AI - MeasurefestPreparing for AI - Measurefest
Preparing for AI - Measurefest
 
Rohit Nagpal_Resume
Rohit Nagpal_ResumeRohit Nagpal_Resume
Rohit Nagpal_Resume
 
Role of Data in Digital Transformation
Role of Data in Digital TransformationRole of Data in Digital Transformation
Role of Data in Digital Transformation
 
WMBT Team Pitch: Sustainability Management Platform
WMBT Team Pitch: Sustainability Management PlatformWMBT Team Pitch: Sustainability Management Platform
WMBT Team Pitch: Sustainability Management Platform
 
Discover Minitab Workspace - The Ultimate Visual Toolkit to Elevate Your Work...
Discover Minitab Workspace - The Ultimate Visual Toolkit to Elevate Your Work...Discover Minitab Workspace - The Ultimate Visual Toolkit to Elevate Your Work...
Discover Minitab Workspace - The Ultimate Visual Toolkit to Elevate Your Work...
 
23.pdf
23.pdf23.pdf
23.pdf
 
Business analytics and it's tools and competitive advantage
Business analytics and it's tools and competitive advantage Business analytics and it's tools and competitive advantage
Business analytics and it's tools and competitive advantage
 
Using machine learning to optimize marketing ROI at Honest Company by Roozbeh...
Using machine learning to optimize marketing ROI at Honest Company by Roozbeh...Using machine learning to optimize marketing ROI at Honest Company by Roozbeh...
Using machine learning to optimize marketing ROI at Honest Company by Roozbeh...
 

More from Minitab, LLC

Pilotez le développement de vos produits et de vos procédés avec Minitab et M...
Pilotez le développement de vos produits et de vos procédés avec Minitab et M...Pilotez le développement de vos produits et de vos procédés avec Minitab et M...
Pilotez le développement de vos produits et de vos procédés avec Minitab et M...
Minitab, LLC
 
Strukturierte problemloesung mit datenunterstuetzung
Strukturierte problemloesung mit datenunterstuetzungStrukturierte problemloesung mit datenunterstuetzung
Strukturierte problemloesung mit datenunterstuetzung
Minitab, LLC
 
Einführung in den Minitab Workspace_Visuelle Toolkit zur Verbesserung Ihrer A...
Einführung in den Minitab Workspace_Visuelle Toolkit zur Verbesserung Ihrer A...Einführung in den Minitab Workspace_Visuelle Toolkit zur Verbesserung Ihrer A...
Einführung in den Minitab Workspace_Visuelle Toolkit zur Verbesserung Ihrer A...
Minitab, LLC
 
Praesentation - Identifizieren und eliminieren sie ihre analytischen schwachp...
Praesentation - Identifizieren und eliminieren sie ihre analytischen schwachp...Praesentation - Identifizieren und eliminieren sie ihre analytischen schwachp...
Praesentation - Identifizieren und eliminieren sie ihre analytischen schwachp...
Minitab, LLC
 

More from Minitab, LLC (20)

L'art de la visualisation pour une meilleure compréhension des données
L'art de la visualisation pour une meilleure compréhension des donnéesL'art de la visualisation pour une meilleure compréhension des données
L'art de la visualisation pour une meilleure compréhension des données
 
Pilotez le développement de vos produits et de vos procédés avec Minitab et M...
Pilotez le développement de vos produits et de vos procédés avec Minitab et M...Pilotez le développement de vos produits et de vos procédés avec Minitab et M...
Pilotez le développement de vos produits et de vos procédés avec Minitab et M...
 
Introducing Graph Builder: Visualizations Built to Move You Forward
Introducing Graph Builder: Visualizations Built to Move You ForwardIntroducing Graph Builder: Visualizations Built to Move You Forward
Introducing Graph Builder: Visualizations Built to Move You Forward
 
Les solutions Minitab pour développer vos produits selon les réglementations ...
Les solutions Minitab pour développer vos produits selon les réglementations ...Les solutions Minitab pour développer vos produits selon les réglementations ...
Les solutions Minitab pour développer vos produits selon les réglementations ...
 
Concrétisez votre transformation digitale avec Minitab et Minitab Connect
Concrétisez votre transformation digitale avec Minitab et Minitab ConnectConcrétisez votre transformation digitale avec Minitab et Minitab Connect
Concrétisez votre transformation digitale avec Minitab et Minitab Connect
 
En route vers l'excellence avec les solutions Minitab
En route vers l'excellence avec les solutions MinitabEn route vers l'excellence avec les solutions Minitab
En route vers l'excellence avec les solutions Minitab
 
Meet Minitab Engage Your End-to-End Improvement Solution From Idea Generation...
Meet Minitab Engage Your End-to-End Improvement Solution From Idea Generation...Meet Minitab Engage Your End-to-End Improvement Solution From Idea Generation...
Meet Minitab Engage Your End-to-End Improvement Solution From Idea Generation...
 
La puissance du machine learning et des algorithmes cart au service des métiers
La puissance du machine learning et des algorithmes cart au service des métiersLa puissance du machine learning et des algorithmes cart au service des métiers
La puissance du machine learning et des algorithmes cart au service des métiers
 
Strukturierte problemloesung mit datenunterstuetzung
Strukturierte problemloesung mit datenunterstuetzungStrukturierte problemloesung mit datenunterstuetzung
Strukturierte problemloesung mit datenunterstuetzung
 
Visualizaciones que crean impacto: Vea las novedades de Minitab Statistical ...
Visualizaciones que crean impacto:  Vea las novedades de Minitab Statistical ...Visualizaciones que crean impacto:  Vea las novedades de Minitab Statistical ...
Visualizaciones que crean impacto: Vea las novedades de Minitab Statistical ...
 
Powerful Statistical Tools in the Pharmaceutical and Medical Devices Industry...
Powerful Statistical Tools in the Pharmaceutical and Medical Devices Industry...Powerful Statistical Tools in the Pharmaceutical and Medical Devices Industry...
Powerful Statistical Tools in the Pharmaceutical and Medical Devices Industry...
 
Statistical solutions to help you with 5 FDA medical devices stages
Statistical solutions to help you with 5 FDA medical devices stagesStatistical solutions to help you with 5 FDA medical devices stages
Statistical solutions to help you with 5 FDA medical devices stages
 
Machine Learning with Binary Logistic Regression - APAC
Machine Learning with Binary Logistic Regression - APACMachine Learning with Binary Logistic Regression - APAC
Machine Learning with Binary Logistic Regression - APAC
 
Machine Learning with Multiple Regression - APAC
Machine Learning with Multiple Regression - APACMachine Learning with Multiple Regression - APAC
Machine Learning with Multiple Regression - APAC
 
Unleashing the Power of Python Using the New Minitab/Python Integration Modul...
Unleashing the Power of Python Using the New Minitab/Python Integration Modul...Unleashing the Power of Python Using the New Minitab/Python Integration Modul...
Unleashing the Power of Python Using the New Minitab/Python Integration Modul...
 
Einführung in den Minitab Workspace_Visuelle Toolkit zur Verbesserung Ihrer A...
Einführung in den Minitab Workspace_Visuelle Toolkit zur Verbesserung Ihrer A...Einführung in den Minitab Workspace_Visuelle Toolkit zur Verbesserung Ihrer A...
Einführung in den Minitab Workspace_Visuelle Toolkit zur Verbesserung Ihrer A...
 
Melhore seu conhecimento sobre analise de dados com a versao mais recente do ...
Melhore seu conhecimento sobre analise de dados com a versao mais recente do ...Melhore seu conhecimento sobre analise de dados com a versao mais recente do ...
Melhore seu conhecimento sobre analise de dados com a versao mais recente do ...
 
Pilotez, structurez et cartographiez vos processus avec minitab workspace
Pilotez, structurez et cartographiez vos processus avec minitab workspacePilotez, structurez et cartographiez vos processus avec minitab workspace
Pilotez, structurez et cartographiez vos processus avec minitab workspace
 
Minitab Preview Training: Introduction to t-Tests for Manufacturing
Minitab Preview Training: Introduction to t-Tests for ManufacturingMinitab Preview Training: Introduction to t-Tests for Manufacturing
Minitab Preview Training: Introduction to t-Tests for Manufacturing
 
Praesentation - Identifizieren und eliminieren sie ihre analytischen schwachp...
Praesentation - Identifizieren und eliminieren sie ihre analytischen schwachp...Praesentation - Identifizieren und eliminieren sie ihre analytischen schwachp...
Praesentation - Identifizieren und eliminieren sie ihre analytischen schwachp...
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

Performing at your best turning words into numbers and numbers into data driven insights with Minitab, Python and Text Mining

  • 2. © 2020 Minitab, LLC. Mikhail has been prototyping new machine learning algorithms and modeling automation for 20 years, and he has been a major contributor to developing technological improvements to the most important algorithms in Machine Learning: CART®️ Decision Trees, MARS®️ Non-linear Regression, TreeNet®️ gradient boosting, and Random Forests®️. He holds master’s degrees both in rocket science from Kharkov State Polytechnic University in Ukraine and statistical computing from the University of Central Florida. Meet the Presenter: Mikhail Golovnya Minitab Senior Advisory Data Scientist
  • 3. © 2020 Minitab, LLC. The Challenge of Text Mining ► Data sets often have a character variable that contains a possibly long text (user feedback, comments, etc.) ► Such a variable will usually have as many distinct values as there are records in the dataset – thus, it cannot be used directly for modeling ► Core objective of Text Mining: Find ways to extract numeric measures from a text variable that can be used in quantitative modeling 3 Wine Review Excellent wine! HIGHLY Recomended LOVE IT; AWESOME! Too bitter, fordettable Love this wine Had better wine before
  • 4. © 2020 Minitab, LLC. Simple Text Statistics ► The following simple numeric summaries of the raw text itself can be extracted and used in quantitative analysis as derived numeric variables ▪ Total count of words ▪ Total count of characters ▪ Average word length (in characters) ▪ Count of stop-words (commonly occurring words) ▪ Count of numeric words (series of digits) ▪ Count of words written in all upper-case 4
  • 5. © 2020 Minitab, LLC. Simple Stats 5 Wine Review Excellent wine! HIGHLY Recomended LOVE IT; AWESOME. Too bitter, forgettable Love this wine Had beter wine before
  • 7. © 2020 Minitab, LLC. Text Cleaning Steps ► Raw text stats summarize the original text in its raw form ► The following steps (cleaning up) are normally employed to prepare a raw text variable for further analysis ▪ Converting all characters to lower case only ▪ Removing all punctuation ▪ Removing all stop-words ▪ Correct spelling errors ▪ Removing infrequent words ► More advanced analyses (semantic extraction, etc.) might omit some of the above steps 7
  • 8. © 2020 Minitab, LLC. Cleaning Up Process 8 Wine Review Excellent wine! Highly Recomended Love it; awesome. Too bitter, forgettable Love this wine Had beter wine before Wine Review excellent wine highly recommended love awesome bitter forgettable love wine better wine
  • 9. © 2020 Minitab, LLC. Summary Statistics ► The following summary statistics can now be computed and visualized for a “beautified” text variable ▪ Total word count for each word that “survived the beautification process” ▪ Inverse Document Frequency (IDF) for each word 𝐼𝐷𝐹 = log 𝑁 𝐷𝐹 here N – number of observations DF – number of documents where a given word occurs A word present in all observations has IDF=0 A word present in only one observation has the largest possible IDF ▪ Bar chart of the most frequently occurring words and their IDFs ▪ Word-cloud image of the most frequently occurring words 9
  • 10. © 2020 Minitab, LLC. Summary Statistics 10
  • 11. © 2020 Minitab, LLC. Word Counts 11
  • 12. © 2020 Minitab, LLC. Word IDFs 12
  • 13. © 2020 Minitab, LLC. Extracting Sentiment Values ► Sentiment value is a number that summarizes writer’s overall attitude based on the linguistic analysis of the text ▪ Positive sentiment reflects positive attitude ▪ Negative sentiment reflects negative attitude 13
  • 14. © 2020 Minitab, LLC. Creating a Bag of Words ► For each word create a new variable that reports how many times the word occurs in the text field ► To avoid explosion of new variables, the user might want to exclude infrequent words 14
  • 15. © 2020 Minitab, LLC. Extracting Singular Vectors 15
  • 16. © 2020 Minitab, LLC. Summary ► Reporting stage (text_summary.py) ▪ Word frequencies and IDFs ▪ Bar charts and word cloud ► Extracting stage (text_convert.py) ▪ Created original raw text statistics variables ▪ Cleaning up stage ▪ Created sentiment value variable ▪ Created bag of words variables ▪ Created singular vector variables ► We have solved the original text mining challenge: all these numeric variables summarize the original text variable and can be used in predictive modeling algorithms along with the rest of the predictors! 16
  • 17. © 2020 Minitab, LLC. Reporting Stage ► LET K1 = "reviews.csv“ – input data set ► LET K2 = "Review“ – text variable ► LET K3 = 1 – word count limit ► PYSC "text_summary.py“ – reporting script 17
  • 18. © 2020 Minitab, LLC. Extracting Stage ► LET K1 = "reviews.csv“ – input data set ► LET K2 = "Review“ – text variable ► LET K3 = 1 – word count limit ► LET K5 = 5 – number of singular vectors ► LET K6 = "reviews_bow.csv“ – bag of words dataset ► LET K7 = "reviews_svd.csv“ – singular vector dataset ► LET K8 = "reviews_lds.csv“ – word loadings ► PYSC "text_convert.py“ – extracting script 18
  • 19. © 2020 Minitab, LLC. Our Approach: More Than Business Analytics… Solutions Analytics Software Services Training Learn first-hand by attending public trainings or customized trainings according to your requirements. Statistical Consulting Personalized help with statistical challenges from collecting the right data to interpreting analysis more. Support Assistance with installation, implementation, version updates and license management. Master statistics and Minitab anywhere with online training Machine learning and predictive analytics software Start, track, manage and execute improvement projects with real-time dashboards Powerful statistical software everyone can use Data Analysis Predictive Modeling Visual Business Tools Project Oversight Visual tools to process and product excellence Online Training Solutions analytics is our integrated approach to providing software and services that enable organizations to make better decisions that drive business excellence.
  • 20. © 2020 Minitab, LLC. Upcoming Webinar Wednesdays Continue learning and working efficiently with our free webinar series: • A TEDx Coach’s Secrets To Developing Innovative Leaders and Ensuring They Thrive at Your Organization – July 15 info.minitab.com/resources/webinars/webinar-wednesdays Minitab Training is now virtual! Learn more at minitab.com/training
  • 21. © 2020 Minitab, LLC. Thank You! From all of us at