SlideShare a Scribd company logo
1 of 28
Download to read offline
BEGINNING TEXT ANALYSIS
Barry DeCicco
Ann Arbor Chapter of the American Statistical
Association
April 22, 2020
CONTENTS
Sentiment Scoring with TextBlob.
Predicting Categories with
Machine Learning, using NLTK
and scikit-learn.
CREDITS
(UP FRONT!)
Almost everything I’ve learned about text analytics I
learned from posters at Medium.com, particularly their
section ‘Towards Data Science’.
Medium.com has a $5/year subscription, which for the
knowledge I’ve gained is a better value than most free
resources.
SENTIMENT SCORING
Using TextBlob
WHAT IS SENTIMENT SCORING?
This means assigning a positive/negative score to each
piece of text (e.g., comment in a survey, customer review
for a purchase, etc.).
These scores can then be tracked over time, or
associated with various cuts in the data (department,
division, product, customer demographic).
The tool used here will be the Python module TextBlob.
TEXTBLOB
TextBlob is a Python package which does a lot of things
with text:
 Spelling correction
 Noun phrase extraction
 Part-of-speech tagging
 Tokenization (splitting text into words and sentences)
 Sentiment analysis
CREATING A TEXTBLOB
Install the package.
In a python program, load it:
 from textblob import TextBlob
Run it on some text:
CREATING A TEXTBLOB
text = "Absolutely wonderful - silky and sexy and
comortable“ [note misspelling]
text_lower=text.lower()
blob_pre = TextBlob(text_lower)
blob=blob_pre.correct()
sentiment = blob.sentiment
polarity = sentiment.polarity
subjectivity = sentiment.subjectivity
CREATING A TEXTBLOB - RESULTS
Absolutely wonderful - silky and sexy and comortable
absolutely wonderful - silky and sexy and comortable
absolutely wonderful - silky and sexy and comortable
absolutely wonderful - silk and sex and comfortable
Sentiment(polarity=0.7, subjectivity=0.9)
0.7 [on a scale of -1 to 1]
0.9
RESULTS
RESULTS (CON.)
BASIC STEPS IN TEXT
ANALYTICS
If you have a data set with 10,000 comments, you have
close to 10,000 unique values for a variable. That makes
analysis futile, in almost all cases.
Therefore the text values are tokenized:
 Break text into sentences,
 Break sentences into words,
 ‘Standardize’ the words (e.g., set to root form, singularizing
plurals and setting verbs to present tense, possibly
removing stop words).
TOKENIZATION
Most comments are unique, resulting in a variable with
mostly unique values. That generally makes analysis futile,
Therefore the text values are tokenized:
 Break text into sentences,
 Break sentences into words,
 ‘Standardize’ the words (e.g., set to root form, singularizing
plurals and setting verbs to present tense).
This converts 10,000 unique values into a smaller set of
values. Each text field is now a list of standardized tokens.
COMMENTS ON TOKENIZATION
There are a variety of tools and methods/settings in Python
to tokenize. This presentation will use NLTK (Natural
Language Tool Kit).
There are trade-offs
 Stemming trims words to a root, not necessarily
grammatically correct (‘riding’ => ‘rid’).
 Lemmatization attempts to find a good root (‘riding’ =>
‘ride’).
 Spelling correction is far from perfect, and can really slow
down a program, depending on the misspellings.
NLTK PROCESSING
➢ text = 'Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i
did bc i never would have ordered it online bc it's petite. i bought a petite and am
5'8". i love the length on me- hits just a little below the knee. would definitely be a true
midi on someone who is truly petite.’
➢ text_fixed = re.sub(r"'",r"'",text) # fix an oddity in import.
➢ text_lower=text_fixed.lower()
➢ word_tokens = nltk.word_tokenize(text_lower)
➢ removing_stopwords = [word for word in word_tokens if word not in stopwords]
➢ lemmatized_word = [lemmatizer.lemmatize(word) for word in removing_stopwords]
➢ line = ' '.join(map(str, lemmatized_word))
➢ print(line)
NLTK PROCESSING - RESULTS
love dress 's sooo pretty happened find store 'm glad bc
never would ordered online bc 's petite bought petite 5 ' 8
'' love length me- hit little knee would definitely true midi
someone truly petite
Absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silk and sex and comfortable Sen
COUNT VECTORIZATION
One way to approach the problem of predictors is to
create a set of predictors based on the tokens for each
comment.
A dictionary is compiled for the ‘words’ (tokens) in the set
of comments, and a number is assigned to each token.
The set of numbers and counts can be used as predictors
for each comment.
Two common ways are:
 Count vectorization.
 Tf-idf vectorization.
TF-IDS VECTORIZATION
An importance weight can be assigned to each token.
The Term Frequency-Inverse Document Frequency
method.
In this method, higher terms counts within a comment
(‘document’) make the token more significant, but higher
counts for that token in the entire set of comments
(documents) make it less important.
TF-IDS VECTORIZATION (CONTINUED
The concept is that a token which appears a lot in a given
comment (‘document’) gets upweighted: Term
Frequency.
However, the more commonly that token appears in the
overall set of comments, it gets down weighted: Inverse
document frequency.
For example, ‘the’, ‘and’, ‘or’ would generally get a very
low weight. This could be used to automatically disregard
stop words.
EXAMPLE OF TF-IDF VECTORIZATION
When the data set is divided into 2/3 training data and 1/3
test data, there are 15,160 rows and 1 column.
After vectorization, there are 15,160 rows by 10,846
columns.
MACHINE LEARNING
At this point, the vectorized data can be used in any
machine learning method.
You can also explore the resulting models, to find out the
important tokens.
TOPIC MODELING
There are a number of methods to explore text to find
cluster and groups (‘topics’).
QUESTIONS?
REFERENCES
TextBlob:
 Introducing TextBlob
(https://towardsdatascience.com/having-
fun-with-textblob-7e9eed783d3f)
 Tutorial: QuickStart
(https://textblob.readthedocs.io/en/dev/)
REFERENCES
Sentiment Scoring:
 Statistical Sentiment-Analysis for Survey Data
using Python
(https://towardsdatascience.com/statistical-
sentiment-analysis-for-survey-data-using-
python-9c824ef0c9b0)
 Opinion Mining Of Survey Comments
(https://towardsdatascience.com/https-
medium-com-sacharath-opinion-
mining-of-survey-comments-
14e3fc902b10)
REFERENCES
 A comparison of methods
 NLP Pipeline: Word Tokenization (Part 1) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-
4b2b547e6a3)
 NLP Pipeline: Part of Speech (Part 2) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-part-of-speech-part-2-
b683c90e327d)
 NLP Pipeline: Lemmatization (Part 3) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-lemmatization-part-3-
4bfd7304957)
 NLP Pipeline: Stemming (Part 4) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-stemming-part-4-
b60a319fd52)
 NLP Pipeline: Stop words (Part 5) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5-
d6770df8a936)
 NLP Pipeline: Sentence Tokenization (Part 6) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-sentence-tokenization-part-6-
86ed55b185e6)
REFERENCES
NLTK, Tokenizing, etc.:
 NLTK documentation (https://www.nltk.org/)
 Tutorial: Extracting Keywords with TF-IDF and
Python’s Scikit-Learn (https://kavita-
ganesan.com/extracting-keywords-from-
text-tfidf/#.Xp9NsZl7mUl)
 Tf-idf (https://en.wikipedia.org/wiki/Tf-idf)
 Scikit-learn site, ‘Working With Text Data’
(https://scikit-
learn.org/stable/tutorial/text_analytics/work
ing_with_text_data.html)

More Related Content

What's hot

Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easyGopi Krishnan Nambiar
 
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGEPRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGEkevig
 
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONIMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONcsandit
 
The NuGram dynamic grammar language
The NuGram dynamic grammar languageThe NuGram dynamic grammar language
The NuGram dynamic grammar languageNu Echo Inc.
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonJanu Jahnavi
 

What's hot (10)

Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
Descriptions
DescriptionsDescriptions
Descriptions
 
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGEPRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
PRONOUN DISAMBIGUATION: WITH APPLICATION TO THE WINOGRAD SCHEMA CHALLENGE
 
Computation Chapter 4
Computation Chapter 4Computation Chapter 4
Computation Chapter 4
 
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONIMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
 
The NuGram dynamic grammar language
The NuGram dynamic grammar languageThe NuGram dynamic grammar language
The NuGram dynamic grammar language
 
Definition
DefinitionDefinition
Definition
 
An ABNF Primer
An ABNF PrimerAn ABNF Primer
An ABNF Primer
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 

Similar to Beginning text analysis

Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonJanu Jahnavi
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4Glenn De Backer
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxnikshaikh786
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxrohithprabhas1
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltkJanu Jahnavi
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introductionguest0edcaf
 
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsThe Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsLucidworks
 
IRJET- A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET-  	  A System for Determining Sarcasm in Tweets: Sarcasm DetectorIRJET-  	  A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET- A System for Determining Sarcasm in Tweets: Sarcasm DetectorIRJET Journal
 

Similar to Beginning text analysis (20)

Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
FinalReport
FinalReportFinalReport
FinalReport
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltk
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource ConnectionsThe Neural Search Frontier - Doug Turnbull, OpenSource Connections
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
 
IRJET- A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET-  	  A System for Determining Sarcasm in Tweets: Sarcasm DetectorIRJET-  	  A System for Determining Sarcasm in Tweets: Sarcasm Detector
IRJET- A System for Determining Sarcasm in Tweets: Sarcasm Detector
 
Ir 03
Ir   03Ir   03
Ir 03
 

More from Barry DeCicco

Easy HTML Tables in RStudio with Tabyl and kableExtra
Easy HTML Tables in RStudio with Tabyl and kableExtraEasy HTML Tables in RStudio with Tabyl and kableExtra
Easy HTML Tables in RStudio with Tabyl and kableExtraBarry DeCicco
 
Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06Barry DeCicco
 
Up and running with python
Up and running with pythonUp and running with python
Up and running with pythonBarry DeCicco
 
Using RStudio on AWS
Using RStudio on AWSUsing RStudio on AWS
Using RStudio on AWSBarry DeCicco
 
Calling python from r
Calling python from rCalling python from r
Calling python from rBarry DeCicco
 
Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)Barry DeCicco
 
Calling r from sas (msug meeting, feb 17, 2018) revised
Calling r from sas (msug meeting, feb 17, 2018)   revisedCalling r from sas (msug meeting, feb 17, 2018)   revised
Calling r from sas (msug meeting, feb 17, 2018) revisedBarry DeCicco
 

More from Barry DeCicco (7)

Easy HTML Tables in RStudio with Tabyl and kableExtra
Easy HTML Tables in RStudio with Tabyl and kableExtraEasy HTML Tables in RStudio with Tabyl and kableExtra
Easy HTML Tables in RStudio with Tabyl and kableExtra
 
Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06Introduction to r studio on aws 2020 05_06
Introduction to r studio on aws 2020 05_06
 
Up and running with python
Up and running with pythonUp and running with python
Up and running with python
 
Using RStudio on AWS
Using RStudio on AWSUsing RStudio on AWS
Using RStudio on AWS
 
Calling python from r
Calling python from rCalling python from r
Calling python from r
 
Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)Draft sas and r and sas (may, 2018 asa meeting)
Draft sas and r and sas (may, 2018 asa meeting)
 
Calling r from sas (msug meeting, feb 17, 2018) revised
Calling r from sas (msug meeting, feb 17, 2018)   revisedCalling r from sas (msug meeting, feb 17, 2018)   revised
Calling r from sas (msug meeting, feb 17, 2018) revised
 

Recently uploaded

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 

Recently uploaded (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 

Beginning text analysis

  • 1. BEGINNING TEXT ANALYSIS Barry DeCicco Ann Arbor Chapter of the American Statistical Association April 22, 2020
  • 2. CONTENTS Sentiment Scoring with TextBlob. Predicting Categories with Machine Learning, using NLTK and scikit-learn.
  • 3. CREDITS (UP FRONT!) Almost everything I’ve learned about text analytics I learned from posters at Medium.com, particularly their section ‘Towards Data Science’. Medium.com has a $5/year subscription, which for the knowledge I’ve gained is a better value than most free resources.
  • 5. WHAT IS SENTIMENT SCORING? This means assigning a positive/negative score to each piece of text (e.g., comment in a survey, customer review for a purchase, etc.). These scores can then be tracked over time, or associated with various cuts in the data (department, division, product, customer demographic). The tool used here will be the Python module TextBlob.
  • 6. TEXTBLOB TextBlob is a Python package which does a lot of things with text:  Spelling correction  Noun phrase extraction  Part-of-speech tagging  Tokenization (splitting text into words and sentences)  Sentiment analysis
  • 7. CREATING A TEXTBLOB Install the package. In a python program, load it:  from textblob import TextBlob Run it on some text:
  • 8. CREATING A TEXTBLOB text = "Absolutely wonderful - silky and sexy and comortable“ [note misspelling] text_lower=text.lower() blob_pre = TextBlob(text_lower) blob=blob_pre.correct() sentiment = blob.sentiment polarity = sentiment.polarity subjectivity = sentiment.subjectivity
  • 9. CREATING A TEXTBLOB - RESULTS Absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silk and sex and comfortable Sentiment(polarity=0.7, subjectivity=0.9) 0.7 [on a scale of -1 to 1] 0.9
  • 12. BASIC STEPS IN TEXT ANALYTICS
  • 13. If you have a data set with 10,000 comments, you have close to 10,000 unique values for a variable. That makes analysis futile, in almost all cases. Therefore the text values are tokenized:  Break text into sentences,  Break sentences into words,  ‘Standardize’ the words (e.g., set to root form, singularizing plurals and setting verbs to present tense, possibly removing stop words).
  • 14. TOKENIZATION Most comments are unique, resulting in a variable with mostly unique values. That generally makes analysis futile, Therefore the text values are tokenized:  Break text into sentences,  Break sentences into words,  ‘Standardize’ the words (e.g., set to root form, singularizing plurals and setting verbs to present tense). This converts 10,000 unique values into a smaller set of values. Each text field is now a list of standardized tokens.
  • 15. COMMENTS ON TOKENIZATION There are a variety of tools and methods/settings in Python to tokenize. This presentation will use NLTK (Natural Language Tool Kit). There are trade-offs  Stemming trims words to a root, not necessarily grammatically correct (‘riding’ => ‘rid’).  Lemmatization attempts to find a good root (‘riding’ => ‘ride’).  Spelling correction is far from perfect, and can really slow down a program, depending on the misspellings.
  • 16. NLTK PROCESSING ➢ text = 'Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.’ ➢ text_fixed = re.sub(r"'",r"'",text) # fix an oddity in import. ➢ text_lower=text_fixed.lower() ➢ word_tokens = nltk.word_tokenize(text_lower) ➢ removing_stopwords = [word for word in word_tokens if word not in stopwords] ➢ lemmatized_word = [lemmatizer.lemmatize(word) for word in removing_stopwords] ➢ line = ' '.join(map(str, lemmatized_word)) ➢ print(line)
  • 17. NLTK PROCESSING - RESULTS love dress 's sooo pretty happened find store 'm glad bc never would ordered online bc 's petite bought petite 5 ' 8 '' love length me- hit little knee would definitely true midi someone truly petite Absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silk and sex and comfortable Sen
  • 18. COUNT VECTORIZATION One way to approach the problem of predictors is to create a set of predictors based on the tokens for each comment. A dictionary is compiled for the ‘words’ (tokens) in the set of comments, and a number is assigned to each token. The set of numbers and counts can be used as predictors for each comment. Two common ways are:  Count vectorization.  Tf-idf vectorization.
  • 19. TF-IDS VECTORIZATION An importance weight can be assigned to each token. The Term Frequency-Inverse Document Frequency method. In this method, higher terms counts within a comment (‘document’) make the token more significant, but higher counts for that token in the entire set of comments (documents) make it less important.
  • 20. TF-IDS VECTORIZATION (CONTINUED The concept is that a token which appears a lot in a given comment (‘document’) gets upweighted: Term Frequency. However, the more commonly that token appears in the overall set of comments, it gets down weighted: Inverse document frequency. For example, ‘the’, ‘and’, ‘or’ would generally get a very low weight. This could be used to automatically disregard stop words.
  • 21. EXAMPLE OF TF-IDF VECTORIZATION When the data set is divided into 2/3 training data and 1/3 test data, there are 15,160 rows and 1 column. After vectorization, there are 15,160 rows by 10,846 columns.
  • 22. MACHINE LEARNING At this point, the vectorized data can be used in any machine learning method. You can also explore the resulting models, to find out the important tokens.
  • 23. TOPIC MODELING There are a number of methods to explore text to find cluster and groups (‘topics’).
  • 26. REFERENCES Sentiment Scoring:  Statistical Sentiment-Analysis for Survey Data using Python (https://towardsdatascience.com/statistical- sentiment-analysis-for-survey-data-using- python-9c824ef0c9b0)  Opinion Mining Of Survey Comments (https://towardsdatascience.com/https- medium-com-sacharath-opinion- mining-of-survey-comments- 14e3fc902b10)
  • 27. REFERENCES  A comparison of methods  NLP Pipeline: Word Tokenization (Part 1) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1- 4b2b547e6a3)  NLP Pipeline: Part of Speech (Part 2) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-part-of-speech-part-2- b683c90e327d)  NLP Pipeline: Lemmatization (Part 3) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-lemmatization-part-3- 4bfd7304957)  NLP Pipeline: Stemming (Part 4) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-stemming-part-4- b60a319fd52)  NLP Pipeline: Stop words (Part 5) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5- d6770df8a936)  NLP Pipeline: Sentence Tokenization (Part 6) by Edward Ma (https://medium.com/@makcedward/nlp-pipeline-sentence-tokenization-part-6- 86ed55b185e6)
  • 28. REFERENCES NLTK, Tokenizing, etc.:  NLTK documentation (https://www.nltk.org/)  Tutorial: Extracting Keywords with TF-IDF and Python’s Scikit-Learn (https://kavita- ganesan.com/extracting-keywords-from- text-tfidf/#.Xp9NsZl7mUl)  Tf-idf (https://en.wikipedia.org/wiki/Tf-idf)  Scikit-learn site, ‘Working With Text Data’ (https://scikit- learn.org/stable/tutorial/text_analytics/work ing_with_text_data.html)