Beginning text analysis

BEGINNING TEXT ANALYSIS
Barry DeCicco
Ann Arbor Chapter of the American Statistical
Association
April 22, 2020

CONTENTS
Sentiment Scoring with TextBlob.
Predicting Categories with
Machine Learning, using NLTK
and scikit-learn.

CREDITS
(UP FRONT!)
Almost everything I’ve learned about text analytics I
learned from posters at Medium.com, particularly their
section ‘Towards Data Science’.
Medium.com has a $5/year subscription, which for the
knowledge I’ve gained is a better value than most free
resources.

SENTIMENT SCORING
Using TextBlob

WHAT IS SENTIMENT SCORING?
This means assigning a positive/negative score to each
piece of text (e.g., comment in a survey, customer review
for a purchase, etc.).
These scores can then be tracked over time, or
associated with various cuts in the data (department,
division, product, customer demographic).
The tool used here will be the Python module TextBlob.

TEXTBLOB
TextBlob is a Python package which does a lot of things
with text:
 Spelling correction
 Noun phrase extraction
 Part-of-speech tagging
 Tokenization (splitting text into words and sentences)
 Sentiment analysis

CREATING A TEXTBLOB
Install the package.
In a python program, load it:
 from textblob import TextBlob
Run it on some text:

CREATING A TEXTBLOB
text = "Absolutely wonderful - silky and sexy and
comortable“ [note misspelling]
text_lower=text.lower()
blob_pre = TextBlob(text_lower)
blob=blob_pre.correct()
sentiment = blob.sentiment
polarity = sentiment.polarity
subjectivity = sentiment.subjectivity

CREATING A TEXTBLOB - RESULTS
Absolutely wonderful - silky and sexy and comortable
absolutely wonderful - silky and sexy and comortable
absolutely wonderful - silky and sexy and comortable
absolutely wonderful - silk and sex and comfortable
Sentiment(polarity=0.7, subjectivity=0.9)
0.7 [on a scale of -1 to 1]
0.9

If you have a data set with 10,000 comments, you have
close to 10,000 unique values for a variable. That makes
analysis futile, in almost all cases.
Therefore the text values are tokenized:
 Break text into sentences,
 Break sentences into words,
 ‘Standardize’ the words (e.g., set to root form, singularizing
plurals and setting verbs to present tense, possibly
removing stop words).

TOKENIZATION
Most comments are unique, resulting in a variable with
mostly unique values. That generally makes analysis futile,
Therefore the text values are tokenized:
 Break text into sentences,
 Break sentences into words,
 ‘Standardize’ the words (e.g., set to root form, singularizing
plurals and setting verbs to present tense).
This converts 10,000 unique values into a smaller set of
values. Each text field is now a list of standardized tokens.

COMMENTS ON TOKENIZATION
There are a variety of tools and methods/settings in Python
to tokenize. This presentation will use NLTK (Natural
Language Tool Kit).
There are trade-offs
 Stemming trims words to a root, not necessarily
grammatically correct (‘riding’ => ‘rid’).
 Lemmatization attempts to find a good root (‘riding’ =>
‘ride’).
 Spelling correction is far from perfect, and can really slow
down a program, depending on the misspellings.

NLTK PROCESSING
➢ text = 'Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i
did bc i never would have ordered it online bc it's petite. i bought a petite and am
5'8". i love the length on me- hits just a little below the knee. would definitely be a true
midi on someone who is truly petite.’
➢ text_fixed = re.sub(r"'",r"'",text) # fix an oddity in import.
➢ text_lower=text_fixed.lower()
➢ word_tokens = nltk.word_tokenize(text_lower)
➢ removing_stopwords = [word for word in word_tokens if word not in stopwords]
➢ lemmatized_word = [lemmatizer.lemmatize(word) for word in removing_stopwords]
➢ line = ' '.join(map(str, lemmatized_word))
➢ print(line)

NLTK PROCESSING - RESULTS
love dress 's sooo pretty happened find store 'm glad bc
never would ordered online bc 's petite bought petite 5 ' 8
'' love length me- hit little knee would definitely true midi
someone truly petite
Absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silk and sex and comfortable Sen

COUNT VECTORIZATION
One way to approach the problem of predictors is to
create a set of predictors based on the tokens for each
comment.
A dictionary is compiled for the ‘words’ (tokens) in the set
of comments, and a number is assigned to each token.
The set of numbers and counts can be used as predictors
for each comment.
Two common ways are:
 Count vectorization.
 Tf-idf vectorization.

TF-IDS VECTORIZATION
An importance weight can be assigned to each token.
The Term Frequency-Inverse Document Frequency
method.
In this method, higher terms counts within a comment
(‘document’) make the token more significant, but higher
counts for that token in the entire set of comments
(documents) make it less important.

TF-IDS VECTORIZATION (CONTINUED
The concept is that a token which appears a lot in a given
comment (‘document’) gets upweighted: Term
Frequency.
However, the more commonly that token appears in the
overall set of comments, it gets down weighted: Inverse
document frequency.
For example, ‘the’, ‘and’, ‘or’ would generally get a very
low weight. This could be used to automatically disregard
stop words.

EXAMPLE OF TF-IDF VECTORIZATION
When the data set is divided into 2/3 training data and 1/3
test data, there are 15,160 rows and 1 column.
After vectorization, there are 15,160 rows by 10,846
columns.

MACHINE LEARNING
At this point, the vectorized data can be used in any
machine learning method.
You can also explore the resulting models, to find out the
important tokens.

TOPIC MODELING
There are a number of methods to explore text to find
cluster and groups (‘topics’).

REFERENCES
TextBlob:
 Introducing TextBlob
(https://towardsdatascience.com/having-
fun-with-textblob-7e9eed783d3f)
 Tutorial: QuickStart
(https://textblob.readthedocs.io/en/dev/)

REFERENCES
Sentiment Scoring:
 Statistical Sentiment-Analysis for Survey Data
using Python
(https://towardsdatascience.com/statistical-
sentiment-analysis-for-survey-data-using-
python-9c824ef0c9b0)
 Opinion Mining Of Survey Comments
(https://towardsdatascience.com/https-
medium-com-sacharath-opinion-
mining-of-survey-comments-
14e3fc902b10)

REFERENCES
 A comparison of methods
 NLP Pipeline: Word Tokenization (Part 1) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-
4b2b547e6a3)
 NLP Pipeline: Part of Speech (Part 2) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-part-of-speech-part-2-
b683c90e327d)
 NLP Pipeline: Lemmatization (Part 3) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-lemmatization-part-3-
4bfd7304957)
 NLP Pipeline: Stemming (Part 4) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-stemming-part-4-
b60a319fd52)
 NLP Pipeline: Stop words (Part 5) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5-
d6770df8a936)
 NLP Pipeline: Sentence Tokenization (Part 6) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-sentence-tokenization-part-6-
86ed55b185e6)

REFERENCES
NLTK, Tokenizing, etc.:
 NLTK documentation (https://www.nltk.org/)
 Tutorial: Extracting Keywords with TF-IDF and
Python’s Scikit-Learn (https://kavita-
ganesan.com/extracting-keywords-from-
text-tfidf/#.Xp9NsZl7mUl)
 Tf-idf (https://en.wikipedia.org/wiki/Tf-idf)
 Scikit-learn site, ‘Working With Text Data’
(https://scikit-
learn.org/stable/tutorial/text_analytics/work
ing_with_text_data.html)

Beginning text analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Beginning text analysis

Similar to Beginning text analysis (20)

More from Barry DeCicco

More from Barry DeCicco (7)

Recently uploaded

Recently uploaded (20)

Beginning text analysis