Barry DeCicco presented on text analysis techniques including sentiment scoring using TextBlob and machine learning category prediction using NLTK and scikit-learn. The presentation covered tokenizing text into words and standardizing tokens, creating sentiment scores in TextBlob, and vectorizing tokenized text for use in machine learning models. Methods like count vectorization and TF-IDF weighting were discussed for creating feature vectors from token counts.
3. CREDITS
(UP FRONT!)
Almost everything I’ve learned about text analytics I
learned from posters at Medium.com, particularly their
section ‘Towards Data Science’.
Medium.com has a $5/year subscription, which for the
knowledge I’ve gained is a better value than most free
resources.
5. WHAT IS SENTIMENT SCORING?
This means assigning a positive/negative score to each
piece of text (e.g., comment in a survey, customer review
for a purchase, etc.).
These scores can then be tracked over time, or
associated with various cuts in the data (department,
division, product, customer demographic).
The tool used here will be the Python module TextBlob.
6. TEXTBLOB
TextBlob is a Python package which does a lot of things
with text:
Spelling correction
Noun phrase extraction
Part-of-speech tagging
Tokenization (splitting text into words and sentences)
Sentiment analysis
7. CREATING A TEXTBLOB
Install the package.
In a python program, load it:
from textblob import TextBlob
Run it on some text:
8. CREATING A TEXTBLOB
text = "Absolutely wonderful - silky and sexy and
comortable“ [note misspelling]
text_lower=text.lower()
blob_pre = TextBlob(text_lower)
blob=blob_pre.correct()
sentiment = blob.sentiment
polarity = sentiment.polarity
subjectivity = sentiment.subjectivity
9. CREATING A TEXTBLOB - RESULTS
Absolutely wonderful - silky and sexy and comortable
absolutely wonderful - silky and sexy and comortable
absolutely wonderful - silky and sexy and comortable
absolutely wonderful - silk and sex and comfortable
Sentiment(polarity=0.7, subjectivity=0.9)
0.7 [on a scale of -1 to 1]
0.9
13. If you have a data set with 10,000 comments, you have
close to 10,000 unique values for a variable. That makes
analysis futile, in almost all cases.
Therefore the text values are tokenized:
Break text into sentences,
Break sentences into words,
‘Standardize’ the words (e.g., set to root form, singularizing
plurals and setting verbs to present tense, possibly
removing stop words).
14. TOKENIZATION
Most comments are unique, resulting in a variable with
mostly unique values. That generally makes analysis futile,
Therefore the text values are tokenized:
Break text into sentences,
Break sentences into words,
‘Standardize’ the words (e.g., set to root form, singularizing
plurals and setting verbs to present tense).
This converts 10,000 unique values into a smaller set of
values. Each text field is now a list of standardized tokens.
15. COMMENTS ON TOKENIZATION
There are a variety of tools and methods/settings in Python
to tokenize. This presentation will use NLTK (Natural
Language Tool Kit).
There are trade-offs
Stemming trims words to a root, not necessarily
grammatically correct (‘riding’ => ‘rid’).
Lemmatization attempts to find a good root (‘riding’ =>
‘ride’).
Spelling correction is far from perfect, and can really slow
down a program, depending on the misspellings.
16. NLTK PROCESSING
➢ text = 'Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i
did bc i never would have ordered it online bc it's petite. i bought a petite and am
5'8". i love the length on me- hits just a little below the knee. would definitely be a true
midi on someone who is truly petite.’
➢ text_fixed = re.sub(r"'",r"'",text) # fix an oddity in import.
➢ text_lower=text_fixed.lower()
➢ word_tokens = nltk.word_tokenize(text_lower)
➢ removing_stopwords = [word for word in word_tokens if word not in stopwords]
➢ lemmatized_word = [lemmatizer.lemmatize(word) for word in removing_stopwords]
➢ line = ' '.join(map(str, lemmatized_word))
➢ print(line)
17. NLTK PROCESSING - RESULTS
love dress 's sooo pretty happened find store 'm glad bc
never would ordered online bc 's petite bought petite 5 ' 8
'' love length me- hit little knee would definitely true midi
someone truly petite
Absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silky and sexy and comortable absolutely wonderful - silk and sex and comfortable Sen
18. COUNT VECTORIZATION
One way to approach the problem of predictors is to
create a set of predictors based on the tokens for each
comment.
A dictionary is compiled for the ‘words’ (tokens) in the set
of comments, and a number is assigned to each token.
The set of numbers and counts can be used as predictors
for each comment.
Two common ways are:
Count vectorization.
Tf-idf vectorization.
19. TF-IDS VECTORIZATION
An importance weight can be assigned to each token.
The Term Frequency-Inverse Document Frequency
method.
In this method, higher terms counts within a comment
(‘document’) make the token more significant, but higher
counts for that token in the entire set of comments
(documents) make it less important.
20. TF-IDS VECTORIZATION (CONTINUED
The concept is that a token which appears a lot in a given
comment (‘document’) gets upweighted: Term
Frequency.
However, the more commonly that token appears in the
overall set of comments, it gets down weighted: Inverse
document frequency.
For example, ‘the’, ‘and’, ‘or’ would generally get a very
low weight. This could be used to automatically disregard
stop words.
21. EXAMPLE OF TF-IDF VECTORIZATION
When the data set is divided into 2/3 training data and 1/3
test data, there are 15,160 rows and 1 column.
After vectorization, there are 15,160 rows by 10,846
columns.
22. MACHINE LEARNING
At this point, the vectorized data can be used in any
machine learning method.
You can also explore the resulting models, to find out the
important tokens.
23. TOPIC MODELING
There are a number of methods to explore text to find
cluster and groups (‘topics’).
26. REFERENCES
Sentiment Scoring:
Statistical Sentiment-Analysis for Survey Data
using Python
(https://towardsdatascience.com/statistical-
sentiment-analysis-for-survey-data-using-
python-9c824ef0c9b0)
Opinion Mining Of Survey Comments
(https://towardsdatascience.com/https-
medium-com-sacharath-opinion-
mining-of-survey-comments-
14e3fc902b10)
27. REFERENCES
A comparison of methods
NLP Pipeline: Word Tokenization (Part 1) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-
4b2b547e6a3)
NLP Pipeline: Part of Speech (Part 2) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-part-of-speech-part-2-
b683c90e327d)
NLP Pipeline: Lemmatization (Part 3) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-lemmatization-part-3-
4bfd7304957)
NLP Pipeline: Stemming (Part 4) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-stemming-part-4-
b60a319fd52)
NLP Pipeline: Stop words (Part 5) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5-
d6770df8a936)
NLP Pipeline: Sentence Tokenization (Part 6) by Edward Ma
(https://medium.com/@makcedward/nlp-pipeline-sentence-tokenization-part-6-
86ed55b185e6)
28. REFERENCES
NLTK, Tokenizing, etc.:
NLTK documentation (https://www.nltk.org/)
Tutorial: Extracting Keywords with TF-IDF and
Python’s Scikit-Learn (https://kavita-
ganesan.com/extracting-keywords-from-
text-tfidf/#.Xp9NsZl7mUl)
Tf-idf (https://en.wikipedia.org/wiki/Tf-idf)
Scikit-learn site, ‘Working With Text Data’
(https://scikit-
learn.org/stable/tutorial/text_analytics/work
ing_with_text_data.html)