Customer Service Analytics - Make Sense of All Your Data.pptx
Natural Language Processing sample code by Aiden
1. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 1/35
0. Workflow
1. Start with a Question
2. Get & Clean the Data
3. Perform Exploratory Data Analysis
4. Apply Techniques
5. Share Insights
1. Data Cleaning
1.1 Introduction
This part one goes through a necessary step of any data science project - data cleaning. Data cleaning is a time
consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out".
Feeding dirty data into a model will give us results that are meaningless.
Specifically, we'll be walking through:
1. Getting the data - in this case, we'll be scraping data from a website
2. Cleaning the data - we will walk through popular text pre-processing techniques
3. Organizing the data - we will organize the cleaned data into a way that is easy to input into other
algorithms
The output of this part one will be clean, organized data in two standard text formats:
1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format
1.2 Problem Statement
My goal is to look at transcripts of various comedians and note their similarities and differences. Specifically, I'd
like to know if Trevor Noah's comedy style is different than other comedians, since he's the comedian that got
me interested in stand up comedy.
2. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 2/35
1.3 Getting The Data
Scraps From The Loft (http://scrapsfromtheloft.com) keeps track of stand up routine transcripts, and makes
them available for non-profit and educational purposes. To decide which comedians to look into, I went on
IMDB and looked specifically at comedy specials that were released in the past 5 years. To narrow it down
further, I looked only at those with greater than a 7.5/10 rating and more than 2000 votes. If a comedian had
multiple specials that fit those requirements, I would pick the most highly rated one. I ended up with 13 comedy
specials.
Packages for Web Scraping:
1. Requests - make HTTP requests
2. Beautiful Soup - parse HTTP documents
Package for Python Data Processing:
1. Pickle - serialize Python objects
3. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 3/35
In [1]: # Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle
# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
'''Returns transcript data specifically from scrapsfromtheloft.co
m.'''
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
text = [p.text for p in soup.find(class_="post-content").find_all(
'p')]
print(url)
return text
# URLs of transcripts in scope
urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full
-transcript/',
'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin
-2017-full-transcript/',
'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-
transcript/',
'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-tr
anscript/',
'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel
-way-2014-full-transcript/',
'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014
-full-transcript/',
'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-k
id-2015-full-transcript/',
'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming
-king-2017-full-transcript/',
'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-201
6-full-transcript/',
'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-though
ts-prayers-2015-full-transcript/',
'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlf
riends-boyfriend-2013-full-transcript/',
'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-201
6-full-transcript/',
'http://scrapsfromtheloft.com/2018/11/21/trevor-noah-son-of-patr
icia-transcript/']
# Comedian names
comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'has
an', 'ali', 'anthony', 'mike', 'joe', 'trevor']
4. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 4/35
In [5]: # # Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]
In [6]: # Pickle files for later use
# Make a new directory to hold the text files
# !mkdir transcripts
for i, c in enumerate(comedians):
with open("transcripts/" + c + ".txt", "wb") as file:
pickle.dump(transcripts[i], file)
In [7]: # Load pickled files
data = {}
for i, c in enumerate(comedians):
with open("transcripts/" + c + ".txt", "rb") as file:
data[c] = pickle.load(file)
http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcr
ipt/
http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-fu
ll-transcript/
http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcri
pt/
http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcrip
t/
http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-201
4-full-transcript/
http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-tr
anscript/
http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-
full-transcript/
http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-20
17-full-transcript/
http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-t
ranscript/
http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-praye
rs-2015-full-transcript/
http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-b
oyfriend-2013-full-transcript/
http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-t
ranscript/
http://scrapsfromtheloft.com/2018/11/21/trevor-noah-son-of-patricia-tra
nscript/
5. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 5/35
1.4 Cleaning The Data
When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing
with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as
text pre-processing techniques.
With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So,
we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of
things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest
can be done at a later point to improve our results.
Common data cleaning steps on all text:
Make text all lower case
Remove punctuation
Remove numerical values
Remove common non-sensical text (/n)
Tokenize text
Remove stop words
More data cleaning steps after tokenization:
Stemming / lemmatization
Parts of speech tagging
Create bi-grams or tri-grams
Deal with typos
And more...
Packages for round1 cleaning:
1. re - regular expression operations/matching/substitution
2. string - common string operations/punctuation
Package for round2 cleaning:
1. nltk - stemming.lemmatization
2. gensim - tokenization, excluding stopwords
In [14]: # We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
'''Takes a list of text and combines them into one large chunk of te
xt.'''
combined_text = ' '.join(list_of_text)
return combined_text
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.ite
ms()}
6. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 6/35
In [54]: # We can either keep it in dictionary format or put it into a pandas dat
aframe
import pandas as pd
pd.set_option('max_colwidth',150)
data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
In [18]: # Apply a first round of text cleaning techniques
import re
import string
def clean_text_round1(text):
'''Make text lowercase, remove text in square brackets, remove punct
uation and remove words containing numbers.'''
text = text.lower()
text = re.sub('[.*?]', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('w*dw*', '', text)
text = re.sub('[‘’“”…]', '', text)
text = re.sub('n', '', text)
return text
round1 = lambda x: clean_text_round1(x)
In [61]: # Let's take a look at the updated text
data_clean1 = pd.DataFrame(data_df.transcript.apply(round1))
data_clean1.iloc[0]
Out[61]: transcript ladies and gentlemen please welcome to the stage ali wong
hi hello welcome thank you thank you for coming hello hello we are gonn
a have to get thi...
Name: ali, dtype: object
7. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 7/35
In [52]: # Apply a second round of text cleaning techniques
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import nltk
#nltk.download('wordnet')
stemmer = SnowballStemmer('english')
def lemmatize_stemming(word):
return stemmer.stem(WordNetLemmatizer().lemmatize(word, pos='v'))
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
def clean_text_round2(text):
'''Mark 'cheering' and 'cheer' as the same word (stemming / lemmatiz
ation)
Excluding stopwords
'''
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(t
oken) > 3:
result.append(lemmatize_stemming(token))
result = ' '.join(result)
return result
round2 = lambda x: clean_text_round2(x)
In [62]: # Let's take a look at the updated text
data_clean2 = pd.DataFrame(data_clean1.transcript.apply(round2))
data_clean2.iloc[0]
1.5 Organizing The Data
Output of this part1 will be clean, organized data in two standard text formats:
1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format
Corpus
Out[62]: transcript ladi gentlemen welcom stage wong hello welcom thank thank
come hello hello gonna shit caus like minut thank everybodi come excit
excit year turn y...
Name: ali, dtype: object
8. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 8/35
In [75]: # Let's add the comedians' full names as well
full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham',
'Dave Chappelle', 'Hasan Minhaj',
'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.'
, 'Mike Birbiglia', 'Ricky Gervais', 'Trevor Noah']
data_df['full_name'] = full_names
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")
Document-Term Matrix
The text must be tokenized, meaning broken down into smaller pieces. The most common tokenization
technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row
will represent a different document and every column will represent a different word.
In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no
additional meaning to text such as 'a', 'the', etc.
9. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 9/35
In [80]: # We are going to create a document-term matrix using CountVectorizer, a
nd exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean2.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names
())
data_dtm.index = data_clean.index
data_dtm
In [82]: # Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")
# Let's also pickle the cleaned data (before we put it in document-term
matrix format) and the CountVectorizer object
data_clean2.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))
2. Exploratory Data Analysis
Out[80]:
aaaaah aaaaahhhhhhh aaaahhhhh aaah abandon abc abil abject abl ablebodi .
ali 0 0 0 0 0 0 0 0 2 0 .
anthony 0 0 0 0 0 0 0 0 0 0 .
bill 1 0 0 0 0 1 0 0 1 0 .
bo 0 1 1 0 0 0 1 0 0 0 .
dave 0 0 0 1 0 0 0 0 0 0 .
hasan 0 0 0 0 0 0 0 0 1 0 .
jim 0 0 0 0 0 0 0 0 1 2 .
joe 0 0 0 0 0 0 0 0 2 0 .
john 0 0 0 0 0 0 0 0 3 0 .
louis 0 0 0 0 0 0 0 0 1 0 .
mike 0 0 0 0 0 0 0 0 0 0 .
ricky 0 0 0 0 0 0 1 1 2 0 .
trevor 0 0 0 0 1 0 0 0 0 0 .
13 rows × 5319 columns
10. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 10/35
2.1 Introduction
After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at
the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always
important to explore the data first.
When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include
finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the
same when working with text data. We are going to find some more obvious patterns with EDA before
identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for
each comedian:
1. Most common words - find these and create word clouds
2. Size of vocabulary - look number of unique words and also how quickly someone speaks
3. Amount of profanity - most common terms
2.2 Most Common Words
Analysis
In [84]: # Read in the document-term matrix
import pandas as pd
data = pd.read_pickle('dtm.pkl')
data = data.transpose()
In [87]: # Find the top 30 words said by each comedian
top_dict = {}
for c in data.columns:
top = data[c].sort_values(ascending=False).head(30)
top_dict[c]= list(zip(top.index, top.values))
#pprint(top_dict)
11. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 11/35
In [88]: # Print the top 15 words said by each comedian
for comedian, top_words in top_dict.items():
print(comedian)
print(', '.join([word for word, count in top_words[0:14]]))
print('---')
12. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 12/35
ali
like, know, dont, shit, gonna, come, wanna, gotta, time, husband, fuck,
tell, right, women
---
anthony
joke, like, know, dont, say, thing, anthoni, tell, think, guy, peopl, t
ime, fuck, love
---
bill
like, right, fuck, know, dont, gonna, yeah, come, shit, think, want, du
de, peopl, thing
---
bo
know, think, like, love, fuck, stuff, repeat, want, dont, yeah, right,
say, slut, peopl
---
dave
like, know, say, fuck, shit, peopl, didnt, ahah, dont, time, black, com
e, look, good
---
hasan
like, know, dont, want, look, love, time, shes, hasan, right, come, wal
k, fuck, say
---
jim
fuck, like, dont, right, know, come, think, say, thing, peopl, want, gu
n, theyr, good
---
joe
like, fuck, peopl, dont, think, know, gonna, theyr, shit, thing, right,
hous, dude, look
---
john
like, know, dont, say, walk, clinton, right, time, think, littl, look,
thing, peopl, caus
---
louis
like, know, dont, thing, life, peopl, gonna, think, shit, caus, say, ha
ppen, look, murder
---
mike
like, say, know, dont, think, jenni, caus, right, point, want, mean, go
nna, come, friend
---
ricky
right, like, say, know, fuck, dont, yeah, thing, joke, think, year, peo
pl, didnt, littl
---
trevor
like, know, say, dont, snake, taco, peopl, come, yeah, right, want, thi
ng, think, time
---
13. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 13/35
NOTE: At this point, we could go on and create word clouds. However, by looking at these top words, you can
see that some of them have very little meaning and could be added to a stop words list, so let's do just that.
In [90]: # Look at the most common top words --> add them to the stop word list
from collections import Counter
# Let's first pull out the top 30 words for each comedian
words = []
for comedian in data.columns:
top = [word for (word, count) in top_dict[comedian]]
for t in top:
words.append(t)
In [105]: # Let's aggregate this list and identify the most common words along wit
h how many routines they occur in
Counter(words).most_common()
# If more than half of the comedians have it as a top word, exclude it f
rom the list
add_stop_words = [word for word, count in Counter(words).most_common() i
f count > 10]
add_stop_words
In [106]: # Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
# Read in cleaned data
data_clean = pd.read_pickle('data_clean.pkl')
# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)
# Recreate document-term matrix
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean.transcript)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names
())
data_stop.index = data_clean.index
# Pickle it for later use
import pickle
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")
Out[105]: ['like', 'know', 'dont', 'come', 'time', 'right', 'peopl', 'think', 'sa
y', 'look', 'want', 'thing']
14. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 14/35
In [107]: # Let's make some word clouds!
# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud
wc = WordCloud(stopwords=stop_words, background_color="white", colormap=
"Dark2",
max_font_size=150, random_state=42)
In [108]: # Reset the output dimensions
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [16, 6]
full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham',
'Dave Chappelle', 'Hasan Minhaj',
'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.'
, 'Mike Birbiglia', 'Ricky Gervais', 'Trevor Noah']
# Create subplots for each comedian
for index, comedian in enumerate(data.columns):
wc.generate(data_clean.transcript[comedian])
plt.subplot(4, 4, index+1)
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title(full_names[index])
plt.show()
Findings
Trevor Noah says the "yeah" a lot and mock on himself. I guess his enthusiastic language is funny to me.
A lot of people use the F-word. Let's dig into that later.
15. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 15/35
2.3 Number of Words
Analysis
In [110]: # Find the number of unique words that each comedian uses
# Identify the non-zero items in the document-term matrix, meaning that
the word occurs at least once
unique_list = []
for comedian in data.columns:
uniques = data[comedian].nonzero()[0].size
unique_list.append(uniques)
# Create a new dataframe that contains this unique word count
data_words = pd.DataFrame(list(zip(full_names, unique_list)), columns=[
'comedian', 'unique_words'])
data_unique_sort = data_words.sort_values(by='unique_words')
data_unique_sort
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:6: FutureW
arning: Series.nonzero() is deprecated and will be removed in a future
version.Use Series.to_numpy().nonzero() instead
Out[110]:
comedian unique_words
1 Anthony Jeselnik 732
9 Louis C.K. 817
12 Trevor Noah 874
6 Jim Jefferies 938
3 Bo Burnham 984
4 Dave Chappelle 1035
8 John Mulaney 1042
0 Ali Wong 1057
7 Joe Rogan 1057
10 Mike Birbiglia 1101
5 Hasan Minhaj 1168
2 Bill Burr 1215
11 Ricky Gervais 1218
16. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 16/35
In [111]: # Calculate the words per minute of each comedian
# Find the total number of words that a comedian uses
total_list = []
for comedian in data.columns:
totals = sum(data[comedian])
total_list.append(totals)
# Comedy special run times from IMDB, in minutes
run_times = [60, 59, 80, 60, 67, 73, 77, 63, 62, 58, 76, 79, 63]
# Let's add some columns to our dataframe
data_words['total_words'] = total_list
data_words['run_times'] = run_times
data_words['words_per_minute'] = data_words['total_words'] / data_words[
'run_times']
# Sort the dataframe by words per minute to see who talks the slowest an
d fastest
data_wpm_sort = data_words.sort_values(by='words_per_minute')
data_wpm_sort
Out[111]:
comedian unique_words total_words run_times words_per_minute
1 Anthony Jeselnik 732 2232 59 37.830508
3 Bo Burnham 984 2449 60 40.816667
0 Ali Wong 1057 2596 60 43.266667
9 Louis C.K. 817 2539 58 43.775862
6 Jim Jefferies 938 3612 77 46.909091
11 Ricky Gervais 1218 3827 79 48.443038
4 Dave Chappelle 1035 3247 67 48.462687
10 Mike Birbiglia 1101 3752 76 49.368421
5 Hasan Minhaj 1168 3652 73 50.027397
8 John Mulaney 1042 3129 62 50.467742
2 Bill Burr 1215 4208 80 52.600000
12 Trevor Noah 874 3609 63 57.285714
7 Joe Rogan 1057 3632 63 57.650794
17. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 17/35
In [112]: # Let's plot our findings
import numpy as np
y_pos = np.arange(len(data_words))
plt.subplot(1, 2, 1)
plt.barh(y_pos, data_unique_sort.unique_words, align='center')
plt.yticks(y_pos, data_unique_sort.comedian)
plt.title('Number of Unique Words', fontsize=20)
plt.subplot(1, 2, 2)
plt.barh(y_pos, data_wpm_sort.words_per_minute, align='center')
plt.yticks(y_pos, data_wpm_sort.comedian)
plt.title('Number of Words Per Minute', fontsize=20)
plt.tight_layout()
plt.show()
Findings
Vocabulary
Ricky Gervais (British comedy) and Bill Burr (podcast host) use a lot of words in their comedy
Louis C.K. (self-depricating comedy) and Anthony Jeselnik (dark humor) have a smaller vocabulary
Talking Speed
Joe Rogan (blue comedy) and Trevor Noah talk fast
Bo Burnham (musical comedy) and Anthony Jeselnik (dark humor) talk slow
This is really interesting. Trevor Noah doesn't use many words in his comedy, but he talks fast. This fact may be
reason why his comedy is impressive to audience.
2.4 Amount of Profanity
Analysis
18. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 18/35
In [115]: # Earlier I said we'd revisit profanity. Let's take a look at the most c
ommon words again.
Counter(words).most_common()
# Let's isolate just these bad words
data_bad_words = data.transpose()[['fuck', 'shit']]
data_profanity = pd.concat([data_bad_words.fuck, data_bad_words.shit], a
xis=1)
data_profanity.columns = ['f_word', 's_word']
data_profanity
Out[115]:
f_word s_word
ali 20 36
anthony 19 9
bill 111 65
bo 40 7
dave 72 46
hasan 28 16
jim 126 20
joe 141 41
john 4 7
louis 22 28
mike 0 1
ricky 63 6
trevor 0 14
19. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 19/35
In [116]: # Let's create a scatter plot of our findings
plt.rcParams['figure.figsize'] = [10, 8]
for i, comedian in enumerate(data_profanity.index):
x = data_profanity.f_word.loc[comedian]
y = data_profanity.s_word.loc[comedian]
plt.scatter(x, y, color='blue')
plt.text(x+1.5, y+0.5, full_names[i], fontsize=10)
plt.xlim(-5, 155)
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of F Bombs', fontsize=15)
plt.ylabel('Number of S Words', fontsize=15)
plt.show()
Findings
Averaging 2 F-Bombs Per Minute! - I don't like too much swearing, especially the f-word, which is
probably why I've never heard of Bill Bur, Joe Rogan and Jim Jefferies.
Clean Humor - It looks like profanity might be a good predictor of the type of comedy I like. Besides Trevor
Noah, my two other favorite comedians in this group are John Mulaney and Mike Birbiglia.
20. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 20/35
Side Note
What was our goal for the EDA portion of our journey? To be able to take an initial look at our data and see if
the results of some basic analysis made sense.
My conclusion - yes, it does, for a first pass. The results are interesting and make general sense, so we're going
to move on.
As a reminder to myself, the data science process is an interative one. It's better to see some non-perfect but
acceptable results to help you quickly decide whether your project is a dud or not, instead of having analysis
paralysis and never delivering anything.
3. Sentiment Analysis
3.1 Introduction
So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc.
These techniques could be applied to numeric data as well.
When it comes to text data, there are a few popular techniques that we'll be going through in the next few
notebooks, starting with sentiment analysis. A few key points to remember with sentiment analysis.
1. TextBlob Module: Linguistic researchers have labeled the sentiment of words based on their domain
expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us
to take advantage of these labels.
2. Sentiment Labels: Each word in a corpus is labeled in terms of polarity and subjectivity (there are more
labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
Polarity: How positive or negative a word is. -1 is very negative. +1 is very positive.
Subjectivity: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.
For more info on how TextBlob coded up its sentiment function (https://planspace.org/20150607-
textblob_sentiment/). Other ststistical methods such as Naive Bayes can be also used for sentiment analysis.
Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine.
3.2 Sentiment of Routine
21. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 21/35
In [119]: # We'll start by reading in the corpus, which preserves word order
import pandas as pd
data = pd.read_pickle('corpus.pkl')
# Create quick lambda functions to find the polarity and subjectivity of
each routine
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity
data['polarity'] = data['transcript'].apply(pol)
data['subjectivity'] = data['transcript'].apply(sub)
data
22. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 22/35
Out[119]:
transcript full_name polarity subjectivity
ali
Ladies and gentlemen, please welcome to the stage: Ali
Wong! Hi. Hello! Welcome! Thank you! Thank you for
coming. Hello! Hello. We are gonna have ...
Ali Wong 0.069359 0.482403
anthony
Thank you. Thank you. Thank you, San Francisco. Thank
you so much. So good to be here. People were surprised
when I told ’em I was gonna tape my s...
Anthony
Jeselnik
0.054285 0.559732
bill
[cheers and applause] All right, thank you! Thank you very
much! Thank you. Thank you. Thank you. How are you?
What’s going on? Thank you. It’s a ...
Bill Burr 0.016479 0.537016
bo
Bo What? Old MacDonald had a farm E I E I O And on that
farm he had a pig E I E I O Here a snort There a Old
MacDonald had a farm E I E I O [Appla...
Bo
Burnham
0.074514 0.539368
dave
This is Dave. He tells dirty jokes for a living. That stare is
where most of his hard work happens. It signifies a
profound train of thought, the ...
Dave
Chappelle
-0.002690 0.513958
hasan
[theme music: orchestral hip-hop] [crowd roars] What’s up?
Davis, what’s up? I’m home. I had to bring it back here.
Netflix said, “Where do you wa...
Hasan
Minhaj
0.086856 0.460619
jim
[Car horn honks] [Audience cheering] [Announcer] Ladies
and gentlemen, please welcome to the stage Mr. Jim
Jefferies! [Upbeat music playing] Hello...
Jim
Jefferies
0.044224 0.523382
joe
[rock music playing] [audience cheering] [announcer]
Ladies and gentlemen, welcome Joe Rogan. [audience
cheering and applauding] What the fuck is ...
Joe
Rogan
0.004968 0.551628
john
All right, Petunia. Wish me luck out there. You will die on
August 7th, 2037. That’s pretty good. All right. Hello. Hello,
Chicago. Nice to see yo...
John
Mulaney
0.082355 0.484137
louis
IntronFade the music out. Let’s roll. Hold there. Lights. Do
the lights. Thank you. Thank you very much. I appreciate
that. I don’t necessarily a...
Louis C.K. 0.056665 0.515796
mike
Wow. Hey, thank you. Thanks. Thank you, guys. Hey,
Seattle. Nice to see you. Look at this. Look at us. We’re
here. This is crazy. It’s insane. So ...
Mike
Birbiglia
0.092927 0.518476
ricky
Hello. Hello! How you doing? Great. Thank you. Wow.
Calm down. Shut the fuck up. Thank you. What a lovely
welcome. I’m gonna try my hardest tonigh...
Ricky
Gervais
0.066489 0.497313
trevor
A NETFLIX ORIGINAL COMEDY SPECIAL [distant traffic]
LIVE NATION PRESENTS TREVOR NOAH [presenter]
Beautiful people, put your hands together for Tre...
Trevor
Noah
0.096365 0.479900
23. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 23/35
In [120]: # Let's plot the results
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 8]
for index, comedian in enumerate(data.index):
x = data.polarity.loc[comedian]
y = data.subjectivity.loc[comedian]
plt.scatter(x, y, color='blue')
plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)
plt.xlim(-.01, .12)
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)
plt.show()
24. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 24/35
Findings
Positive Energy - Trevor Noah is ranked No.1 in positive/negative tone in this sample pool. This is a
definitely inspiring finding, but not surprising after a thought. Personally, I prefer comedy that are not only
funny but also somehow optimistic.
Objective Tone - As a practical person, objective rality is another aspect that I pay attention to regarding
the content of the show. Trevor Noah is more objective than most of the other comedians in the pool.
3.3 Sentiment of Routine Over Time
Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time
throughout each routine.
In [124]: # Split each routine into 10 parts
import numpy as np
import math
def split_text(text, n=10):
'''Takes in a string of text and splits into n equal parts, with a d
efault of 10 equal parts.'''
# Calculate length of text, the size of each chunk of text and the s
tarting points of each chunk of text
length = len(text)
size = math.floor(length / n)
start = np.arange(0, length, size)
# Pull out equally sized pieces of text and put it into a list
split_list = []
for piece in range(n):
split_list.append(text[start[piece]:start[piece]+size])
return split_list
# Let's create a list to hold all of the pieces of text
list_pieces = []
for t in data.transcript:
split = split_text(t)
list_pieces.append(split)
25. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 25/35
In [126]: # Calculate the polarity for each piece of text
polarity_transcript = []
for lp in list_pieces:
polarity_piece = []
for p in lp:
polarity_piece.append(TextBlob(p).sentiment.polarity)
polarity_transcript.append(polarity_piece)
In [130]: # Show the plot for all comedians
plt.rcParams['figure.figsize'] = [28, 18]
for index, comedian in enumerate(data.index):
plt.subplot(4, 4, index+1)
plt.plot(polarity_transcript[index])
plt.plot(np.arange(0,10), np.zeros(10))
plt.title(data['full_name'][index])
plt.ylim(ymin=-.2, ymax=.3)
plt.show()
Findings
Trevor Noah stays generally positive throughout his routine. Similar comedians are Louis C.K. and Ali Wong.
On the other hand, you have some pretty different patterns here like Bo Burnham who gets happier as time
passes and Dave Chappelle who has some pretty down moments in his routine.
26. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 26/35
4. Topic Modeling
4.1 Introduction
Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find
various topics that are present in your corpus. Each document in the corpus will be made up of at least one
topic, if not multiple topics.
In this notebook, we will be covering the steps on how to do Latent Dirichlet Allocation (LDA), which is one of
many topic modeling techniques. It was specifically designed for text data.
To use a topic modeling technique, we need to provide (1) a document-term matrix and (2) the number of topics
we would like the algorithm to pick up.
Once the topic modeling technique is applied, our job as a human is to interpret the results and see if the mix of
words in each topic make sense. If they don't make sense, we can try changing up the number of topics, the
terms in the document-term matrix, model parameters, or even try a different model.
4.2 Topic Modeling - Attempt #1 (All Text)
In [132]: # Let's read in our document-term matrix
import pandas as pd
import pickle
data = pd.read_pickle('dtm_stop.pkl')
In [134]: # Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse
tdm = data.transpose()
In [135]: # We're going to put the term-document matrix into a new gensim format,
from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)
# Gensim also requires dictionary of the all terms and their respective
location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())
27. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 27/35
In [136]: # Now that we have the corpus (term-document matrix) and id2word (dictio
nary of location: term),
# we need to specify two other parameters as well - the number of topics
and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, pass
es=10)
lda.print_topics()
In [139]: # LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, pass
es=10)
lda.print_topics()
In [140]: # LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, pass
es=10)
lda.print_topics()
These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as
well.
Out[136]: [(0, '0.014*"fuck" + 0.009*"shit" + 0.008*"gonna" + 0.008*"yeah" + 0.00
6*"didnt" + 0.006*"caus" + 0.006*"tell" + 0.005*"good" + 0.005*"love" +
0.005*"life"'), (1, '0.021*"fuck" + 0.009*"joke" + 0.008*"yeah" + 0.007
*"theyr" + 0.007*"love" + 0.007*"gonna" + 0.006*"littl" + 0.005*"shit"
+ 0.005*"didnt" + 0.005*"tell"')]
Out[139]: [(0, '0.009*"yeah" + 0.007*"fuck" + 0.007*"love" + 0.007*"didnt" + 0.00
6*"walk" + 0.006*"littl" + 0.005*"mean" + 0.005*"snake" + 0.005*"year"
+ 0.005*"life"'), (1, '0.017*"fuck" + 0.009*"gonna" + 0.009*"shit" + 0.
006*"yeah" + 0.006*"caus" + 0.006*"good" + 0.006*"tell" + 0.006*"theyr"
+ 0.006*"didnt" + 0.006*"mean"'), (2, '0.029*"fuck" + 0.012*"gonna" +
0.011*"shit" + 0.010*"yeah" + 0.007*"theyr" + 0.006*"caus" + 0.005*"dud
e" + 0.005*"littl" + 0.005*"didnt" + 0.005*"tell"')]
Out[140]: [(0, '0.026*"fuck" + 0.010*"shit" + 0.010*"gonna" + 0.008*"yeah" + 0.00
8*"theyr" + 0.007*"didnt" + 0.006*"life" + 0.006*"good" + 0.006*"love"
+ 0.006*"tell"'), (1, '0.016*"joke" + 0.008*"anthoni" + 0.008*"tell" +
0.007*"guy" + 0.006*"fuck" + 0.006*"love" + 0.006*"grandma" + 0.006*"sh
ark" + 0.005*"babi" + 0.005*"good"'), (2, '0.010*"yeah" + 0.008*"love"
+ 0.008*"snake" + 0.007*"fuck" + 0.007*"shit" + 0.006*"taco" + 0.005*"g
onna" + 0.005*"feel" + 0.005*"littl" + 0.005*"friend"'), (3, '0.008*"ca
us" + 0.007*"gonna" + 0.007*"mean" + 0.007*"friend" + 0.007*"walk" + 0.
006*"point" + 0.005*"jenni" + 0.005*"tell" + 0.005*"yeah" + 0.005*"didn
t"')]
28. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 28/35
4.3 Topic Modeling - Attempt #2 (Nouns Only)
One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.).
Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
(https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
In [141]: # Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag
def nouns(text):
'''Given a string of text, tokenize the text and pull out only the n
ouns.'''
is_noun = lambda pos: pos[:2] == 'NN'
tokenized = word_tokenize(text)
all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(p
os)]
return ' '.join(all_nouns)
In [179]: # Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
In [180]: # Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
# Re-add the additional stop words since we are recreating the document-
term matrix
add_stop_words = [word for word, count in Counter(words).most_common() i
f count > 10]
add_stop_words += ['like', 'im', 'know', 'just', 'dont', 'thats', 'righ
t', 'people',
'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'sai
d']
add_stop_words = list(set(add_stop_words))
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)
# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_nam
es())
data_dtmn.index = data_nouns.index
In [153]: # Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.trans
pose()))
# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())
29. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 29/35
In [154]: # Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, p
asses=10)
ldan.print_topics()
In [155]: # Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, p
asses=10)
ldan.print_topics()
In [156]: # Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, p
asses=10)
ldan.print_topics()
4.4 Topic Modeling - Attempt #3 (Nouns and Adjectives)
In [157]: # Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
'''Given a string of text, tokenize the text and pull out only the n
ouns and adjectives.'''
is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
tokenized = word_tokenize(text)
nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_a
dj(pos)]
return ' '.join(nouns_adj)
Out[154]: [(0, '0.012*"gon" + 0.010*"fuck" + 0.009*"theyr" + 0.009*"life" + 0.008
*"caus" + 0.007*"didnt" + 0.007*"year" + 0.006*"women" + 0.006*"person"
+ 0.006*"joke"'), (1, '0.009*"fuck" + 0.009*"gon" + 0.007*"didnt" + 0.0
07*"life" + 0.007*"year" + 0.007*"shes" + 0.006*"guy" + 0.006*"love" +
0.006*"night" + 0.005*"shit"')]
Out[155]: [(0, '0.014*"fuck" + 0.012*"gon" + 0.010*"life" + 0.009*"theyr" + 0.008
*"year" + 0.007*"didnt" + 0.007*"joke" + 0.007*"shes" + 0.006*"guy" +
0.006*"caus"'), (1, '0.010*"gon" + 0.009*"fuck" + 0.009*"ahah" + 0.008
*"women" + 0.007*"didnt" + 0.007*"shit" + 0.006*"year" + 0.005*"dude" +
0.005*"guy" + 0.005*"work"'), (2, '0.008*"gon" + 0.008*"friend" + 0.008
*"didnt" + 0.007*"point" + 0.007*"caus" + 0.005*"night" + 0.005*"kind"
+ 0.005*"life" + 0.005*"person" + 0.005*"jenni"')]
Out[156]: [(0, '0.016*"joke" + 0.009*"stuff" + 0.009*"repeat" + 0.008*"guy" + 0.0
08*"love" + 0.007*"fuck" + 0.006*"contact" + 0.006*"littl" + 0.005*"go
n" + 0.005*"tell"'), (1, '0.015*"gon" + 0.011*"fuck" + 0.010*"life" +
0.010*"theyr" + 0.009*"caus" + 0.007*"didnt" + 0.007*"shit" + 0.007*"wo
men" + 0.006*"year" + 0.006*"kind"'), (2, '0.008*"life" + 0.008*"didnt"
+ 0.007*"shes" + 0.006*"point" + 0.006*"night" + 0.006*"friend" + 0.006
*"person" + 0.006*"gon" + 0.006*"guy" + 0.005*"jenni"'), (3, '0.016*"fu
ck" + 0.013*"gon" + 0.010*"year" + 0.009*"theyr" + 0.008*"didnt" + 0.00
7*"life" + 0.006*"joke" + 0.006*"shes" + 0.006*"work" + 0.006*"litt
l"')]
30. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 30/35
In [160]: # Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
# Create a new document-term matrix using only nouns and adjectives, als
o remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_
names())
data_dtmna.index = data_nouns_adj.index
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.tra
nspose()))
# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())
In [161]: # Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna
, passes=10)
ldana.print_topics()
In [162]: # Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna
, passes=10)
ldana.print_topics()
Out[161]: [(0, '0.006*"joke" + 0.005*"dude" + 0.003*"parent" + 0.003*"sleep" + 0.
003*"wife" + 0.003*"care" + 0.003*"cunt" + 0.003*"jenni" + 0.003*"part
i" + 0.002*"funni"'), (1, '0.005*"taco" + 0.004*"dude" + 0.004*"ahah" +
0.004*"nigga" + 0.003*"repeat" + 0.003*"snake" + 0.003*"american" + 0.0
03*"comedi" + 0.003*"food" + 0.003*"hasan"')]
Out[162]: [(0, '0.009*"dude" + 0.008*"ahah" + 0.005*"nigga" + 0.004*"sleep" + 0.0
04*"wife" + 0.004*"shoot" + 0.004*"rap" + 0.003*"stori" + 0.003*"sudde
n" + 0.003*"jesus"'), (1, '0.009*"joke" + 0.005*"parent" + 0.003*"comed
i" + 0.003*"jenni" + 0.003*"repeat" + 0.003*"clinton" + 0.003*"funni" +
0.003*"marri" + 0.003*"tweet" + 0.003*"andi"'), (2, '0.007*"dude" + 0.0
06*"taco" + 0.004*"food" + 0.004*"wan" + 0.004*"differ" + 0.004*"snake"
+ 0.004*"american" + 0.004*"date" + 0.003*"murder" + 0.003*"gun"')]
31. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 31/35
In [163]: # Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna
, passes=10)
ldana.print_topics()
4.5 Identify Topics in Each Document
Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's
pull that down here and run it through some more iterations to get more fine-tuned topics.
In [166]: # Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna
, passes=80)
pprint(ldana.print_topics())
Out[163]: [(0, '0.005*"dude" + 0.005*"stori" + 0.005*"clinton" + 0.005*"repeat" +
0.004*"wife" + 0.004*"gun" + 0.004*"joke" + 0.003*"cunt" + 0.003*"slut"
+ 0.003*"care"'), (1, '0.009*"hasan" + 0.007*"parent" + 0.006*"brown" +
0.005*"marri" + 0.005*"dream" + 0.005*"birthday" + 0.005*"bike" + 0.004
*"york" + 0.004*"door" + 0.004*"immigr"'), (2, '0.005*"joke" + 0.005*"j
enni" + 0.004*"dude" + 0.003*"tweet" + 0.003*"murder" + 0.003*"cours" +
0.003*"parent" + 0.003*"bruce" + 0.003*"jenner" + 0.003*"date"'), (3,
'0.007*"taco" + 0.006*"joke" + 0.006*"ahah" + 0.006*"nigga" + 0.006*"du
de" + 0.005*"snake" + 0.004*"food" + 0.004*"american" + 0.004*"trevor"
+ 0.004*"wall"')]
[(0,
'0.010*"ahah" + 0.008*"murder" + 0.006*"nigga" + 0.005*"young" + '
'0.005*"dude" + 0.005*"rap" + 0.004*"date" + 0.004*"touch" + 0.004*"c
ours" + '
'0.004*"suck"'),
(1,
'0.007*"jenni" + 0.006*"repeat" + 0.005*"andi" + 0.004*"slut" + '
'0.004*"contact" + 0.004*"prolong" + 0.004*"husband" + 0.004*"song" +
'
'0.004*"comedi" + 0.003*"wan"'),
(2,
'0.011*"joke" + 0.005*"cunt" + 0.005*"clinton" + 0.004*"funni" + 0.00
4*"gun" '
'+ 0.004*"parti" + 0.004*"tweet" + 0.003*"hate" + 0.003*"care" + '
'0.003*"wife"'),
(3,
'0.010*"dude" + 0.006*"taco" + 0.004*"snake" + 0.004*"door" + 0.004
*"sleep" '
'+ 0.004*"wall" + 0.004*"stori" + 0.004*"parent" + 0.003*"presid" + '
'0.003*"hasan"')]
32. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 32/35
These four topics look pretty decent. Let's settle on these for now.
Topic 0: african american, ghetto, crime
Topic 1: gender
Topic 2: politics
Topic 3: family
In [168]: # Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
pprint(list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index
)))
For a first pass of LDA, these kind of make sense to me, so we'll call it a day for now.
Topic 0: african american [Dave, Louis]
Topic 1: gender [Ali, Bo, Mike]
Topic 2: politics [Anthony, Jim, John, Ricky]
Topic 3: family [Bill, Hasan, Joe, Trevor]
5. Text Generation
5.1 Introduction
Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We
can make a simple assumption that the next word is only dependent on the previous word - which is the basic
assumption of a Markov chain. Note: LSTM(deep learning) is a better technique for text generation than Markov
chain.
[(1, 'ali'),
(2, 'anthony'),
(3, 'bill'),
(1, 'bo'),
(0, 'dave'),
(3, 'hasan'),
(2, 'jim'),
(3, 'joe'),
(2, 'john'),
(0, 'louis'),
(1, 'mike'),
(2, 'ricky'),
(3, 'trevor')]
33. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 33/35
5.2 Select Text to Imitate
In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract
the text from her comedy routine.
In [169]: # Read in the corpus, including punctuation!
import pandas as pd
data = pd.read_pickle('corpus.pkl')
# Extract only Ali Wong's text
trevor_text = data.transcript.loc['trevor']
trevor_text[:200]
5.3 Build a Markov Chain Function
We are going to build a simple Markov chain function that creates a dictionary:
The keys should be all of the words in the corpus
The values should be a list of the words that follow the keys
In [170]: from collections import defaultdict #still be able to input key if not e
xist
def markov_chain(text):
'''The input is a string of text and the output will be a dictionary
with each word as
a key and each value as the list of words that come after the key
in the text.'''
# Tokenize the text by word, though including punctuation
words = text.split(' ')
# Initialize a default dictionary to hold all of the words and next
words
m_dict = defaultdict(list)
# Create a zipped list of all of the word pairs and put them in wor
d: list of next words format
for current_word, next_word in zip(words[0:-1], words[1:]):
m_dict[current_word].append(next_word)
# Convert the default dict back into a dictionary
m_dict = dict(m_dict)
return m_dict
Out[169]: 'A NETFLIX ORIGINAL COMEDY SPECIAL [distant traffic] LIVE NATION PRESEN
TS TREVOR NOAH [presenter] Beautiful people, put your hands together fo
r Trevor Noah. [shouting and whooping] [hip hop intro music'
34. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 34/35
In [178]: # Create the dictionary for Ali's routine, take a look at it
trevor_dict = markov_chain(trevor_text)
5.4 Create a Text Generator
We're going to create a function that generates sentences. It will take two things as inputs:
The dictionary you just created
The number of words you want generated
Here are some examples of generated sentences:
'Shape right turn– I also takes so that she’s got women all know that snail-trail.'
'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'
In [172]: import random
def generate_sentence(chain, count=15):
'''Input a dictionary in the format of key = current word, value = l
ist of next words
along with the number of words you would like to see in your gene
rated sentence.'''
# Capitalize the first word
word1 = random.choice(list(chain.keys()))
sentence = word1.capitalize()
# Generate the second word from the value list. Set the new word as
the first word. Repeat.
for i in range(count-1):
word2 = random.choice(chain[word1])
word1 = word2
sentence += ' ' + word2
# End it with a period
sentence += '.'
return(sentence)
In [177]: generate_sentence(trevor_dict, 30)
Out[177]: 'Right, that’s moms. Dads will be somebody needs 25 billion dollars the
guy was, they have vowels. I know what was your mom.” Ah… The mind of j
ust put your.'
35. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 35/35
6. Conclusion
Question: What makes Trevor Noah's comedy routine stand out?
Exploratory Data Analysis:
Top Words(Word Clouds)
He talks about love and his friends a lot.
Vocabulary Size(Bar Plot)
He ranked second highest in number of words per minute, which means he talks fast. He
ranked third lowest in number of unique words, which might be the reason that his
comedian is easier to understand than the others
Amount of Profanity(Scatter Plot)
He doesn't use f-word based on the sample, and raely use s-word.
NLP Techniques:
Sentiment Analysis
He tends to be more positive and less opinionated.
Topic Modeling
His comedy involves the topics of family, firends and love.
Text Generation
In [ ]: