SlideShare a Scribd company logo
1 of 35
Download to read offline
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 1/35
0. Workflow
1. Start with a Question
2. Get & Clean the Data
3. Perform Exploratory Data Analysis
4. Apply Techniques
5. Share Insights
1. Data Cleaning
1.1 Introduction
This part one goes through a necessary step of any data science project - data cleaning. Data cleaning is a time
consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out".
Feeding dirty data into a model will give us results that are meaningless.
Specifically, we'll be walking through:
1. Getting the data - in this case, we'll be scraping data from a website
2. Cleaning the data - we will walk through popular text pre-processing techniques
3. Organizing the data - we will organize the cleaned data into a way that is easy to input into other
algorithms
The output of this part one will be clean, organized data in two standard text formats:
1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format
1.2 Problem Statement
My goal is to look at transcripts of various comedians and note their similarities and differences. Specifically, I'd
like to know if Trevor Noah's comedy style is different than other comedians, since he's the comedian that got
me interested in stand up comedy.
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 2/35
1.3 Getting The Data
Scraps From The Loft (http://scrapsfromtheloft.com) keeps track of stand up routine transcripts, and makes
them available for non-profit and educational purposes. To decide which comedians to look into, I went on
IMDB and looked specifically at comedy specials that were released in the past 5 years. To narrow it down
further, I looked only at those with greater than a 7.5/10 rating and more than 2000 votes. If a comedian had
multiple specials that fit those requirements, I would pick the most highly rated one. I ended up with 13 comedy
specials.
Packages for Web Scraping:
1. Requests - make HTTP requests
2. Beautiful Soup - parse HTTP documents
Package for Python Data Processing:
1. Pickle - serialize Python objects
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 3/35
In [1]: # Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle
# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
'''Returns transcript data specifically from scrapsfromtheloft.co
m.'''
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
text = [p.text for p in soup.find(class_="post-content").find_all(
'p')]
print(url)
return text
# URLs of transcripts in scope
urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full
-transcript/',
'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin
-2017-full-transcript/',
'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-
transcript/',
'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-tr
anscript/',
'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel
-way-2014-full-transcript/',
'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014
-full-transcript/',
'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-k
id-2015-full-transcript/',
'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming
-king-2017-full-transcript/',
'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-201
6-full-transcript/',
'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-though
ts-prayers-2015-full-transcript/',
'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlf
riends-boyfriend-2013-full-transcript/',
'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-201
6-full-transcript/',
'http://scrapsfromtheloft.com/2018/11/21/trevor-noah-son-of-patr
icia-transcript/']
# Comedian names
comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'has
an', 'ali', 'anthony', 'mike', 'joe', 'trevor']
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 4/35
In [5]: # # Actually request transcripts (takes a few minutes to run)
transcripts = [url_to_transcript(u) for u in urls]
In [6]: # Pickle files for later use
# Make a new directory to hold the text files
# !mkdir transcripts
for i, c in enumerate(comedians):
with open("transcripts/" + c + ".txt", "wb") as file:
pickle.dump(transcripts[i], file)
In [7]: # Load pickled files
data = {}
for i, c in enumerate(comedians):
with open("transcripts/" + c + ".txt", "rb") as file:
data[c] = pickle.load(file)
http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcr
ipt/
http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-fu
ll-transcript/
http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcri
pt/
http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcrip
t/
http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-201
4-full-transcript/
http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-tr
anscript/
http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015-
full-transcript/
http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-20
17-full-transcript/
http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-t
ranscript/
http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-praye
rs-2015-full-transcript/
http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-b
oyfriend-2013-full-transcript/
http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-t
ranscript/
http://scrapsfromtheloft.com/2018/11/21/trevor-noah-son-of-patricia-tra
nscript/
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 5/35
1.4 Cleaning The Data
When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing
with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as
text pre-processing techniques.
With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So,
we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of
things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest
can be done at a later point to improve our results.
Common data cleaning steps on all text:
Make text all lower case
Remove punctuation
Remove numerical values
Remove common non-sensical text (/n)
Tokenize text
Remove stop words
More data cleaning steps after tokenization:
Stemming / lemmatization
Parts of speech tagging
Create bi-grams or tri-grams
Deal with typos
And more...
Packages for round1 cleaning:
1. re - regular expression operations/matching/substitution
2. string - common string operations/punctuation
Package for round2 cleaning:
1. nltk - stemming.lemmatization
2. gensim - tokenization, excluding stopwords
In [14]: # We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
'''Takes a list of text and combines them into one large chunk of te
xt.'''
combined_text = ' '.join(list_of_text)
return combined_text
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.ite
ms()}
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 6/35
In [54]: # We can either keep it in dictionary format or put it into a pandas dat
aframe
import pandas as pd
pd.set_option('max_colwidth',150)
data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
In [18]: # Apply a first round of text cleaning techniques
import re
import string
def clean_text_round1(text):
'''Make text lowercase, remove text in square brackets, remove punct
uation and remove words containing numbers.'''
text = text.lower()
text = re.sub('[.*?]', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('w*dw*', '', text)
text = re.sub('[‘’“”…]', '', text)
text = re.sub('n', '', text)
return text
round1 = lambda x: clean_text_round1(x)
In [61]: # Let's take a look at the updated text
data_clean1 = pd.DataFrame(data_df.transcript.apply(round1))
data_clean1.iloc[0]
Out[61]: transcript ladies and gentlemen please welcome to the stage ali wong
hi hello welcome thank you thank you for coming hello hello we are gonn
a have to get thi...
Name: ali, dtype: object
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 7/35
In [52]: # Apply a second round of text cleaning techniques
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import nltk
#nltk.download('wordnet')
stemmer = SnowballStemmer('english')
def lemmatize_stemming(word):
return stemmer.stem(WordNetLemmatizer().lemmatize(word, pos='v'))
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
def clean_text_round2(text):
'''Mark 'cheering' and 'cheer' as the same word (stemming / lemmatiz
ation)
Excluding stopwords
'''
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(t
oken) > 3:
result.append(lemmatize_stemming(token))
result = ' '.join(result)
return result
round2 = lambda x: clean_text_round2(x)
In [62]: # Let's take a look at the updated text
data_clean2 = pd.DataFrame(data_clean1.transcript.apply(round2))
data_clean2.iloc[0]
1.5 Organizing The Data
Output of this part1 will be clean, organized data in two standard text formats:
1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format
Corpus
Out[62]: transcript ladi gentlemen welcom stage wong hello welcom thank thank
come hello hello gonna shit caus like minut thank everybodi come excit
excit year turn y...
Name: ali, dtype: object
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 8/35
In [75]: # Let's add the comedians' full names as well
full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham',
'Dave Chappelle', 'Hasan Minhaj',
'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.'
, 'Mike Birbiglia', 'Ricky Gervais', 'Trevor Noah']
data_df['full_name'] = full_names
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")
Document-Term Matrix
The text must be tokenized, meaning broken down into smaller pieces. The most common tokenization
technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row
will represent a different document and every column will represent a different word.
In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no
additional meaning to text such as 'a', 'the', etc.
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 9/35
In [80]: # We are going to create a document-term matrix using CountVectorizer, a
nd exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean2.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names
())
data_dtm.index = data_clean.index
data_dtm
In [82]: # Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")
# Let's also pickle the cleaned data (before we put it in document-term
matrix format) and the CountVectorizer object
data_clean2.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))
2. Exploratory Data Analysis
Out[80]:
aaaaah aaaaahhhhhhh aaaahhhhh aaah abandon abc abil abject abl ablebodi .
ali 0 0 0 0 0 0 0 0 2 0 .
anthony 0 0 0 0 0 0 0 0 0 0 .
bill 1 0 0 0 0 1 0 0 1 0 .
bo 0 1 1 0 0 0 1 0 0 0 .
dave 0 0 0 1 0 0 0 0 0 0 .
hasan 0 0 0 0 0 0 0 0 1 0 .
jim 0 0 0 0 0 0 0 0 1 2 .
joe 0 0 0 0 0 0 0 0 2 0 .
john 0 0 0 0 0 0 0 0 3 0 .
louis 0 0 0 0 0 0 0 0 1 0 .
mike 0 0 0 0 0 0 0 0 0 0 .
ricky 0 0 0 0 0 0 1 1 2 0 .
trevor 0 0 0 0 1 0 0 0 0 0 .
13 rows × 5319 columns
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 10/35
2.1 Introduction
After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at
the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always
important to explore the data first.
When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include
finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the
same when working with text data. We are going to find some more obvious patterns with EDA before
identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for
each comedian:
1. Most common words - find these and create word clouds
2. Size of vocabulary - look number of unique words and also how quickly someone speaks
3. Amount of profanity - most common terms
2.2 Most Common Words
Analysis
In [84]: # Read in the document-term matrix
import pandas as pd
data = pd.read_pickle('dtm.pkl')
data = data.transpose()
In [87]: # Find the top 30 words said by each comedian
top_dict = {}
for c in data.columns:
top = data[c].sort_values(ascending=False).head(30)
top_dict[c]= list(zip(top.index, top.values))
#pprint(top_dict)
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 11/35
In [88]: # Print the top 15 words said by each comedian
for comedian, top_words in top_dict.items():
print(comedian)
print(', '.join([word for word, count in top_words[0:14]]))
print('---')
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 12/35
ali
like, know, dont, shit, gonna, come, wanna, gotta, time, husband, fuck,
tell, right, women
---
anthony
joke, like, know, dont, say, thing, anthoni, tell, think, guy, peopl, t
ime, fuck, love
---
bill
like, right, fuck, know, dont, gonna, yeah, come, shit, think, want, du
de, peopl, thing
---
bo
know, think, like, love, fuck, stuff, repeat, want, dont, yeah, right,
say, slut, peopl
---
dave
like, know, say, fuck, shit, peopl, didnt, ahah, dont, time, black, com
e, look, good
---
hasan
like, know, dont, want, look, love, time, shes, hasan, right, come, wal
k, fuck, say
---
jim
fuck, like, dont, right, know, come, think, say, thing, peopl, want, gu
n, theyr, good
---
joe
like, fuck, peopl, dont, think, know, gonna, theyr, shit, thing, right,
hous, dude, look
---
john
like, know, dont, say, walk, clinton, right, time, think, littl, look,
thing, peopl, caus
---
louis
like, know, dont, thing, life, peopl, gonna, think, shit, caus, say, ha
ppen, look, murder
---
mike
like, say, know, dont, think, jenni, caus, right, point, want, mean, go
nna, come, friend
---
ricky
right, like, say, know, fuck, dont, yeah, thing, joke, think, year, peo
pl, didnt, littl
---
trevor
like, know, say, dont, snake, taco, peopl, come, yeah, right, want, thi
ng, think, time
---
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 13/35
NOTE: At this point, we could go on and create word clouds. However, by looking at these top words, you can
see that some of them have very little meaning and could be added to a stop words list, so let's do just that.
In [90]: # Look at the most common top words --> add them to the stop word list
from collections import Counter
# Let's first pull out the top 30 words for each comedian
words = []
for comedian in data.columns:
top = [word for (word, count) in top_dict[comedian]]
for t in top:
words.append(t)
In [105]: # Let's aggregate this list and identify the most common words along wit
h how many routines they occur in
Counter(words).most_common()
# If more than half of the comedians have it as a top word, exclude it f
rom the list
add_stop_words = [word for word, count in Counter(words).most_common() i
f count > 10]
add_stop_words
In [106]: # Let's update our document-term matrix with the new list of stop words
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
# Read in cleaned data
data_clean = pd.read_pickle('data_clean.pkl')
# Add new stop words
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)
# Recreate document-term matrix
cv = CountVectorizer(stop_words=stop_words)
data_cv = cv.fit_transform(data_clean.transcript)
data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names
())
data_stop.index = data_clean.index
# Pickle it for later use
import pickle
pickle.dump(cv, open("cv_stop.pkl", "wb"))
data_stop.to_pickle("dtm_stop.pkl")
Out[105]: ['like', 'know', 'dont', 'come', 'time', 'right', 'peopl', 'think', 'sa
y', 'look', 'want', 'thing']
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 14/35
In [107]: # Let's make some word clouds!
# Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud
from wordcloud import WordCloud
wc = WordCloud(stopwords=stop_words, background_color="white", colormap=
"Dark2",
max_font_size=150, random_state=42)
In [108]: # Reset the output dimensions
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [16, 6]
full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham',
'Dave Chappelle', 'Hasan Minhaj',
'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.'
, 'Mike Birbiglia', 'Ricky Gervais', 'Trevor Noah']
# Create subplots for each comedian
for index, comedian in enumerate(data.columns):
wc.generate(data_clean.transcript[comedian])
plt.subplot(4, 4, index+1)
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title(full_names[index])
plt.show()
Findings
Trevor Noah says the "yeah" a lot and mock on himself. I guess his enthusiastic language is funny to me.
A lot of people use the F-word. Let's dig into that later.
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 15/35
2.3 Number of Words
Analysis
In [110]: # Find the number of unique words that each comedian uses
# Identify the non-zero items in the document-term matrix, meaning that
the word occurs at least once
unique_list = []
for comedian in data.columns:
uniques = data[comedian].nonzero()[0].size
unique_list.append(uniques)
# Create a new dataframe that contains this unique word count
data_words = pd.DataFrame(list(zip(full_names, unique_list)), columns=[
'comedian', 'unique_words'])
data_unique_sort = data_words.sort_values(by='unique_words')
data_unique_sort
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:6: FutureW
arning: Series.nonzero() is deprecated and will be removed in a future
version.Use Series.to_numpy().nonzero() instead
Out[110]:
comedian unique_words
1 Anthony Jeselnik 732
9 Louis C.K. 817
12 Trevor Noah 874
6 Jim Jefferies 938
3 Bo Burnham 984
4 Dave Chappelle 1035
8 John Mulaney 1042
0 Ali Wong 1057
7 Joe Rogan 1057
10 Mike Birbiglia 1101
5 Hasan Minhaj 1168
2 Bill Burr 1215
11 Ricky Gervais 1218
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 16/35
In [111]: # Calculate the words per minute of each comedian
# Find the total number of words that a comedian uses
total_list = []
for comedian in data.columns:
totals = sum(data[comedian])
total_list.append(totals)
# Comedy special run times from IMDB, in minutes
run_times = [60, 59, 80, 60, 67, 73, 77, 63, 62, 58, 76, 79, 63]
# Let's add some columns to our dataframe
data_words['total_words'] = total_list
data_words['run_times'] = run_times
data_words['words_per_minute'] = data_words['total_words'] / data_words[
'run_times']
# Sort the dataframe by words per minute to see who talks the slowest an
d fastest
data_wpm_sort = data_words.sort_values(by='words_per_minute')
data_wpm_sort
Out[111]:
comedian unique_words total_words run_times words_per_minute
1 Anthony Jeselnik 732 2232 59 37.830508
3 Bo Burnham 984 2449 60 40.816667
0 Ali Wong 1057 2596 60 43.266667
9 Louis C.K. 817 2539 58 43.775862
6 Jim Jefferies 938 3612 77 46.909091
11 Ricky Gervais 1218 3827 79 48.443038
4 Dave Chappelle 1035 3247 67 48.462687
10 Mike Birbiglia 1101 3752 76 49.368421
5 Hasan Minhaj 1168 3652 73 50.027397
8 John Mulaney 1042 3129 62 50.467742
2 Bill Burr 1215 4208 80 52.600000
12 Trevor Noah 874 3609 63 57.285714
7 Joe Rogan 1057 3632 63 57.650794
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 17/35
In [112]: # Let's plot our findings
import numpy as np
y_pos = np.arange(len(data_words))
plt.subplot(1, 2, 1)
plt.barh(y_pos, data_unique_sort.unique_words, align='center')
plt.yticks(y_pos, data_unique_sort.comedian)
plt.title('Number of Unique Words', fontsize=20)
plt.subplot(1, 2, 2)
plt.barh(y_pos, data_wpm_sort.words_per_minute, align='center')
plt.yticks(y_pos, data_wpm_sort.comedian)
plt.title('Number of Words Per Minute', fontsize=20)
plt.tight_layout()
plt.show()
Findings
Vocabulary
Ricky Gervais (British comedy) and Bill Burr (podcast host) use a lot of words in their comedy
Louis C.K. (self-depricating comedy) and Anthony Jeselnik (dark humor) have a smaller vocabulary
Talking Speed
Joe Rogan (blue comedy) and Trevor Noah talk fast
Bo Burnham (musical comedy) and Anthony Jeselnik (dark humor) talk slow
This is really interesting. Trevor Noah doesn't use many words in his comedy, but he talks fast. This fact may be
reason why his comedy is impressive to audience.
2.4 Amount of Profanity
Analysis
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 18/35
In [115]: # Earlier I said we'd revisit profanity. Let's take a look at the most c
ommon words again.
Counter(words).most_common()
# Let's isolate just these bad words
data_bad_words = data.transpose()[['fuck', 'shit']]
data_profanity = pd.concat([data_bad_words.fuck, data_bad_words.shit], a
xis=1)
data_profanity.columns = ['f_word', 's_word']
data_profanity
Out[115]:
f_word s_word
ali 20 36
anthony 19 9
bill 111 65
bo 40 7
dave 72 46
hasan 28 16
jim 126 20
joe 141 41
john 4 7
louis 22 28
mike 0 1
ricky 63 6
trevor 0 14
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 19/35
In [116]: # Let's create a scatter plot of our findings
plt.rcParams['figure.figsize'] = [10, 8]
for i, comedian in enumerate(data_profanity.index):
x = data_profanity.f_word.loc[comedian]
y = data_profanity.s_word.loc[comedian]
plt.scatter(x, y, color='blue')
plt.text(x+1.5, y+0.5, full_names[i], fontsize=10)
plt.xlim(-5, 155)
plt.title('Number of Bad Words Used in Routine', fontsize=20)
plt.xlabel('Number of F Bombs', fontsize=15)
plt.ylabel('Number of S Words', fontsize=15)
plt.show()
Findings
Averaging 2 F-Bombs Per Minute! - I don't like too much swearing, especially the f-word, which is
probably why I've never heard of Bill Bur, Joe Rogan and Jim Jefferies.
Clean Humor - It looks like profanity might be a good predictor of the type of comedy I like. Besides Trevor
Noah, my two other favorite comedians in this group are John Mulaney and Mike Birbiglia.
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 20/35
Side Note
What was our goal for the EDA portion of our journey? To be able to take an initial look at our data and see if
the results of some basic analysis made sense.
My conclusion - yes, it does, for a first pass. The results are interesting and make general sense, so we're going
to move on.
As a reminder to myself, the data science process is an interative one. It's better to see some non-perfect but
acceptable results to help you quickly decide whether your project is a dud or not, instead of having analysis
paralysis and never delivering anything.
3. Sentiment Analysis
3.1 Introduction
So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc.
These techniques could be applied to numeric data as well.
When it comes to text data, there are a few popular techniques that we'll be going through in the next few
notebooks, starting with sentiment analysis. A few key points to remember with sentiment analysis.
1. TextBlob Module: Linguistic researchers have labeled the sentiment of words based on their domain
expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us
to take advantage of these labels.
2. Sentiment Labels: Each word in a corpus is labeled in terms of polarity and subjectivity (there are more
labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
Polarity: How positive or negative a word is. -1 is very negative. +1 is very positive.
Subjectivity: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.
For more info on how TextBlob coded up its sentiment function (https://planspace.org/20150607-
textblob_sentiment/). Other ststistical methods such as Naive Bayes can be also used for sentiment analysis.
Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine.
3.2 Sentiment of Routine
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 21/35
In [119]: # We'll start by reading in the corpus, which preserves word order
import pandas as pd
data = pd.read_pickle('corpus.pkl')
# Create quick lambda functions to find the polarity and subjectivity of
each routine
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity
data['polarity'] = data['transcript'].apply(pol)
data['subjectivity'] = data['transcript'].apply(sub)
data
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 22/35
Out[119]:
transcript full_name polarity subjectivity
ali
Ladies and gentlemen, please welcome to the stage: Ali
Wong! Hi. Hello! Welcome! Thank you! Thank you for
coming. Hello! Hello. We are gonna have ...
Ali Wong 0.069359 0.482403
anthony
Thank you. Thank you. Thank you, San Francisco. Thank
you so much. So good to be here. People were surprised
when I told ’em I was gonna tape my s...
Anthony
Jeselnik
0.054285 0.559732
bill
[cheers and applause] All right, thank you! Thank you very
much! Thank you. Thank you. Thank you. How are you?
What’s going on? Thank you. It’s a ...
Bill Burr 0.016479 0.537016
bo
Bo What? Old MacDonald had a farm E I E I O And on that
farm he had a pig E I E I O Here a snort There a Old
MacDonald had a farm E I E I O [Appla...
Bo
Burnham
0.074514 0.539368
dave
This is Dave. He tells dirty jokes for a living. That stare is
where most of his hard work happens. It signifies a
profound train of thought, the ...
Dave
Chappelle
-0.002690 0.513958
hasan
[theme music: orchestral hip-hop] [crowd roars] What’s up?
Davis, what’s up? I’m home. I had to bring it back here.
Netflix said, “Where do you wa...
Hasan
Minhaj
0.086856 0.460619
jim
[Car horn honks] [Audience cheering] [Announcer] Ladies
and gentlemen, please welcome to the stage Mr. Jim
Jefferies! [Upbeat music playing] Hello...
Jim
Jefferies
0.044224 0.523382
joe
[rock music playing] [audience cheering] [announcer]
Ladies and gentlemen, welcome Joe Rogan. [audience
cheering and applauding] What the fuck is ...
Joe
Rogan
0.004968 0.551628
john
All right, Petunia. Wish me luck out there. You will die on
August 7th, 2037. That’s pretty good. All right. Hello. Hello,
Chicago. Nice to see yo...
John
Mulaney
0.082355 0.484137
louis
IntronFade the music out. Let’s roll. Hold there. Lights. Do
the lights. Thank you. Thank you very much. I appreciate
that. I don’t necessarily a...
Louis C.K. 0.056665 0.515796
mike
Wow. Hey, thank you. Thanks. Thank you, guys. Hey,
Seattle. Nice to see you. Look at this. Look at us. We’re
here. This is crazy. It’s insane. So ...
Mike
Birbiglia
0.092927 0.518476
ricky
Hello. Hello! How you doing? Great. Thank you. Wow.
Calm down. Shut the fuck up. Thank you. What a lovely
welcome. I’m gonna try my hardest tonigh...
Ricky
Gervais
0.066489 0.497313
trevor
A NETFLIX ORIGINAL COMEDY SPECIAL [distant traffic]
LIVE NATION PRESENTS TREVOR NOAH [presenter]
Beautiful people, put your hands together for Tre...
Trevor
Noah
0.096365 0.479900
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 23/35
In [120]: # Let's plot the results
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 8]
for index, comedian in enumerate(data.index):
x = data.polarity.loc[comedian]
y = data.subjectivity.loc[comedian]
plt.scatter(x, y, color='blue')
plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)
plt.xlim(-.01, .12)
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)
plt.show()
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 24/35
Findings
Positive Energy - Trevor Noah is ranked No.1 in positive/negative tone in this sample pool. This is a
definitely inspiring finding, but not surprising after a thought. Personally, I prefer comedy that are not only
funny but also somehow optimistic.
Objective Tone - As a practical person, objective rality is another aspect that I pay attention to regarding
the content of the show. Trevor Noah is more objective than most of the other comedians in the pool.
3.3 Sentiment of Routine Over Time
Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time
throughout each routine.
In [124]: # Split each routine into 10 parts
import numpy as np
import math
def split_text(text, n=10):
'''Takes in a string of text and splits into n equal parts, with a d
efault of 10 equal parts.'''
# Calculate length of text, the size of each chunk of text and the s
tarting points of each chunk of text
length = len(text)
size = math.floor(length / n)
start = np.arange(0, length, size)
# Pull out equally sized pieces of text and put it into a list
split_list = []
for piece in range(n):
split_list.append(text[start[piece]:start[piece]+size])
return split_list
# Let's create a list to hold all of the pieces of text
list_pieces = []
for t in data.transcript:
split = split_text(t)
list_pieces.append(split)
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 25/35
In [126]: # Calculate the polarity for each piece of text
polarity_transcript = []
for lp in list_pieces:
polarity_piece = []
for p in lp:
polarity_piece.append(TextBlob(p).sentiment.polarity)
polarity_transcript.append(polarity_piece)
In [130]: # Show the plot for all comedians
plt.rcParams['figure.figsize'] = [28, 18]
for index, comedian in enumerate(data.index):
plt.subplot(4, 4, index+1)
plt.plot(polarity_transcript[index])
plt.plot(np.arange(0,10), np.zeros(10))
plt.title(data['full_name'][index])
plt.ylim(ymin=-.2, ymax=.3)
plt.show()
Findings
Trevor Noah stays generally positive throughout his routine. Similar comedians are Louis C.K. and Ali Wong.
On the other hand, you have some pretty different patterns here like Bo Burnham who gets happier as time
passes and Dave Chappelle who has some pretty down moments in his routine.
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 26/35
4. Topic Modeling
4.1 Introduction
Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find
various topics that are present in your corpus. Each document in the corpus will be made up of at least one
topic, if not multiple topics.
In this notebook, we will be covering the steps on how to do Latent Dirichlet Allocation (LDA), which is one of
many topic modeling techniques. It was specifically designed for text data.
To use a topic modeling technique, we need to provide (1) a document-term matrix and (2) the number of topics
we would like the algorithm to pick up.
Once the topic modeling technique is applied, our job as a human is to interpret the results and see if the mix of
words in each topic make sense. If they don't make sense, we can try changing up the number of topics, the
terms in the document-term matrix, model parameters, or even try a different model.
4.2 Topic Modeling - Attempt #1 (All Text)
In [132]: # Let's read in our document-term matrix
import pandas as pd
import pickle
data = pd.read_pickle('dtm_stop.pkl')
In [134]: # Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse
tdm = data.transpose()
In [135]: # We're going to put the term-document matrix into a new gensim format,
from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)
# Gensim also requires dictionary of the all terms and their respective
location in the term-document matrix
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 27/35
In [136]: # Now that we have the corpus (term-document matrix) and id2word (dictio
nary of location: term),
# we need to specify two other parameters as well - the number of topics
and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, pass
es=10)
lda.print_topics()
In [139]: # LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, pass
es=10)
lda.print_topics()
In [140]: # LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, pass
es=10)
lda.print_topics()
These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as
well.
Out[136]: [(0, '0.014*"fuck" + 0.009*"shit" + 0.008*"gonna" + 0.008*"yeah" + 0.00
6*"didnt" + 0.006*"caus" + 0.006*"tell" + 0.005*"good" + 0.005*"love" +
0.005*"life"'), (1, '0.021*"fuck" + 0.009*"joke" + 0.008*"yeah" + 0.007
*"theyr" + 0.007*"love" + 0.007*"gonna" + 0.006*"littl" + 0.005*"shit"
+ 0.005*"didnt" + 0.005*"tell"')]
Out[139]: [(0, '0.009*"yeah" + 0.007*"fuck" + 0.007*"love" + 0.007*"didnt" + 0.00
6*"walk" + 0.006*"littl" + 0.005*"mean" + 0.005*"snake" + 0.005*"year"
+ 0.005*"life"'), (1, '0.017*"fuck" + 0.009*"gonna" + 0.009*"shit" + 0.
006*"yeah" + 0.006*"caus" + 0.006*"good" + 0.006*"tell" + 0.006*"theyr"
+ 0.006*"didnt" + 0.006*"mean"'), (2, '0.029*"fuck" + 0.012*"gonna" +
0.011*"shit" + 0.010*"yeah" + 0.007*"theyr" + 0.006*"caus" + 0.005*"dud
e" + 0.005*"littl" + 0.005*"didnt" + 0.005*"tell"')]
Out[140]: [(0, '0.026*"fuck" + 0.010*"shit" + 0.010*"gonna" + 0.008*"yeah" + 0.00
8*"theyr" + 0.007*"didnt" + 0.006*"life" + 0.006*"good" + 0.006*"love"
+ 0.006*"tell"'), (1, '0.016*"joke" + 0.008*"anthoni" + 0.008*"tell" +
0.007*"guy" + 0.006*"fuck" + 0.006*"love" + 0.006*"grandma" + 0.006*"sh
ark" + 0.005*"babi" + 0.005*"good"'), (2, '0.010*"yeah" + 0.008*"love"
+ 0.008*"snake" + 0.007*"fuck" + 0.007*"shit" + 0.006*"taco" + 0.005*"g
onna" + 0.005*"feel" + 0.005*"littl" + 0.005*"friend"'), (3, '0.008*"ca
us" + 0.007*"gonna" + 0.007*"mean" + 0.007*"friend" + 0.007*"walk" + 0.
006*"point" + 0.005*"jenni" + 0.005*"tell" + 0.005*"yeah" + 0.005*"didn
t"')]
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 28/35
4.3 Topic Modeling - Attempt #2 (Nouns Only)
One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.).
Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
(https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
In [141]: # Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag
def nouns(text):
'''Given a string of text, tokenize the text and pull out only the n
ouns.'''
is_noun = lambda pos: pos[:2] == 'NN'
tokenized = word_tokenize(text)
all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(p
os)]
return ' '.join(all_nouns)
In [179]: # Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean.pkl')
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
In [180]: # Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
# Re-add the additional stop words since we are recreating the document-
term matrix
add_stop_words = [word for word, count in Counter(words).most_common() i
f count > 10]
add_stop_words += ['like', 'im', 'know', 'just', 'dont', 'thats', 'righ
t', 'people',
'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'sai
d']
add_stop_words = list(set(add_stop_words))
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)
# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_nam
es())
data_dtmn.index = data_nouns.index
In [153]: # Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.trans
pose()))
# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 29/35
In [154]: # Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, p
asses=10)
ldan.print_topics()
In [155]: # Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, p
asses=10)
ldan.print_topics()
In [156]: # Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, p
asses=10)
ldan.print_topics()
4.4 Topic Modeling - Attempt #3 (Nouns and Adjectives)
In [157]: # Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
'''Given a string of text, tokenize the text and pull out only the n
ouns and adjectives.'''
is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
tokenized = word_tokenize(text)
nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_a
dj(pos)]
return ' '.join(nouns_adj)
Out[154]: [(0, '0.012*"gon" + 0.010*"fuck" + 0.009*"theyr" + 0.009*"life" + 0.008
*"caus" + 0.007*"didnt" + 0.007*"year" + 0.006*"women" + 0.006*"person"
+ 0.006*"joke"'), (1, '0.009*"fuck" + 0.009*"gon" + 0.007*"didnt" + 0.0
07*"life" + 0.007*"year" + 0.007*"shes" + 0.006*"guy" + 0.006*"love" +
0.006*"night" + 0.005*"shit"')]
Out[155]: [(0, '0.014*"fuck" + 0.012*"gon" + 0.010*"life" + 0.009*"theyr" + 0.008
*"year" + 0.007*"didnt" + 0.007*"joke" + 0.007*"shes" + 0.006*"guy" +
0.006*"caus"'), (1, '0.010*"gon" + 0.009*"fuck" + 0.009*"ahah" + 0.008
*"women" + 0.007*"didnt" + 0.007*"shit" + 0.006*"year" + 0.005*"dude" +
0.005*"guy" + 0.005*"work"'), (2, '0.008*"gon" + 0.008*"friend" + 0.008
*"didnt" + 0.007*"point" + 0.007*"caus" + 0.005*"night" + 0.005*"kind"
+ 0.005*"life" + 0.005*"person" + 0.005*"jenni"')]
Out[156]: [(0, '0.016*"joke" + 0.009*"stuff" + 0.009*"repeat" + 0.008*"guy" + 0.0
08*"love" + 0.007*"fuck" + 0.006*"contact" + 0.006*"littl" + 0.005*"go
n" + 0.005*"tell"'), (1, '0.015*"gon" + 0.011*"fuck" + 0.010*"life" +
0.010*"theyr" + 0.009*"caus" + 0.007*"didnt" + 0.007*"shit" + 0.007*"wo
men" + 0.006*"year" + 0.006*"kind"'), (2, '0.008*"life" + 0.008*"didnt"
+ 0.007*"shes" + 0.006*"point" + 0.006*"night" + 0.006*"friend" + 0.006
*"person" + 0.006*"gon" + 0.006*"guy" + 0.005*"jenni"'), (3, '0.016*"fu
ck" + 0.013*"gon" + 0.010*"year" + 0.009*"theyr" + 0.008*"didnt" + 0.00
7*"life" + 0.006*"joke" + 0.006*"shes" + 0.006*"work" + 0.006*"litt
l"')]
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 30/35
In [160]: # Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
# Create a new document-term matrix using only nouns and adjectives, als
o remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_
names())
data_dtmna.index = data_nouns_adj.index
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.tra
nspose()))
# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())
In [161]: # Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna
, passes=10)
ldana.print_topics()
In [162]: # Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna
, passes=10)
ldana.print_topics()
Out[161]: [(0, '0.006*"joke" + 0.005*"dude" + 0.003*"parent" + 0.003*"sleep" + 0.
003*"wife" + 0.003*"care" + 0.003*"cunt" + 0.003*"jenni" + 0.003*"part
i" + 0.002*"funni"'), (1, '0.005*"taco" + 0.004*"dude" + 0.004*"ahah" +
0.004*"nigga" + 0.003*"repeat" + 0.003*"snake" + 0.003*"american" + 0.0
03*"comedi" + 0.003*"food" + 0.003*"hasan"')]
Out[162]: [(0, '0.009*"dude" + 0.008*"ahah" + 0.005*"nigga" + 0.004*"sleep" + 0.0
04*"wife" + 0.004*"shoot" + 0.004*"rap" + 0.003*"stori" + 0.003*"sudde
n" + 0.003*"jesus"'), (1, '0.009*"joke" + 0.005*"parent" + 0.003*"comed
i" + 0.003*"jenni" + 0.003*"repeat" + 0.003*"clinton" + 0.003*"funni" +
0.003*"marri" + 0.003*"tweet" + 0.003*"andi"'), (2, '0.007*"dude" + 0.0
06*"taco" + 0.004*"food" + 0.004*"wan" + 0.004*"differ" + 0.004*"snake"
+ 0.004*"american" + 0.004*"date" + 0.003*"murder" + 0.003*"gun"')]
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 31/35
In [163]: # Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna
, passes=10)
ldana.print_topics()
4.5 Identify Topics in Each Document
Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's
pull that down here and run it through some more iterations to get more fine-tuned topics.
In [166]: # Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna
, passes=80)
pprint(ldana.print_topics())
Out[163]: [(0, '0.005*"dude" + 0.005*"stori" + 0.005*"clinton" + 0.005*"repeat" +
0.004*"wife" + 0.004*"gun" + 0.004*"joke" + 0.003*"cunt" + 0.003*"slut"
+ 0.003*"care"'), (1, '0.009*"hasan" + 0.007*"parent" + 0.006*"brown" +
0.005*"marri" + 0.005*"dream" + 0.005*"birthday" + 0.005*"bike" + 0.004
*"york" + 0.004*"door" + 0.004*"immigr"'), (2, '0.005*"joke" + 0.005*"j
enni" + 0.004*"dude" + 0.003*"tweet" + 0.003*"murder" + 0.003*"cours" +
0.003*"parent" + 0.003*"bruce" + 0.003*"jenner" + 0.003*"date"'), (3,
'0.007*"taco" + 0.006*"joke" + 0.006*"ahah" + 0.006*"nigga" + 0.006*"du
de" + 0.005*"snake" + 0.004*"food" + 0.004*"american" + 0.004*"trevor"
+ 0.004*"wall"')]
[(0,
'0.010*"ahah" + 0.008*"murder" + 0.006*"nigga" + 0.005*"young" + '
'0.005*"dude" + 0.005*"rap" + 0.004*"date" + 0.004*"touch" + 0.004*"c
ours" + '
'0.004*"suck"'),
(1,
'0.007*"jenni" + 0.006*"repeat" + 0.005*"andi" + 0.004*"slut" + '
'0.004*"contact" + 0.004*"prolong" + 0.004*"husband" + 0.004*"song" +
'
'0.004*"comedi" + 0.003*"wan"'),
(2,
'0.011*"joke" + 0.005*"cunt" + 0.005*"clinton" + 0.004*"funni" + 0.00
4*"gun" '
'+ 0.004*"parti" + 0.004*"tweet" + 0.003*"hate" + 0.003*"care" + '
'0.003*"wife"'),
(3,
'0.010*"dude" + 0.006*"taco" + 0.004*"snake" + 0.004*"door" + 0.004
*"sleep" '
'+ 0.004*"wall" + 0.004*"stori" + 0.004*"parent" + 0.003*"presid" + '
'0.003*"hasan"')]
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 32/35
These four topics look pretty decent. Let's settle on these for now.
Topic 0: african american, ghetto, crime
Topic 1: gender
Topic 2: politics
Topic 3: family
In [168]: # Let's take a look at which topics each transcript contains
corpus_transformed = ldana[corpusna]
pprint(list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index
)))
For a first pass of LDA, these kind of make sense to me, so we'll call it a day for now.
Topic 0: african american [Dave, Louis]
Topic 1: gender [Ali, Bo, Mike]
Topic 2: politics [Anthony, Jim, John, Ricky]
Topic 3: family [Bill, Hasan, Joe, Trevor]
5. Text Generation
5.1 Introduction
Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We
can make a simple assumption that the next word is only dependent on the previous word - which is the basic
assumption of a Markov chain. Note: LSTM(deep learning) is a better technique for text generation than Markov
chain.
[(1, 'ali'),
(2, 'anthony'),
(3, 'bill'),
(1, 'bo'),
(0, 'dave'),
(3, 'hasan'),
(2, 'jim'),
(3, 'joe'),
(2, 'john'),
(0, 'louis'),
(1, 'mike'),
(2, 'ricky'),
(3, 'trevor')]
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 33/35
5.2 Select Text to Imitate
In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract
the text from her comedy routine.
In [169]: # Read in the corpus, including punctuation!
import pandas as pd
data = pd.read_pickle('corpus.pkl')
# Extract only Ali Wong's text
trevor_text = data.transcript.loc['trevor']
trevor_text[:200]
5.3 Build a Markov Chain Function
We are going to build a simple Markov chain function that creates a dictionary:
The keys should be all of the words in the corpus
The values should be a list of the words that follow the keys
In [170]: from collections import defaultdict #still be able to input key if not e
xist
def markov_chain(text):
'''The input is a string of text and the output will be a dictionary
with each word as
a key and each value as the list of words that come after the key
in the text.'''
# Tokenize the text by word, though including punctuation
words = text.split(' ')
# Initialize a default dictionary to hold all of the words and next
words
m_dict = defaultdict(list)
# Create a zipped list of all of the word pairs and put them in wor
d: list of next words format
for current_word, next_word in zip(words[0:-1], words[1:]):
m_dict[current_word].append(next_word)
# Convert the default dict back into a dictionary
m_dict = dict(m_dict)
return m_dict
Out[169]: 'A NETFLIX ORIGINAL COMEDY SPECIAL [distant traffic] LIVE NATION PRESEN
TS TREVOR NOAH [presenter] Beautiful people, put your hands together fo
r Trevor Noah. [shouting and whooping] [hip hop intro music'
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 34/35
In [178]: # Create the dictionary for Ali's routine, take a look at it
trevor_dict = markov_chain(trevor_text)
5.4 Create a Text Generator
We're going to create a function that generates sentences. It will take two things as inputs:
The dictionary you just created
The number of words you want generated
Here are some examples of generated sentences:
'Shape right turn– I also takes so that she’s got women all know that snail-trail.'
'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'
In [172]: import random
def generate_sentence(chain, count=15):
'''Input a dictionary in the format of key = current word, value = l
ist of next words
along with the number of words you would like to see in your gene
rated sentence.'''
# Capitalize the first word
word1 = random.choice(list(chain.keys()))
sentence = word1.capitalize()
# Generate the second word from the value list. Set the new word as
the first word. Repeat.
for i in range(count-1):
word2 = random.choice(chain[word1])
word1 = word2
sentence += ' ' + word2
# End it with a period
sentence += '.'
return(sentence)
In [177]: generate_sentence(trevor_dict, 30)
Out[177]: 'Right, that’s moms. Dads will be somebody needs 25 billion dollars the
guy was, they have vowels. I know what was your mom.” Ah… The mind of j
ust put your.'
2/5/2020 end-to-end NLP in Python Sample Code by Aiden
localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 35/35
6. Conclusion
Question: What makes Trevor Noah's comedy routine stand out?
Exploratory Data Analysis:
Top Words(Word Clouds)
He talks about love and his friends a lot.
Vocabulary Size(Bar Plot)
He ranked second highest in number of words per minute, which means he talks fast. He
ranked third lowest in number of unique words, which might be the reason that his
comedian is easier to understand than the others
Amount of Profanity(Scatter Plot)
He doesn't use f-word based on the sample, and raely use s-word.
NLP Techniques:
Sentiment Analysis
He tends to be more positive and less opinionated.
Topic Modeling
His comedy involves the topics of family, firends and love.
Text Generation
In [ ]:

More Related Content

Similar to Natural Language Processing sample code by Aiden

Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from DataMosky Liu
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
 
Pypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelPypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelMark Rees
 
The Myths, Musts and Migraines of Migrations - DrupalJam 2018
The Myths, Musts and Migraines of Migrations - DrupalJam 2018The Myths, Musts and Migraines of Migrations - DrupalJam 2018
The Myths, Musts and Migraines of Migrations - DrupalJam 2018LimoenGroen
 
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010singingfish
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 
Will iPython replace Bash?
Will iPython replace Bash?Will iPython replace Bash?
Will iPython replace Bash?Babel
 
Will iPython replace bash?
Will iPython replace bash?Will iPython replace bash?
Will iPython replace bash?Roberto Polli
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat Pôle Systematic Paris-Region
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and batteryVitali Pekelis
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and rKelli-Jean Chun
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonPyData
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
BreizhCamp 2013 - Pimp my backend
BreizhCamp 2013 - Pimp my backendBreizhCamp 2013 - Pimp my backend
BreizhCamp 2013 - Pimp my backendHoracio Gonzalez
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabeDataiku
 

Similar to Natural Language Processing sample code by Aiden (20)

Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
Pypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelPypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequel
 
The Myths, Musts and Migraines of Migrations - DrupalJam 2018
The Myths, Musts and Migraines of Migrations - DrupalJam 2018The Myths, Musts and Migraines of Migrations - DrupalJam 2018
The Myths, Musts and Migraines of Migrations - DrupalJam 2018
 
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Will iPython replace Bash?
Will iPython replace Bash?Will iPython replace Bash?
Will iPython replace Bash?
 
Will iPython replace bash?
Will iPython replace bash?Will iPython replace bash?
Will iPython replace bash?
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and battery
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and r
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
BreizhCamp 2013 - Pimp my backend
BreizhCamp 2013 - Pimp my backendBreizhCamp 2013 - Pimp my backend
BreizhCamp 2013 - Pimp my backend
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 

Recently uploaded

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

Natural Language Processing sample code by Aiden

  • 1. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 1/35 0. Workflow 1. Start with a Question 2. Get & Clean the Data 3. Perform Exploratory Data Analysis 4. Apply Techniques 5. Share Insights 1. Data Cleaning 1.1 Introduction This part one goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless. Specifically, we'll be walking through: 1. Getting the data - in this case, we'll be scraping data from a website 2. Cleaning the data - we will walk through popular text pre-processing techniques 3. Organizing the data - we will organize the cleaned data into a way that is easy to input into other algorithms The output of this part one will be clean, organized data in two standard text formats: 1. Corpus - a collection of text 2. Document-Term Matrix - word counts in matrix format 1.2 Problem Statement My goal is to look at transcripts of various comedians and note their similarities and differences. Specifically, I'd like to know if Trevor Noah's comedy style is different than other comedians, since he's the comedian that got me interested in stand up comedy.
  • 2. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 2/35 1.3 Getting The Data Scraps From The Loft (http://scrapsfromtheloft.com) keeps track of stand up routine transcripts, and makes them available for non-profit and educational purposes. To decide which comedians to look into, I went on IMDB and looked specifically at comedy specials that were released in the past 5 years. To narrow it down further, I looked only at those with greater than a 7.5/10 rating and more than 2000 votes. If a comedian had multiple specials that fit those requirements, I would pick the most highly rated one. I ended up with 13 comedy specials. Packages for Web Scraping: 1. Requests - make HTTP requests 2. Beautiful Soup - parse HTTP documents Package for Python Data Processing: 1. Pickle - serialize Python objects
  • 3. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 3/35 In [1]: # Web scraping, pickle imports import requests from bs4 import BeautifulSoup import pickle # Scrapes transcript data from scrapsfromtheloft.com def url_to_transcript(url): '''Returns transcript data specifically from scrapsfromtheloft.co m.''' page = requests.get(url).text soup = BeautifulSoup(page, "lxml") text = [p.text for p in soup.find(class_="post-content").find_all( 'p')] print(url) return text # URLs of transcripts in scope urls = ['http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full -transcript/', 'http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin -2017-full-transcript/', 'http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity- transcript/', 'http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-tr anscript/', 'http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel -way-2014-full-transcript/', 'http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014 -full-transcript/', 'http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-k id-2015-full-transcript/', 'http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming -king-2017-full-transcript/', 'http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-201 6-full-transcript/', 'http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-though ts-prayers-2015-full-transcript/', 'http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlf riends-boyfriend-2013-full-transcript/', 'http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-201 6-full-transcript/', 'http://scrapsfromtheloft.com/2018/11/21/trevor-noah-son-of-patr icia-transcript/'] # Comedian names comedians = ['louis', 'dave', 'ricky', 'bo', 'bill', 'jim', 'john', 'has an', 'ali', 'anthony', 'mike', 'joe', 'trevor']
  • 4. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 4/35 In [5]: # # Actually request transcripts (takes a few minutes to run) transcripts = [url_to_transcript(u) for u in urls] In [6]: # Pickle files for later use # Make a new directory to hold the text files # !mkdir transcripts for i, c in enumerate(comedians): with open("transcripts/" + c + ".txt", "wb") as file: pickle.dump(transcripts[i], file) In [7]: # Load pickled files data = {} for i, c in enumerate(comedians): with open("transcripts/" + c + ".txt", "rb") as file: data[c] = pickle.load(file) http://scrapsfromtheloft.com/2017/05/06/louis-ck-oh-my-god-full-transcr ipt/ http://scrapsfromtheloft.com/2017/04/11/dave-chappelle-age-spin-2017-fu ll-transcript/ http://scrapsfromtheloft.com/2018/03/15/ricky-gervais-humanity-transcri pt/ http://scrapsfromtheloft.com/2017/08/07/bo-burnham-2013-full-transcrip t/ http://scrapsfromtheloft.com/2017/05/24/bill-burr-im-sorry-feel-way-201 4-full-transcript/ http://scrapsfromtheloft.com/2017/04/21/jim-jefferies-bare-2014-full-tr anscript/ http://scrapsfromtheloft.com/2017/08/02/john-mulaney-comeback-kid-2015- full-transcript/ http://scrapsfromtheloft.com/2017/10/21/hasan-minhaj-homecoming-king-20 17-full-transcript/ http://scrapsfromtheloft.com/2017/09/19/ali-wong-baby-cobra-2016-full-t ranscript/ http://scrapsfromtheloft.com/2017/08/03/anthony-jeselnik-thoughts-praye rs-2015-full-transcript/ http://scrapsfromtheloft.com/2018/03/03/mike-birbiglia-my-girlfriends-b oyfriend-2013-full-transcript/ http://scrapsfromtheloft.com/2017/08/19/joe-rogan-triggered-2016-full-t ranscript/ http://scrapsfromtheloft.com/2018/11/21/trevor-noah-son-of-patricia-tra nscript/
  • 5. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 5/35 1.4 Cleaning The Data When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques. With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results. Common data cleaning steps on all text: Make text all lower case Remove punctuation Remove numerical values Remove common non-sensical text (/n) Tokenize text Remove stop words More data cleaning steps after tokenization: Stemming / lemmatization Parts of speech tagging Create bi-grams or tri-grams Deal with typos And more... Packages for round1 cleaning: 1. re - regular expression operations/matching/substitution 2. string - common string operations/punctuation Package for round2 cleaning: 1. nltk - stemming.lemmatization 2. gensim - tokenization, excluding stopwords In [14]: # We are going to change this to key: comedian, value: string format def combine_text(list_of_text): '''Takes a list of text and combines them into one large chunk of te xt.''' combined_text = ' '.join(list_of_text) return combined_text # Combine it! data_combined = {key: [combine_text(value)] for (key, value) in data.ite ms()}
  • 6. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 6/35 In [54]: # We can either keep it in dictionary format or put it into a pandas dat aframe import pandas as pd pd.set_option('max_colwidth',150) data_df = pd.DataFrame.from_dict(data_combined).transpose() data_df.columns = ['transcript'] data_df = data_df.sort_index() In [18]: # Apply a first round of text cleaning techniques import re import string def clean_text_round1(text): '''Make text lowercase, remove text in square brackets, remove punct uation and remove words containing numbers.''' text = text.lower() text = re.sub('[.*?]', '', text) text = re.sub('[%s]' % re.escape(string.punctuation), '', text) text = re.sub('w*dw*', '', text) text = re.sub('[‘’“”…]', '', text) text = re.sub('n', '', text) return text round1 = lambda x: clean_text_round1(x) In [61]: # Let's take a look at the updated text data_clean1 = pd.DataFrame(data_df.transcript.apply(round1)) data_clean1.iloc[0] Out[61]: transcript ladies and gentlemen please welcome to the stage ali wong hi hello welcome thank you thank you for coming hello hello we are gonn a have to get thi... Name: ali, dtype: object
  • 7. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 7/35 In [52]: # Apply a second round of text cleaning techniques from nltk.stem import WordNetLemmatizer, SnowballStemmer import nltk #nltk.download('wordnet') stemmer = SnowballStemmer('english') def lemmatize_stemming(word): return stemmer.stem(WordNetLemmatizer().lemmatize(word, pos='v')) import gensim from gensim.utils import simple_preprocess from gensim.parsing.preprocessing import STOPWORDS def clean_text_round2(text): '''Mark 'cheering' and 'cheer' as the same word (stemming / lemmatiz ation) Excluding stopwords ''' result = [] for token in gensim.utils.simple_preprocess(text): if token not in gensim.parsing.preprocessing.STOPWORDS and len(t oken) > 3: result.append(lemmatize_stemming(token)) result = ' '.join(result) return result round2 = lambda x: clean_text_round2(x) In [62]: # Let's take a look at the updated text data_clean2 = pd.DataFrame(data_clean1.transcript.apply(round2)) data_clean2.iloc[0] 1.5 Organizing The Data Output of this part1 will be clean, organized data in two standard text formats: 1. Corpus - a collection of text 2. Document-Term Matrix - word counts in matrix format Corpus Out[62]: transcript ladi gentlemen welcom stage wong hello welcom thank thank come hello hello gonna shit caus like minut thank everybodi come excit excit year turn y... Name: ali, dtype: object
  • 8. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 8/35 In [75]: # Let's add the comedians' full names as well full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 'Dave Chappelle', 'Hasan Minhaj', 'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.' , 'Mike Birbiglia', 'Ricky Gervais', 'Trevor Noah'] data_df['full_name'] = full_names # Let's pickle it for later use data_df.to_pickle("corpus.pkl") Document-Term Matrix The text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word. In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.
  • 9. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 9/35 In [80]: # We are going to create a document-term matrix using CountVectorizer, a nd exclude common English stop words from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(stop_words='english') data_cv = cv.fit_transform(data_clean2.transcript) data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names ()) data_dtm.index = data_clean.index data_dtm In [82]: # Let's pickle it for later use data_dtm.to_pickle("dtm.pkl") # Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object data_clean2.to_pickle('data_clean.pkl') pickle.dump(cv, open("cv.pkl", "wb")) 2. Exploratory Data Analysis Out[80]: aaaaah aaaaahhhhhhh aaaahhhhh aaah abandon abc abil abject abl ablebodi . ali 0 0 0 0 0 0 0 0 2 0 . anthony 0 0 0 0 0 0 0 0 0 0 . bill 1 0 0 0 0 1 0 0 1 0 . bo 0 1 1 0 0 0 1 0 0 0 . dave 0 0 0 1 0 0 0 0 0 0 . hasan 0 0 0 0 0 0 0 0 1 0 . jim 0 0 0 0 0 0 0 0 1 2 . joe 0 0 0 0 0 0 0 0 2 0 . john 0 0 0 0 0 0 0 0 3 0 . louis 0 0 0 0 0 0 0 0 1 0 . mike 0 0 0 0 0 0 0 0 0 0 . ricky 0 0 0 0 0 0 1 1 2 0 . trevor 0 0 0 0 1 0 0 0 0 0 . 13 rows × 5319 columns
  • 10. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 10/35 2.1 Introduction After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first. When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each comedian: 1. Most common words - find these and create word clouds 2. Size of vocabulary - look number of unique words and also how quickly someone speaks 3. Amount of profanity - most common terms 2.2 Most Common Words Analysis In [84]: # Read in the document-term matrix import pandas as pd data = pd.read_pickle('dtm.pkl') data = data.transpose() In [87]: # Find the top 30 words said by each comedian top_dict = {} for c in data.columns: top = data[c].sort_values(ascending=False).head(30) top_dict[c]= list(zip(top.index, top.values)) #pprint(top_dict)
  • 11. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 11/35 In [88]: # Print the top 15 words said by each comedian for comedian, top_words in top_dict.items(): print(comedian) print(', '.join([word for word, count in top_words[0:14]])) print('---')
  • 12. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 12/35 ali like, know, dont, shit, gonna, come, wanna, gotta, time, husband, fuck, tell, right, women --- anthony joke, like, know, dont, say, thing, anthoni, tell, think, guy, peopl, t ime, fuck, love --- bill like, right, fuck, know, dont, gonna, yeah, come, shit, think, want, du de, peopl, thing --- bo know, think, like, love, fuck, stuff, repeat, want, dont, yeah, right, say, slut, peopl --- dave like, know, say, fuck, shit, peopl, didnt, ahah, dont, time, black, com e, look, good --- hasan like, know, dont, want, look, love, time, shes, hasan, right, come, wal k, fuck, say --- jim fuck, like, dont, right, know, come, think, say, thing, peopl, want, gu n, theyr, good --- joe like, fuck, peopl, dont, think, know, gonna, theyr, shit, thing, right, hous, dude, look --- john like, know, dont, say, walk, clinton, right, time, think, littl, look, thing, peopl, caus --- louis like, know, dont, thing, life, peopl, gonna, think, shit, caus, say, ha ppen, look, murder --- mike like, say, know, dont, think, jenni, caus, right, point, want, mean, go nna, come, friend --- ricky right, like, say, know, fuck, dont, yeah, thing, joke, think, year, peo pl, didnt, littl --- trevor like, know, say, dont, snake, taco, peopl, come, yeah, right, want, thi ng, think, time ---
  • 13. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 13/35 NOTE: At this point, we could go on and create word clouds. However, by looking at these top words, you can see that some of them have very little meaning and could be added to a stop words list, so let's do just that. In [90]: # Look at the most common top words --> add them to the stop word list from collections import Counter # Let's first pull out the top 30 words for each comedian words = [] for comedian in data.columns: top = [word for (word, count) in top_dict[comedian]] for t in top: words.append(t) In [105]: # Let's aggregate this list and identify the most common words along wit h how many routines they occur in Counter(words).most_common() # If more than half of the comedians have it as a top word, exclude it f rom the list add_stop_words = [word for word, count in Counter(words).most_common() i f count > 10] add_stop_words In [106]: # Let's update our document-term matrix with the new list of stop words from sklearn.feature_extraction import text from sklearn.feature_extraction.text import CountVectorizer # Read in cleaned data data_clean = pd.read_pickle('data_clean.pkl') # Add new stop words stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words) # Recreate document-term matrix cv = CountVectorizer(stop_words=stop_words) data_cv = cv.fit_transform(data_clean.transcript) data_stop = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names ()) data_stop.index = data_clean.index # Pickle it for later use import pickle pickle.dump(cv, open("cv_stop.pkl", "wb")) data_stop.to_pickle("dtm_stop.pkl") Out[105]: ['like', 'know', 'dont', 'come', 'time', 'right', 'peopl', 'think', 'sa y', 'look', 'want', 'thing']
  • 14. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 14/35 In [107]: # Let's make some word clouds! # Terminal / Anaconda Prompt: conda install -c conda-forge wordcloud from wordcloud import WordCloud wc = WordCloud(stopwords=stop_words, background_color="white", colormap= "Dark2", max_font_size=150, random_state=42) In [108]: # Reset the output dimensions import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = [16, 6] full_names = ['Ali Wong', 'Anthony Jeselnik', 'Bill Burr', 'Bo Burnham', 'Dave Chappelle', 'Hasan Minhaj', 'Jim Jefferies', 'Joe Rogan', 'John Mulaney', 'Louis C.K.' , 'Mike Birbiglia', 'Ricky Gervais', 'Trevor Noah'] # Create subplots for each comedian for index, comedian in enumerate(data.columns): wc.generate(data_clean.transcript[comedian]) plt.subplot(4, 4, index+1) plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.title(full_names[index]) plt.show() Findings Trevor Noah says the "yeah" a lot and mock on himself. I guess his enthusiastic language is funny to me. A lot of people use the F-word. Let's dig into that later.
  • 15. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 15/35 2.3 Number of Words Analysis In [110]: # Find the number of unique words that each comedian uses # Identify the non-zero items in the document-term matrix, meaning that the word occurs at least once unique_list = [] for comedian in data.columns: uniques = data[comedian].nonzero()[0].size unique_list.append(uniques) # Create a new dataframe that contains this unique word count data_words = pd.DataFrame(list(zip(full_names, unique_list)), columns=[ 'comedian', 'unique_words']) data_unique_sort = data_words.sort_values(by='unique_words') data_unique_sort /usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:6: FutureW arning: Series.nonzero() is deprecated and will be removed in a future version.Use Series.to_numpy().nonzero() instead Out[110]: comedian unique_words 1 Anthony Jeselnik 732 9 Louis C.K. 817 12 Trevor Noah 874 6 Jim Jefferies 938 3 Bo Burnham 984 4 Dave Chappelle 1035 8 John Mulaney 1042 0 Ali Wong 1057 7 Joe Rogan 1057 10 Mike Birbiglia 1101 5 Hasan Minhaj 1168 2 Bill Burr 1215 11 Ricky Gervais 1218
  • 16. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 16/35 In [111]: # Calculate the words per minute of each comedian # Find the total number of words that a comedian uses total_list = [] for comedian in data.columns: totals = sum(data[comedian]) total_list.append(totals) # Comedy special run times from IMDB, in minutes run_times = [60, 59, 80, 60, 67, 73, 77, 63, 62, 58, 76, 79, 63] # Let's add some columns to our dataframe data_words['total_words'] = total_list data_words['run_times'] = run_times data_words['words_per_minute'] = data_words['total_words'] / data_words[ 'run_times'] # Sort the dataframe by words per minute to see who talks the slowest an d fastest data_wpm_sort = data_words.sort_values(by='words_per_minute') data_wpm_sort Out[111]: comedian unique_words total_words run_times words_per_minute 1 Anthony Jeselnik 732 2232 59 37.830508 3 Bo Burnham 984 2449 60 40.816667 0 Ali Wong 1057 2596 60 43.266667 9 Louis C.K. 817 2539 58 43.775862 6 Jim Jefferies 938 3612 77 46.909091 11 Ricky Gervais 1218 3827 79 48.443038 4 Dave Chappelle 1035 3247 67 48.462687 10 Mike Birbiglia 1101 3752 76 49.368421 5 Hasan Minhaj 1168 3652 73 50.027397 8 John Mulaney 1042 3129 62 50.467742 2 Bill Burr 1215 4208 80 52.600000 12 Trevor Noah 874 3609 63 57.285714 7 Joe Rogan 1057 3632 63 57.650794
  • 17. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 17/35 In [112]: # Let's plot our findings import numpy as np y_pos = np.arange(len(data_words)) plt.subplot(1, 2, 1) plt.barh(y_pos, data_unique_sort.unique_words, align='center') plt.yticks(y_pos, data_unique_sort.comedian) plt.title('Number of Unique Words', fontsize=20) plt.subplot(1, 2, 2) plt.barh(y_pos, data_wpm_sort.words_per_minute, align='center') plt.yticks(y_pos, data_wpm_sort.comedian) plt.title('Number of Words Per Minute', fontsize=20) plt.tight_layout() plt.show() Findings Vocabulary Ricky Gervais (British comedy) and Bill Burr (podcast host) use a lot of words in their comedy Louis C.K. (self-depricating comedy) and Anthony Jeselnik (dark humor) have a smaller vocabulary Talking Speed Joe Rogan (blue comedy) and Trevor Noah talk fast Bo Burnham (musical comedy) and Anthony Jeselnik (dark humor) talk slow This is really interesting. Trevor Noah doesn't use many words in his comedy, but he talks fast. This fact may be reason why his comedy is impressive to audience. 2.4 Amount of Profanity Analysis
  • 18. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 18/35 In [115]: # Earlier I said we'd revisit profanity. Let's take a look at the most c ommon words again. Counter(words).most_common() # Let's isolate just these bad words data_bad_words = data.transpose()[['fuck', 'shit']] data_profanity = pd.concat([data_bad_words.fuck, data_bad_words.shit], a xis=1) data_profanity.columns = ['f_word', 's_word'] data_profanity Out[115]: f_word s_word ali 20 36 anthony 19 9 bill 111 65 bo 40 7 dave 72 46 hasan 28 16 jim 126 20 joe 141 41 john 4 7 louis 22 28 mike 0 1 ricky 63 6 trevor 0 14
  • 19. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 19/35 In [116]: # Let's create a scatter plot of our findings plt.rcParams['figure.figsize'] = [10, 8] for i, comedian in enumerate(data_profanity.index): x = data_profanity.f_word.loc[comedian] y = data_profanity.s_word.loc[comedian] plt.scatter(x, y, color='blue') plt.text(x+1.5, y+0.5, full_names[i], fontsize=10) plt.xlim(-5, 155) plt.title('Number of Bad Words Used in Routine', fontsize=20) plt.xlabel('Number of F Bombs', fontsize=15) plt.ylabel('Number of S Words', fontsize=15) plt.show() Findings Averaging 2 F-Bombs Per Minute! - I don't like too much swearing, especially the f-word, which is probably why I've never heard of Bill Bur, Joe Rogan and Jim Jefferies. Clean Humor - It looks like profanity might be a good predictor of the type of comedy I like. Besides Trevor Noah, my two other favorite comedians in this group are John Mulaney and Mike Birbiglia.
  • 20. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 20/35 Side Note What was our goal for the EDA portion of our journey? To be able to take an initial look at our data and see if the results of some basic analysis made sense. My conclusion - yes, it does, for a first pass. The results are interesting and make general sense, so we're going to move on. As a reminder to myself, the data science process is an interative one. It's better to see some non-perfect but acceptable results to help you quickly decide whether your project is a dud or not, instead of having analysis paralysis and never delivering anything. 3. Sentiment Analysis 3.1 Introduction So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc. These techniques could be applied to numeric data as well. When it comes to text data, there are a few popular techniques that we'll be going through in the next few notebooks, starting with sentiment analysis. A few key points to remember with sentiment analysis. 1. TextBlob Module: Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels. 2. Sentiment Labels: Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these. Polarity: How positive or negative a word is. -1 is very negative. +1 is very positive. Subjectivity: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion. For more info on how TextBlob coded up its sentiment function (https://planspace.org/20150607- textblob_sentiment/). Other ststistical methods such as Naive Bayes can be also used for sentiment analysis. Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine. 3.2 Sentiment of Routine
  • 21. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 21/35 In [119]: # We'll start by reading in the corpus, which preserves word order import pandas as pd data = pd.read_pickle('corpus.pkl') # Create quick lambda functions to find the polarity and subjectivity of each routine # Terminal / Anaconda Navigator: conda install -c conda-forge textblob from textblob import TextBlob pol = lambda x: TextBlob(x).sentiment.polarity sub = lambda x: TextBlob(x).sentiment.subjectivity data['polarity'] = data['transcript'].apply(pol) data['subjectivity'] = data['transcript'].apply(sub) data
  • 22. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 22/35 Out[119]: transcript full_name polarity subjectivity ali Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have ... Ali Wong 0.069359 0.482403 anthony Thank you. Thank you. Thank you, San Francisco. Thank you so much. So good to be here. People were surprised when I told ’em I was gonna tape my s... Anthony Jeselnik 0.054285 0.559732 bill [cheers and applause] All right, thank you! Thank you very much! Thank you. Thank you. Thank you. How are you? What’s going on? Thank you. It’s a ... Bill Burr 0.016479 0.537016 bo Bo What? Old MacDonald had a farm E I E I O And on that farm he had a pig E I E I O Here a snort There a Old MacDonald had a farm E I E I O [Appla... Bo Burnham 0.074514 0.539368 dave This is Dave. He tells dirty jokes for a living. That stare is where most of his hard work happens. It signifies a profound train of thought, the ... Dave Chappelle -0.002690 0.513958 hasan [theme music: orchestral hip-hop] [crowd roars] What’s up? Davis, what’s up? I’m home. I had to bring it back here. Netflix said, “Where do you wa... Hasan Minhaj 0.086856 0.460619 jim [Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello... Jim Jefferies 0.044224 0.523382 joe [rock music playing] [audience cheering] [announcer] Ladies and gentlemen, welcome Joe Rogan. [audience cheering and applauding] What the fuck is ... Joe Rogan 0.004968 0.551628 john All right, Petunia. Wish me luck out there. You will die on August 7th, 2037. That’s pretty good. All right. Hello. Hello, Chicago. Nice to see yo... John Mulaney 0.082355 0.484137 louis IntronFade the music out. Let’s roll. Hold there. Lights. Do the lights. Thank you. Thank you very much. I appreciate that. I don’t necessarily a... Louis C.K. 0.056665 0.515796 mike Wow. Hey, thank you. Thanks. Thank you, guys. Hey, Seattle. Nice to see you. Look at this. Look at us. We’re here. This is crazy. It’s insane. So ... Mike Birbiglia 0.092927 0.518476 ricky Hello. Hello! How you doing? Great. Thank you. Wow. Calm down. Shut the fuck up. Thank you. What a lovely welcome. I’m gonna try my hardest tonigh... Ricky Gervais 0.066489 0.497313 trevor A NETFLIX ORIGINAL COMEDY SPECIAL [distant traffic] LIVE NATION PRESENTS TREVOR NOAH [presenter] Beautiful people, put your hands together for Tre... Trevor Noah 0.096365 0.479900
  • 23. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 23/35 In [120]: # Let's plot the results import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = [10, 8] for index, comedian in enumerate(data.index): x = data.polarity.loc[comedian] y = data.subjectivity.loc[comedian] plt.scatter(x, y, color='blue') plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10) plt.xlim(-.01, .12) plt.title('Sentiment Analysis', fontsize=20) plt.xlabel('<-- Negative -------- Positive -->', fontsize=15) plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15) plt.show()
  • 24. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 24/35 Findings Positive Energy - Trevor Noah is ranked No.1 in positive/negative tone in this sample pool. This is a definitely inspiring finding, but not surprising after a thought. Personally, I prefer comedy that are not only funny but also somehow optimistic. Objective Tone - As a practical person, objective rality is another aspect that I pay attention to regarding the content of the show. Trevor Noah is more objective than most of the other comedians in the pool. 3.3 Sentiment of Routine Over Time Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time throughout each routine. In [124]: # Split each routine into 10 parts import numpy as np import math def split_text(text, n=10): '''Takes in a string of text and splits into n equal parts, with a d efault of 10 equal parts.''' # Calculate length of text, the size of each chunk of text and the s tarting points of each chunk of text length = len(text) size = math.floor(length / n) start = np.arange(0, length, size) # Pull out equally sized pieces of text and put it into a list split_list = [] for piece in range(n): split_list.append(text[start[piece]:start[piece]+size]) return split_list # Let's create a list to hold all of the pieces of text list_pieces = [] for t in data.transcript: split = split_text(t) list_pieces.append(split)
  • 25. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 25/35 In [126]: # Calculate the polarity for each piece of text polarity_transcript = [] for lp in list_pieces: polarity_piece = [] for p in lp: polarity_piece.append(TextBlob(p).sentiment.polarity) polarity_transcript.append(polarity_piece) In [130]: # Show the plot for all comedians plt.rcParams['figure.figsize'] = [28, 18] for index, comedian in enumerate(data.index): plt.subplot(4, 4, index+1) plt.plot(polarity_transcript[index]) plt.plot(np.arange(0,10), np.zeros(10)) plt.title(data['full_name'][index]) plt.ylim(ymin=-.2, ymax=.3) plt.show() Findings Trevor Noah stays generally positive throughout his routine. Similar comedians are Louis C.K. and Ali Wong. On the other hand, you have some pretty different patterns here like Bo Burnham who gets happier as time passes and Dave Chappelle who has some pretty down moments in his routine.
  • 26. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 26/35 4. Topic Modeling 4.1 Introduction Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics. In this notebook, we will be covering the steps on how to do Latent Dirichlet Allocation (LDA), which is one of many topic modeling techniques. It was specifically designed for text data. To use a topic modeling technique, we need to provide (1) a document-term matrix and (2) the number of topics we would like the algorithm to pick up. Once the topic modeling technique is applied, our job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, we can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model. 4.2 Topic Modeling - Attempt #1 (All Text) In [132]: # Let's read in our document-term matrix import pandas as pd import pickle data = pd.read_pickle('dtm_stop.pkl') In [134]: # Import the necessary modules for LDA with gensim # Terminal / Anaconda Navigator: conda install -c conda-forge gensim from gensim import matutils, models import scipy.sparse tdm = data.transpose() In [135]: # We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus sparse_counts = scipy.sparse.csr_matrix(tdm) corpus = matutils.Sparse2Corpus(sparse_counts) # Gensim also requires dictionary of the all terms and their respective location in the term-document matrix cv = pickle.load(open("cv_stop.pkl", "rb")) id2word = dict((v, k) for k, v in cv.vocabulary_.items())
  • 27. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 27/35 In [136]: # Now that we have the corpus (term-document matrix) and id2word (dictio nary of location: term), # we need to specify two other parameters as well - the number of topics and the number of passes lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, pass es=10) lda.print_topics() In [139]: # LDA for num_topics = 3 lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, pass es=10) lda.print_topics() In [140]: # LDA for num_topics = 4 lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, pass es=10) lda.print_topics() These topics aren't looking too great. We've tried modifying our parameters. Let's try modifying our terms list as well. Out[136]: [(0, '0.014*"fuck" + 0.009*"shit" + 0.008*"gonna" + 0.008*"yeah" + 0.00 6*"didnt" + 0.006*"caus" + 0.006*"tell" + 0.005*"good" + 0.005*"love" + 0.005*"life"'), (1, '0.021*"fuck" + 0.009*"joke" + 0.008*"yeah" + 0.007 *"theyr" + 0.007*"love" + 0.007*"gonna" + 0.006*"littl" + 0.005*"shit" + 0.005*"didnt" + 0.005*"tell"')] Out[139]: [(0, '0.009*"yeah" + 0.007*"fuck" + 0.007*"love" + 0.007*"didnt" + 0.00 6*"walk" + 0.006*"littl" + 0.005*"mean" + 0.005*"snake" + 0.005*"year" + 0.005*"life"'), (1, '0.017*"fuck" + 0.009*"gonna" + 0.009*"shit" + 0. 006*"yeah" + 0.006*"caus" + 0.006*"good" + 0.006*"tell" + 0.006*"theyr" + 0.006*"didnt" + 0.006*"mean"'), (2, '0.029*"fuck" + 0.012*"gonna" + 0.011*"shit" + 0.010*"yeah" + 0.007*"theyr" + 0.006*"caus" + 0.005*"dud e" + 0.005*"littl" + 0.005*"didnt" + 0.005*"tell"')] Out[140]: [(0, '0.026*"fuck" + 0.010*"shit" + 0.010*"gonna" + 0.008*"yeah" + 0.00 8*"theyr" + 0.007*"didnt" + 0.006*"life" + 0.006*"good" + 0.006*"love" + 0.006*"tell"'), (1, '0.016*"joke" + 0.008*"anthoni" + 0.008*"tell" + 0.007*"guy" + 0.006*"fuck" + 0.006*"love" + 0.006*"grandma" + 0.006*"sh ark" + 0.005*"babi" + 0.005*"good"'), (2, '0.010*"yeah" + 0.008*"love" + 0.008*"snake" + 0.007*"fuck" + 0.007*"shit" + 0.006*"taco" + 0.005*"g onna" + 0.005*"feel" + 0.005*"littl" + 0.005*"friend"'), (3, '0.008*"ca us" + 0.007*"gonna" + 0.007*"mean" + 0.007*"friend" + 0.007*"walk" + 0. 006*"point" + 0.005*"jenni" + 0.005*"tell" + 0.005*"yeah" + 0.005*"didn t"')]
  • 28. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 28/35 4.3 Topic Modeling - Attempt #2 (Nouns Only) One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). In [141]: # Let's create a function to pull out nouns from a string of text from nltk import word_tokenize, pos_tag def nouns(text): '''Given a string of text, tokenize the text and pull out only the n ouns.''' is_noun = lambda pos: pos[:2] == 'NN' tokenized = word_tokenize(text) all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(p os)] return ' '.join(all_nouns) In [179]: # Read in the cleaned data, before the CountVectorizer step data_clean = pd.read_pickle('data_clean.pkl') # Apply the nouns function to the transcripts to filter only on nouns data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns)) In [180]: # Create a new document-term matrix using only nouns from sklearn.feature_extraction import text from sklearn.feature_extraction.text import CountVectorizer # Re-add the additional stop words since we are recreating the document- term matrix add_stop_words = [word for word, count in Counter(words).most_common() i f count > 10] add_stop_words += ['like', 'im', 'know', 'just', 'dont', 'thats', 'righ t', 'people', 'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'sai d'] add_stop_words = list(set(add_stop_words)) stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words) # Recreate a document-term matrix with only nouns cvn = CountVectorizer(stop_words=stop_words) data_cvn = cvn.fit_transform(data_nouns.transcript) data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_nam es()) data_dtmn.index = data_nouns.index In [153]: # Create the gensim corpus corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.trans pose())) # Create the vocabulary dictionary id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())
  • 29. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 29/35 In [154]: # Let's start with 2 topics ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, p asses=10) ldan.print_topics() In [155]: # Let's try topics = 3 ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, p asses=10) ldan.print_topics() In [156]: # Let's try 4 topics ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, p asses=10) ldan.print_topics() 4.4 Topic Modeling - Attempt #3 (Nouns and Adjectives) In [157]: # Let's create a function to pull out nouns from a string of text def nouns_adj(text): '''Given a string of text, tokenize the text and pull out only the n ouns and adjectives.''' is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ' tokenized = word_tokenize(text) nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_a dj(pos)] return ' '.join(nouns_adj) Out[154]: [(0, '0.012*"gon" + 0.010*"fuck" + 0.009*"theyr" + 0.009*"life" + 0.008 *"caus" + 0.007*"didnt" + 0.007*"year" + 0.006*"women" + 0.006*"person" + 0.006*"joke"'), (1, '0.009*"fuck" + 0.009*"gon" + 0.007*"didnt" + 0.0 07*"life" + 0.007*"year" + 0.007*"shes" + 0.006*"guy" + 0.006*"love" + 0.006*"night" + 0.005*"shit"')] Out[155]: [(0, '0.014*"fuck" + 0.012*"gon" + 0.010*"life" + 0.009*"theyr" + 0.008 *"year" + 0.007*"didnt" + 0.007*"joke" + 0.007*"shes" + 0.006*"guy" + 0.006*"caus"'), (1, '0.010*"gon" + 0.009*"fuck" + 0.009*"ahah" + 0.008 *"women" + 0.007*"didnt" + 0.007*"shit" + 0.006*"year" + 0.005*"dude" + 0.005*"guy" + 0.005*"work"'), (2, '0.008*"gon" + 0.008*"friend" + 0.008 *"didnt" + 0.007*"point" + 0.007*"caus" + 0.005*"night" + 0.005*"kind" + 0.005*"life" + 0.005*"person" + 0.005*"jenni"')] Out[156]: [(0, '0.016*"joke" + 0.009*"stuff" + 0.009*"repeat" + 0.008*"guy" + 0.0 08*"love" + 0.007*"fuck" + 0.006*"contact" + 0.006*"littl" + 0.005*"go n" + 0.005*"tell"'), (1, '0.015*"gon" + 0.011*"fuck" + 0.010*"life" + 0.010*"theyr" + 0.009*"caus" + 0.007*"didnt" + 0.007*"shit" + 0.007*"wo men" + 0.006*"year" + 0.006*"kind"'), (2, '0.008*"life" + 0.008*"didnt" + 0.007*"shes" + 0.006*"point" + 0.006*"night" + 0.006*"friend" + 0.006 *"person" + 0.006*"gon" + 0.006*"guy" + 0.005*"jenni"'), (3, '0.016*"fu ck" + 0.013*"gon" + 0.010*"year" + 0.009*"theyr" + 0.008*"didnt" + 0.00 7*"life" + 0.006*"joke" + 0.006*"shes" + 0.006*"work" + 0.006*"litt l"')]
  • 30. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 30/35 In [160]: # Apply the nouns function to the transcripts to filter only on nouns data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj)) # Create a new document-term matrix using only nouns and adjectives, als o remove common words with max_df cvna = CountVectorizer(stop_words=stop_words, max_df=.8) data_cvna = cvna.fit_transform(data_nouns_adj.transcript) data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_ names()) data_dtmna.index = data_nouns_adj.index # Create the gensim corpus corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.tra nspose())) # Create the vocabulary dictionary id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items()) In [161]: # Let's start with 2 topics ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna , passes=10) ldana.print_topics() In [162]: # Let's try 3 topics ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna , passes=10) ldana.print_topics() Out[161]: [(0, '0.006*"joke" + 0.005*"dude" + 0.003*"parent" + 0.003*"sleep" + 0. 003*"wife" + 0.003*"care" + 0.003*"cunt" + 0.003*"jenni" + 0.003*"part i" + 0.002*"funni"'), (1, '0.005*"taco" + 0.004*"dude" + 0.004*"ahah" + 0.004*"nigga" + 0.003*"repeat" + 0.003*"snake" + 0.003*"american" + 0.0 03*"comedi" + 0.003*"food" + 0.003*"hasan"')] Out[162]: [(0, '0.009*"dude" + 0.008*"ahah" + 0.005*"nigga" + 0.004*"sleep" + 0.0 04*"wife" + 0.004*"shoot" + 0.004*"rap" + 0.003*"stori" + 0.003*"sudde n" + 0.003*"jesus"'), (1, '0.009*"joke" + 0.005*"parent" + 0.003*"comed i" + 0.003*"jenni" + 0.003*"repeat" + 0.003*"clinton" + 0.003*"funni" + 0.003*"marri" + 0.003*"tweet" + 0.003*"andi"'), (2, '0.007*"dude" + 0.0 06*"taco" + 0.004*"food" + 0.004*"wan" + 0.004*"differ" + 0.004*"snake" + 0.004*"american" + 0.004*"date" + 0.003*"murder" + 0.003*"gun"')]
  • 31. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 31/35 In [163]: # Let's try 4 topics ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna , passes=10) ldana.print_topics() 4.5 Identify Topics in Each Document Out of the 9 topic models we looked at, the nouns and adjectives, 4 topic one made the most sense. So let's pull that down here and run it through some more iterations to get more fine-tuned topics. In [166]: # Our final LDA model (for now) ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna , passes=80) pprint(ldana.print_topics()) Out[163]: [(0, '0.005*"dude" + 0.005*"stori" + 0.005*"clinton" + 0.005*"repeat" + 0.004*"wife" + 0.004*"gun" + 0.004*"joke" + 0.003*"cunt" + 0.003*"slut" + 0.003*"care"'), (1, '0.009*"hasan" + 0.007*"parent" + 0.006*"brown" + 0.005*"marri" + 0.005*"dream" + 0.005*"birthday" + 0.005*"bike" + 0.004 *"york" + 0.004*"door" + 0.004*"immigr"'), (2, '0.005*"joke" + 0.005*"j enni" + 0.004*"dude" + 0.003*"tweet" + 0.003*"murder" + 0.003*"cours" + 0.003*"parent" + 0.003*"bruce" + 0.003*"jenner" + 0.003*"date"'), (3, '0.007*"taco" + 0.006*"joke" + 0.006*"ahah" + 0.006*"nigga" + 0.006*"du de" + 0.005*"snake" + 0.004*"food" + 0.004*"american" + 0.004*"trevor" + 0.004*"wall"')] [(0, '0.010*"ahah" + 0.008*"murder" + 0.006*"nigga" + 0.005*"young" + ' '0.005*"dude" + 0.005*"rap" + 0.004*"date" + 0.004*"touch" + 0.004*"c ours" + ' '0.004*"suck"'), (1, '0.007*"jenni" + 0.006*"repeat" + 0.005*"andi" + 0.004*"slut" + ' '0.004*"contact" + 0.004*"prolong" + 0.004*"husband" + 0.004*"song" + ' '0.004*"comedi" + 0.003*"wan"'), (2, '0.011*"joke" + 0.005*"cunt" + 0.005*"clinton" + 0.004*"funni" + 0.00 4*"gun" ' '+ 0.004*"parti" + 0.004*"tweet" + 0.003*"hate" + 0.003*"care" + ' '0.003*"wife"'), (3, '0.010*"dude" + 0.006*"taco" + 0.004*"snake" + 0.004*"door" + 0.004 *"sleep" ' '+ 0.004*"wall" + 0.004*"stori" + 0.004*"parent" + 0.003*"presid" + ' '0.003*"hasan"')]
  • 32. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 32/35 These four topics look pretty decent. Let's settle on these for now. Topic 0: african american, ghetto, crime Topic 1: gender Topic 2: politics Topic 3: family In [168]: # Let's take a look at which topics each transcript contains corpus_transformed = ldana[corpusna] pprint(list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index ))) For a first pass of LDA, these kind of make sense to me, so we'll call it a day for now. Topic 0: african american [Dave, Louis] Topic 1: gender [Ali, Bo, Mike] Topic 2: politics [Anthony, Jim, John, Ricky] Topic 3: family [Bill, Hasan, Joe, Trevor] 5. Text Generation 5.1 Introduction Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain. Note: LSTM(deep learning) is a better technique for text generation than Markov chain. [(1, 'ali'), (2, 'anthony'), (3, 'bill'), (1, 'bo'), (0, 'dave'), (3, 'hasan'), (2, 'jim'), (3, 'joe'), (2, 'john'), (0, 'louis'), (1, 'mike'), (2, 'ricky'), (3, 'trevor')]
  • 33. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 33/35 5.2 Select Text to Imitate In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine. In [169]: # Read in the corpus, including punctuation! import pandas as pd data = pd.read_pickle('corpus.pkl') # Extract only Ali Wong's text trevor_text = data.transcript.loc['trevor'] trevor_text[:200] 5.3 Build a Markov Chain Function We are going to build a simple Markov chain function that creates a dictionary: The keys should be all of the words in the corpus The values should be a list of the words that follow the keys In [170]: from collections import defaultdict #still be able to input key if not e xist def markov_chain(text): '''The input is a string of text and the output will be a dictionary with each word as a key and each value as the list of words that come after the key in the text.''' # Tokenize the text by word, though including punctuation words = text.split(' ') # Initialize a default dictionary to hold all of the words and next words m_dict = defaultdict(list) # Create a zipped list of all of the word pairs and put them in wor d: list of next words format for current_word, next_word in zip(words[0:-1], words[1:]): m_dict[current_word].append(next_word) # Convert the default dict back into a dictionary m_dict = dict(m_dict) return m_dict Out[169]: 'A NETFLIX ORIGINAL COMEDY SPECIAL [distant traffic] LIVE NATION PRESEN TS TREVOR NOAH [presenter] Beautiful people, put your hands together fo r Trevor Noah. [shouting and whooping] [hip hop intro music'
  • 34. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 34/35 In [178]: # Create the dictionary for Ali's routine, take a look at it trevor_dict = markov_chain(trevor_text) 5.4 Create a Text Generator We're going to create a function that generates sentences. It will take two things as inputs: The dictionary you just created The number of words you want generated Here are some examples of generated sentences: 'Shape right turn– I also takes so that she’s got women all know that snail-trail.' 'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.' In [172]: import random def generate_sentence(chain, count=15): '''Input a dictionary in the format of key = current word, value = l ist of next words along with the number of words you would like to see in your gene rated sentence.''' # Capitalize the first word word1 = random.choice(list(chain.keys())) sentence = word1.capitalize() # Generate the second word from the value list. Set the new word as the first word. Repeat. for i in range(count-1): word2 = random.choice(chain[word1]) word1 = word2 sentence += ' ' + word2 # End it with a period sentence += '.' return(sentence) In [177]: generate_sentence(trevor_dict, 30) Out[177]: 'Right, that’s moms. Dads will be somebody needs 25 billion dollars the guy was, they have vowels. I know what was your mom.” Ah… The mind of j ust put your.'
  • 35. 2/5/2020 end-to-end NLP in Python Sample Code by Aiden localhost:8888/nbconvert/html/Desktop/PythonRelated/NLP/nlp-in-python-tutorial-master/end-to-end NLP in Python Sample Code by Aiden.ipynb?download=false 35/35 6. Conclusion Question: What makes Trevor Noah's comedy routine stand out? Exploratory Data Analysis: Top Words(Word Clouds) He talks about love and his friends a lot. Vocabulary Size(Bar Plot) He ranked second highest in number of words per minute, which means he talks fast. He ranked third lowest in number of unique words, which might be the reason that his comedian is easier to understand than the others Amount of Profanity(Scatter Plot) He doesn't use f-word based on the sample, and raely use s-word. NLP Techniques: Sentiment Analysis He tends to be more positive and less opinionated. Topic Modeling His comedy involves the topics of family, firends and love. Text Generation In [ ]: