SlideShare a Scribd company logo
Esteban Ribero, Assignment #8 - MSDS 422 | Winter 2019
Language Modeling With an RNN
Purpose and summary of results
The purpose of this exercise is to test different language models for predicting movie review
sentiment using recurrent neural networks (RNN). In particular, we test the effect of different
pertained words vectors and vocabulary size on the model’s predictive accuracy.
The target is to predict the sentiment (positive or negative) of the movie reviews. We use a 2x2
experimental design to isolate the effect of the vector dimension and the vocabulary size in training
time and prediction accuracy. We use two versions of the global vectors GloVe (glove.6b.300d and
glove.6b.50d) developed at Stanford using content from Wikipedia+Gigaword. Both of these neural
network embeddings contain 400k words in them but they differ in the number of dimensions
associated with each word. The simplest one uses a vector of size 50 to represent each word while
the other one uses a vector of size 300. These numeric vectors have been pertained and carry
with them the meaning of the words in natural language and so they are a great way to represent
words for language models. We limit the size of the vocabulary for each of these embeddings to
10,000 or 100,000 of the most common words, and we test the effect of these combinations in
runtime and prediction accuracy.
We found that the vocabulary size of the embeddings has little effect on the processing time
required to train the models and only a small effect on the models prediction accuracy. The larger
vocabulary only increased prediction accuracy by 0.02 points. On the other hand, the larger word
vector had a significant effect in prediction accuracy, moving the models from just predicting
slightly above random choice (55% accuracy) to accurately predicting the sentiment of the review
73% of the time. The managerial implications of these findings are discussed at the end.
Loading the required packages
In [1]: 
Important functions for embedding and text parsing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # static plotting
from time import time #time counters
import os # operating system functions
import os.path # for manipulation of file path names
import re # regular expressions
from collections import defaultdict
import nltk
from nltk.tokenize import TreebankWordTokenizer
from sklearn.model_selection import train_test_split #for random splitting of
import tensorflow as tf
To keep the code organized and provide clarity with the experimental design we will first define a
set of important function that we will be calling out later. The following function resets the graph
and sets the random seend to creates stable outputs across runs.
In [2]: 
The following utility function loads the pre-trained and downloaded embeddings to create the
features that will be passed to the RNN. It follows methods described in
https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
(https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer).
In [3]: 
The following function is used to parse a string of text by removing non-alphanumeric characters,
code characters and stopwords (if we were to use them), and lowercase all words and removing
unnecesary spaces. We will call this function in subsequent code when preparing the data
RANDOM_SEED = 9999
def reset_graph(seed= RANDOM_SEED):
tf.reset_default_graph()
tf.set_random_seed(seed)
np.random.seed(seed)
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
if with_indexes:
word_to_index_dict = dict()
index_to_embedding_array = []
else:
word_to_embedding_dict = dict()
with open(embeddings_filename,'r',encoding='utf-8') as embeddings_file:
for (i, line) in enumerate(embeddings_file):
split = line.split(' ')
word = split[0]
representation = split[1:]
representation = np.array(
[float(val) for val in representation])
if with_indexes:
word_to_index_dict[word] = i
index_to_embedding_array.append(representation)
else:
word_to_embedding_dict[word] = representation
# For unknown words, the representation is an empty vector
_WORD_NOT_FOUND = [0.0] * len(representation)
if with_indexes:
_LAST_INDEX = i + 1
word_to_index_dict = defaultdict(
lambda: _LAST_INDEX, word_to_index_dict)
index_to_embedding_array = np.array(
index_to_embedding_array + [_WORD_NOT_FOUND])
return word_to_index_dict, index_to_embedding_array
else:
word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
return word_to_embedding_dict
In [4]: 
A utility function to get file names within a directory
In [5]: 
The function for reading and storing the data
In [6]: 
The data
The data is a collection of 500 positive and 500 negative movie reviews. The length of the reviews
goes from 22 words to 1052 words. We will first gather and store the 500 negative reviews with the
code below. The data is stored in a list of list, where each list represents a document and a
document is a list of words.
REMOVE_STOPWORDS = False # we won't remove stopwords for this excercise
def text_parse(string):
codelist = ['r', 'n', 't'] #list of codes to be dropped
# replace non-alphanumeric with space
temp_string = re.sub('[^a-zA-Z]', ' ', string)
# replace codes with space
for i in range(len(codelist)):
stopstring = ' ' + codelist[i] + ' '
temp_string = re.sub(stopstring, ' ', temp_string)
# replace single-character words with space
temp_string = re.sub('s.s', ' ', temp_string)
# convert uppercase to lowercase
temp_string = temp_string.lower()
if REMOVE_STOPWORDS:
# replace selected character strings/stop-words with space
for i in range(len(stoplist)):
stopstring = ' ' + str(stoplist[i]) + ' '
temp_string = re.sub(stopstring, ' ', temp_string)
# replace multiple blank characters with one blank character
temp_string = re.sub('s+', ' ', temp_string)
return(temp_string)
def listdir_no_hidden(path):
start_list = os.listdir(path)
end_list = []
for file in start_list:
if (not file.startswith('.')):
end_list.append(file)
return(end_list)
def read_data(filename):
with open(filename, encoding='utf-8') as f:
data = tf.compat.as_str(f.read())
data = data.lower()
data = text_parse(data)
data = TreebankWordTokenizer().tokenize(data) # The Penn Treebank
return data
In [7]: 
We now do the same for the positive reviews
In [8]: 
Since the reviews vary considerably in length we will create lists of documents of max 40 words
each. To do that we will take the first 20 words and the last 20 words from each reviews and get rid
of everything in between. The result will be a list of 1000 lists (500 negative, 500 positive) with 40
words in each list.
In [9]: 
Defining the first language model
We will use the Glove.6B.50d embeddings with a vocabulary size of 10,000 words. We first load
the Glove embeddings using the load_embedding_from_disks function defined previously.
Processed 500 document files under movie-reviews-negative
Processed 500 document files under movie-reviews-positive
# gather data for the negative movie reviews
dir_name = 'movie-reviews-negative'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)
negative_documents = []
for i in range(num_files):
words = read_data(os.path.join(dir_name, filenames[i]))
negative_documents.append(words)
print('Processed {} document files under {}'.format(
len(negative_documents),dir_name))
# gather data for the positive movie reviews
dir_name = 'movie-reviews-positive'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)
positive_documents = []
for i in range(num_files):
words = read_data(os.path.join(dir_name, filenames[i]))
positive_documents.append(words)
print('Processed {} document files under {}'.format(
len(positive_documents),dir_name))
# constructing a list of 1000 lists with 40 words in each list
from itertools import chain
documents = []
for doc in negative_documents:
doc_begin = doc[0:20]
doc_end = doc[len(doc) - 20: len(doc)]
documents.append(list(chain(*[doc_begin, doc_end])))
for doc in positive_documents:
doc_begin = doc[0:20]
doc_end = doc[len(doc) - 20: len(doc)]
documents.append(list(chain(*[doc_begin, doc_end])))
In [10]: 
In [11]: 
Now, we will reduce the size of the vocabulary to 10,000 words. Since the most common words are
listed first, we will select the rows between 0 and 10,000. The following code will create a limited
index for the embedding and clear the rest to save CPU and RAM.
In [12]: 
Now we create a list of list of list with the embeddings. Every word in every review is now
represented by a vector of dimension 50 per the Glove.6B.50 embedding that we are using.
In [13]: 
We are now ready to create the training and test set. We first make the embeddings a numpy array
to feed the RNN, and we create the labels: 0 for negatives and 1 for positives given the order in
which we loaded the documents. Lastly, we use Scikit-Learn to random splitting the data into
training set (80%) and test set (20%)
Loading embeddings from embeddings/gloVe.6Bglove.6B.50d.txt
Embedding is of shape: (400001, 50)
embeddings_directory = 'embeddings/gloVe.6B' #embeddings source
filename = 'glove.6B.50d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
print('Loading embeddings from', embeddings_filename)
word_to_index, index_to_embedding = 
load_embedding_from_disks(embeddings_filename, with_indexes=True)
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
EVOCABSIZE = 10000 #desired size of pre-defined embedding vocabulary
def default_factory():
return EVOCABSIZE
limited_word_to_index = defaultdict(default_factory, 
{k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
limited_index_to_embedding = np.append(limited_index_to_embedding,
index_to_embedding[index_to_embedding.shape[0] - 1, :].
reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros
del index_to_embedding # to clear some CPU RAM
# create list of lists of lists for embeddings
embeddings = []
for doc in documents:
embedding = []
for word in doc:
embedding.append(
limited_index_to_embedding[limited_word_to_index[word]])
embeddings.append(embedding)
In [14]: 
Creating the graph
We will be using the same graph for all four language models. It contains a simple Recurrent
Neural Network with 30 neurons. We will use AdamOptimizer with a learning rate of 0.0003 and
the cross entropy as the cost function to minimize.
In [15]: 
Executing the graph
We will train the model in 50 epochs with mini-batches of size 100. We will estimate the runtime
and collect the results for comparison with the other language models.
WARNING:tensorflow:From <ipython-input-15-81f75ef979ce>:9: BasicRNNCell.__i
nit__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be
removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be repl
aced by that in Tensorflow 2.0.
embeddings_array = np.array(embeddings)
# Define the labels to be used 500 negative (0) and 500 positive (1)
thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32),
np.ones((500), dtype = np.int32)), axis = 0)
# Random splitting of the data in to training (80%) and test (20%)
X_train, X_test, y_train, y_test = 
train_test_split(embeddings_array, thumbs_down_up, test_size=0.20,
random_state = RANDOM_SEED)
reset_graph()
n_steps = embeddings_array.shape[1] # number of words per document
n_inputs = embeddings_array.shape[2] # dimension of pre-trained embeddings
n_neurons = 30 # number of neurons
n_outputs = 2 # thumbs-down or thumbs-up
learning_rate = 0.0003
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
In [17]: 
In [18]: 
Defining the second language model
We will use the same Glove.6B.50d embeddings as before but this time increasing the vocabulary
size x10 to 100,000 words. We need to load the embeddings again since we deleted the
index_to_embedding variable to clear memory. Everything else is the same but with EVOCABSIZE
= 100,000.
In [19]: 
Train accuracy: 0.66 Test accuracy: 0.535
Total runtime in seconds: 6.46
Loading embeddings from embeddings/gloVe.6Bglove.6B.50d.txt
Embedding is of shape: (400001, 50)
init = tf.global_variables_initializer()
start_time = time() #start counter
n_epochs = 50
batch_size = 100
training_results = np.zeros((n_epochs, 2)) # to store results
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(y_train.shape[0] // batch_size):
X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:]
y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
training_results[epoch,0]= acc_train
training_results[epoch,1]= acc_test
end_time = time() #stop counter
runtime = end_time - start_time #calculalting total runtime
print('Train accuracy:', acc_train, 'Test accuracy:', acc_test)
print('Total runtime in seconds:', round(runtime, ndigits=3))
#collecting the results
first_round_training_results = training_results
first_round_runtime = round(runtime, ndigits=3)
print('Loading embeddings from', embeddings_filename)
word_to_index, index_to_embedding = 
load_embedding_from_disks(embeddings_filename, with_indexes=True)
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
In [20]: 
In [21]: 
We create a new sets of train and test data with the adjusted embeddings with larger vocabulary
size.
In [22]: 
Executing the graph
EVOCABSIZE = 100000 #desired size of pre-defined embedding vocabulary
def default_factory():
return EVOCABSIZE
limited_word_to_index = defaultdict(default_factory, 
{k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
limited_index_to_embedding = np.append(limited_index_to_embedding,
index_to_embedding[index_to_embedding.shape[0] - 1, :].
reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros
del index_to_embedding # to clear some CPU RAM
# create list of lists of lists for embeddings
embeddings = []
for doc in documents:
embedding = []
for word in doc:
embedding.append(
limited_index_to_embedding[limited_word_to_index[word]])
embeddings.append(embedding)
embeddings_array = np.array(embeddings)
# Random splitting of the data in to training (80%) and test (20%)
X_train, X_test, y_train, y_test = 
train_test_split(embeddings_array, thumbs_down_up, test_size=0.20,
random_state = RANDOM_SEED)
In [23]: 
In [24]: 
Defining the third language model
For the next two rounds We will be using the Glove.6B.300d embeddings. It contains vectors of
dimensions 300 instead of 50 for each word in the vocabulary. This is a much bigger and
comprehensive embedding, however, we will be restricting again the size of the vocabulary for our
exercise to 10,000 in this round, and then 100,000 in the last round. Everything else is the same.
In [25]: 
In [26]: 
Train accuracy: 0.72 Test accuracy: 0.555
Total runtime in seconds: 6.326
Loading embeddings from embeddings/gloVe.6Bglove.6B.300d.txt
Embedding is of shape: (400001, 300)
init = tf.global_variables_initializer()
start_time = time() #start counter
n_epochs = 50
batch_size = 100
training_results = np.zeros((n_epochs, 2)) # to store results
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(y_train.shape[0] // batch_size):
X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:]
y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
training_results[epoch,0]= acc_train
training_results[epoch,1]= acc_test
end_time = time() #stop counter
runtime = end_time - start_time #calculalting total runtime
print('Train accuracy:', acc_train, 'Test accuracy:', acc_test)
print('Total runtime in seconds:', round(runtime, ndigits=3))
#collecting the results
second_round_training_results = training_results
second_round_runtime = round(runtime, ndigits=3)
embeddings_directory = 'embeddings/gloVe.6B' #embeddings source
filename = 'glove.6B.300d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
print('Loading embeddings from', embeddings_filename)
word_to_index, index_to_embedding = 
load_embedding_from_disks(embeddings_filename, with_indexes=True)
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
In [27]: 
In [28]: 
We create a new sets of train and test data with the adjusted embeddings with dimesion 300.
In [29]: 
In [30]: 
EVOCABSIZE = 100000 #desired size of pre-defined embedding vocabulary
def default_factory():
return EVOCABSIZE
limited_word_to_index = defaultdict(default_factory, 
{k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
limited_index_to_embedding = np.append(limited_index_to_embedding,
index_to_embedding[index_to_embedding.shape[0] - 1, :].
reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros
del index_to_embedding # to clear some CPU RAM
# create list of lists of lists for embeddings
embeddings = []
for doc in documents:
embedding = []
for word in doc:
embedding.append(
limited_index_to_embedding[limited_word_to_index[word]])
embeddings.append(embedding)
embeddings_array = np.array(embeddings)
# Random splitting of the data in to training (80%) and test (20%)
X_train, X_test, y_train, y_test = 
train_test_split(embeddings_array, thumbs_down_up, test_size=0.20,
random_state = RANDOM_SEED)
reset_graph()
n_steps = embeddings_array.shape[1] # number of words per document
n_inputs = embeddings_array.shape[2] # dimension of pre-trained embeddings
n_neurons = 30 # analyst specified number of neurons
n_outputs = 2 # thumbs-down or thumbs-up
learning_rate = 0.0003
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
In [31]: 
In [32]: 
Defining the fourth language model
Vocabulary size 100,000 of the most common words and vector dimension = 300
In [33]: 
In [34]: 
Train accuracy: 0.94 Test accuracy: 0.725
Total runtime in seconds: 19.574
Loading embeddings from embeddings/gloVe.6Bglove.6B.300d.txt
Embedding is of shape: (400001, 300)
init = tf.global_variables_initializer()
start_time = time() #start counter
n_epochs = 50
batch_size = 100
training_results = np.zeros((n_epochs, 2)) # to store results
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(y_train.shape[0] // batch_size):
X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:]
y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
training_results[epoch,0]= acc_train
training_results[epoch,1]= acc_test
end_time = time() #stop counter
runtime = end_time - start_time #calculalting total runtime
print('Train accuracy:', acc_train, 'Test accuracy:', acc_test)
print('Total runtime in seconds:', round(runtime, ndigits=3))
#collecting the results
third_round_training_results = training_results
third_round_runtime = round(runtime, ndigits=3)
print('Loading embeddings from', embeddings_filename)
word_to_index, index_to_embedding = 
load_embedding_from_disks(embeddings_filename, with_indexes=True)
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
EVOCABSIZE = 100000 #desired size of pre-defined embedding vocabulary
def default_factory():
return EVOCABSIZE
limited_word_to_index = defaultdict(default_factory, 
{k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
limited_index_to_embedding = np.append(limited_index_to_embedding,
index_to_embedding[index_to_embedding.shape[0] - 1, :].
reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros
del index_to_embedding # to clear some CPU RAM
In [35]: 
We create a new sets of train and test data.
In [36]: 
In [37]: 
In [38]: 
Results
Train accuracy: 0.93 Test accuracy: 0.735
Total runtime in seconds: 19.637
# create list of lists of lists for embeddings
embeddings = []
for doc in documents:
embedding = []
for word in doc:
embedding.append(
limited_index_to_embedding[limited_word_to_index[word]])
embeddings.append(embedding)
embeddings_array = np.array(embeddings)
# Random splitting of the data in to training (80%) and test (20%)
X_train, X_test, y_train, y_test = 
train_test_split(embeddings_array, thumbs_down_up, test_size=0.20,
random_state = RANDOM_SEED)
init = tf.global_variables_initializer()
start_time = time() #start counter
n_epochs = 50
batch_size = 100
training_results = np.zeros((n_epochs, 2)) # to store results
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(y_train.shape[0] // batch_size):
X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:]
y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
training_results[epoch,0]= acc_train
training_results[epoch,1]= acc_test
end_time = time() #stop counter
runtime = end_time - start_time #calculalting total runtime
print('Train accuracy:', acc_train, 'Test accuracy:', acc_test)
print('Total runtime in seconds:', round(runtime, ndigits=3))
#collecting the results
fourth_round_training_results = training_results
fourth_round_runtime = round(runtime, ndigits=3)
In [44]: 
As can be seen, in the table above, the first two models with word vectors of size 50 barely
perform above random choice with a final accuracy on the training set of 0.535 and 0.555. The
models are clearly under-fitting the data given that the prediction accuracy on the test set is also
low (0.66 and 0.72). Notice the difference in the training set accuracy with the larger vocabulary
size. A lager vocabulary does help the model better fit the data reducing bias, yet the performance
is still not satisfactory. The increase in performance is only 2 percentage points. The difference in
runtime between different vocabulary sizes is negligible and probably noise.
The models using word vectors of size 300 perform significantly better. Both models final accuracy
on the test data is above 72% and above 93% on the test set. Again, the vocabulary size has
negligible effect on runtime and only a minor effect on the prediction accuracy on the test set with
the same 2 percentage points increase with the larger vocabulary size. The runtime while training
the models is significantly higher with the larger word vectors. The runtime for the models with
word vectors of size 300 is about 3 times higher than with word vectors of size 50. Let’s take a look
at the learning curves.
In [45]: 
Out[44]:
Vector
Dimension
Vocabulary
Size
Runtime
(Seconds)
Train Set
Accuracy
Test Set
Accuracy
GloVe.50 VSize
10k
50 10000 6.460 0.66 0.535
GloVe.50 VSize
100k
50 100000 6.326 0.72 0.555
GloVe.300 VSize
10k
300 10000 19.574 0.94 0.725
GloVe.300 VSize
100k
300 100000 19.637 0.93 0.735
d = {'Vector Dimension':[50,50,300,300],
'Vocabulary Size': [10000,100000,10000,100000],
'Runtime (Seconds)' :[first_round_runtime, second_round_runtime,
third_round_runtime, fourth_round_runtime],
'Train Set Accuracy' : [first_round_training_results[-1,0],
second_round_training_results[-1,0],
third_round_training_results[-1,0],
fourth_round_training_results[-1,0]],
'Test Set Accuracy': [first_round_training_results[-1,1],
second_round_training_results[-1,1],
third_round_training_results[-1,1],
fourth_round_training_results[-1,1]]}
results_table = pd.DataFrame(index = ['GloVe.50 VSize 10k',
'GloVe.50 VSize 100k',
'GloVe.300 VSize 10k',
'GloVe.300 VSize 100k'], data=d)
results_table
data = {'GloVe.50 VSize 10k' :first_round_training_results[:,1],
'GloVe.50 VSize 100k':second_round_training_results[:,1],
'GloVe.300 VSize 10k':third_round_training_results[:,1],
'GloVe.300 VSize 100k':fourth_round_training_results[:,1]}
learning_curves = pd.DataFrame(data=data)
In [46]: 
As it can be seen in the graph above, the models with word vectors of size 50 struggle to learn
effectively over time. The learning rate of the RNN was set low (0.0003) to prevent overfitting but in
the case of the models with low dimensionality of the word vectors, this setting is hindering the
model’s ability to learn. On the other hand, this learning rate appears to be appropriate for the high
dimension word vectors. The models learn effectively over time and plateau at around 0.72. Notice
that the top prediction accuracy on the test set (around 75%) is achieved at some point before the
end of training and it is possible that the models started overfitting before training ended. Tweaking
some hyper-parameters might prevent this from happening, boosting performance slightly, but it is
likely that a more complex RNN is needed to more significantly improve performance. The larger
vocabulary size does appear to help learning more effectively over time but its effect is not nearly
as important as the dimensionality of the data with the larger word vectors.
Conclusion
Pre-trained word vectors are a great solution for language models. They make the use of RNN for
language processing practical and they do not appear to be tremendously costly in terms of
processing time, at least for the simple RNN used in this exercise. Larger word vectors have a
significant advantage over shorter ones as they provide more dimensions for the model to learn
from. This type of language models could be used effectively to classify customer reviews and call
complaint logs. Even if the accuracy of the model is not extremely high it can still provide guidance
and automate an otherwise painful and costly task of manually reviewing and classifying
thousands of reviews. Simply classifying reviews into positive and negative is a first step, but this
could be combined with some models for topic extraction, so managers can better address the
underlying problems of the negative reviews or leverage positive aspects of products and services
highlighted in the reviews. A classification model to identify critical complaints could also be
developed so those critical complaints can be passed on to customer representatives to be
addressed personally, reducing potential risks for the company and elevating the service to
customers.
learning_curves.plot.line()
ax = plt.xlim(-1,50)
ax = plt.title("Learning Curves")
ax = plt.xlabel("epoch"), plt.ylabel("Accuracy Score")
ax = plt.legend(loc="best")

More Related Content

Recently uploaded

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 

Recently uploaded (20)

End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Language Modeling With RNN

  • 1. Esteban Ribero, Assignment #8 - MSDS 422 | Winter 2019 Language Modeling With an RNN Purpose and summary of results The purpose of this exercise is to test different language models for predicting movie review sentiment using recurrent neural networks (RNN). In particular, we test the effect of different pertained words vectors and vocabulary size on the model’s predictive accuracy. The target is to predict the sentiment (positive or negative) of the movie reviews. We use a 2x2 experimental design to isolate the effect of the vector dimension and the vocabulary size in training time and prediction accuracy. We use two versions of the global vectors GloVe (glove.6b.300d and glove.6b.50d) developed at Stanford using content from Wikipedia+Gigaword. Both of these neural network embeddings contain 400k words in them but they differ in the number of dimensions associated with each word. The simplest one uses a vector of size 50 to represent each word while the other one uses a vector of size 300. These numeric vectors have been pertained and carry with them the meaning of the words in natural language and so they are a great way to represent words for language models. We limit the size of the vocabulary for each of these embeddings to 10,000 or 100,000 of the most common words, and we test the effect of these combinations in runtime and prediction accuracy. We found that the vocabulary size of the embeddings has little effect on the processing time required to train the models and only a small effect on the models prediction accuracy. The larger vocabulary only increased prediction accuracy by 0.02 points. On the other hand, the larger word vector had a significant effect in prediction accuracy, moving the models from just predicting slightly above random choice (55% accuracy) to accurately predicting the sentiment of the review 73% of the time. The managerial implications of these findings are discussed at the end. Loading the required packages In [1]:  Important functions for embedding and text parsing import numpy as np import pandas as pd import matplotlib.pyplot as plt # static plotting from time import time #time counters import os # operating system functions import os.path # for manipulation of file path names import re # regular expressions from collections import defaultdict import nltk from nltk.tokenize import TreebankWordTokenizer from sklearn.model_selection import train_test_split #for random splitting of import tensorflow as tf
  • 2. To keep the code organized and provide clarity with the experimental design we will first define a set of important function that we will be calling out later. The following function resets the graph and sets the random seend to creates stable outputs across runs. In [2]:  The following utility function loads the pre-trained and downloaded embeddings to create the features that will be passed to the RNN. It follows methods described in https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer (https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer). In [3]:  The following function is used to parse a string of text by removing non-alphanumeric characters, code characters and stopwords (if we were to use them), and lowercase all words and removing unnecesary spaces. We will call this function in subsequent code when preparing the data RANDOM_SEED = 9999 def reset_graph(seed= RANDOM_SEED): tf.reset_default_graph() tf.set_random_seed(seed) np.random.seed(seed) def load_embedding_from_disks(embeddings_filename, with_indexes=True): if with_indexes: word_to_index_dict = dict() index_to_embedding_array = [] else: word_to_embedding_dict = dict() with open(embeddings_filename,'r',encoding='utf-8') as embeddings_file: for (i, line) in enumerate(embeddings_file): split = line.split(' ') word = split[0] representation = split[1:] representation = np.array( [float(val) for val in representation]) if with_indexes: word_to_index_dict[word] = i index_to_embedding_array.append(representation) else: word_to_embedding_dict[word] = representation # For unknown words, the representation is an empty vector _WORD_NOT_FOUND = [0.0] * len(representation) if with_indexes: _LAST_INDEX = i + 1 word_to_index_dict = defaultdict( lambda: _LAST_INDEX, word_to_index_dict) index_to_embedding_array = np.array( index_to_embedding_array + [_WORD_NOT_FOUND]) return word_to_index_dict, index_to_embedding_array else: word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND) return word_to_embedding_dict
  • 3. In [4]:  A utility function to get file names within a directory In [5]:  The function for reading and storing the data In [6]:  The data The data is a collection of 500 positive and 500 negative movie reviews. The length of the reviews goes from 22 words to 1052 words. We will first gather and store the 500 negative reviews with the code below. The data is stored in a list of list, where each list represents a document and a document is a list of words. REMOVE_STOPWORDS = False # we won't remove stopwords for this excercise def text_parse(string): codelist = ['r', 'n', 't'] #list of codes to be dropped # replace non-alphanumeric with space temp_string = re.sub('[^a-zA-Z]', ' ', string) # replace codes with space for i in range(len(codelist)): stopstring = ' ' + codelist[i] + ' ' temp_string = re.sub(stopstring, ' ', temp_string) # replace single-character words with space temp_string = re.sub('s.s', ' ', temp_string) # convert uppercase to lowercase temp_string = temp_string.lower() if REMOVE_STOPWORDS: # replace selected character strings/stop-words with space for i in range(len(stoplist)): stopstring = ' ' + str(stoplist[i]) + ' ' temp_string = re.sub(stopstring, ' ', temp_string) # replace multiple blank characters with one blank character temp_string = re.sub('s+', ' ', temp_string) return(temp_string) def listdir_no_hidden(path): start_list = os.listdir(path) end_list = [] for file in start_list: if (not file.startswith('.')): end_list.append(file) return(end_list) def read_data(filename): with open(filename, encoding='utf-8') as f: data = tf.compat.as_str(f.read()) data = data.lower() data = text_parse(data) data = TreebankWordTokenizer().tokenize(data) # The Penn Treebank return data
  • 4. In [7]:  We now do the same for the positive reviews In [8]:  Since the reviews vary considerably in length we will create lists of documents of max 40 words each. To do that we will take the first 20 words and the last 20 words from each reviews and get rid of everything in between. The result will be a list of 1000 lists (500 negative, 500 positive) with 40 words in each list. In [9]:  Defining the first language model We will use the Glove.6B.50d embeddings with a vocabulary size of 10,000 words. We first load the Glove embeddings using the load_embedding_from_disks function defined previously. Processed 500 document files under movie-reviews-negative Processed 500 document files under movie-reviews-positive # gather data for the negative movie reviews dir_name = 'movie-reviews-negative' filenames = listdir_no_hidden(path=dir_name) num_files = len(filenames) negative_documents = [] for i in range(num_files): words = read_data(os.path.join(dir_name, filenames[i])) negative_documents.append(words) print('Processed {} document files under {}'.format( len(negative_documents),dir_name)) # gather data for the positive movie reviews dir_name = 'movie-reviews-positive' filenames = listdir_no_hidden(path=dir_name) num_files = len(filenames) positive_documents = [] for i in range(num_files): words = read_data(os.path.join(dir_name, filenames[i])) positive_documents.append(words) print('Processed {} document files under {}'.format( len(positive_documents),dir_name)) # constructing a list of 1000 lists with 40 words in each list from itertools import chain documents = [] for doc in negative_documents: doc_begin = doc[0:20] doc_end = doc[len(doc) - 20: len(doc)] documents.append(list(chain(*[doc_begin, doc_end]))) for doc in positive_documents: doc_begin = doc[0:20] doc_end = doc[len(doc) - 20: len(doc)] documents.append(list(chain(*[doc_begin, doc_end])))
  • 5. In [10]:  In [11]:  Now, we will reduce the size of the vocabulary to 10,000 words. Since the most common words are listed first, we will select the rows between 0 and 10,000. The following code will create a limited index for the embedding and clear the rest to save CPU and RAM. In [12]:  Now we create a list of list of list with the embeddings. Every word in every review is now represented by a vector of dimension 50 per the Glove.6B.50 embedding that we are using. In [13]:  We are now ready to create the training and test set. We first make the embeddings a numpy array to feed the RNN, and we create the labels: 0 for negatives and 1 for positives given the order in which we loaded the documents. Lastly, we use Scikit-Learn to random splitting the data into training set (80%) and test set (20%) Loading embeddings from embeddings/gloVe.6Bglove.6B.50d.txt Embedding is of shape: (400001, 50) embeddings_directory = 'embeddings/gloVe.6B' #embeddings source filename = 'glove.6B.50d.txt' embeddings_filename = os.path.join(embeddings_directory, filename) print('Loading embeddings from', embeddings_filename) word_to_index, index_to_embedding = load_embedding_from_disks(embeddings_filename, with_indexes=True) vocab_size, embedding_dim = index_to_embedding.shape print("Embedding is of shape: {}".format(index_to_embedding.shape)) EVOCABSIZE = 10000 #desired size of pre-defined embedding vocabulary def default_factory(): return EVOCABSIZE limited_word_to_index = defaultdict(default_factory, {k: v for k, v in word_to_index.items() if v < EVOCABSIZE}) limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:] limited_index_to_embedding = np.append(limited_index_to_embedding, index_to_embedding[index_to_embedding.shape[0] - 1, :]. reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros del index_to_embedding # to clear some CPU RAM # create list of lists of lists for embeddings embeddings = [] for doc in documents: embedding = [] for word in doc: embedding.append( limited_index_to_embedding[limited_word_to_index[word]]) embeddings.append(embedding)
  • 6. In [14]:  Creating the graph We will be using the same graph for all four language models. It contains a simple Recurrent Neural Network with 30 neurons. We will use AdamOptimizer with a learning rate of 0.0003 and the cross entropy as the cost function to minimize. In [15]:  Executing the graph We will train the model in 50 epochs with mini-batches of size 100. We will estimate the runtime and collect the results for comparison with the other language models. WARNING:tensorflow:From <ipython-input-15-81f75ef979ce>:9: BasicRNNCell.__i nit__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be repl aced by that in Tensorflow 2.0. embeddings_array = np.array(embeddings) # Define the labels to be used 500 negative (0) and 500 positive (1) thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), np.ones((500), dtype = np.int32)), axis = 0) # Random splitting of the data in to training (80%) and test (20%) X_train, X_test, y_train, y_test = train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, random_state = RANDOM_SEED) reset_graph() n_steps = embeddings_array.shape[1] # number of words per document n_inputs = embeddings_array.shape[2] # dimension of pre-trained embeddings n_neurons = 30 # number of neurons n_outputs = 2 # thumbs-down or thumbs-up learning_rate = 0.0003 X = tf.placeholder(tf.float32, [None, n_steps, n_inputs]) y = tf.placeholder(tf.int32, [None]) basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons) outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32) logits = tf.layers.dense(states, n_outputs) xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) loss = tf.reduce_mean(xentropy) optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) training_op = optimizer.minimize(loss) correct = tf.nn.in_top_k(logits, y, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
  • 7. In [17]:  In [18]:  Defining the second language model We will use the same Glove.6B.50d embeddings as before but this time increasing the vocabulary size x10 to 100,000 words. We need to load the embeddings again since we deleted the index_to_embedding variable to clear memory. Everything else is the same but with EVOCABSIZE = 100,000. In [19]:  Train accuracy: 0.66 Test accuracy: 0.535 Total runtime in seconds: 6.46 Loading embeddings from embeddings/gloVe.6Bglove.6B.50d.txt Embedding is of shape: (400001, 50) init = tf.global_variables_initializer() start_time = time() #start counter n_epochs = 50 batch_size = 100 training_results = np.zeros((n_epochs, 2)) # to store results with tf.Session() as sess: init.run() for epoch in range(n_epochs): for iteration in range(y_train.shape[0] // batch_size): X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:] y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size] sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch}) acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test}) training_results[epoch,0]= acc_train training_results[epoch,1]= acc_test end_time = time() #stop counter runtime = end_time - start_time #calculalting total runtime print('Train accuracy:', acc_train, 'Test accuracy:', acc_test) print('Total runtime in seconds:', round(runtime, ndigits=3)) #collecting the results first_round_training_results = training_results first_round_runtime = round(runtime, ndigits=3) print('Loading embeddings from', embeddings_filename) word_to_index, index_to_embedding = load_embedding_from_disks(embeddings_filename, with_indexes=True) vocab_size, embedding_dim = index_to_embedding.shape print("Embedding is of shape: {}".format(index_to_embedding.shape))
  • 8. In [20]:  In [21]:  We create a new sets of train and test data with the adjusted embeddings with larger vocabulary size. In [22]:  Executing the graph EVOCABSIZE = 100000 #desired size of pre-defined embedding vocabulary def default_factory(): return EVOCABSIZE limited_word_to_index = defaultdict(default_factory, {k: v for k, v in word_to_index.items() if v < EVOCABSIZE}) limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:] limited_index_to_embedding = np.append(limited_index_to_embedding, index_to_embedding[index_to_embedding.shape[0] - 1, :]. reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros del index_to_embedding # to clear some CPU RAM # create list of lists of lists for embeddings embeddings = [] for doc in documents: embedding = [] for word in doc: embedding.append( limited_index_to_embedding[limited_word_to_index[word]]) embeddings.append(embedding) embeddings_array = np.array(embeddings) # Random splitting of the data in to training (80%) and test (20%) X_train, X_test, y_train, y_test = train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, random_state = RANDOM_SEED)
  • 9. In [23]:  In [24]:  Defining the third language model For the next two rounds We will be using the Glove.6B.300d embeddings. It contains vectors of dimensions 300 instead of 50 for each word in the vocabulary. This is a much bigger and comprehensive embedding, however, we will be restricting again the size of the vocabulary for our exercise to 10,000 in this round, and then 100,000 in the last round. Everything else is the same. In [25]:  In [26]:  Train accuracy: 0.72 Test accuracy: 0.555 Total runtime in seconds: 6.326 Loading embeddings from embeddings/gloVe.6Bglove.6B.300d.txt Embedding is of shape: (400001, 300) init = tf.global_variables_initializer() start_time = time() #start counter n_epochs = 50 batch_size = 100 training_results = np.zeros((n_epochs, 2)) # to store results with tf.Session() as sess: init.run() for epoch in range(n_epochs): for iteration in range(y_train.shape[0] // batch_size): X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:] y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size] sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch}) acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test}) training_results[epoch,0]= acc_train training_results[epoch,1]= acc_test end_time = time() #stop counter runtime = end_time - start_time #calculalting total runtime print('Train accuracy:', acc_train, 'Test accuracy:', acc_test) print('Total runtime in seconds:', round(runtime, ndigits=3)) #collecting the results second_round_training_results = training_results second_round_runtime = round(runtime, ndigits=3) embeddings_directory = 'embeddings/gloVe.6B' #embeddings source filename = 'glove.6B.300d.txt' embeddings_filename = os.path.join(embeddings_directory, filename) print('Loading embeddings from', embeddings_filename) word_to_index, index_to_embedding = load_embedding_from_disks(embeddings_filename, with_indexes=True) vocab_size, embedding_dim = index_to_embedding.shape print("Embedding is of shape: {}".format(index_to_embedding.shape))
  • 10. In [27]:  In [28]:  We create a new sets of train and test data with the adjusted embeddings with dimesion 300. In [29]:  In [30]:  EVOCABSIZE = 100000 #desired size of pre-defined embedding vocabulary def default_factory(): return EVOCABSIZE limited_word_to_index = defaultdict(default_factory, {k: v for k, v in word_to_index.items() if v < EVOCABSIZE}) limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:] limited_index_to_embedding = np.append(limited_index_to_embedding, index_to_embedding[index_to_embedding.shape[0] - 1, :]. reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros del index_to_embedding # to clear some CPU RAM # create list of lists of lists for embeddings embeddings = [] for doc in documents: embedding = [] for word in doc: embedding.append( limited_index_to_embedding[limited_word_to_index[word]]) embeddings.append(embedding) embeddings_array = np.array(embeddings) # Random splitting of the data in to training (80%) and test (20%) X_train, X_test, y_train, y_test = train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, random_state = RANDOM_SEED) reset_graph() n_steps = embeddings_array.shape[1] # number of words per document n_inputs = embeddings_array.shape[2] # dimension of pre-trained embeddings n_neurons = 30 # analyst specified number of neurons n_outputs = 2 # thumbs-down or thumbs-up learning_rate = 0.0003 X = tf.placeholder(tf.float32, [None, n_steps, n_inputs]) y = tf.placeholder(tf.int32, [None]) basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons) outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32) logits = tf.layers.dense(states, n_outputs) xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) loss = tf.reduce_mean(xentropy) optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) training_op = optimizer.minimize(loss) correct = tf.nn.in_top_k(logits, y, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
  • 11. In [31]:  In [32]:  Defining the fourth language model Vocabulary size 100,000 of the most common words and vector dimension = 300 In [33]:  In [34]:  Train accuracy: 0.94 Test accuracy: 0.725 Total runtime in seconds: 19.574 Loading embeddings from embeddings/gloVe.6Bglove.6B.300d.txt Embedding is of shape: (400001, 300) init = tf.global_variables_initializer() start_time = time() #start counter n_epochs = 50 batch_size = 100 training_results = np.zeros((n_epochs, 2)) # to store results with tf.Session() as sess: init.run() for epoch in range(n_epochs): for iteration in range(y_train.shape[0] // batch_size): X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:] y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size] sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch}) acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test}) training_results[epoch,0]= acc_train training_results[epoch,1]= acc_test end_time = time() #stop counter runtime = end_time - start_time #calculalting total runtime print('Train accuracy:', acc_train, 'Test accuracy:', acc_test) print('Total runtime in seconds:', round(runtime, ndigits=3)) #collecting the results third_round_training_results = training_results third_round_runtime = round(runtime, ndigits=3) print('Loading embeddings from', embeddings_filename) word_to_index, index_to_embedding = load_embedding_from_disks(embeddings_filename, with_indexes=True) vocab_size, embedding_dim = index_to_embedding.shape print("Embedding is of shape: {}".format(index_to_embedding.shape)) EVOCABSIZE = 100000 #desired size of pre-defined embedding vocabulary def default_factory(): return EVOCABSIZE limited_word_to_index = defaultdict(default_factory, {k: v for k, v in word_to_index.items() if v < EVOCABSIZE}) limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:] limited_index_to_embedding = np.append(limited_index_to_embedding, index_to_embedding[index_to_embedding.shape[0] - 1, :]. reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros del index_to_embedding # to clear some CPU RAM
  • 12. In [35]:  We create a new sets of train and test data. In [36]:  In [37]:  In [38]:  Results Train accuracy: 0.93 Test accuracy: 0.735 Total runtime in seconds: 19.637 # create list of lists of lists for embeddings embeddings = [] for doc in documents: embedding = [] for word in doc: embedding.append( limited_index_to_embedding[limited_word_to_index[word]]) embeddings.append(embedding) embeddings_array = np.array(embeddings) # Random splitting of the data in to training (80%) and test (20%) X_train, X_test, y_train, y_test = train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, random_state = RANDOM_SEED) init = tf.global_variables_initializer() start_time = time() #start counter n_epochs = 50 batch_size = 100 training_results = np.zeros((n_epochs, 2)) # to store results with tf.Session() as sess: init.run() for epoch in range(n_epochs): for iteration in range(y_train.shape[0] // batch_size): X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:] y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size] sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch}) acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test}) training_results[epoch,0]= acc_train training_results[epoch,1]= acc_test end_time = time() #stop counter runtime = end_time - start_time #calculalting total runtime print('Train accuracy:', acc_train, 'Test accuracy:', acc_test) print('Total runtime in seconds:', round(runtime, ndigits=3)) #collecting the results fourth_round_training_results = training_results fourth_round_runtime = round(runtime, ndigits=3)
  • 13. In [44]:  As can be seen, in the table above, the first two models with word vectors of size 50 barely perform above random choice with a final accuracy on the training set of 0.535 and 0.555. The models are clearly under-fitting the data given that the prediction accuracy on the test set is also low (0.66 and 0.72). Notice the difference in the training set accuracy with the larger vocabulary size. A lager vocabulary does help the model better fit the data reducing bias, yet the performance is still not satisfactory. The increase in performance is only 2 percentage points. The difference in runtime between different vocabulary sizes is negligible and probably noise. The models using word vectors of size 300 perform significantly better. Both models final accuracy on the test data is above 72% and above 93% on the test set. Again, the vocabulary size has negligible effect on runtime and only a minor effect on the prediction accuracy on the test set with the same 2 percentage points increase with the larger vocabulary size. The runtime while training the models is significantly higher with the larger word vectors. The runtime for the models with word vectors of size 300 is about 3 times higher than with word vectors of size 50. Let’s take a look at the learning curves. In [45]:  Out[44]: Vector Dimension Vocabulary Size Runtime (Seconds) Train Set Accuracy Test Set Accuracy GloVe.50 VSize 10k 50 10000 6.460 0.66 0.535 GloVe.50 VSize 100k 50 100000 6.326 0.72 0.555 GloVe.300 VSize 10k 300 10000 19.574 0.94 0.725 GloVe.300 VSize 100k 300 100000 19.637 0.93 0.735 d = {'Vector Dimension':[50,50,300,300], 'Vocabulary Size': [10000,100000,10000,100000], 'Runtime (Seconds)' :[first_round_runtime, second_round_runtime, third_round_runtime, fourth_round_runtime], 'Train Set Accuracy' : [first_round_training_results[-1,0], second_round_training_results[-1,0], third_round_training_results[-1,0], fourth_round_training_results[-1,0]], 'Test Set Accuracy': [first_round_training_results[-1,1], second_round_training_results[-1,1], third_round_training_results[-1,1], fourth_round_training_results[-1,1]]} results_table = pd.DataFrame(index = ['GloVe.50 VSize 10k', 'GloVe.50 VSize 100k', 'GloVe.300 VSize 10k', 'GloVe.300 VSize 100k'], data=d) results_table data = {'GloVe.50 VSize 10k' :first_round_training_results[:,1], 'GloVe.50 VSize 100k':second_round_training_results[:,1], 'GloVe.300 VSize 10k':third_round_training_results[:,1], 'GloVe.300 VSize 100k':fourth_round_training_results[:,1]} learning_curves = pd.DataFrame(data=data)
  • 14. In [46]:  As it can be seen in the graph above, the models with word vectors of size 50 struggle to learn effectively over time. The learning rate of the RNN was set low (0.0003) to prevent overfitting but in the case of the models with low dimensionality of the word vectors, this setting is hindering the model’s ability to learn. On the other hand, this learning rate appears to be appropriate for the high dimension word vectors. The models learn effectively over time and plateau at around 0.72. Notice that the top prediction accuracy on the test set (around 75%) is achieved at some point before the end of training and it is possible that the models started overfitting before training ended. Tweaking some hyper-parameters might prevent this from happening, boosting performance slightly, but it is likely that a more complex RNN is needed to more significantly improve performance. The larger vocabulary size does appear to help learning more effectively over time but its effect is not nearly as important as the dimensionality of the data with the larger word vectors. Conclusion Pre-trained word vectors are a great solution for language models. They make the use of RNN for language processing practical and they do not appear to be tremendously costly in terms of processing time, at least for the simple RNN used in this exercise. Larger word vectors have a significant advantage over shorter ones as they provide more dimensions for the model to learn from. This type of language models could be used effectively to classify customer reviews and call complaint logs. Even if the accuracy of the model is not extremely high it can still provide guidance and automate an otherwise painful and costly task of manually reviewing and classifying thousands of reviews. Simply classifying reviews into positive and negative is a first step, but this could be combined with some models for topic extraction, so managers can better address the underlying problems of the negative reviews or leverage positive aspects of products and services highlighted in the reviews. A classification model to identify critical complaints could also be developed so those critical complaints can be passed on to customer representatives to be addressed personally, reducing potential risks for the company and elevating the service to customers. learning_curves.plot.line() ax = plt.xlim(-1,50) ax = plt.title("Learning Curves") ax = plt.xlabel("epoch"), plt.ylabel("Accuracy Score") ax = plt.legend(loc="best")