Language Modeling With RNN

Esteban Ribero, Assignment #8 - MSDS 422 | Winter 2019
Language Modeling With an RNN
Purpose and summary of results
The purpose of this exercise is to test different language models for predicting movie review
sentiment using recurrent neural networks (RNN). In particular, we test the effect of different
pertained words vectors and vocabulary size on the model’s predictive accuracy.
The target is to predict the sentiment (positive or negative) of the movie reviews. We use a 2x2
experimental design to isolate the effect of the vector dimension and the vocabulary size in training
time and prediction accuracy. We use two versions of the global vectors GloVe (glove.6b.300d and
glove.6b.50d) developed at Stanford using content from Wikipedia+Gigaword. Both of these neural
network embeddings contain 400k words in them but they differ in the number of dimensions
associated with each word. The simplest one uses a vector of size 50 to represent each word while
the other one uses a vector of size 300. These numeric vectors have been pertained and carry
with them the meaning of the words in natural language and so they are a great way to represent
words for language models. We limit the size of the vocabulary for each of these embeddings to
10,000 or 100,000 of the most common words, and we test the effect of these combinations in
runtime and prediction accuracy.
We found that the vocabulary size of the embeddings has little effect on the processing time
required to train the models and only a small effect on the models prediction accuracy. The larger
vocabulary only increased prediction accuracy by 0.02 points. On the other hand, the larger word
vector had a significant effect in prediction accuracy, moving the models from just predicting
slightly above random choice (55% accuracy) to accurately predicting the sentiment of the review
73% of the time. The managerial implications of these findings are discussed at the end.
Loading the required packages
In [1]: 
Important functions for embedding and text parsing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # static plotting
from time import time #time counters
import os # operating system functions
import os.path # for manipulation of file path names
import re # regular expressions
from collections import defaultdict
import nltk
from nltk.tokenize import TreebankWordTokenizer
from sklearn.model_selection import train_test_split #for random splitting of
import tensorflow as tf

To keep the code organized and provide clarity with the experimental design we will first define a
set of important function that we will be calling out later. The following function resets the graph
and sets the random seend to creates stable outputs across runs.
In [2]: 
The following utility function loads the pre-trained and downloaded embeddings to create the
features that will be passed to the RNN. It follows methods described in
https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
(https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer).
In [3]: 
The following function is used to parse a string of text by removing non-alphanumeric characters,
code characters and stopwords (if we were to use them), and lowercase all words and removing
unnecesary spaces. We will call this function in subsequent code when preparing the data
RANDOM_SEED = 9999
def reset_graph(seed= RANDOM_SEED):
tf.reset_default_graph()
tf.set_random_seed(seed)
np.random.seed(seed)
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
if with_indexes:
word_to_index_dict = dict()
index_to_embedding_array = []
else:
word_to_embedding_dict = dict()
with open(embeddings_filename,'r',encoding='utf-8') as embeddings_file:
for (i, line) in enumerate(embeddings_file):
split = line.split(' ')
word = split[0]
representation = split[1:]
representation = np.array(
[float(val) for val in representation])
if with_indexes:
word_to_index_dict[word] = i
index_to_embedding_array.append(representation)
else:
word_to_embedding_dict[word] = representation
# For unknown words, the representation is an empty vector
_WORD_NOT_FOUND = [0.0] * len(representation)
if with_indexes:
_LAST_INDEX = i + 1
word_to_index_dict = defaultdict(
lambda: _LAST_INDEX, word_to_index_dict)
index_to_embedding_array = np.array(
index_to_embedding_array + [_WORD_NOT_FOUND])
return word_to_index_dict, index_to_embedding_array
else:
word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
return word_to_embedding_dict

In [4]: 
A utility function to get file names within a directory
In [5]: 
The function for reading and storing the data
In [6]: 
The data
The data is a collection of 500 positive and 500 negative movie reviews. The length of the reviews
goes from 22 words to 1052 words. We will first gather and store the 500 negative reviews with the
code below. The data is stored in a list of list, where each list represents a document and a
document is a list of words.
REMOVE_STOPWORDS = False # we won't remove stopwords for this excercise
def text_parse(string):
codelist = ['r', 'n', 't'] #list of codes to be dropped
# replace non-alphanumeric with space
temp_string = re.sub('[^a-zA-Z]', ' ', string)
# replace codes with space
for i in range(len(codelist)):
stopstring = ' ' + codelist[i] + ' '
temp_string = re.sub(stopstring, ' ', temp_string)
# replace single-character words with space
temp_string = re.sub('s.s', ' ', temp_string)
# convert uppercase to lowercase
temp_string = temp_string.lower()
if REMOVE_STOPWORDS:
# replace selected character strings/stop-words with space
for i in range(len(stoplist)):
stopstring = ' ' + str(stoplist[i]) + ' '
temp_string = re.sub(stopstring, ' ', temp_string)
# replace multiple blank characters with one blank character
temp_string = re.sub('s+', ' ', temp_string)
return(temp_string)
def listdir_no_hidden(path):
start_list = os.listdir(path)
end_list = []
for file in start_list:
if (not file.startswith('.')):
end_list.append(file)
return(end_list)
def read_data(filename):
with open(filename, encoding='utf-8') as f:
data = tf.compat.as_str(f.read())
data = data.lower()
data = text_parse(data)
data = TreebankWordTokenizer().tokenize(data) # The Penn Treebank
return data

In [7]: 
We now do the same for the positive reviews
In [8]: 
Since the reviews vary considerably in length we will create lists of documents of max 40 words
each. To do that we will take the first 20 words and the last 20 words from each reviews and get rid
of everything in between. The result will be a list of 1000 lists (500 negative, 500 positive) with 40
words in each list.
In [9]: 
Defining the first language model
We will use the Glove.6B.50d embeddings with a vocabulary size of 10,000 words. We first load
the Glove embeddings using the load_embedding_from_disks function defined previously.
Processed 500 document files under movie-reviews-negative
Processed 500 document files under movie-reviews-positive
# gather data for the negative movie reviews
dir_name = 'movie-reviews-negative'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)
negative_documents = []
for i in range(num_files):
words = read_data(os.path.join(dir_name, filenames[i]))
negative_documents.append(words)
print('Processed {} document files under {}'.format(
len(negative_documents),dir_name))
# gather data for the positive movie reviews
dir_name = 'movie-reviews-positive'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)
positive_documents = []
for i in range(num_files):
words = read_data(os.path.join(dir_name, filenames[i]))
positive_documents.append(words)
print('Processed {} document files under {}'.format(
len(positive_documents),dir_name))
# constructing a list of 1000 lists with 40 words in each list
from itertools import chain
documents = []
for doc in negative_documents:
doc_begin = doc[0:20]
doc_end = doc[len(doc) - 20: len(doc)]
documents.append(list(chain(*[doc_begin, doc_end])))
for doc in positive_documents:
doc_begin = doc[0:20]
doc_end = doc[len(doc) - 20: len(doc)]
documents.append(list(chain(*[doc_begin, doc_end])))

In [10]: 
In [11]: 
Now, we will reduce the size of the vocabulary to 10,000 words. Since the most common words are
listed first, we will select the rows between 0 and 10,000. The following code will create a limited
index for the embedding and clear the rest to save CPU and RAM.
In [12]: 
Now we create a list of list of list with the embeddings. Every word in every review is now
represented by a vector of dimension 50 per the Glove.6B.50 embedding that we are using.
In [13]: 
We are now ready to create the training and test set. We first make the embeddings a numpy array
to feed the RNN, and we create the labels: 0 for negatives and 1 for positives given the order in
which we loaded the documents. Lastly, we use Scikit-Learn to random splitting the data into
training set (80%) and test set (20%)
Loading embeddings from embeddings/gloVe.6Bglove.6B.50d.txt
Embedding is of shape: (400001, 50)
embeddings_directory = 'embeddings/gloVe.6B' #embeddings source
filename = 'glove.6B.50d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
print('Loading embeddings from', embeddings_filename)
word_to_index, index_to_embedding =
load_embedding_from_disks(embeddings_filename, with_indexes=True)
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
EVOCABSIZE = 10000 #desired size of pre-defined embedding vocabulary
def default_factory():
return EVOCABSIZE
limited_word_to_index = defaultdict(default_factory,
{k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
limited_index_to_embedding = np.append(limited_index_to_embedding,
index_to_embedding[index_to_embedding.shape[0] - 1, :].
reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros
del index_to_embedding # to clear some CPU RAM
# create list of lists of lists for embeddings
embeddings = []
for doc in documents:
embedding = []
for word in doc:
embedding.append(
limited_index_to_embedding[limited_word_to_index[word]])
embeddings.append(embedding)

In [14]: 
Creating the graph
We will be using the same graph for all four language models. It contains a simple Recurrent
Neural Network with 30 neurons. We will use AdamOptimizer with a learning rate of 0.0003 and
the cross entropy as the cost function to minimize.
In [15]: 
Executing the graph
We will train the model in 50 epochs with mini-batches of size 100. We will estimate the runtime
and collect the results for comparison with the other language models.
WARNING:tensorflow:From <ipython-input-15-81f75ef979ce>:9: BasicRNNCell.__i
nit__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be
removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be repl
aced by that in Tensorflow 2.0.
embeddings_array = np.array(embeddings)
# Define the labels to be used 500 negative (0) and 500 positive (1)
thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32),
np.ones((500), dtype = np.int32)), axis = 0)
# Random splitting of the data in to training (80%) and test (20%)
X_train, X_test, y_train, y_test =
train_test_split(embeddings_array, thumbs_down_up, test_size=0.20,
random_state = RANDOM_SEED)
reset_graph()
n_steps = embeddings_array.shape[1] # number of words per document
n_inputs = embeddings_array.shape[2] # dimension of pre-trained embeddings
n_neurons = 30 # number of neurons
n_outputs = 2 # thumbs-down or thumbs-up
learning_rate = 0.0003
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [17]: 
In [18]: 
Defining the second language model
We will use the same Glove.6B.50d embeddings as before but this time increasing the vocabulary
size x10 to 100,000 words. We need to load the embeddings again since we deleted the
index_to_embedding variable to clear memory. Everything else is the same but with EVOCABSIZE
= 100,000.
In [19]: 
Train accuracy: 0.66 Test accuracy: 0.535
Total runtime in seconds: 6.46
init = tf.global_variables_initializer()
start_time = time() #start counter
n_epochs = 50
batch_size = 100
training_results = np.zeros((n_epochs, 2)) # to store results
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(y_train.shape[0] // batch_size):
X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:]
y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
training_results[epoch,0]= acc_train
training_results[epoch,1]= acc_test
end_time = time() #stop counter
runtime = end_time - start_time #calculalting total runtime
print('Train accuracy:', acc_train, 'Test accuracy:', acc_test)
print('Total runtime in seconds:', round(runtime, ndigits=3))
#collecting the results
first_round_training_results = training_results
first_round_runtime = round(runtime, ndigits=3)

In [20]: 
In [21]: 
We create a new sets of train and test data with the adjusted embeddings with larger vocabulary
size.
In [22]: 
Executing the graph
return EVOCABSIZE
embeddings = []
embedding = []
for word in doc:
embedding.append(

In [23]: 
In [24]: 
Defining the third language model
For the next two rounds We will be using the Glove.6B.300d embeddings. It contains vectors of
dimensions 300 instead of 50 for each word in the vocabulary. This is a much bigger and
comprehensive embedding, however, we will be restricting again the size of the vocabulary for our
exercise to 10,000 in this round, and then 100,000 in the last round. Everything else is the same.
In [25]: 
In [26]: 
n_epochs = 50
batch_size = 100
init.run()
second_round_training_results = training_results
second_round_runtime = round(runtime, ndigits=3)
embeddings_directory = 'embeddings/gloVe.6B' #embeddings source
filename = 'glove.6B.300d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)

In [27]: 
In [28]: 
We create a new sets of train and test data with the adjusted embeddings with dimesion 300.
In [29]: 
In [30]: 
return EVOCABSIZE
embeddings = []
embedding = []
for word in doc:
embedding.append(
reset_graph()
n_steps = embeddings_array.shape[1] # number of words per document
n_inputs = embeddings_array.shape[2] # dimension of pre-trained embeddings
n_neurons = 30 # analyst specified number of neurons
n_outputs = 2 # thumbs-down or thumbs-up
learning_rate = 0.0003
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [31]: 
In [32]: 
Defining the fourth language model
Vocabulary size 100,000 of the most common words and vector dimension = 300
In [33]: 
In [34]: 
n_epochs = 50
batch_size = 100
init.run()
third_round_training_results = training_results
third_round_runtime = round(runtime, ndigits=3)
return EVOCABSIZE

In [35]: 
We create a new sets of train and test data.
In [36]: 
In [37]: 
In [38]: 
Results
embeddings = []
embedding = []
for word in doc:
embedding.append(
n_epochs = 50
batch_size = 100
init.run()
fourth_round_training_results = training_results
fourth_round_runtime = round(runtime, ndigits=3)

In [44]: 
As can be seen, in the table above, the first two models with word vectors of size 50 barely
perform above random choice with a final accuracy on the training set of 0.535 and 0.555. The
models are clearly under-fitting the data given that the prediction accuracy on the test set is also
low (0.66 and 0.72). Notice the difference in the training set accuracy with the larger vocabulary
size. A lager vocabulary does help the model better fit the data reducing bias, yet the performance
is still not satisfactory. The increase in performance is only 2 percentage points. The difference in
runtime between different vocabulary sizes is negligible and probably noise.
The models using word vectors of size 300 perform significantly better. Both models final accuracy
on the test data is above 72% and above 93% on the test set. Again, the vocabulary size has
negligible effect on runtime and only a minor effect on the prediction accuracy on the test set with
the same 2 percentage points increase with the larger vocabulary size. The runtime while training
the models is significantly higher with the larger word vectors. The runtime for the models with
word vectors of size 300 is about 3 times higher than with word vectors of size 50. Let’s take a look
at the learning curves.
In [45]: 
Out[44]:
Vector
Dimension
Vocabulary
Size
Runtime
(Seconds)
Train Set
Accuracy
Test Set
Accuracy
GloVe.50 VSize
10k
50 10000 6.460 0.66 0.535
GloVe.50 VSize
100k
50 100000 6.326 0.72 0.555
GloVe.300 VSize
10k
300 10000 19.574 0.94 0.725
GloVe.300 VSize
100k
300 100000 19.637 0.93 0.735
d = {'Vector Dimension':[50,50,300,300],
'Vocabulary Size': [10000,100000,10000,100000],
'Runtime (Seconds)' :[first_round_runtime, second_round_runtime,
third_round_runtime, fourth_round_runtime],
'Train Set Accuracy' : [first_round_training_results[-1,0],
second_round_training_results[-1,0],
third_round_training_results[-1,0],
fourth_round_training_results[-1,0]],
'Test Set Accuracy': [first_round_training_results[-1,1],
second_round_training_results[-1,1],
third_round_training_results[-1,1],
fourth_round_training_results[-1,1]]}
results_table = pd.DataFrame(index = ['GloVe.50 VSize 10k',
'GloVe.50 VSize 100k',
'GloVe.300 VSize 10k',
'GloVe.300 VSize 100k'], data=d)
results_table
data = {'GloVe.50 VSize 10k' :first_round_training_results[:,1],
'GloVe.50 VSize 100k':second_round_training_results[:,1],
'GloVe.300 VSize 10k':third_round_training_results[:,1],
'GloVe.300 VSize 100k':fourth_round_training_results[:,1]}
learning_curves = pd.DataFrame(data=data)

In [46]: 
As it can be seen in the graph above, the models with word vectors of size 50 struggle to learn
effectively over time. The learning rate of the RNN was set low (0.0003) to prevent overfitting but in
the case of the models with low dimensionality of the word vectors, this setting is hindering the
model’s ability to learn. On the other hand, this learning rate appears to be appropriate for the high
dimension word vectors. The models learn effectively over time and plateau at around 0.72. Notice
that the top prediction accuracy on the test set (around 75%) is achieved at some point before the
end of training and it is possible that the models started overfitting before training ended. Tweaking
some hyper-parameters might prevent this from happening, boosting performance slightly, but it is
likely that a more complex RNN is needed to more significantly improve performance. The larger
vocabulary size does appear to help learning more effectively over time but its effect is not nearly
as important as the dimensionality of the data with the larger word vectors.
Conclusion
Pre-trained word vectors are a great solution for language models. They make the use of RNN for
language processing practical and they do not appear to be tremendously costly in terms of
processing time, at least for the simple RNN used in this exercise. Larger word vectors have a
significant advantage over shorter ones as they provide more dimensions for the model to learn
from. This type of language models could be used effectively to classify customer reviews and call
complaint logs. Even if the accuracy of the model is not extremely high it can still provide guidance
and automate an otherwise painful and costly task of manually reviewing and classifying
thousands of reviews. Simply classifying reviews into positive and negative is a first step, but this
could be combined with some models for topic extraction, so managers can better address the
underlying problems of the negative reviews or leverage positive aspects of products and services
highlighted in the reviews. A classification model to identify critical complaints could also be
developed so those critical complaints can be passed on to customer representatives to be
addressed personally, reducing potential risks for the company and elevating the service to
customers.
learning_curves.plot.line()
ax = plt.xlim(-1,50)
ax = plt.title("Learning Curves")
ax = plt.xlabel("epoch"), plt.ylabel("Accuracy Score")
ax = plt.legend(loc="best")

Language Modeling With RNN

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Language Modeling With RNN