SlideShare a Scribd company logo
Predicting Email
Duration
Using SAS Text Miner
Table of Contents
1. Introduction 1
2. The Barbaric Beginnings 2
3. SAS Text Miner Introduction 4
4. SAS Text Miner Tools 5
5. Results 10
6. Addition Results 12
7. Conclusion and Future Studies 14
8. References 15
9. Appendix 16
Page 1
Part 1: Introduction
This project started out as a text mining exploration to identify emails related to one specific type
of problem within the company’s software. These emails were coming into the customer service
department of Lithium Technologies. So the clients sending the emails could be talking about a
problem/error they were experiencing with the product, or maybe they were simply asking a
question. They could also be talking about other things: maybe they had a request or an action
they needed to be performed. My initial research goal was to identify frequent issues in order for
the company to decide if it should invest time in developing more efficient methods of
preventing this problem in the future.
My journey began with an initial data set of emails from the years 2008 to 2014. My initial goal
was to look for certain keywords to best predict whether an email belonged to a specific type of
problem. This problem related to a software bug that some clients were experiencing.
Discriminant analyses was used to see which keywords did the best job of segregating the emails
into their respective category (whether they related to the specific software bug or not). Later on,
I realized this method was highly inefficient.
My next steps were to use a program called SAS Text miner for my analysis. This allowed me to
increase the significance of my research question. Instead of predicting whether an email
belonged in a specific category, I would now be able to predict an individual email’s time until
resolution. In order to do so, I looked into 3 different types of model: linear regression, logistic
regression, and a decision tree. Some of these models required a categorical response variable,
but my data contained a continuous variable for the time it takes an email to resolve. In order to
accommodate this, I created a cutoff at the 75th
percentile of my continuous total time variable.
Emails in the bottom 75th
percentile were classified as a 0 (not taking a long time), while emails
in the top 25th
percentile were classified as a 1 (taking a long time). This part of my project will
be discussed further in the SAS Text Miner Intro and Tools sections.
The entire data set contained about 65,000 emails in total (over the years 2008 - 2014). During
my analyses, I discovered the best predictor of an email’s duration was time; on average, emails
from 2008 took much longer to resolve than emails from 2014. Because of this, I ended up only
using emails from 2014 in my final analysis. The reason for this is because I wanted my model to
be as relevant and accurate as possible.
Page 2
Part 2: The Barbaric Beginnings
At the start of my project, my goal was to predict whether an email related to a specific software
bug or not. To begin, I manually sorted through about 1,000 emails and determined whether or
not they related to the specific problem I was looking for. I then used a frequency word counter
to determine which keywords came up most frequently in the description portion of emails and
also in the subject header. The next step was to run a discriminant analysis on these keywords to
see how well they determined which bucket an email falls into.
There were certain words with a high frequency that weren't necessarily meaningful in terms of
using that word to predict if an email falls into a certain category. For example, a word like
“twitter” would come up very often, however, this is not a good word for separating the
categories because it is likely to show up in all emails coming into the company’s customer
service department; it is equally as likely to show up in emails that fall under the category vs.
other emails.
Figure 1: Jittered scatterplot of the Canonical1 scores of each email
Figure 1 shows a visual of the discriminant analysis. In this particular example, the blue dots
represent emails that describe the specific software bug I was interested in, while the red dots
represent everything else. An email’s classification is based on how close its Canonical1 score
lies to the average of the Canonical1 scores (X axis). These averages are represented by the
larger red and blue circles on the plot. The Canonical1 scores are calculated as a linear
combination of indicator variables which showed whether an email contained a specific word or
not.
Page 3
Figure 2: Discriminant analysis scoring coefficients example
Figure 2 shows a sample of the scoring coefficients for these indicator variables. Each variable
name represents an indicator for whether or not that word is included in the email. For example,
if we’re using the preceding coefficients to represent the entire data set, an email with the words
“community” and “issue”, but without the words “respond”, “forum”, and “reproduce” would
have a Canonical1 score of about .624 + .363 + 0 + 0 + 0 = .987. This example email’s
Canonical1 score would fall closer to the blue circle and would be predicted to be an email
relating to the specific type software bug I was looking for.
It is important to note that in Figure 1 there are some red dots that fall closer to the blue dot
average as represented by the blue circle; these are the incorrectly predicted emails. The same is
true for the blue dots that fall closer to the red circle. In the score summaries shown on Figure 1,
the percentage of emails that were misclassified is approximately 7.6%.
The method used to determine the proportions of emails relating to the specific software bug is
not very “traditional. I did not check the canonical1 score of an email and then determine which
category’s average canonical1 score was closer. Keep in mind, my initial goal was not to predict
whether individual emails related to the specific software bug I was looking for. Instead, I only
wanted to predict the overall percentage of these emails within a given data set.
In order to do this I examined the probabilities provided for an individual email relating to the
software bug. I would then take the average of these probabilities and use that value as the
overall percentage of emails relating to the specific problem. For example, suppose that we have
a data set of 4 emails. The probabilities of each email relating to the specific problem I want are:
.3, .05, .6, and .7. Traditionally, we would predict an email to relate to the software bug if its
probability is greater than .5, so in this example we would predict 2 out of the 4 emails to be
talking about the software bug. But using my strategy, the proportion of software bug emails
would be predicted as .41 (the average of the four probabilities). I found this method to be much
more accurate when predicting the overall proportions of the large data set I was working with.
In order to perform this discriminant analysis, I manually selected words of my choosing. I did
this by examining about 1,000 emails and determining whether or not they fell into the category I
was looking for. Once this was completed, I would use a frequency word counter to show how
often certain words would appear within the emails I was interested in. Finally, I would choose
the words I thought did a good job of segregating these emails from all others. However, I
wouldn’t always choose the best keywords. After running individual discriminant analyses I
would change the words with coefficient scores close to zero (this means they do a poor job of
segregating the two groups of emails). I would repeat this process until I obtained a small enough
misclassification rate on my sample of emails.
Page 4
Part 3: SAS Text Miner Intro
This project began with the goal of predicting the overall percentage of emails describing a
specific type of problem. This information could be used by the company in order to decide
whether they should invest time in discovering a more efficient method to prevent this type of
problem in order to save time in the long run.
My overall goal was to save the company as much time as possible. The method which I had
begun my project was not the most ideal way to approach this issue. It required me to sort
through an initial sample of emails by hand, it only allowed me to examine one specific type of
email I wanted to look for, and it forced me to run multiple analyses in order to obtain the best
set of keywords to segregate between the 2 groups. In order to further investigate how to save as
much time as possible, I would need to find a more efficient strategy.
Rather than limiting my research to just examining one specific type of problem, I decided to
find a way to predict how long an individual email will take to resolve. Could the length of time
an email takes to resolve be predicted by a model whose only predictor variable is the physical
text of that email? In order to build such a model I had to improve upon my previous strategy by
using SAS Text Miner to examine the unstructured text buried in the emails.
Figure 3 shows the basic/clean structure of the email data I worked with. It contains columns for
the subject line, body of the email (description), and total time till resolution (in hours).
Figure 3: Sample data set of emails
Page 5
Part 4: SAS Text Miner Tools
Figure 4: Sample of SAS Text Miner Diagram
Figure 4 shows a simple example of the SAS Text Miner diagram for these text mining tools.
The diagram begins with the node on the left representing the data set. The following text
parsing node is similar to a word frequency counter; it counts the frequency of nouns, pronouns,
interjections, verbs… etc. This information is then passed through the text filter node to correct
for misspellings of words and pluralisms. It also allows you to import a custom list of synonyms
specific for your data, which was a technique that I used for this project.
After the text filter node, the data were run through 3 tools: text topics, text clusters, and text rule
builders. Each of these tools provided me with predictor variables to use in my analysis. A text
topic/cluster is a collection of words that describe and characterize a main theme or idea within
each email. For example, if I create two text topics for my data set of emails: one text topic may
describe emails where the client is talking about experiencing certain bug in the software, while
the other text topic may describe emails where the client isn’t describing a problem, but asking a
question. The text topic about the software bug may contain words such as “issue”, “resolve”, or
“fix”. Similarly, the text topic about questions may contain words such as “ask”, “wondering”, or
“curious”. SAS Text Miner allowed me to specify how many topics/clusters I wanted to create
with my data. It would then scan through every single email in the data set and create the desired
number of text topics/clusters.
Figure 5: Text Topic output
Page 6
The above Figure 5 shows sample output of some of the words that each text topic contained.
The column labeled # Docs represents the number of documents (emails) that fall under the
corresponding text topic. Some of the text topic words include URL links or email addresses. I
refer to these topics as “garbage” topics. I threw these topics out because they are likely to only
exist within this specific data set used to create the list of text topics. In other words, I am unable
to generalize this information to other emails outside of the data set. The output from the text
clusters is not shown because they operate in a similar manner by grouping words that describe a
group of emails. The main difference between the topics and clusters is that an individual email
can be categorized into multiple text topics, but only one text cluster.
The most significant tool of SAS Text Miner for my analysis is called the text rule builder. The
text rule builder operates slightly differently than the text topics/clusters. Text rules require a
categorical response variable. Unfortunately, I am dealing with a quantitative response variable
(total time in hours). Therefore I created a binary response variable by creating a cutoff value at
the 75th
percentile of total time till resolution. If an individual email was in the upper 25th
percentile of total time, it was categorized as taking a long time (this could be a type of email
worth figuring out how to more efficiently respond to in the future). All other emails were
categorized as not taking a long time. The value of 1 was used to represent if the email fell in the
upper 25th
percentile and a value of 0 otherwise.
A text rule consists of 1 to 3 words. If an email contains all of the words of the rule, it will
qualify for the rule. If an email qualifies for a rule, it can either be predicted as taking a long time
(value of 1) or not (value of 0).
Figure 6: Text Rule Builder Output
Figure 6 shows output for the text rules. The rule column contains the word(s) for each
individual rule, and if an email contains these words, it qualifies for the rule. The 4th
rule shown
in Figure 6 contains the words “search” and “response”, its target value is 1, and has a true
positive/total value of 8/8. This means that out of all the emails, 8 of them contained the words
“search” and “response”. Of those, all 8 were categorized in the upper 25th
percentile of total
time till resolution, meaning that they took a long time. There were also “garbage” rules created
in this process such as the output rule number 10.
In order to determine whether an email was classified as taking a long time or not, I combined all
3 of the aforementioned tools. Because the data contains a text based variable for the subject of
Page 7
the email and another variable for the body of the email, I had to run 2 separate node trails for
each one. I would later have to combine them into a single data set.
Combining the text topics and clusters was very straightforward because their format allowed
them to be merged easily. However, the text rules were a different story. The format they were
exported in didn’t allow merging them into the data. To work around this problem, I created
separate variables which indicated if an email qualified for a meaningful rule. This allowed me to
pick and choose which rules would get passed in. This way, “garbage” rules could be dropped
and only the rules that contained a high percentage of the target value would be included. I
classified the new variables into 8 separate categories:
 one_topPredict
 one_topSubPredict
 one_medPredict
 one_medSubPredict
 zero_topPredict
 zero_topSubPredict
 zero_medPredict
 zero_medSubPredict
The first variable on the list (one_topPredict) is a top predictor of emails that are categorized as a
1 (taking a long time). In order to create the one_topPredict variable, I examined all the rules
built for the description of the email that did a good job of predicting whether an email took a
long time or not. If an email qualified for one of these rules, I classified it as having a value of 1
under the one_topPredict variable, otherwise it would have a value of 0. The same logic applies
to the other variables. When the variable name contains “Sub”, this variable relates to rules
looking at the subject of the email. The difference between “top” and “med” in the variable
names refers to how high the percentage of emails were accurately predicted. For example, if an
email qualified under the one_topPredict variable, it would have a 97.7% chance of being
categorized as a one (this percentage is the total number of emails which qualified for this
variable and were categorized as a 1, divided by the total number of emails which qualified for
this variable). If an email qualified for one_medPredict it would only have a 63.4% chance of
being categorized as a 1.
The final analytic data set contained several hundred variables. Each email was assigned
indicator variables of whether or not it belonged to the respective text topic, and it was also
assigned a raw score variable for each text topic. If the raw score value for an email was high
enough for a certain text topic, it would qualify to fit into that text topic. This resulted in 270 text
topics for the body of the email and 114 text topics for the subject of an email, each with their
own respective indicator and raw score variables. In terms of text clusters, each email was
assigned singular value decomposition (SVD) scores. SVD scores of an email are used to
determine the probability of an email falling into certain cluster. My final data set contained 50
SVD values and 190 text clusters for the body of an email along with 30 SVD values and 11 text
clusters for the subject of an email.
All of these aforementioned variables were used as potential explanatory variables in a decision
tree model, a logistic regression model, and a linear regression model. For the sake of ease of
Page 8
interpretation from a non-statistical perspective, I will mainly discuss the findings of the decision
tree model in this report. A decision tree is made up of a series of blocks and branches. The first
block at the top of the decision tree represents the entire data set I was using. Thus, the first
block had 25% of the emails categorized as a 1 (taking a long time), and 75% of the emails
categorized as a 0 (not taking a long time).
Figure 7: Beginning segment of my decision tree model
Figure 7 shows the beginning of the decision tree for this analysis (only part of the total tree).
The first block is split into two separate blocks by a condition. Each condition is selected based
on what will segregate the groups of emails the most. In other words, what condition will put the
highest possible percentage of emails categorized as 1 in one group, with the highest possible
percentage of emails categorized as 0 in the other group. The condition for the very first block is
whether an email falls under the one_topPredict variable or not. If an email falls under the
one_topPredict variable it will be classified into the block on the left, otherwise it will be
classified into the block to the right. The blocks continue to split upon various conditions of the
predictor variables until they reach their respective bottom rows.
The conditions for the decision tree branches were assigned manually; I did not have the
computer automatically assign conditions for me. I wanted to avoid assigning “garbage”
Page 9
conditions to the decision tree branches because these conditions only exist within this specific
data set and cannot be generalized to external emails. However, I did examine the recommended
conditions (generated by SAS Text Miner) that did the best job of segregating emails based on
my categorical response. This process allowed me to manually look at these conditions and only
use the ones I wanted. Once the decision tree reaches the bottom row of blocks, the emails would
be as segregated as possible. In the final blocks, an email would now be predicted to have a
certain probability of being categorized as a 1 or 0. This probability is based on the block’s
percentage of emails within each category.
Page 10
Part 5: Results
The results of the decision tree can be used to predict my categorical variable of time duration.
The decision tree sorted through the hundreds of predictor variables and separated the data based
on whether they met certain conditions of these predictor variables. For example, if an email
qualified for a certain text topic, and was also part of a specific cluster, and contained a positive
value for one of the custom rule variables, the decision tree would predict a specific probability
of that email falling into the upper 25th
percentile of total time duration.
Certain branches of the decision tree, referred to as “money makers”, classify a high percentage
of emails as equal to 1 (upper 25th
percentile) or 0 (lower 75th
percentile). These nodes also
contain a decent number of emails within them that meet the specified conditions. Figure 8
shows one of the “money maker” branches for predicting if an email does not take a long time
(value of 0):
Figure 8: Money maker for predicting emails not taking a long time
Out of the 5,518 emails (just the emails from the year 2014), 265 of these emails fell on to this
node. Of these 265 emails, 98.5% of them were categorized as 0 and 1.5% of them were
categorized as 1. The conditions for this node can be seen on the right hand side. The top
condition represents the very first condition emails had to meet to be classified into this node
followed by subsequent conditions in a sequential order. The variable name of the second
condition, “DescCluster_prob136” represents the probability of an email falling into a cluster136
of the description of the email.
Page 11
Figure 9: Money maker for predicting emails taking a long time
Figure 9 represents an example of a “money maker” that predicts a high percentage of emails
taking a long time. Notice how the number of emails that fall under this node is lower than the
previous. This is to be expected due to the overall percentage of emails taking a long time only
being 25% of the data set as opposed to 75%.
There were several “money maker” blocks for each category of email (long versus not long
duration) which were combined into a final model for the decision tree. From these combined
results, 179 of these emails were classified as having a 97.7% probability of taking a long time
and can be considered the top tier of prediction. The second tier of prediction consisted of a
separate 232 emails having a 62.9% probability of taking a long time. While 62.9% is not an
extremely high probability, this may be due to the nature of the data set with emails taking a long
time only represented by 25% of the data. So I was able to find 232 emails that are more than
twice as likely to take a long time.
In regards to predicting if an email will not take a long time there were 464 emails classified
having a 99.1% probability of not taking a long time. I didn’t create a second tier for predicting
emails not taking a long time because they already represented 75% of the data. When all 3 of
these “prediction” groups are combined together, they only represent about 15.8% of the data set.
In other words, I was only able to find meaningful predictions for 15.8% of the data I was
looking at. While this is not as high as I hoped, it was certainly is better than nothing.
Page 12
Part 6: Additional Results
A total of 3 models were produced for my analyses: a decision tree, a logistic regression, and a
linear regression. Of these 3 models, only the decision tree was discussed in this project because
of its utility. The decision tree allowed me to create groups of emails that had an extremely high
percentage of belonging to one category or the other, while the same can’t be said for the other
two models.
The logistic regression model predicted the probability of an email taking a long time
(categorized with a value of 1). This is similar to the purpose of my decision tree model,
however, its concept is not as easy to present to a company. It can be difficult to explain concepts
such as log odds and odds ratios to people with little statistical knowledge. It’s much easier to
explain to a company that if an email meets certain conditions, it will have a certain probability
of taking a long time.
While the decision tree is easier to explain, I was only able to significantly predict length of time
for about 16% of all the emails in the data set. The remaining emails are less straightforward in
terms of predictions of which category they belong in, but we may use a linear regression model
to get a slightly better prediction for these remaining emails.
Linear regression is an easy concept to explain to someone with no statistical knowledge. The
linear regression model used about 70 predictor variables which were selected from the same list
of several hundred variables used in the decision tree. Linear regression needs to use a
quantitative response variable. I ended up using the total time till resolution (the initial response
variable I started with).
The variable selection process for linear regression was automated by the computer in order to
come up with the most useful model. Unlike the decision tree, I was not able to select which
predictors I wanted to leave out of the model. This means that the computer would automatically
select “garbage” predictors to insert into my model! However, I could work around this by
removing the unwanted predictor variables from the initial data set before they were passed into
the linear regression node.
Page 13
Figure 10: Mean Predicted values vs. Mean Target values on Total Time
Figure 10 represents a graph of predicted values vs. target values for average total time across
the depths of the data set. The data are measured at every depth interval of 5 as shown on the
horizontal axis. Depth can be thought of as a percentage of the data. The predicted values shown
in the preceding figure seem like a good fit for the actual target values of total time, but this only
displays the average predicted value for every 5% interval of the data. Meaning this is the
average predicted value for groups of about 250 emails! When the predicted values for the
emails are examined individually they have high residual values overall. This means that the
linear regression model performed well over each average of 5% within my data, but when each
email is examined individually, the model isn't very trustworthy. However, some knowledge is
better than none.
Page 14
Part 7: Conclusion and Future Studies
What are the actual uses of this information? The company has a certain amount of people in the
customer service department who respond to these emails. Some of these employees have a lot of
experience, while others are more inexperienced. The most time efficient method of responding
to emails would be to have the less experienced employees respond to emails that do not take a
long time, while the more experienced employees would respond to all the other emails.
The main reasoning for the preceding idea is that the biggest time sink for the company would be
when an employee who is less experienced attempts to respond to an email which takes a long
time. Because the employee is less experienced, it will take an even longer amount of time for
the email to resolve. In order to save time, the text mining algorithm based on specific words
found in the subject line and body of the email could be run on every new email that comes into
the customer service department. If the email is predicted to not take a long time, it would be
assigned to a less experienced employee. If an email is predicted to take a long time, it would be
assigned to a more experienced employee. All of the “in-between” emails can be assigned to
whoever is free at the given time.
While the results from the decision tree only classify approximately 16% of the data, anything
helps when it comes to saving the company as much time as possible while responding to
customer service emails. Which is why I believe it is a profitable strategy to initially classify
emails with the decision tree first and then run the linear regression model on emails that aren't
highly predicted to belong in a certain category. Depending on the predicted value from the
linear regression model, we could assign the email to a newer employee or a more experienced
employee.
Thus, up to this point we have successfully established an efficient method for assigning emails
to employees of a company based on topics and clusters of text embedded in the email. But what
if we could identify which types of emails are taking a long time to resolve? This would allow
the company to develop more efficient methods for resolving the subject matter of these emails
in order to save time in the long run. Therefore, a next step for this project would be to examine
the emails with a high probability of being predicted to take a long time. Are there any patterns
in these emails? Are there any specific types of problems these emails are discussing?
In order to find this out, we must delve deep into the various text topics/clusters found in the
email text. What are the words used in the text topics/clusters of high probability email falls?
What do those text topics/clusters mean in context? Do the majority of emails in this specific
node relate to a specific type of problem? In order to do this, we will need to identify the words
associated with each text topic, cluster, and rule. The format for the topics and rules easily allow
this, but clusters are more difficult. The visualization of the decision tree doesn’t allow you to
see the words associated with the clusters. Instead, it only lists the cluster number. To work
around this, you must manually go back into the cluster node output to see the words associated
with the respective cluster number. These findings would be even more useful for a company in
order to discover methods for more efficiently resolving these types of emails in the future.
Page 15
Part 8: References
Sarma, Kattamuri S. Predictive Modeling with SAS Enterprise Miner: Practical Solutions for
Business Applications. Cary, NC: SAS Institute, 2007. Print.
Ville, Barry De, and Padraic Neville. Decision Trees for Analytics: Using SAS Enterprise Miner.
Cary, NC: SAS Institute, 2013. Print.
SAS Certification Prep Guide: Advanced Programming for SAS 9. Cary, NC: SAS Institute,
2011. Print.
SAS Certification Prep Guide Base Programming for SAS 9. Cary, NC: SAS Institute, 2011.
Print.
Cohen, K. Bretonnel, and Lawrence Hunter. "Getting Started in Text Mining." PLoS
Computational Biology PLoS Comput Biol 4.1 (2008): n. pag. Web.
"Getting Started with SAS Enterprise Guide: Main Menu." Getting Started with SAS Enterprise
Guide: Main Menu. N.p., n.d. Web.
Page 16
Part 9: Appendix
The purpose of this section is to provide reproducible steps in order to achieve the results I did. I
will start by providing the SAS code used to import and clean the data until it was in the format I
could work with. In order to read the data in, I had to import many csv files separately due to
large data sets that caused my computer to crash.
/* data from year 2010 */
proc import
datafile='E:LithiumCase History Statuscsv1 2010.csv'
dbms=csv
out= sample12010
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv2 2010.csv'
dbms=csv
out= sample22010
replace;
guessingrows=2000;
run;
/* data from year 2011 */
proc import
datafile='E:LithiumCase History Statuscsv1 2011.csv'
dbms=csv
out= sample12011
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv2 2011.csv'
dbms=csv
out= sample22011
replace;
guessingrows=2000;
run;
/* data from year 2012 */
proc import
datafile='E:LithiumCase History Statuscsv1 2012.csv'
dbms=csv
out= sample12012
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv3 2012.csv'
Page 17
dbms=csv
out= sample32012
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv4 2012.csv'
dbms=csv
out= sample42012
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv5 2012.csv'
dbms=csv
out= sample52012
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv6 2012.csv'
dbms=csv
out= sample62012
replace;
guessingrows=2000;
run;
/* data from year 2013 */
proc import
datafile='E:LithiumCase History Statuscsv1 2013.csv'
dbms=csv
out= sample12013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv2 2013.csv'
dbms=csv
out= sample22013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv3 2013.csv'
dbms=csv
out= sample32013
replace;
guessingrows=2000;
run;
proc import
Page 18
datafile='E:LithiumCase History Statuscsv4 2013.csv'
dbms=csv
out= sample42013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv5 2013.csv'
dbms=csv
out= sample52013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv6 2013.csv'
dbms=csv
out= sample62013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv7 2013.csv'
dbms=csv
out= sample72013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv8 2013.csv'
dbms=csv
out= sample82013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv9 2013.csv'
dbms=csv
out= sample92013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv10 2013.csv'
dbms=csv
out= sample102013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv11 2013.csv'
Page 19
dbms=csv
out= sample112013
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv12 2013.csv'
dbms=csv
out= sample122013
replace;
guessingrows=2000;
run;
/* data from year 2014 */
proc import
datafile='E:LithiumCase History Statuscsv2 2014.csv'
dbms=csv
out= sample22014
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv3 2014.csv'
dbms=csv
out= sample32014
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv4 2014.csv'
dbms=csv
out= sample42014
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv5 2014.csv'
dbms=csv
out= sample52014
replace;
guessingrows=2000;
run;
proc import
datafile='E:LithiumCase History Statuscsv7 2014.csv'
dbms=csv
out= sample72014
replace;
guessingrows=2000;
run;
Page 20
/* Combining all of the data together from all years */
data alldata;
set sample12010 sample22010 sample12011 sample22011 sample12012
sample32012 sample42012 sample52012 sample62012 sample12013 sample22013
sample32013 sample42013 sample52013 sample62013 sample72013 sample82013
sample92013 sample102013 sample112013 sample122013 sample22014 sample32014
sample42014 sample52014 sample72014;
run;
proc export data=alldata
outfile='E:Senior ProjectDatadata.csv'
dbms=csv
replace;
run;
/* Sorting the Data appropriately */
libname loc "LibrariesDocuments";
proc import
datafile='E:Senior ProjectDatadata.csv'
dbms=csv
out=totaldata
replace;
guessingrows=2000;
run;
proc sort data=totaldata;
by Date_Time_Opened Subject Description Case_History_Status;
run;
proc export data=totaldata
outfile='E:Senior ProjectDataalldatasorted.csv'
dbms=csv
replace;
run;
/* Combining the status durations of the email's
in_progress status */
proc import
datafile='E:Senior ProjectDataalldatasorted.csv'
dbms=csv
out=sorted
replace;
guessingrows=2000;
run;
Page 21
/* assigning an ID variable to each individual email */
data sorted1;
set sorted;
by Date_Time_Opened Subject Description Case_History_Status;
dateopened=datepart(Date_Time_Opened);
retain ID 0;
if first.Description then ID=ID+1;
TotDuration+Duration;
if first.Case_History_Status then TotDuration=Duration;
if last.Case_History_Status then output;
run;
/* trasposing the status durations in order to combine later */
proc transpose data=sorted1
out=sorted2;
ID Case_History_Status;
by ID dateopened Subject Description;
var TotDuration;
run;
/* combining the status durations for in_progress status */
data sorted4;
set sorted2;
In_Progress=sum(In_Progress,In_Progress__Engineering_,
In_Progress__Support_,In_Progress__Internal_,
In_Progress__TechOps_,In_Progress__Social_Dynamx_,
In_Progress__DATA_);
Delay=sum(Delay,Delayed);
run;
proc export data=sorted4
outfile='E:Senior ProjectDataCombinedInProgress.csv'
dbms=csv
replace;
run;
/* assigning half year variables to my emails to
assist with organizing by time. This would allow me
to analyze emails within the year they were sent.
Also summing all statuses to retrieve total time variable. */
proc import
datafile='E:Senior ProjectDataCombinedInProgress.csv'
dbms=csv
out=combined
replace;
guessingrows=35000;
run;
libname mylib "Desktop";
Page 22
/* assigning half year values */
data mylib.halfyears (encoding=asciiany);
set combined;
if dateopened<18444 then
halfyear=1;
if dateopened>=18444 and dateopened<18628 then
halfyear=2;
if dateopened>=18628 and dateopened<18809 then
halfyear=3;
if dateopened>=18809 and dateopened<18993 then
halfyear=4;
if dateopened>=18993 and dateopened<19175 then
halfyear=5;
if dateopened>=19175 and dateopened<19359 then
halfyear=6;
if dateopened>=19359 and dateopened<19540 then
halfyear=7;
if dateopened>=19540 and dateopened<19724 then
halfyear=8;
if dateopened>=19724 then
halfyear=9;
/* summing all status durations to retrieve total time */
total_time=sum(In_Progress,New,Updated_by_Customer,Waiting_for_Fix,
Work_Complete,Pending_Customer_Response,Scheduled_for_Production_Deploym,
Waiting_for_Upgrade,Awaiting_Customer_Approval,Delay,ER_Planned_for_Roadmap,
Waiting_for_Enhancement,Preparing_for_Production_Deploym,Delayed__Misc_,
Delayed__Production_Freeze_);
keep ID dateopened subject description halfyear total_time;
run;
proc export data=mylib.halfyears
outfile='E:Senior ProjectDatatotal_time.csv'
dbms=csv
replace;
run;
/* this is where I created my categorical response variable.
I actually created 3 separate response variables: one for the
50th percentile, the 75th percentile and the 90th percentile.
I assigned these cutoff values within each half year in order
to adjust for the effect of time on the email's duration. In
my final analysis I ended up only using the 75th percentile. */
libname mylib "Desktop";
proc import
datafile='F:Senior ProjectDatatotal_time.csv'
dbms=csv
out=totaltime
replace;
guessingrows=35000;
run;
Page 23
/* Identifying my cutoff values */
proc means mean median q3 p90 n;
by halfyear;
run;
/* assigning curoff values for each of my 3 categorical responses */
data mylib.total_time_ (encoding=asciiany);
set totaltime;
ind50_time=0;
if halfyear=1 and total_time>164.7
then ind50_time=1;
if halfyear=2 and total_time>136.86
then ind50_time=1;
if halfyear=3 and total_time>99.56
then ind50_time=1;
if halfyear=4 and total_time>76.17
then ind50_time=1;
if halfyear=5 and total_time>45.4
then ind50_time=1;
if halfyear=6 and total_time>46.47
then ind50_time=1;
if halfyear=7 and total_time>46.96
then ind50_time=1;
if halfyear=8 and total_time>80.85
then ind50_time=1;
if halfyear=9 and total_time>70.18
then ind50_time=1;
ind75_time=0;
if halfyear=1 and total_time>822.3
then ind75_time=1;
if halfyear=2 and total_time>534
then ind75_time=1;
if halfyear=3 and total_time>382.5
then ind75_time=1;
if halfyear=4 and total_time>287.4
then ind75_time=1;
if halfyear=5 and total_time>160.9
then ind75_time=1;
if halfyear=6 and total_time>144.8
then ind75_time=1;
Page 24
if halfyear=7 and total_time>161
then ind75_time=1;
if halfyear=8 and total_time>236.1
then ind75_time=1;
if halfyear=9 and total_time>185.6
then ind75_time=1;
ind90_time=0;
if halfyear=1 and total_time>2082.1
then ind90_time=1;
if halfyear=2 and total_time>1554
then ind90_time=1;
if halfyear=3 and total_time>1579.8
then ind90_time=1;
if halfyear=4 and total_time>977
then ind90_time=1;
if halfyear=5 and total_time>479.9
then ind90_time=1;
if halfyear=6 and total_time>402.8
then ind90_time=1;
if halfyear=7 and total_time>450.7
then ind90_time=1;
if halfyear=8 and total_time>625
then ind90_time=1;
if halfyear=9 and total_time>507.5
then ind90_time=1;
run;
Page 25
/* I exported 3 total data sets. One for the entire
span from 2008 to 2014, one from 2012 to 2014, and
one of just 2014 emails. In my final analysis I ended
up just looking at the 2014 email data set */
data mylib.total_time_since2012 (encoding=asciiany);
set mylib.total_time_;
if halfyear>4;
run;
data mylib.total_time_2014 (encoding=asciiany);
set mylib.total_time_;
if halfyear=9;
run;
proc export data=mylib.halfyears
outfile='E:Senior ProjectDatatotal_time.csv'
dbms=csv
replace;
run;
This concludes the section of data cleaning/manipulation. The next section will refer to code
done within SAS Text Miner: creating my custom synonym data set, merging my topics/clusters,
creating my rule variables, and merging them all together. The first page will include a picture of
my final SAS Text Miner diagram to provide an idea of what everything looked like. I will go on
to explain areas of the diagram in more detail along with the SAS code used in these areas.
Figure 11: SAS Text Miner Final Diagram
The top right of the diagram is the area in which I performed my analysis. You can see a node
for decision tree, logistic regression, and linear regression. I had to use two data sets in this
Page 26
section in order to use my 2 response variables separately; one data set contained my categorical
response and one data set contained my quantitative response. I compare all 3 models with a
model comparison node at the bottom of this area. This allowed me to identify the
misclassification rates of the categorical response models as well as comparing ROC curves
between the models.
The top left of the diagram represents the node trail used to create my custom set of synonyms
specifically for the jargon of the emails. Below is the code within the SAS code node to create
the data set of synonyms.
/* Creating my custom synonyms */
%textsyn( termds=emws2.textfilter_terms
, docds=&em_import_data
, outds=&em_import_transaction
, textvar=description
, mnpardoc=8
, mxchddoc=10
, synds=mydata.halfyearextsyns
, dict=mydata.engdict2
, maxsped=15
) ;
The middle portion of the diagram is the area where I created my topics/clusters and rules for the
body and subject line of the emails. The node trail on the left is for the body of the emails and the
node trail on the right is for the subject line of the emails. The second from the bottom SAS code
node is where I merge the text topics/clusters together. The SAS code node on the bottom of the
middle section is where I create my rule variables and merge them with my entire data set. The
code for both nodes is displayed below.
/* merging my topics/clusters */
proc sort data=emws2.texttopic_train;
by subject;
run;
proc sort data=emws2.texttopic2_train;
by subject;
run;
proc sort data=emws2.textcluster_train;
by subject;
run;
proc sort data=emws2.textcluster2_train;
by subject;
run;
libname mydata "/home/msanregret/sasuser.v94";
Page 27
data mydata.bigmergedtopics;
merge emws2.texttopic_train
emws2.texttopic2_train
emws2.textcluster_train
emws2.textcluster2_train;
by subject;
run;
/* Separately creating my custom rule variables.
I will merge them all together later. */
proc sort data = EMWS2.TextRule_Train;
by subject;
run;
/* rule variables for the description of the email */
data description (keep = subject zero_topPredict zero_medPredict
one_topPredict);
set EMWS2.TextRule_Train;
zero_topPredict = 0;
zero_medPredict = 0;
one_topPredict = 0;
if w_ind75_time = 37 then
zero_topPredict = 1;
else if w_ind75_time >= 40 and w_ind75_time <= 44 then
zero_topPredict = 1;
else if w_ind75_time = 47 or w_ind75_time = 48 then
zero_medPredict = 1;
if w_ind75_time = 1 then
one_topPredict = 1;
else if w_ind75_time >= 3 and w_ind75_time <= 8 then
one_topPredict = 1;
else if w_ind75_time = 17 then
one_topPredict = 1;
else if w_ind75_time >= 12 and w_ind75_time <= 15 then
one_topPredict = 1;
else if w_ind75_time = 27 or w_ind75_time = 29 then
one_topPredict = 1;
run;
proc sort data = EMWS2.TextRule2_Train;
by subject;
run;
Page 28
/* rule variables for the subject of the email */
data subject (keep = subject zero_topSubPredict zero_medSubPredict
one_topSubPredict one_medSubPredict);
set EMWS2.TextRule2_Train;
zero_topSubPredict = 0;
zero_medSubPredict = 0;
one_topSubPredict = 0;
one_medSubPredict = 0;
if w_ind75_time = 42 or w_ind75_time = 43 then
zero_topSubPredict = 1;
else if w_ind75_time = 45 or w_ind75_time = 47 then
zero_topSubPredict = 1;
else if w_ind75_time >= 48 then
zero_medSubPredict = 1;
if w_ind75_time = 1 or w_ind75_time = 3 then
one_topSubPredict = 1;
else if w_ind75_time = 4 or w_ind75_time = 6 then
one_topSubPredict = 1;
else if w_ind75_time >= 10 and w_ind75_time <= 14 then
one_topSubPredict = 1;
else if w_ind75_time >= 20 and w_ind75_time <= 24 then
one_topSubPredict = 1;
else if w_ind75_time = 18 or w_ind75_time = 25 then
one_medSubPredict = 1;
else if w_ind75_time >= 31 and w_ind75_time <= 37 then
one_medSubPredict = 1;
run;
libname mydata "/home/msanregret/sasuser.v94";
/* merging the rules together */
data mydata.rules;
merge description
subject;
by subject;
run;
/* merging the rules with my dataset */
data mydata.TopicsWithRules_2014;
merge mydata.bigmergedtopics
mydata.rules;
by subject;
run;

More Related Content

What's hot

IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
IRJET Journal
 
Design, analysis and implementation of geolocation based emotion detection te...
Design, analysis and implementation of geolocation based emotion detection te...Design, analysis and implementation of geolocation based emotion detection te...
Design, analysis and implementation of geolocation based emotion detection te...
eSAT Journals
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET Journal
 
A Survey Of Collaborative Filtering Techniques
A Survey Of Collaborative Filtering TechniquesA Survey Of Collaborative Filtering Techniques
A Survey Of Collaborative Filtering Techniques
tengyue5i5j
 
IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...
IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...
IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...
IRJET Journal
 
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Shailly Saxena
 
Spam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmSpam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes Algorithm
Akshay Pal
 
Enabling Spam filtering
Enabling Spam filteringEnabling Spam filtering
Enabling Spam filtering
Dattatreya Biswas
 
Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...
IJECEIAES
 
Spam Email identification
Spam Email identificationSpam Email identification
Spam Email identification
Partnered Health
 
IRJET - Fake News Detection using Machine Learning
IRJET -  	  Fake News Detection using Machine LearningIRJET -  	  Fake News Detection using Machine Learning
IRJET - Fake News Detection using Machine Learning
IRJET Journal
 
32 99-1-pb
32 99-1-pb32 99-1-pb
32 99-1-pb
Mahendra Sisodia
 
IRJET - Suicidal Text Detection using Machine Learning
IRJET -  	  Suicidal Text Detection using Machine LearningIRJET -  	  Suicidal Text Detection using Machine Learning
IRJET - Suicidal Text Detection using Machine Learning
IRJET Journal
 
Entity Annotation WordPress Plugin using TAGME Technology
Entity Annotation WordPress Plugin using TAGME TechnologyEntity Annotation WordPress Plugin using TAGME Technology
Entity Annotation WordPress Plugin using TAGME Technology
TELKOMNIKA JOURNAL
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET Journal
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment Analysis
Editor IJCATR
 
IRJET- Sentimental Analysis of Twitter for Stock Market Investment
IRJET- Sentimental Analysis of Twitter for Stock Market InvestmentIRJET- Sentimental Analysis of Twitter for Stock Market Investment
IRJET- Sentimental Analysis of Twitter for Stock Market Investment
IRJET Journal
 
IRJET - Cyberbulling Detection Model
IRJET -  	  Cyberbulling Detection ModelIRJET -  	  Cyberbulling Detection Model
IRJET - Cyberbulling Detection Model
IRJET Journal
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detection
eSAT Publishing House
 

What's hot (19)

IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
Design, analysis and implementation of geolocation based emotion detection te...
Design, analysis and implementation of geolocation based emotion detection te...Design, analysis and implementation of geolocation based emotion detection te...
Design, analysis and implementation of geolocation based emotion detection te...
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 
A Survey Of Collaborative Filtering Techniques
A Survey Of Collaborative Filtering TechniquesA Survey Of Collaborative Filtering Techniques
A Survey Of Collaborative Filtering Techniques
 
IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...
IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...
IRJET - Sentiment Analysis for Marketing and Product Review using a Hybrid Ap...
 
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemLatent Dirichlet Allocation as a Twitter Hashtag Recommendation System
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
 
Spam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmSpam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes Algorithm
 
Enabling Spam filtering
Enabling Spam filteringEnabling Spam filtering
Enabling Spam filtering
 
Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...
 
Spam Email identification
Spam Email identificationSpam Email identification
Spam Email identification
 
IRJET - Fake News Detection using Machine Learning
IRJET -  	  Fake News Detection using Machine LearningIRJET -  	  Fake News Detection using Machine Learning
IRJET - Fake News Detection using Machine Learning
 
32 99-1-pb
32 99-1-pb32 99-1-pb
32 99-1-pb
 
IRJET - Suicidal Text Detection using Machine Learning
IRJET -  	  Suicidal Text Detection using Machine LearningIRJET -  	  Suicidal Text Detection using Machine Learning
IRJET - Suicidal Text Detection using Machine Learning
 
Entity Annotation WordPress Plugin using TAGME Technology
Entity Annotation WordPress Plugin using TAGME TechnologyEntity Annotation WordPress Plugin using TAGME Technology
Entity Annotation WordPress Plugin using TAGME Technology
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment Analysis
 
IRJET- Sentimental Analysis of Twitter for Stock Market Investment
IRJET- Sentimental Analysis of Twitter for Stock Market InvestmentIRJET- Sentimental Analysis of Twitter for Stock Market Investment
IRJET- Sentimental Analysis of Twitter for Stock Market Investment
 
IRJET - Cyberbulling Detection Model
IRJET -  	  Cyberbulling Detection ModelIRJET -  	  Cyberbulling Detection Model
IRJET - Cyberbulling Detection Model
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detection
 

Viewers also liked

Predictive Modeling with Enterprise Miner
Predictive Modeling with Enterprise MinerPredictive Modeling with Enterprise Miner
Predictive Modeling with Enterprise Miner
Jeffrey Strickland, Ph.D., CMSP
 
Use of SAS Based Natural Language Processing to Identify Incident and Recurre...
Use of SAS Based Natural Language Processing to Identify Incident and Recurre...Use of SAS Based Natural Language Processing to Identify Incident and Recurre...
Use of SAS Based Natural Language Processing to Identify Incident and Recurre...
HMO Research Network
 
IBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics BriefIBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics Brief
Ian Balina
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
Seth Grimes
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
Longhow Lam
 
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?
Chris Sparshott
 

Viewers also liked (6)

Predictive Modeling with Enterprise Miner
Predictive Modeling with Enterprise MinerPredictive Modeling with Enterprise Miner
Predictive Modeling with Enterprise Miner
 
Use of SAS Based Natural Language Processing to Identify Incident and Recurre...
Use of SAS Based Natural Language Processing to Identify Incident and Recurre...Use of SAS Based Natural Language Processing to Identify Incident and Recurre...
Use of SAS Based Natural Language Processing to Identify Incident and Recurre...
 
IBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics BriefIBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics Brief
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
 
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?
 

Similar to SAS Text Mining

SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
DevinSohi
 
Network paperthesis1
Network paperthesis1Network paperthesis1
Network paperthesis1
Dhara Shah
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1
Dhara Shah
 
optimizing_site_performance
optimizing_site_performanceoptimizing_site_performance
optimizing_site_performance
Bryan Farrow
 
AI and ML.pptx
AI and ML.pptxAI and ML.pptx
AI and ML.pptx
NeyaShree1
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
Anush90
 
Jt3616901697
Jt3616901697Jt3616901697
Jt3616901697
IJERA Editor
 
Types of Sentiment Analysis
Types of Sentiment AnalysisTypes of Sentiment Analysis
Types of Sentiment Analysis
Repustate
 
How to Write Compelling Emails in a Channel Marketing Organization
How to Write Compelling Emails in a Channel Marketing OrganizationHow to Write Compelling Emails in a Channel Marketing Organization
How to Write Compelling Emails in a Channel Marketing Organization
ZINFI Technologies, Inc.
 
Emailphishing(deep anti phishnet applying deep neural networks for phishing e...
Emailphishing(deep anti phishnet applying deep neural networks for phishing e...Emailphishing(deep anti phishnet applying deep neural networks for phishing e...
Emailphishing(deep anti phishnet applying deep neural networks for phishing e...
Venkat Projects
 
5 e mail marketing quick wins- cloud marketing manager
5 e mail marketing quick wins- cloud marketing manager5 e mail marketing quick wins- cloud marketing manager
5 e mail marketing quick wins- cloud marketing manager
Cloud Marketing Manager
 
S N A I L Final Presentation
S N A I L    Final  PresentationS N A I L    Final  Presentation
S N A I L Final Presentation
Qiong Wu
 
Live Twitter Sentiment Analysis and Interactive Visualizations with PyLDAvis ...
Live Twitter Sentiment Analysis and Interactive Visualizations with PyLDAvis ...Live Twitter Sentiment Analysis and Interactive Visualizations with PyLDAvis ...
Live Twitter Sentiment Analysis and Interactive Visualizations with PyLDAvis ...
IRJET Journal
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Editor IJCATR
 
Top 9 tips for email marketing
Top 9 tips for email marketingTop 9 tips for email marketing
Top 9 tips for email marketing
-
 
Top-15-Best-Practices-for-Email-Marketers-with-AI.pdf
Top-15-Best-Practices-for-Email-Marketers-with-AI.pdfTop-15-Best-Practices-for-Email-Marketers-with-AI.pdf
Top-15-Best-Practices-for-Email-Marketers-with-AI.pdf
LolaNikulshina
 
Report
ReportReport

Similar to SAS Text Mining (20)

SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
 
Network paperthesis1
Network paperthesis1Network paperthesis1
Network paperthesis1
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1
 
optimizing_site_performance
optimizing_site_performanceoptimizing_site_performance
optimizing_site_performance
 
AI and ML.pptx
AI and ML.pptxAI and ML.pptx
AI and ML.pptx
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
 
Jt3616901697
Jt3616901697Jt3616901697
Jt3616901697
 
Types of Sentiment Analysis
Types of Sentiment AnalysisTypes of Sentiment Analysis
Types of Sentiment Analysis
 
How to Write Compelling Emails in a Channel Marketing Organization
How to Write Compelling Emails in a Channel Marketing OrganizationHow to Write Compelling Emails in a Channel Marketing Organization
How to Write Compelling Emails in a Channel Marketing Organization
 
Emailphishing(deep anti phishnet applying deep neural networks for phishing e...
Emailphishing(deep anti phishnet applying deep neural networks for phishing e...Emailphishing(deep anti phishnet applying deep neural networks for phishing e...
Emailphishing(deep anti phishnet applying deep neural networks for phishing e...
 
5 e mail marketing quick wins- cloud marketing manager
5 e mail marketing quick wins- cloud marketing manager5 e mail marketing quick wins- cloud marketing manager
5 e mail marketing quick wins- cloud marketing manager
 
S N A I L Final Presentation
S N A I L    Final  PresentationS N A I L    Final  Presentation
S N A I L Final Presentation
 
Live Twitter Sentiment Analysis and Interactive Visualizations with PyLDAvis ...
Live Twitter Sentiment Analysis and Interactive Visualizations with PyLDAvis ...Live Twitter Sentiment Analysis and Interactive Visualizations with PyLDAvis ...
Live Twitter Sentiment Analysis and Interactive Visualizations with PyLDAvis ...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Top 9 tips for email marketing
Top 9 tips for email marketingTop 9 tips for email marketing
Top 9 tips for email marketing
 
Top-15-Best-Practices-for-Email-Marketers-with-AI.pdf
Top-15-Best-Practices-for-Email-Marketers-with-AI.pdfTop-15-Best-Practices-for-Email-Marketers-with-AI.pdf
Top-15-Best-Practices-for-Email-Marketers-with-AI.pdf
 
Report
ReportReport
Report
 

SAS Text Mining

  • 2. Table of Contents 1. Introduction 1 2. The Barbaric Beginnings 2 3. SAS Text Miner Introduction 4 4. SAS Text Miner Tools 5 5. Results 10 6. Addition Results 12 7. Conclusion and Future Studies 14 8. References 15 9. Appendix 16
  • 3. Page 1 Part 1: Introduction This project started out as a text mining exploration to identify emails related to one specific type of problem within the company’s software. These emails were coming into the customer service department of Lithium Technologies. So the clients sending the emails could be talking about a problem/error they were experiencing with the product, or maybe they were simply asking a question. They could also be talking about other things: maybe they had a request or an action they needed to be performed. My initial research goal was to identify frequent issues in order for the company to decide if it should invest time in developing more efficient methods of preventing this problem in the future. My journey began with an initial data set of emails from the years 2008 to 2014. My initial goal was to look for certain keywords to best predict whether an email belonged to a specific type of problem. This problem related to a software bug that some clients were experiencing. Discriminant analyses was used to see which keywords did the best job of segregating the emails into their respective category (whether they related to the specific software bug or not). Later on, I realized this method was highly inefficient. My next steps were to use a program called SAS Text miner for my analysis. This allowed me to increase the significance of my research question. Instead of predicting whether an email belonged in a specific category, I would now be able to predict an individual email’s time until resolution. In order to do so, I looked into 3 different types of model: linear regression, logistic regression, and a decision tree. Some of these models required a categorical response variable, but my data contained a continuous variable for the time it takes an email to resolve. In order to accommodate this, I created a cutoff at the 75th percentile of my continuous total time variable. Emails in the bottom 75th percentile were classified as a 0 (not taking a long time), while emails in the top 25th percentile were classified as a 1 (taking a long time). This part of my project will be discussed further in the SAS Text Miner Intro and Tools sections. The entire data set contained about 65,000 emails in total (over the years 2008 - 2014). During my analyses, I discovered the best predictor of an email’s duration was time; on average, emails from 2008 took much longer to resolve than emails from 2014. Because of this, I ended up only using emails from 2014 in my final analysis. The reason for this is because I wanted my model to be as relevant and accurate as possible.
  • 4. Page 2 Part 2: The Barbaric Beginnings At the start of my project, my goal was to predict whether an email related to a specific software bug or not. To begin, I manually sorted through about 1,000 emails and determined whether or not they related to the specific problem I was looking for. I then used a frequency word counter to determine which keywords came up most frequently in the description portion of emails and also in the subject header. The next step was to run a discriminant analysis on these keywords to see how well they determined which bucket an email falls into. There were certain words with a high frequency that weren't necessarily meaningful in terms of using that word to predict if an email falls into a certain category. For example, a word like “twitter” would come up very often, however, this is not a good word for separating the categories because it is likely to show up in all emails coming into the company’s customer service department; it is equally as likely to show up in emails that fall under the category vs. other emails. Figure 1: Jittered scatterplot of the Canonical1 scores of each email Figure 1 shows a visual of the discriminant analysis. In this particular example, the blue dots represent emails that describe the specific software bug I was interested in, while the red dots represent everything else. An email’s classification is based on how close its Canonical1 score lies to the average of the Canonical1 scores (X axis). These averages are represented by the larger red and blue circles on the plot. The Canonical1 scores are calculated as a linear combination of indicator variables which showed whether an email contained a specific word or not.
  • 5. Page 3 Figure 2: Discriminant analysis scoring coefficients example Figure 2 shows a sample of the scoring coefficients for these indicator variables. Each variable name represents an indicator for whether or not that word is included in the email. For example, if we’re using the preceding coefficients to represent the entire data set, an email with the words “community” and “issue”, but without the words “respond”, “forum”, and “reproduce” would have a Canonical1 score of about .624 + .363 + 0 + 0 + 0 = .987. This example email’s Canonical1 score would fall closer to the blue circle and would be predicted to be an email relating to the specific type software bug I was looking for. It is important to note that in Figure 1 there are some red dots that fall closer to the blue dot average as represented by the blue circle; these are the incorrectly predicted emails. The same is true for the blue dots that fall closer to the red circle. In the score summaries shown on Figure 1, the percentage of emails that were misclassified is approximately 7.6%. The method used to determine the proportions of emails relating to the specific software bug is not very “traditional. I did not check the canonical1 score of an email and then determine which category’s average canonical1 score was closer. Keep in mind, my initial goal was not to predict whether individual emails related to the specific software bug I was looking for. Instead, I only wanted to predict the overall percentage of these emails within a given data set. In order to do this I examined the probabilities provided for an individual email relating to the software bug. I would then take the average of these probabilities and use that value as the overall percentage of emails relating to the specific problem. For example, suppose that we have a data set of 4 emails. The probabilities of each email relating to the specific problem I want are: .3, .05, .6, and .7. Traditionally, we would predict an email to relate to the software bug if its probability is greater than .5, so in this example we would predict 2 out of the 4 emails to be talking about the software bug. But using my strategy, the proportion of software bug emails would be predicted as .41 (the average of the four probabilities). I found this method to be much more accurate when predicting the overall proportions of the large data set I was working with. In order to perform this discriminant analysis, I manually selected words of my choosing. I did this by examining about 1,000 emails and determining whether or not they fell into the category I was looking for. Once this was completed, I would use a frequency word counter to show how often certain words would appear within the emails I was interested in. Finally, I would choose the words I thought did a good job of segregating these emails from all others. However, I wouldn’t always choose the best keywords. After running individual discriminant analyses I would change the words with coefficient scores close to zero (this means they do a poor job of segregating the two groups of emails). I would repeat this process until I obtained a small enough misclassification rate on my sample of emails.
  • 6. Page 4 Part 3: SAS Text Miner Intro This project began with the goal of predicting the overall percentage of emails describing a specific type of problem. This information could be used by the company in order to decide whether they should invest time in discovering a more efficient method to prevent this type of problem in order to save time in the long run. My overall goal was to save the company as much time as possible. The method which I had begun my project was not the most ideal way to approach this issue. It required me to sort through an initial sample of emails by hand, it only allowed me to examine one specific type of email I wanted to look for, and it forced me to run multiple analyses in order to obtain the best set of keywords to segregate between the 2 groups. In order to further investigate how to save as much time as possible, I would need to find a more efficient strategy. Rather than limiting my research to just examining one specific type of problem, I decided to find a way to predict how long an individual email will take to resolve. Could the length of time an email takes to resolve be predicted by a model whose only predictor variable is the physical text of that email? In order to build such a model I had to improve upon my previous strategy by using SAS Text Miner to examine the unstructured text buried in the emails. Figure 3 shows the basic/clean structure of the email data I worked with. It contains columns for the subject line, body of the email (description), and total time till resolution (in hours). Figure 3: Sample data set of emails
  • 7. Page 5 Part 4: SAS Text Miner Tools Figure 4: Sample of SAS Text Miner Diagram Figure 4 shows a simple example of the SAS Text Miner diagram for these text mining tools. The diagram begins with the node on the left representing the data set. The following text parsing node is similar to a word frequency counter; it counts the frequency of nouns, pronouns, interjections, verbs… etc. This information is then passed through the text filter node to correct for misspellings of words and pluralisms. It also allows you to import a custom list of synonyms specific for your data, which was a technique that I used for this project. After the text filter node, the data were run through 3 tools: text topics, text clusters, and text rule builders. Each of these tools provided me with predictor variables to use in my analysis. A text topic/cluster is a collection of words that describe and characterize a main theme or idea within each email. For example, if I create two text topics for my data set of emails: one text topic may describe emails where the client is talking about experiencing certain bug in the software, while the other text topic may describe emails where the client isn’t describing a problem, but asking a question. The text topic about the software bug may contain words such as “issue”, “resolve”, or “fix”. Similarly, the text topic about questions may contain words such as “ask”, “wondering”, or “curious”. SAS Text Miner allowed me to specify how many topics/clusters I wanted to create with my data. It would then scan through every single email in the data set and create the desired number of text topics/clusters. Figure 5: Text Topic output
  • 8. Page 6 The above Figure 5 shows sample output of some of the words that each text topic contained. The column labeled # Docs represents the number of documents (emails) that fall under the corresponding text topic. Some of the text topic words include URL links or email addresses. I refer to these topics as “garbage” topics. I threw these topics out because they are likely to only exist within this specific data set used to create the list of text topics. In other words, I am unable to generalize this information to other emails outside of the data set. The output from the text clusters is not shown because they operate in a similar manner by grouping words that describe a group of emails. The main difference between the topics and clusters is that an individual email can be categorized into multiple text topics, but only one text cluster. The most significant tool of SAS Text Miner for my analysis is called the text rule builder. The text rule builder operates slightly differently than the text topics/clusters. Text rules require a categorical response variable. Unfortunately, I am dealing with a quantitative response variable (total time in hours). Therefore I created a binary response variable by creating a cutoff value at the 75th percentile of total time till resolution. If an individual email was in the upper 25th percentile of total time, it was categorized as taking a long time (this could be a type of email worth figuring out how to more efficiently respond to in the future). All other emails were categorized as not taking a long time. The value of 1 was used to represent if the email fell in the upper 25th percentile and a value of 0 otherwise. A text rule consists of 1 to 3 words. If an email contains all of the words of the rule, it will qualify for the rule. If an email qualifies for a rule, it can either be predicted as taking a long time (value of 1) or not (value of 0). Figure 6: Text Rule Builder Output Figure 6 shows output for the text rules. The rule column contains the word(s) for each individual rule, and if an email contains these words, it qualifies for the rule. The 4th rule shown in Figure 6 contains the words “search” and “response”, its target value is 1, and has a true positive/total value of 8/8. This means that out of all the emails, 8 of them contained the words “search” and “response”. Of those, all 8 were categorized in the upper 25th percentile of total time till resolution, meaning that they took a long time. There were also “garbage” rules created in this process such as the output rule number 10. In order to determine whether an email was classified as taking a long time or not, I combined all 3 of the aforementioned tools. Because the data contains a text based variable for the subject of
  • 9. Page 7 the email and another variable for the body of the email, I had to run 2 separate node trails for each one. I would later have to combine them into a single data set. Combining the text topics and clusters was very straightforward because their format allowed them to be merged easily. However, the text rules were a different story. The format they were exported in didn’t allow merging them into the data. To work around this problem, I created separate variables which indicated if an email qualified for a meaningful rule. This allowed me to pick and choose which rules would get passed in. This way, “garbage” rules could be dropped and only the rules that contained a high percentage of the target value would be included. I classified the new variables into 8 separate categories:  one_topPredict  one_topSubPredict  one_medPredict  one_medSubPredict  zero_topPredict  zero_topSubPredict  zero_medPredict  zero_medSubPredict The first variable on the list (one_topPredict) is a top predictor of emails that are categorized as a 1 (taking a long time). In order to create the one_topPredict variable, I examined all the rules built for the description of the email that did a good job of predicting whether an email took a long time or not. If an email qualified for one of these rules, I classified it as having a value of 1 under the one_topPredict variable, otherwise it would have a value of 0. The same logic applies to the other variables. When the variable name contains “Sub”, this variable relates to rules looking at the subject of the email. The difference between “top” and “med” in the variable names refers to how high the percentage of emails were accurately predicted. For example, if an email qualified under the one_topPredict variable, it would have a 97.7% chance of being categorized as a one (this percentage is the total number of emails which qualified for this variable and were categorized as a 1, divided by the total number of emails which qualified for this variable). If an email qualified for one_medPredict it would only have a 63.4% chance of being categorized as a 1. The final analytic data set contained several hundred variables. Each email was assigned indicator variables of whether or not it belonged to the respective text topic, and it was also assigned a raw score variable for each text topic. If the raw score value for an email was high enough for a certain text topic, it would qualify to fit into that text topic. This resulted in 270 text topics for the body of the email and 114 text topics for the subject of an email, each with their own respective indicator and raw score variables. In terms of text clusters, each email was assigned singular value decomposition (SVD) scores. SVD scores of an email are used to determine the probability of an email falling into certain cluster. My final data set contained 50 SVD values and 190 text clusters for the body of an email along with 30 SVD values and 11 text clusters for the subject of an email. All of these aforementioned variables were used as potential explanatory variables in a decision tree model, a logistic regression model, and a linear regression model. For the sake of ease of
  • 10. Page 8 interpretation from a non-statistical perspective, I will mainly discuss the findings of the decision tree model in this report. A decision tree is made up of a series of blocks and branches. The first block at the top of the decision tree represents the entire data set I was using. Thus, the first block had 25% of the emails categorized as a 1 (taking a long time), and 75% of the emails categorized as a 0 (not taking a long time). Figure 7: Beginning segment of my decision tree model Figure 7 shows the beginning of the decision tree for this analysis (only part of the total tree). The first block is split into two separate blocks by a condition. Each condition is selected based on what will segregate the groups of emails the most. In other words, what condition will put the highest possible percentage of emails categorized as 1 in one group, with the highest possible percentage of emails categorized as 0 in the other group. The condition for the very first block is whether an email falls under the one_topPredict variable or not. If an email falls under the one_topPredict variable it will be classified into the block on the left, otherwise it will be classified into the block to the right. The blocks continue to split upon various conditions of the predictor variables until they reach their respective bottom rows. The conditions for the decision tree branches were assigned manually; I did not have the computer automatically assign conditions for me. I wanted to avoid assigning “garbage”
  • 11. Page 9 conditions to the decision tree branches because these conditions only exist within this specific data set and cannot be generalized to external emails. However, I did examine the recommended conditions (generated by SAS Text Miner) that did the best job of segregating emails based on my categorical response. This process allowed me to manually look at these conditions and only use the ones I wanted. Once the decision tree reaches the bottom row of blocks, the emails would be as segregated as possible. In the final blocks, an email would now be predicted to have a certain probability of being categorized as a 1 or 0. This probability is based on the block’s percentage of emails within each category.
  • 12. Page 10 Part 5: Results The results of the decision tree can be used to predict my categorical variable of time duration. The decision tree sorted through the hundreds of predictor variables and separated the data based on whether they met certain conditions of these predictor variables. For example, if an email qualified for a certain text topic, and was also part of a specific cluster, and contained a positive value for one of the custom rule variables, the decision tree would predict a specific probability of that email falling into the upper 25th percentile of total time duration. Certain branches of the decision tree, referred to as “money makers”, classify a high percentage of emails as equal to 1 (upper 25th percentile) or 0 (lower 75th percentile). These nodes also contain a decent number of emails within them that meet the specified conditions. Figure 8 shows one of the “money maker” branches for predicting if an email does not take a long time (value of 0): Figure 8: Money maker for predicting emails not taking a long time Out of the 5,518 emails (just the emails from the year 2014), 265 of these emails fell on to this node. Of these 265 emails, 98.5% of them were categorized as 0 and 1.5% of them were categorized as 1. The conditions for this node can be seen on the right hand side. The top condition represents the very first condition emails had to meet to be classified into this node followed by subsequent conditions in a sequential order. The variable name of the second condition, “DescCluster_prob136” represents the probability of an email falling into a cluster136 of the description of the email.
  • 13. Page 11 Figure 9: Money maker for predicting emails taking a long time Figure 9 represents an example of a “money maker” that predicts a high percentage of emails taking a long time. Notice how the number of emails that fall under this node is lower than the previous. This is to be expected due to the overall percentage of emails taking a long time only being 25% of the data set as opposed to 75%. There were several “money maker” blocks for each category of email (long versus not long duration) which were combined into a final model for the decision tree. From these combined results, 179 of these emails were classified as having a 97.7% probability of taking a long time and can be considered the top tier of prediction. The second tier of prediction consisted of a separate 232 emails having a 62.9% probability of taking a long time. While 62.9% is not an extremely high probability, this may be due to the nature of the data set with emails taking a long time only represented by 25% of the data. So I was able to find 232 emails that are more than twice as likely to take a long time. In regards to predicting if an email will not take a long time there were 464 emails classified having a 99.1% probability of not taking a long time. I didn’t create a second tier for predicting emails not taking a long time because they already represented 75% of the data. When all 3 of these “prediction” groups are combined together, they only represent about 15.8% of the data set. In other words, I was only able to find meaningful predictions for 15.8% of the data I was looking at. While this is not as high as I hoped, it was certainly is better than nothing.
  • 14. Page 12 Part 6: Additional Results A total of 3 models were produced for my analyses: a decision tree, a logistic regression, and a linear regression. Of these 3 models, only the decision tree was discussed in this project because of its utility. The decision tree allowed me to create groups of emails that had an extremely high percentage of belonging to one category or the other, while the same can’t be said for the other two models. The logistic regression model predicted the probability of an email taking a long time (categorized with a value of 1). This is similar to the purpose of my decision tree model, however, its concept is not as easy to present to a company. It can be difficult to explain concepts such as log odds and odds ratios to people with little statistical knowledge. It’s much easier to explain to a company that if an email meets certain conditions, it will have a certain probability of taking a long time. While the decision tree is easier to explain, I was only able to significantly predict length of time for about 16% of all the emails in the data set. The remaining emails are less straightforward in terms of predictions of which category they belong in, but we may use a linear regression model to get a slightly better prediction for these remaining emails. Linear regression is an easy concept to explain to someone with no statistical knowledge. The linear regression model used about 70 predictor variables which were selected from the same list of several hundred variables used in the decision tree. Linear regression needs to use a quantitative response variable. I ended up using the total time till resolution (the initial response variable I started with). The variable selection process for linear regression was automated by the computer in order to come up with the most useful model. Unlike the decision tree, I was not able to select which predictors I wanted to leave out of the model. This means that the computer would automatically select “garbage” predictors to insert into my model! However, I could work around this by removing the unwanted predictor variables from the initial data set before they were passed into the linear regression node.
  • 15. Page 13 Figure 10: Mean Predicted values vs. Mean Target values on Total Time Figure 10 represents a graph of predicted values vs. target values for average total time across the depths of the data set. The data are measured at every depth interval of 5 as shown on the horizontal axis. Depth can be thought of as a percentage of the data. The predicted values shown in the preceding figure seem like a good fit for the actual target values of total time, but this only displays the average predicted value for every 5% interval of the data. Meaning this is the average predicted value for groups of about 250 emails! When the predicted values for the emails are examined individually they have high residual values overall. This means that the linear regression model performed well over each average of 5% within my data, but when each email is examined individually, the model isn't very trustworthy. However, some knowledge is better than none.
  • 16. Page 14 Part 7: Conclusion and Future Studies What are the actual uses of this information? The company has a certain amount of people in the customer service department who respond to these emails. Some of these employees have a lot of experience, while others are more inexperienced. The most time efficient method of responding to emails would be to have the less experienced employees respond to emails that do not take a long time, while the more experienced employees would respond to all the other emails. The main reasoning for the preceding idea is that the biggest time sink for the company would be when an employee who is less experienced attempts to respond to an email which takes a long time. Because the employee is less experienced, it will take an even longer amount of time for the email to resolve. In order to save time, the text mining algorithm based on specific words found in the subject line and body of the email could be run on every new email that comes into the customer service department. If the email is predicted to not take a long time, it would be assigned to a less experienced employee. If an email is predicted to take a long time, it would be assigned to a more experienced employee. All of the “in-between” emails can be assigned to whoever is free at the given time. While the results from the decision tree only classify approximately 16% of the data, anything helps when it comes to saving the company as much time as possible while responding to customer service emails. Which is why I believe it is a profitable strategy to initially classify emails with the decision tree first and then run the linear regression model on emails that aren't highly predicted to belong in a certain category. Depending on the predicted value from the linear regression model, we could assign the email to a newer employee or a more experienced employee. Thus, up to this point we have successfully established an efficient method for assigning emails to employees of a company based on topics and clusters of text embedded in the email. But what if we could identify which types of emails are taking a long time to resolve? This would allow the company to develop more efficient methods for resolving the subject matter of these emails in order to save time in the long run. Therefore, a next step for this project would be to examine the emails with a high probability of being predicted to take a long time. Are there any patterns in these emails? Are there any specific types of problems these emails are discussing? In order to find this out, we must delve deep into the various text topics/clusters found in the email text. What are the words used in the text topics/clusters of high probability email falls? What do those text topics/clusters mean in context? Do the majority of emails in this specific node relate to a specific type of problem? In order to do this, we will need to identify the words associated with each text topic, cluster, and rule. The format for the topics and rules easily allow this, but clusters are more difficult. The visualization of the decision tree doesn’t allow you to see the words associated with the clusters. Instead, it only lists the cluster number. To work around this, you must manually go back into the cluster node output to see the words associated with the respective cluster number. These findings would be even more useful for a company in order to discover methods for more efficiently resolving these types of emails in the future.
  • 17. Page 15 Part 8: References Sarma, Kattamuri S. Predictive Modeling with SAS Enterprise Miner: Practical Solutions for Business Applications. Cary, NC: SAS Institute, 2007. Print. Ville, Barry De, and Padraic Neville. Decision Trees for Analytics: Using SAS Enterprise Miner. Cary, NC: SAS Institute, 2013. Print. SAS Certification Prep Guide: Advanced Programming for SAS 9. Cary, NC: SAS Institute, 2011. Print. SAS Certification Prep Guide Base Programming for SAS 9. Cary, NC: SAS Institute, 2011. Print. Cohen, K. Bretonnel, and Lawrence Hunter. "Getting Started in Text Mining." PLoS Computational Biology PLoS Comput Biol 4.1 (2008): n. pag. Web. "Getting Started with SAS Enterprise Guide: Main Menu." Getting Started with SAS Enterprise Guide: Main Menu. N.p., n.d. Web.
  • 18. Page 16 Part 9: Appendix The purpose of this section is to provide reproducible steps in order to achieve the results I did. I will start by providing the SAS code used to import and clean the data until it was in the format I could work with. In order to read the data in, I had to import many csv files separately due to large data sets that caused my computer to crash. /* data from year 2010 */ proc import datafile='E:LithiumCase History Statuscsv1 2010.csv' dbms=csv out= sample12010 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv2 2010.csv' dbms=csv out= sample22010 replace; guessingrows=2000; run; /* data from year 2011 */ proc import datafile='E:LithiumCase History Statuscsv1 2011.csv' dbms=csv out= sample12011 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv2 2011.csv' dbms=csv out= sample22011 replace; guessingrows=2000; run; /* data from year 2012 */ proc import datafile='E:LithiumCase History Statuscsv1 2012.csv' dbms=csv out= sample12012 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv3 2012.csv'
  • 19. Page 17 dbms=csv out= sample32012 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv4 2012.csv' dbms=csv out= sample42012 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv5 2012.csv' dbms=csv out= sample52012 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv6 2012.csv' dbms=csv out= sample62012 replace; guessingrows=2000; run; /* data from year 2013 */ proc import datafile='E:LithiumCase History Statuscsv1 2013.csv' dbms=csv out= sample12013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv2 2013.csv' dbms=csv out= sample22013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv3 2013.csv' dbms=csv out= sample32013 replace; guessingrows=2000; run; proc import
  • 20. Page 18 datafile='E:LithiumCase History Statuscsv4 2013.csv' dbms=csv out= sample42013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv5 2013.csv' dbms=csv out= sample52013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv6 2013.csv' dbms=csv out= sample62013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv7 2013.csv' dbms=csv out= sample72013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv8 2013.csv' dbms=csv out= sample82013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv9 2013.csv' dbms=csv out= sample92013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv10 2013.csv' dbms=csv out= sample102013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv11 2013.csv'
  • 21. Page 19 dbms=csv out= sample112013 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv12 2013.csv' dbms=csv out= sample122013 replace; guessingrows=2000; run; /* data from year 2014 */ proc import datafile='E:LithiumCase History Statuscsv2 2014.csv' dbms=csv out= sample22014 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv3 2014.csv' dbms=csv out= sample32014 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv4 2014.csv' dbms=csv out= sample42014 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv5 2014.csv' dbms=csv out= sample52014 replace; guessingrows=2000; run; proc import datafile='E:LithiumCase History Statuscsv7 2014.csv' dbms=csv out= sample72014 replace; guessingrows=2000; run;
  • 22. Page 20 /* Combining all of the data together from all years */ data alldata; set sample12010 sample22010 sample12011 sample22011 sample12012 sample32012 sample42012 sample52012 sample62012 sample12013 sample22013 sample32013 sample42013 sample52013 sample62013 sample72013 sample82013 sample92013 sample102013 sample112013 sample122013 sample22014 sample32014 sample42014 sample52014 sample72014; run; proc export data=alldata outfile='E:Senior ProjectDatadata.csv' dbms=csv replace; run; /* Sorting the Data appropriately */ libname loc "LibrariesDocuments"; proc import datafile='E:Senior ProjectDatadata.csv' dbms=csv out=totaldata replace; guessingrows=2000; run; proc sort data=totaldata; by Date_Time_Opened Subject Description Case_History_Status; run; proc export data=totaldata outfile='E:Senior ProjectDataalldatasorted.csv' dbms=csv replace; run; /* Combining the status durations of the email's in_progress status */ proc import datafile='E:Senior ProjectDataalldatasorted.csv' dbms=csv out=sorted replace; guessingrows=2000; run;
  • 23. Page 21 /* assigning an ID variable to each individual email */ data sorted1; set sorted; by Date_Time_Opened Subject Description Case_History_Status; dateopened=datepart(Date_Time_Opened); retain ID 0; if first.Description then ID=ID+1; TotDuration+Duration; if first.Case_History_Status then TotDuration=Duration; if last.Case_History_Status then output; run; /* trasposing the status durations in order to combine later */ proc transpose data=sorted1 out=sorted2; ID Case_History_Status; by ID dateopened Subject Description; var TotDuration; run; /* combining the status durations for in_progress status */ data sorted4; set sorted2; In_Progress=sum(In_Progress,In_Progress__Engineering_, In_Progress__Support_,In_Progress__Internal_, In_Progress__TechOps_,In_Progress__Social_Dynamx_, In_Progress__DATA_); Delay=sum(Delay,Delayed); run; proc export data=sorted4 outfile='E:Senior ProjectDataCombinedInProgress.csv' dbms=csv replace; run; /* assigning half year variables to my emails to assist with organizing by time. This would allow me to analyze emails within the year they were sent. Also summing all statuses to retrieve total time variable. */ proc import datafile='E:Senior ProjectDataCombinedInProgress.csv' dbms=csv out=combined replace; guessingrows=35000; run; libname mylib "Desktop";
  • 24. Page 22 /* assigning half year values */ data mylib.halfyears (encoding=asciiany); set combined; if dateopened<18444 then halfyear=1; if dateopened>=18444 and dateopened<18628 then halfyear=2; if dateopened>=18628 and dateopened<18809 then halfyear=3; if dateopened>=18809 and dateopened<18993 then halfyear=4; if dateopened>=18993 and dateopened<19175 then halfyear=5; if dateopened>=19175 and dateopened<19359 then halfyear=6; if dateopened>=19359 and dateopened<19540 then halfyear=7; if dateopened>=19540 and dateopened<19724 then halfyear=8; if dateopened>=19724 then halfyear=9; /* summing all status durations to retrieve total time */ total_time=sum(In_Progress,New,Updated_by_Customer,Waiting_for_Fix, Work_Complete,Pending_Customer_Response,Scheduled_for_Production_Deploym, Waiting_for_Upgrade,Awaiting_Customer_Approval,Delay,ER_Planned_for_Roadmap, Waiting_for_Enhancement,Preparing_for_Production_Deploym,Delayed__Misc_, Delayed__Production_Freeze_); keep ID dateopened subject description halfyear total_time; run; proc export data=mylib.halfyears outfile='E:Senior ProjectDatatotal_time.csv' dbms=csv replace; run; /* this is where I created my categorical response variable. I actually created 3 separate response variables: one for the 50th percentile, the 75th percentile and the 90th percentile. I assigned these cutoff values within each half year in order to adjust for the effect of time on the email's duration. In my final analysis I ended up only using the 75th percentile. */ libname mylib "Desktop"; proc import datafile='F:Senior ProjectDatatotal_time.csv' dbms=csv out=totaltime replace; guessingrows=35000; run;
  • 25. Page 23 /* Identifying my cutoff values */ proc means mean median q3 p90 n; by halfyear; run; /* assigning curoff values for each of my 3 categorical responses */ data mylib.total_time_ (encoding=asciiany); set totaltime; ind50_time=0; if halfyear=1 and total_time>164.7 then ind50_time=1; if halfyear=2 and total_time>136.86 then ind50_time=1; if halfyear=3 and total_time>99.56 then ind50_time=1; if halfyear=4 and total_time>76.17 then ind50_time=1; if halfyear=5 and total_time>45.4 then ind50_time=1; if halfyear=6 and total_time>46.47 then ind50_time=1; if halfyear=7 and total_time>46.96 then ind50_time=1; if halfyear=8 and total_time>80.85 then ind50_time=1; if halfyear=9 and total_time>70.18 then ind50_time=1; ind75_time=0; if halfyear=1 and total_time>822.3 then ind75_time=1; if halfyear=2 and total_time>534 then ind75_time=1; if halfyear=3 and total_time>382.5 then ind75_time=1; if halfyear=4 and total_time>287.4 then ind75_time=1; if halfyear=5 and total_time>160.9 then ind75_time=1; if halfyear=6 and total_time>144.8 then ind75_time=1;
  • 26. Page 24 if halfyear=7 and total_time>161 then ind75_time=1; if halfyear=8 and total_time>236.1 then ind75_time=1; if halfyear=9 and total_time>185.6 then ind75_time=1; ind90_time=0; if halfyear=1 and total_time>2082.1 then ind90_time=1; if halfyear=2 and total_time>1554 then ind90_time=1; if halfyear=3 and total_time>1579.8 then ind90_time=1; if halfyear=4 and total_time>977 then ind90_time=1; if halfyear=5 and total_time>479.9 then ind90_time=1; if halfyear=6 and total_time>402.8 then ind90_time=1; if halfyear=7 and total_time>450.7 then ind90_time=1; if halfyear=8 and total_time>625 then ind90_time=1; if halfyear=9 and total_time>507.5 then ind90_time=1; run;
  • 27. Page 25 /* I exported 3 total data sets. One for the entire span from 2008 to 2014, one from 2012 to 2014, and one of just 2014 emails. In my final analysis I ended up just looking at the 2014 email data set */ data mylib.total_time_since2012 (encoding=asciiany); set mylib.total_time_; if halfyear>4; run; data mylib.total_time_2014 (encoding=asciiany); set mylib.total_time_; if halfyear=9; run; proc export data=mylib.halfyears outfile='E:Senior ProjectDatatotal_time.csv' dbms=csv replace; run; This concludes the section of data cleaning/manipulation. The next section will refer to code done within SAS Text Miner: creating my custom synonym data set, merging my topics/clusters, creating my rule variables, and merging them all together. The first page will include a picture of my final SAS Text Miner diagram to provide an idea of what everything looked like. I will go on to explain areas of the diagram in more detail along with the SAS code used in these areas. Figure 11: SAS Text Miner Final Diagram The top right of the diagram is the area in which I performed my analysis. You can see a node for decision tree, logistic regression, and linear regression. I had to use two data sets in this
  • 28. Page 26 section in order to use my 2 response variables separately; one data set contained my categorical response and one data set contained my quantitative response. I compare all 3 models with a model comparison node at the bottom of this area. This allowed me to identify the misclassification rates of the categorical response models as well as comparing ROC curves between the models. The top left of the diagram represents the node trail used to create my custom set of synonyms specifically for the jargon of the emails. Below is the code within the SAS code node to create the data set of synonyms. /* Creating my custom synonyms */ %textsyn( termds=emws2.textfilter_terms , docds=&em_import_data , outds=&em_import_transaction , textvar=description , mnpardoc=8 , mxchddoc=10 , synds=mydata.halfyearextsyns , dict=mydata.engdict2 , maxsped=15 ) ; The middle portion of the diagram is the area where I created my topics/clusters and rules for the body and subject line of the emails. The node trail on the left is for the body of the emails and the node trail on the right is for the subject line of the emails. The second from the bottom SAS code node is where I merge the text topics/clusters together. The SAS code node on the bottom of the middle section is where I create my rule variables and merge them with my entire data set. The code for both nodes is displayed below. /* merging my topics/clusters */ proc sort data=emws2.texttopic_train; by subject; run; proc sort data=emws2.texttopic2_train; by subject; run; proc sort data=emws2.textcluster_train; by subject; run; proc sort data=emws2.textcluster2_train; by subject; run; libname mydata "/home/msanregret/sasuser.v94";
  • 29. Page 27 data mydata.bigmergedtopics; merge emws2.texttopic_train emws2.texttopic2_train emws2.textcluster_train emws2.textcluster2_train; by subject; run; /* Separately creating my custom rule variables. I will merge them all together later. */ proc sort data = EMWS2.TextRule_Train; by subject; run; /* rule variables for the description of the email */ data description (keep = subject zero_topPredict zero_medPredict one_topPredict); set EMWS2.TextRule_Train; zero_topPredict = 0; zero_medPredict = 0; one_topPredict = 0; if w_ind75_time = 37 then zero_topPredict = 1; else if w_ind75_time >= 40 and w_ind75_time <= 44 then zero_topPredict = 1; else if w_ind75_time = 47 or w_ind75_time = 48 then zero_medPredict = 1; if w_ind75_time = 1 then one_topPredict = 1; else if w_ind75_time >= 3 and w_ind75_time <= 8 then one_topPredict = 1; else if w_ind75_time = 17 then one_topPredict = 1; else if w_ind75_time >= 12 and w_ind75_time <= 15 then one_topPredict = 1; else if w_ind75_time = 27 or w_ind75_time = 29 then one_topPredict = 1; run; proc sort data = EMWS2.TextRule2_Train; by subject; run;
  • 30. Page 28 /* rule variables for the subject of the email */ data subject (keep = subject zero_topSubPredict zero_medSubPredict one_topSubPredict one_medSubPredict); set EMWS2.TextRule2_Train; zero_topSubPredict = 0; zero_medSubPredict = 0; one_topSubPredict = 0; one_medSubPredict = 0; if w_ind75_time = 42 or w_ind75_time = 43 then zero_topSubPredict = 1; else if w_ind75_time = 45 or w_ind75_time = 47 then zero_topSubPredict = 1; else if w_ind75_time >= 48 then zero_medSubPredict = 1; if w_ind75_time = 1 or w_ind75_time = 3 then one_topSubPredict = 1; else if w_ind75_time = 4 or w_ind75_time = 6 then one_topSubPredict = 1; else if w_ind75_time >= 10 and w_ind75_time <= 14 then one_topSubPredict = 1; else if w_ind75_time >= 20 and w_ind75_time <= 24 then one_topSubPredict = 1; else if w_ind75_time = 18 or w_ind75_time = 25 then one_medSubPredict = 1; else if w_ind75_time >= 31 and w_ind75_time <= 37 then one_medSubPredict = 1; run; libname mydata "/home/msanregret/sasuser.v94"; /* merging the rules together */ data mydata.rules; merge description subject; by subject; run; /* merging the rules with my dataset */ data mydata.TopicsWithRules_2014; merge mydata.bigmergedtopics mydata.rules; by subject; run;