SAS Text Mining

Predicting Email
Duration
Using SAS Text Miner

Table of Contents
1. Introduction 1
2. The Barbaric Beginnings 2
3. SAS Text Miner Introduction 4
4. SAS Text Miner Tools 5
5. Results 10
6. Addition Results 12
7. Conclusion and Future Studies 14
8. References 15
9. Appendix 16

Part 1: Introduction
This project started out as a text mining exploration to identify emails related to one specific type
of problem within the company’s software. These emails were coming into the customer service
department of Lithium Technologies. So the clients sending the emails could be talking about a
problem/error they were experiencing with the product, or maybe they were simply asking a
question. They could also be talking about other things: maybe they had a request or an action
they needed to be performed. My initial research goal was to identify frequent issues in order for
the company to decide if it should invest time in developing more efficient methods of
preventing this problem in the future.
My journey began with an initial data set of emails from the years 2008 to 2014. My initial goal
was to look for certain keywords to best predict whether an email belonged to a specific type of
problem. This problem related to a software bug that some clients were experiencing.
Discriminant analyses was used to see which keywords did the best job of segregating the emails
into their respective category (whether they related to the specific software bug or not). Later on,
I realized this method was highly inefficient.
My next steps were to use a program called SAS Text miner for my analysis. This allowed me to
increase the significance of my research question. Instead of predicting whether an email
belonged in a specific category, I would now be able to predict an individual email’s time until
resolution. In order to do so, I looked into 3 different types of model: linear regression, logistic
regression, and a decision tree. Some of these models required a categorical response variable,
but my data contained a continuous variable for the time it takes an email to resolve. In order to
accommodate this, I created a cutoff at the 75th
percentile of my continuous total time variable.
Emails in the bottom 75th
percentile were classified as a 0 (not taking a long time), while emails
in the top 25th
percentile were classified as a 1 (taking a long time). This part of my project will
be discussed further in the SAS Text Miner Intro and Tools sections.
The entire data set contained about 65,000 emails in total (over the years 2008 - 2014). During
my analyses, I discovered the best predictor of an email’s duration was time; on average, emails
from 2008 took much longer to resolve than emails from 2014. Because of this, I ended up only
using emails from 2014 in my final analysis. The reason for this is because I wanted my model to
be as relevant and accurate as possible.

Part 2: The Barbaric Beginnings
At the start of my project, my goal was to predict whether an email related to a specific software
bug or not. To begin, I manually sorted through about 1,000 emails and determined whether or
not they related to the specific problem I was looking for. I then used a frequency word counter
to determine which keywords came up most frequently in the description portion of emails and
also in the subject header. The next step was to run a discriminant analysis on these keywords to
see how well they determined which bucket an email falls into.
There were certain words with a high frequency that weren't necessarily meaningful in terms of
using that word to predict if an email falls into a certain category. For example, a word like
“twitter” would come up very often, however, this is not a good word for separating the
categories because it is likely to show up in all emails coming into the company’s customer
service department; it is equally as likely to show up in emails that fall under the category vs.
other emails.
Figure 1: Jittered scatterplot of the Canonical1 scores of each email
Figure 1 shows a visual of the discriminant analysis. In this particular example, the blue dots
represent emails that describe the specific software bug I was interested in, while the red dots
represent everything else. An email’s classification is based on how close its Canonical1 score
lies to the average of the Canonical1 scores (X axis). These averages are represented by the
larger red and blue circles on the plot. The Canonical1 scores are calculated as a linear
combination of indicator variables which showed whether an email contained a specific word or
not.

Figure 2: Discriminant analysis scoring coefficients example
Figure 2 shows a sample of the scoring coefficients for these indicator variables. Each variable
name represents an indicator for whether or not that word is included in the email. For example,
if we’re using the preceding coefficients to represent the entire data set, an email with the words
“community” and “issue”, but without the words “respond”, “forum”, and “reproduce” would
have a Canonical1 score of about .624 + .363 + 0 + 0 + 0 = .987. This example email’s
Canonical1 score would fall closer to the blue circle and would be predicted to be an email
relating to the specific type software bug I was looking for.
It is important to note that in Figure 1 there are some red dots that fall closer to the blue dot
average as represented by the blue circle; these are the incorrectly predicted emails. The same is
true for the blue dots that fall closer to the red circle. In the score summaries shown on Figure 1,
the percentage of emails that were misclassified is approximately 7.6%.
The method used to determine the proportions of emails relating to the specific software bug is
not very “traditional. I did not check the canonical1 score of an email and then determine which
category’s average canonical1 score was closer. Keep in mind, my initial goal was not to predict
whether individual emails related to the specific software bug I was looking for. Instead, I only
wanted to predict the overall percentage of these emails within a given data set.
In order to do this I examined the probabilities provided for an individual email relating to the
software bug. I would then take the average of these probabilities and use that value as the
overall percentage of emails relating to the specific problem. For example, suppose that we have
a data set of 4 emails. The probabilities of each email relating to the specific problem I want are:
.3, .05, .6, and .7. Traditionally, we would predict an email to relate to the software bug if its
probability is greater than .5, so in this example we would predict 2 out of the 4 emails to be
talking about the software bug. But using my strategy, the proportion of software bug emails
would be predicted as .41 (the average of the four probabilities). I found this method to be much
more accurate when predicting the overall proportions of the large data set I was working with.
In order to perform this discriminant analysis, I manually selected words of my choosing. I did
this by examining about 1,000 emails and determining whether or not they fell into the category I
was looking for. Once this was completed, I would use a frequency word counter to show how
often certain words would appear within the emails I was interested in. Finally, I would choose
the words I thought did a good job of segregating these emails from all others. However, I
wouldn’t always choose the best keywords. After running individual discriminant analyses I
would change the words with coefficient scores close to zero (this means they do a poor job of
segregating the two groups of emails). I would repeat this process until I obtained a small enough
misclassification rate on my sample of emails.

Part 3: SAS Text Miner Intro
This project began with the goal of predicting the overall percentage of emails describing a
specific type of problem. This information could be used by the company in order to decide
whether they should invest time in discovering a more efficient method to prevent this type of
problem in order to save time in the long run.
My overall goal was to save the company as much time as possible. The method which I had
begun my project was not the most ideal way to approach this issue. It required me to sort
through an initial sample of emails by hand, it only allowed me to examine one specific type of
email I wanted to look for, and it forced me to run multiple analyses in order to obtain the best
set of keywords to segregate between the 2 groups. In order to further investigate how to save as
much time as possible, I would need to find a more efficient strategy.
Rather than limiting my research to just examining one specific type of problem, I decided to
find a way to predict how long an individual email will take to resolve. Could the length of time
an email takes to resolve be predicted by a model whose only predictor variable is the physical
text of that email? In order to build such a model I had to improve upon my previous strategy by
using SAS Text Miner to examine the unstructured text buried in the emails.
Figure 3 shows the basic/clean structure of the email data I worked with. It contains columns for
the subject line, body of the email (description), and total time till resolution (in hours).
Figure 3: Sample data set of emails

Part 4: SAS Text Miner Tools
Figure 4: Sample of SAS Text Miner Diagram
Figure 4 shows a simple example of the SAS Text Miner diagram for these text mining tools.
The diagram begins with the node on the left representing the data set. The following text
parsing node is similar to a word frequency counter; it counts the frequency of nouns, pronouns,
interjections, verbs… etc. This information is then passed through the text filter node to correct
for misspellings of words and pluralisms. It also allows you to import a custom list of synonyms
specific for your data, which was a technique that I used for this project.
After the text filter node, the data were run through 3 tools: text topics, text clusters, and text rule
builders. Each of these tools provided me with predictor variables to use in my analysis. A text
topic/cluster is a collection of words that describe and characterize a main theme or idea within
each email. For example, if I create two text topics for my data set of emails: one text topic may
describe emails where the client is talking about experiencing certain bug in the software, while
the other text topic may describe emails where the client isn’t describing a problem, but asking a
question. The text topic about the software bug may contain words such as “issue”, “resolve”, or
“fix”. Similarly, the text topic about questions may contain words such as “ask”, “wondering”, or
“curious”. SAS Text Miner allowed me to specify how many topics/clusters I wanted to create
with my data. It would then scan through every single email in the data set and create the desired
number of text topics/clusters.
Figure 5: Text Topic output

The above Figure 5 shows sample output of some of the words that each text topic contained.
The column labeled # Docs represents the number of documents (emails) that fall under the
corresponding text topic. Some of the text topic words include URL links or email addresses. I
refer to these topics as “garbage” topics. I threw these topics out because they are likely to only
exist within this specific data set used to create the list of text topics. In other words, I am unable
to generalize this information to other emails outside of the data set. The output from the text
clusters is not shown because they operate in a similar manner by grouping words that describe a
group of emails. The main difference between the topics and clusters is that an individual email
can be categorized into multiple text topics, but only one text cluster.
The most significant tool of SAS Text Miner for my analysis is called the text rule builder. The
text rule builder operates slightly differently than the text topics/clusters. Text rules require a
categorical response variable. Unfortunately, I am dealing with a quantitative response variable
(total time in hours). Therefore I created a binary response variable by creating a cutoff value at
the 75th
percentile of total time till resolution. If an individual email was in the upper 25th
percentile of total time, it was categorized as taking a long time (this could be a type of email
worth figuring out how to more efficiently respond to in the future). All other emails were
categorized as not taking a long time. The value of 1 was used to represent if the email fell in the
upper 25th
percentile and a value of 0 otherwise.
A text rule consists of 1 to 3 words. If an email contains all of the words of the rule, it will
qualify for the rule. If an email qualifies for a rule, it can either be predicted as taking a long time
(value of 1) or not (value of 0).
Figure 6: Text Rule Builder Output
Figure 6 shows output for the text rules. The rule column contains the word(s) for each
individual rule, and if an email contains these words, it qualifies for the rule. The 4th
rule shown
in Figure 6 contains the words “search” and “response”, its target value is 1, and has a true
positive/total value of 8/8. This means that out of all the emails, 8 of them contained the words
“search” and “response”. Of those, all 8 were categorized in the upper 25th
percentile of total
time till resolution, meaning that they took a long time. There were also “garbage” rules created
in this process such as the output rule number 10.
In order to determine whether an email was classified as taking a long time or not, I combined all
3 of the aforementioned tools. Because the data contains a text based variable for the subject of

the email and another variable for the body of the email, I had to run 2 separate node trails for
each one. I would later have to combine them into a single data set.
Combining the text topics and clusters was very straightforward because their format allowed
them to be merged easily. However, the text rules were a different story. The format they were
exported in didn’t allow merging them into the data. To work around this problem, I created
separate variables which indicated if an email qualified for a meaningful rule. This allowed me to
pick and choose which rules would get passed in. This way, “garbage” rules could be dropped
and only the rules that contained a high percentage of the target value would be included. I
classified the new variables into 8 separate categories:
 one_topPredict
 one_topSubPredict
 one_medPredict
 one_medSubPredict
 zero_topPredict
 zero_topSubPredict
 zero_medPredict
 zero_medSubPredict
The first variable on the list (one_topPredict) is a top predictor of emails that are categorized as a
1 (taking a long time). In order to create the one_topPredict variable, I examined all the rules
built for the description of the email that did a good job of predicting whether an email took a
long time or not. If an email qualified for one of these rules, I classified it as having a value of 1
under the one_topPredict variable, otherwise it would have a value of 0. The same logic applies
to the other variables. When the variable name contains “Sub”, this variable relates to rules
looking at the subject of the email. The difference between “top” and “med” in the variable
names refers to how high the percentage of emails were accurately predicted. For example, if an
email qualified under the one_topPredict variable, it would have a 97.7% chance of being
categorized as a one (this percentage is the total number of emails which qualified for this
variable and were categorized as a 1, divided by the total number of emails which qualified for
this variable). If an email qualified for one_medPredict it would only have a 63.4% chance of
being categorized as a 1.
The final analytic data set contained several hundred variables. Each email was assigned
indicator variables of whether or not it belonged to the respective text topic, and it was also
assigned a raw score variable for each text topic. If the raw score value for an email was high
enough for a certain text topic, it would qualify to fit into that text topic. This resulted in 270 text
topics for the body of the email and 114 text topics for the subject of an email, each with their
own respective indicator and raw score variables. In terms of text clusters, each email was
assigned singular value decomposition (SVD) scores. SVD scores of an email are used to
determine the probability of an email falling into certain cluster. My final data set contained 50
SVD values and 190 text clusters for the body of an email along with 30 SVD values and 11 text
clusters for the subject of an email.
All of these aforementioned variables were used as potential explanatory variables in a decision
tree model, a logistic regression model, and a linear regression model. For the sake of ease of

interpretation from a non-statistical perspective, I will mainly discuss the findings of the decision
tree model in this report. A decision tree is made up of a series of blocks and branches. The first
block at the top of the decision tree represents the entire data set I was using. Thus, the first
block had 25% of the emails categorized as a 1 (taking a long time), and 75% of the emails
categorized as a 0 (not taking a long time).
Figure 7: Beginning segment of my decision tree model
Figure 7 shows the beginning of the decision tree for this analysis (only part of the total tree).
The first block is split into two separate blocks by a condition. Each condition is selected based
on what will segregate the groups of emails the most. In other words, what condition will put the
highest possible percentage of emails categorized as 1 in one group, with the highest possible
percentage of emails categorized as 0 in the other group. The condition for the very first block is
whether an email falls under the one_topPredict variable or not. If an email falls under the
one_topPredict variable it will be classified into the block on the left, otherwise it will be
classified into the block to the right. The blocks continue to split upon various conditions of the
predictor variables until they reach their respective bottom rows.
The conditions for the decision tree branches were assigned manually; I did not have the
computer automatically assign conditions for me. I wanted to avoid assigning “garbage”

conditions to the decision tree branches because these conditions only exist within this specific
data set and cannot be generalized to external emails. However, I did examine the recommended
conditions (generated by SAS Text Miner) that did the best job of segregating emails based on
my categorical response. This process allowed me to manually look at these conditions and only
use the ones I wanted. Once the decision tree reaches the bottom row of blocks, the emails would
be as segregated as possible. In the final blocks, an email would now be predicted to have a
certain probability of being categorized as a 1 or 0. This probability is based on the block’s
percentage of emails within each category.

Part 5: Results
The results of the decision tree can be used to predict my categorical variable of time duration.
The decision tree sorted through the hundreds of predictor variables and separated the data based
on whether they met certain conditions of these predictor variables. For example, if an email
qualified for a certain text topic, and was also part of a specific cluster, and contained a positive
value for one of the custom rule variables, the decision tree would predict a specific probability
of that email falling into the upper 25th
percentile of total time duration.
Certain branches of the decision tree, referred to as “money makers”, classify a high percentage
of emails as equal to 1 (upper 25th
percentile) or 0 (lower 75th
percentile). These nodes also
contain a decent number of emails within them that meet the specified conditions. Figure 8
shows one of the “money maker” branches for predicting if an email does not take a long time
(value of 0):
Figure 8: Money maker for predicting emails not taking a long time
Out of the 5,518 emails (just the emails from the year 2014), 265 of these emails fell on to this
node. Of these 265 emails, 98.5% of them were categorized as 0 and 1.5% of them were
categorized as 1. The conditions for this node can be seen on the right hand side. The top
condition represents the very first condition emails had to meet to be classified into this node
followed by subsequent conditions in a sequential order. The variable name of the second
condition, “DescCluster_prob136” represents the probability of an email falling into a cluster136
of the description of the email.

Figure 9: Money maker for predicting emails taking a long time
Figure 9 represents an example of a “money maker” that predicts a high percentage of emails
taking a long time. Notice how the number of emails that fall under this node is lower than the
previous. This is to be expected due to the overall percentage of emails taking a long time only
being 25% of the data set as opposed to 75%.
There were several “money maker” blocks for each category of email (long versus not long
duration) which were combined into a final model for the decision tree. From these combined
results, 179 of these emails were classified as having a 97.7% probability of taking a long time
and can be considered the top tier of prediction. The second tier of prediction consisted of a
separate 232 emails having a 62.9% probability of taking a long time. While 62.9% is not an
extremely high probability, this may be due to the nature of the data set with emails taking a long
time only represented by 25% of the data. So I was able to find 232 emails that are more than
twice as likely to take a long time.
In regards to predicting if an email will not take a long time there were 464 emails classified
having a 99.1% probability of not taking a long time. I didn’t create a second tier for predicting
emails not taking a long time because they already represented 75% of the data. When all 3 of
these “prediction” groups are combined together, they only represent about 15.8% of the data set.
In other words, I was only able to find meaningful predictions for 15.8% of the data I was
looking at. While this is not as high as I hoped, it was certainly is better than nothing.

Part 6: Additional Results
A total of 3 models were produced for my analyses: a decision tree, a logistic regression, and a
linear regression. Of these 3 models, only the decision tree was discussed in this project because
of its utility. The decision tree allowed me to create groups of emails that had an extremely high
percentage of belonging to one category or the other, while the same can’t be said for the other
two models.
The logistic regression model predicted the probability of an email taking a long time
(categorized with a value of 1). This is similar to the purpose of my decision tree model,
however, its concept is not as easy to present to a company. It can be difficult to explain concepts
such as log odds and odds ratios to people with little statistical knowledge. It’s much easier to
explain to a company that if an email meets certain conditions, it will have a certain probability
of taking a long time.
While the decision tree is easier to explain, I was only able to significantly predict length of time
for about 16% of all the emails in the data set. The remaining emails are less straightforward in
terms of predictions of which category they belong in, but we may use a linear regression model
to get a slightly better prediction for these remaining emails.
Linear regression is an easy concept to explain to someone with no statistical knowledge. The
linear regression model used about 70 predictor variables which were selected from the same list
of several hundred variables used in the decision tree. Linear regression needs to use a
quantitative response variable. I ended up using the total time till resolution (the initial response
variable I started with).
The variable selection process for linear regression was automated by the computer in order to
come up with the most useful model. Unlike the decision tree, I was not able to select which
predictors I wanted to leave out of the model. This means that the computer would automatically
select “garbage” predictors to insert into my model! However, I could work around this by
removing the unwanted predictor variables from the initial data set before they were passed into
the linear regression node.

Figure 10: Mean Predicted values vs. Mean Target values on Total Time
Figure 10 represents a graph of predicted values vs. target values for average total time across
the depths of the data set. The data are measured at every depth interval of 5 as shown on the
horizontal axis. Depth can be thought of as a percentage of the data. The predicted values shown
in the preceding figure seem like a good fit for the actual target values of total time, but this only
displays the average predicted value for every 5% interval of the data. Meaning this is the
average predicted value for groups of about 250 emails! When the predicted values for the
emails are examined individually they have high residual values overall. This means that the
linear regression model performed well over each average of 5% within my data, but when each
email is examined individually, the model isn't very trustworthy. However, some knowledge is
better than none.

Part 7: Conclusion and Future Studies
What are the actual uses of this information? The company has a certain amount of people in the
customer service department who respond to these emails. Some of these employees have a lot of
experience, while others are more inexperienced. The most time efficient method of responding
to emails would be to have the less experienced employees respond to emails that do not take a
long time, while the more experienced employees would respond to all the other emails.
The main reasoning for the preceding idea is that the biggest time sink for the company would be
when an employee who is less experienced attempts to respond to an email which takes a long
time. Because the employee is less experienced, it will take an even longer amount of time for
the email to resolve. In order to save time, the text mining algorithm based on specific words
found in the subject line and body of the email could be run on every new email that comes into
the customer service department. If the email is predicted to not take a long time, it would be
assigned to a less experienced employee. If an email is predicted to take a long time, it would be
assigned to a more experienced employee. All of the “in-between” emails can be assigned to
whoever is free at the given time.
While the results from the decision tree only classify approximately 16% of the data, anything
helps when it comes to saving the company as much time as possible while responding to
customer service emails. Which is why I believe it is a profitable strategy to initially classify
emails with the decision tree first and then run the linear regression model on emails that aren't
highly predicted to belong in a certain category. Depending on the predicted value from the
linear regression model, we could assign the email to a newer employee or a more experienced
employee.
Thus, up to this point we have successfully established an efficient method for assigning emails
to employees of a company based on topics and clusters of text embedded in the email. But what
if we could identify which types of emails are taking a long time to resolve? This would allow
the company to develop more efficient methods for resolving the subject matter of these emails
in order to save time in the long run. Therefore, a next step for this project would be to examine
the emails with a high probability of being predicted to take a long time. Are there any patterns
in these emails? Are there any specific types of problems these emails are discussing?
In order to find this out, we must delve deep into the various text topics/clusters found in the
email text. What are the words used in the text topics/clusters of high probability email falls?
What do those text topics/clusters mean in context? Do the majority of emails in this specific
node relate to a specific type of problem? In order to do this, we will need to identify the words
associated with each text topic, cluster, and rule. The format for the topics and rules easily allow
this, but clusters are more difficult. The visualization of the decision tree doesn’t allow you to
see the words associated with the clusters. Instead, it only lists the cluster number. To work
around this, you must manually go back into the cluster node output to see the words associated
with the respective cluster number. These findings would be even more useful for a company in
order to discover methods for more efficiently resolving these types of emails in the future.

Part 8: References
Sarma, Kattamuri S. Predictive Modeling with SAS Enterprise Miner: Practical Solutions for
Business Applications. Cary, NC: SAS Institute, 2007. Print.
Ville, Barry De, and Padraic Neville. Decision Trees for Analytics: Using SAS Enterprise Miner.
Cary, NC: SAS Institute, 2013. Print.
SAS Certification Prep Guide: Advanced Programming for SAS 9. Cary, NC: SAS Institute,
2011. Print.
SAS Certification Prep Guide Base Programming for SAS 9. Cary, NC: SAS Institute, 2011.
Print.
Cohen, K. Bretonnel, and Lawrence Hunter. "Getting Started in Text Mining." PLoS
Computational Biology PLoS Comput Biol 4.1 (2008): n. pag. Web.
"Getting Started with SAS Enterprise Guide: Main Menu." Getting Started with SAS Enterprise
Guide: Main Menu. N.p., n.d. Web.

Part 9: Appendix
The purpose of this section is to provide reproducible steps in order to achieve the results I did. I
will start by providing the SAS code used to import and clean the data until it was in the format I
could work with. In order to read the data in, I had to import many csv files separately due to
large data sets that caused my computer to crash.
/* data from year 2010 */
proc import
datafile='E:LithiumCase History Statuscsv1 2010.csv'
dbms=csv
out= sample12010
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample22010
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample12011
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample22011
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample12012
replace;
guessingrows=2000;
run;
proc import

dbms=csv
out= sample32012
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample42012
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample52012
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample62012
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample12013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample22013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample32013
replace;
guessingrows=2000;
run;
proc import

dbms=csv
out= sample42013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample52013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample62013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample72013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample82013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample92013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample102013
replace;
guessingrows=2000;
run;
proc import

dbms=csv
out= sample112013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample122013
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample22014
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample32014
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample42014
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample52014
replace;
guessingrows=2000;
run;
proc import
dbms=csv
out= sample72014
replace;
guessingrows=2000;
run;

/* Combining all of the data together from all years */
data alldata;
set sample12010 sample22010 sample12011 sample22011 sample12012
sample32012 sample42012 sample52012 sample62012 sample12013 sample22013
sample42014 sample52014 sample72014;
run;
proc export data=alldata
outfile='E:Senior ProjectDatadata.csv'
dbms=csv
replace;
run;
/* Sorting the Data appropriately */
libname loc "LibrariesDocuments";
proc import
datafile='E:Senior ProjectDatadata.csv'
dbms=csv
out=totaldata
replace;
guessingrows=2000;
run;
proc sort data=totaldata;
by Date_Time_Opened Subject Description Case_History_Status;
run;
proc export data=totaldata
outfile='E:Senior ProjectDataalldatasorted.csv'
dbms=csv
replace;
run;
/* Combining the status durations of the email's
in_progress status */
proc import
datafile='E:Senior ProjectDataalldatasorted.csv'
dbms=csv
out=sorted
replace;
guessingrows=2000;
run;

/* assigning an ID variable to each individual email */
data sorted1;
set sorted;
by Date_Time_Opened Subject Description Case_History_Status;
dateopened=datepart(Date_Time_Opened);
retain ID 0;
if first.Description then ID=ID+1;
TotDuration+Duration;
if first.Case_History_Status then TotDuration=Duration;
if last.Case_History_Status then output;
run;
/* trasposing the status durations in order to combine later */
proc transpose data=sorted1
out=sorted2;
ID Case_History_Status;
by ID dateopened Subject Description;
var TotDuration;
run;
/* combining the status durations for in_progress status */
data sorted4;
set sorted2;
In_Progress=sum(In_Progress,In_Progress__Engineering_,
In_Progress__Support_,In_Progress__Internal_,
In_Progress__TechOps_,In_Progress__Social_Dynamx_,
In_Progress__DATA_);
Delay=sum(Delay,Delayed);
run;
proc export data=sorted4
outfile='E:Senior ProjectDataCombinedInProgress.csv'
dbms=csv
replace;
run;
/* assigning half year variables to my emails to
assist with organizing by time. This would allow me
to analyze emails within the year they were sent.
Also summing all statuses to retrieve total time variable. */
proc import
datafile='E:Senior ProjectDataCombinedInProgress.csv'
dbms=csv
out=combined
replace;
guessingrows=35000;
run;
libname mylib "Desktop";

/* assigning half year values */
data mylib.halfyears (encoding=asciiany);
set combined;
if dateopened<18444 then
halfyear=1;
if dateopened>=18444 and dateopened<18628 then
halfyear=2;
halfyear=3;
halfyear=4;
halfyear=5;
halfyear=6;
halfyear=7;
halfyear=8;
if dateopened>=19724 then
halfyear=9;
/* summing all status durations to retrieve total time */
total_time=sum(In_Progress,New,Updated_by_Customer,Waiting_for_Fix,
Work_Complete,Pending_Customer_Response,Scheduled_for_Production_Deploym,
Waiting_for_Upgrade,Awaiting_Customer_Approval,Delay,ER_Planned_for_Roadmap,
Waiting_for_Enhancement,Preparing_for_Production_Deploym,Delayed__Misc_,
Delayed__Production_Freeze_);
keep ID dateopened subject description halfyear total_time;
run;
proc export data=mylib.halfyears
outfile='E:Senior ProjectDatatotal_time.csv'
dbms=csv
replace;
run;
/* this is where I created my categorical response variable.
I actually created 3 separate response variables: one for the
50th percentile, the 75th percentile and the 90th percentile.
I assigned these cutoff values within each half year in order
to adjust for the effect of time on the email's duration. In
my final analysis I ended up only using the 75th percentile. */
libname mylib "Desktop";
proc import
datafile='F:Senior ProjectDatatotal_time.csv'
dbms=csv
out=totaltime
replace;
guessingrows=35000;
run;

/* Identifying my cutoff values */
proc means mean median q3 p90 n;
by halfyear;
run;
/* assigning curoff values for each of my 3 categorical responses */
data mylib.total_time_ (encoding=asciiany);
set totaltime;
ind50_time=0;
if halfyear=1 and total_time>164.7
then ind50_time=1;
then ind50_time=1;
then ind50_time=1;
then ind50_time=1;
then ind50_time=1;
then ind50_time=1;
then ind50_time=1;
then ind50_time=1;
then ind50_time=1;
ind75_time=0;
then ind75_time=1;
if halfyear=2 and total_time>534
then ind75_time=1;
then ind75_time=1;
then ind75_time=1;
then ind75_time=1;
then ind75_time=1;

then ind75_time=1;
then ind75_time=1;
then ind75_time=1;
ind90_time=0;
then ind90_time=1;
then ind90_time=1;
then ind90_time=1;
then ind90_time=1;
then ind90_time=1;
then ind90_time=1;
then ind90_time=1;
then ind90_time=1;
then ind90_time=1;
run;

/* I exported 3 total data sets. One for the entire
span from 2008 to 2014, one from 2012 to 2014, and
one of just 2014 emails. In my final analysis I ended
up just looking at the 2014 email data set */
data mylib.total_time_since2012 (encoding=asciiany);
set mylib.total_time_;
if halfyear>4;
run;
data mylib.total_time_2014 (encoding=asciiany);
set mylib.total_time_;
if halfyear=9;
run;
proc export data=mylib.halfyears
outfile='E:Senior ProjectDatatotal_time.csv'
dbms=csv
replace;
run;
This concludes the section of data cleaning/manipulation. The next section will refer to code
done within SAS Text Miner: creating my custom synonym data set, merging my topics/clusters,
creating my rule variables, and merging them all together. The first page will include a picture of
my final SAS Text Miner diagram to provide an idea of what everything looked like. I will go on
to explain areas of the diagram in more detail along with the SAS code used in these areas.
Figure 11: SAS Text Miner Final Diagram
The top right of the diagram is the area in which I performed my analysis. You can see a node
for decision tree, logistic regression, and linear regression. I had to use two data sets in this

section in order to use my 2 response variables separately; one data set contained my categorical
response and one data set contained my quantitative response. I compare all 3 models with a
model comparison node at the bottom of this area. This allowed me to identify the
misclassification rates of the categorical response models as well as comparing ROC curves
between the models.
The top left of the diagram represents the node trail used to create my custom set of synonyms
specifically for the jargon of the emails. Below is the code within the SAS code node to create
the data set of synonyms.
/* Creating my custom synonyms */
%textsyn( termds=emws2.textfilter_terms
, docds=&em_import_data
, outds=&em_import_transaction
, textvar=description
, mnpardoc=8
, mxchddoc=10
, synds=mydata.halfyearextsyns
, dict=mydata.engdict2
, maxsped=15
) ;
The middle portion of the diagram is the area where I created my topics/clusters and rules for the
body and subject line of the emails. The node trail on the left is for the body of the emails and the
node trail on the right is for the subject line of the emails. The second from the bottom SAS code
node is where I merge the text topics/clusters together. The SAS code node on the bottom of the
middle section is where I create my rule variables and merge them with my entire data set. The
code for both nodes is displayed below.
/* merging my topics/clusters */
proc sort data=emws2.texttopic_train;
by subject;
run;
proc sort data=emws2.texttopic2_train;
by subject;
run;
proc sort data=emws2.textcluster_train;
by subject;
run;
proc sort data=emws2.textcluster2_train;
by subject;
run;
libname mydata "/home/msanregret/sasuser.v94";

data mydata.bigmergedtopics;
merge emws2.texttopic_train
emws2.texttopic2_train
emws2.textcluster_train
emws2.textcluster2_train;
by subject;
run;
/* Separately creating my custom rule variables.
I will merge them all together later. */
proc sort data = EMWS2.TextRule_Train;
by subject;
run;
/* rule variables for the description of the email */
data description (keep = subject zero_topPredict zero_medPredict
one_topPredict);
set EMWS2.TextRule_Train;
zero_topPredict = 0;
zero_medPredict = 0;
one_topPredict = 0;
if w_ind75_time = 37 then
else if w_ind75_time >= 40 and w_ind75_time <= 44 then
else if w_ind75_time = 47 or w_ind75_time = 48 then
zero_medPredict = 1;
if w_ind75_time = 1 then
one_topPredict = 1;
one_topPredict = 1;
else if w_ind75_time = 17 then
one_topPredict = 1;
one_topPredict = 1;
one_topPredict = 1;
run;
proc sort data = EMWS2.TextRule2_Train;
by subject;
run;

/* rule variables for the subject of the email */
data subject (keep = subject zero_topSubPredict zero_medSubPredict
one_topSubPredict one_medSubPredict);
set EMWS2.TextRule2_Train;
zero_topSubPredict = 0;
zero_medSubPredict = 0;
one_topSubPredict = 0;
one_medSubPredict = 0;
if w_ind75_time = 42 or w_ind75_time = 43 then
else if w_ind75_time >= 48 then
zero_medSubPredict = 1;
if w_ind75_time = 1 or w_ind75_time = 3 then
run;
libname mydata "/home/msanregret/sasuser.v94";
/* merging the rules together */
data mydata.rules;
merge description
subject;
by subject;
run;
/* merging the rules with my dataset */
data mydata.TopicsWithRules_2014;
merge mydata.bigmergedtopics
mydata.rules;
by subject;
run;

SAS Text Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to SAS Text Mining

Similar to SAS Text Mining (20)

SAS Text Mining