Group13 kdd cup_report_submitted

CS4642 - Data Mining & Information
Retrieval
Paper Based on KDDCup 2014 Submission
Group Members:
100227D - Jayaweera W.J.A.I.U.
100470N - Sajeewa G.K.M.C
100476M - Sampath P.L.B.
100612E - Wijewardane M.M.D.T.K.
Group Number : 13
Final Group Rank : 76

Description of Data
In this competition, five data files are available for competitors. They are donations
(contains information about the donations to each project. This is only provided for projects
in the training set), essays (contains project text posted by the teachers. This is provided for
both the training and test set), projects (contains information about each project. This is
provided for both the training and test set), resources (contains information about the
resources requested for each project. This is provided for both the training and test set) and
outcomes (contains information about the outcomes of projects in the training set). Before
starting the knowledge discovery process provided data have been analyzed.
First of all number of data records in each file has been counted to get an idea about the
amount of data available. Projects file has 664098 records, essays file has 664098 records,
outcomes file has 619326 records, resources file has 3667217 records and donations file has
3097989. Our next task was to identify the criterion which is used to differentiate test data
from training data. After reading the competition details we realized that projects after
2014-01-01 belongs to test data set and projects before 2014-01-01 belongs to training data
set. According to that 619326 projects are available for training set and remaining amount
(44772) of projects are for training set. For each of the project in training set, project
description, essay data, resources requested, donations provided and outcome are given. For
each of the project in test set, project description, essay data, resources requested are given.
Data Imbalanced Problem
After having a brief understanding of data provided we started to analyze training set.
When we draw a graph between the project’s posted dates and “is_exciting” attribute, we
realized that there are no exciting projects before April 2014. Graph was completely
skewed to the right side.
This leads to a data imbalanced problem as number of exciting projects is very small
compare to the number of non-exciting projects (exciting - 5.9274%). Histogram of
exciting and non-exciting projects was as follows.

In competition forum there was an explanation for this problem. It said that organization
might not keep track of some of the requirements needed to decide ‘is_exciting’ for the
projects before 2010. Therefore we thought that classification given in outcomes file before
2010 may not correct and we decided to use down sampling technique to handle
imbalanced data (remove projects before 2010). It is true that valuable information may get
lost when projects are removed. But accuracy obtained by removing that data outweigh the
loss of information. Therefore we were able to obtain higher accuracy by down sampling
the given data. All the classifiers that we have used performed well after removing projects
before 2010.
Preprocessing Data
First we analyzed characteristics of mining data using statistical measurements. Using the
data frame describe method we calculated number of records, mean, standard deviation,
minimum value, maximum value and the quartile values for each attribute. Given below is
a statistical measurement of two attributes.
We were able to get an idea about the distribution of attributes using these statistical
measurements.

Filling Missing Values
Initially we used pad method (propagate last valid observation forward) to fill missing
values of all the attributes. But we realized that we can achieve high accuracy by selecting a
filling method based on the type of the attribute. To do that first we calculate the percentage
of missing values. It was as follows,
Highest amount of missing values percentage was for secondary focused subject and
secondary focused area. This is because some projects may have only primary focus area
and primary focus subject. We decided to fill missing secondary values with their
respective primary values. Also we used linear interpolation for numeric values and for
other attributes we used pad method. Later when we tune up classifiers we changed the
method from pad to backfill (use next valid observation) as it obtained a higher accuracy
than pad.
Remove Outliers
When we analyzed data, outliers were detected in some of the attributes. We used scatter
plots to identify outliers. There were outliers in cost related attributes and we replaced them
with the mean value of that attribute. Given below is outlier analysis of cost attribute,

Red circle value can be considered as an outlier as it has a really huge value than other
values. These outliers have caused a lot of problems when we discretize data. To identify
outliers in resources, we used inter quartile range as a measurement.
Label Encoding
We did not use all the attributes for predictions. We focused more on repetitive features as
they will help more to the classifier to make predictions. Most of these repetitive
features/attributes have string values rather than numerical values. Available classifiers do
not accept string values for features. So we used label encoder to transform those string
values to integer values between 0 and n-1, n being the number of different values a feature
can take.
But classifiers expect continuous input and may interpret the categories as being ordered
which is not desired. To make the categorical features to features that can be used with
scikit classifiers we used one-hot encoding. Encoder transformed each categorical feature
with k possible values into k binary features, with only one active for particular sample.
This improved the performance of classifiers to greater extent. For an example SGD
classifier obtained about 0.55 ROC score without hot encoding and with encoding it
obtained about 0.59 ROC score.
Continues Values Discretization
Project attributes such as school longitude, school latitude, zip code and total cost cannot be
directly used for predictions as they are less likely to be repetitive. But this information
cannot be eliminated as they may help to get decisions for classifiers. To make these
attributes more repetitive we used discretization. We put these continuous values into bins
and used the bin index as the attribute. For an example we used discretization for longitude
and latitude and divided projects into five regions (bins) and used region id instead of using
longitude and latitude. Discretization results for total cost attribute as follows,

We applied the same concept for cost related attributes, item count for project, total price of
items per project, number of projects per teacher etc.
This has improved the repetitiveness of attributes to a greater extent and more useful
information has been discovered which can be used by the classifier.
Attribute Construction
Some of the features given in data files cannot be used directly due to various reasons (most
of the times they are highly non repetitive). We used some of these features to construct
new features by combining multiple features or transforming one to another. Given below
is the list of derived attributes.
1. Month- posted date of the project was given but it is less repetitive. We derived
month attribute from the posted date and used it for prediction
2. Essay length- for each project corresponding essay was given but it cannot be used
directly for prediction. Therefore we calculated the length of the each essay after
removing extra spaces within the essay text and used it as an attribute.
3. Need statement length
4. Projects per teacher- we calculated number of projects per teacher by grouping the
projects with ‘teacher_acctid’ and used it as an attribute
5. Total items per project- we calculated total number of items requested per each
project from the details provided in resources file and used it as an attribute
6. Cost of total items per project- we calculated total cost of items requested per each
project from the details provided in resources file and used it as an attribute
Several other derived attributes such as date, short description length has been considered
but they did not yield a significant performance improvement.
Model Selection and Evaluation
We have used three classifiers during the project. First we used decision tree classifier, then
we used logistic regression and finally we used SGD (stochastic gradient decent) classifier.
We started with tree classifier as it was easy to use. To evaluate the performance of
classifiers initially we used the cross validation technique. But later we realized that
competition is using ROC (area under the curve) score for evaluations. So we also used
ROC scores to evaluate the performance of the classifiers. As we had several choices for
classifiers we read several articles about the usage of classifiers. From them we realized
that decision tree normally does not perform well when there is data imbalance problem
and logistic regression was used instead of that.
Logistic regression was performed well with the given data and it achieved about 0.61 ROC
score. To improve the accuracy further more we used SGD classifier (logistic regression
with SGD training). On one hand it is more efficient than the logistic regression so that
predictions can be done in less amount of time. On the other hand it achieved higher
accuracy than the regression classifier. With default parameters for SGD classifier we were
able to achieve about 0.635 ROC score. To tune up the SGD classifier (to find best values

for the parameters) we performed a grid search and found optimum values for the number
of iterations, penalty, shuffle and alpha parameters. Using those values we were able
improve the accuracy up to 0.64 ROC score.
Ensemble Methods
We tried to use boosting algorithm to improve the performance of classifier. Among the
methods available we used “ada boost” method (AdaBoostClassifier) for that.
Implementation provided by scikit library only supports decision tree classifier and SGD
classifier. So we were not able to use logistic regression directly. Instead we tried to use
SGD classifier with boosting algorithm. But accuracy was increased only by an
insignificant amount.
Further Improvements
Essays data contains huge amount of data but they were not used during the predictions
apart from the essay length. We tried to extract essay data using TfidVectorizer but it was
not successful due to memory constraints. As an alternative we tried hashing methods but it
reduced the accuracy of the essay data. We think that accuracy of the classifier may
improve further if some features from the essay data are included in training data. Also use
of ensemble methods will definitely improve the accuracy of predications.
Support Libraries Used
We used ‘Pandas’ data analysis library to generate data frames from the provided comma-
separated values files which can be used with other data analysis and modeling tools which
we used. Other than that we used functions provided with ‘Pandas’ library for generating
bins in order to discretize the attributes with less repetitive values and merging data frames
from several data sources.
Then we used ‘NumPy’ extension library in order to generate multidimensional arrays
using ‘Pandas’ data frames and data series to make it easy to access certain ranges of data
(i.e. separate the indices of training data set from test data set) and locate some properties of
data like median and quartiles. Also when combining derived attributes with existing
attributes functions provided with ‘NumPy’ library was useful.
‘Scikit-learn’ machine learning library was the library we used to integrate data analysis,
preprocessing, classification, regression and modeling tools into our implementations. From
the various tools provided with ‘Scikit-learn’ library we used preprocessing tools like
‘Label Encoder’ and ‘One Hot Encoder’, ‘Standard Scalar’ and text feature extraction tools
classification tools like ‘Decision Tree Classifier’, ‘SGD Classifier’ and ‘Logistic
Regression’, model selection and evaluation tools like ‘Grid Search’, ensemble tools like
‘AdaBoost Classifier’ and metrics like ‘roc_auc_score’ to compute area under the curve
(AUC) from prediction scores as mentioned above.

Group13 kdd cup_report_submitted

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (8)

Similar to Group13 kdd cup_report_submitted

Similar to Group13 kdd cup_report_submitted (20)

Recently uploaded

Recently uploaded (20)

Group13 kdd cup_report_submitted