Kaggle KDD Cup Report

1
CS4642 - Data Mining & Information Retrieval
Report based on KDD Cup 2014 Submission
Siriwardena M. P. -100512X
Upeksha W. D. -100552T
Wijayarathna D. G. C. D. -100596F
Wijayarathna Y. M. S. N. -100597J

1.0 Introduction
DonorsChoose.org is an online charity that makes it easy to help students in need
through school donations. Teachers in K-12 schools propose projects requesting
materials to enhance the education of their students. When a project reaches its funding
goal, they ship the materials to the school.
The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that
are exceptionally exciting to the business. While all projects on the site fulfill some kind
of need, certain projects have a quality above and beyond what is typical. By identifying
and recommending such projects early, they will improve funding outcomes, better the
user experience, and help more students receive the materials they need to learn.
Participants are provided with six files of data containing the details of the projects in the
training set and the test set. While the donations.csv and outcomes.csv files provides
data about the projects in the training set, projects.csv, essays.csv and resources.csv
files provides information about both training set and the test set.
sampleSubmission.csv file provide a sample of results that the competitors are needed
to calculate and provide for the competition.
! donations.csv - contains information about the donations to each project.
! essays.csv - contains project text posted by the teachers.
! projects.csv - contains information about each project.
! resources.csv - contains information about the resources requested for each project.
! outcomes.csv - contains information about the outcomes of projects in the training
2
set.
! sampleSubmission.csv - contains the project ids of the test set and shows the
submission format for the competition.
The data is provided in a relational format and split by dates. Any project posted prior to
2014-01-01 is in the training set. Any project posted after is in the test set. Some
projects in the test set may still be live and are ignored in the scoring. Projects that are
still alive are not disclosed to avoid leakage regarding the funding status.
2.0 Classification Algorithms used
For Implementing above tasks we used following classification algorithms. We used
libraries which were implemented in python scikit-learn. By giving same parameters to
each of these algorithms, we calculated which algorithm gives best results for above
problem.

2.1 Stochastic Gradient Descent
In Stochastic Gradient Descent, it creates a linear function 푓 푥 = 푤!푥 + 푏 for training set
푥!, 푦! . . . . . . 푥!, 푦! . The target here is to find f(x)so that it will satisfy the most of 푦!in
training set. This method uses Gradient Descent method to find optimum values for w,
so that intercept b will be minimum.
We tried this method with SGDClassifier which was implemented in python scikit-learn
libraries. Following is the part of code which did the classification using SGD.
3
from sklearn.linear_model import SGDClassifier
model = SGDClassifier( loss = 'modified_huber', penalty = 'l2')
model.fit(train, outcomes=='t')
preds = model.predict_proba(test)[:,1]
Here we create a SGDClassifier and fit it with training data set where train is a sorted
sparse matrix with features and outcomes is the array with class labels. Then we can
predict probability of class label being ‘t’ for training set.
When creating SGDClassifier, we added different values for loss and penalty and,
decided the most effective one by observing results using ROC curve.
2.2 Decision Tree
Decision tree is also a very popular method used for classification. Decision tree is a
predictive model which maps observations about an item to conclusions about the
item's target value.
We used Decision Tree implemented in Scikit learn library in our project.
from sklearn import tree
clf = tree.DecisionTreeRegressor()
clf = clf.fit(train, outcomes=='t')
preds = clf.predict_proba(test)[:,1]
But with the resources we have we could only create decision trees with 4-5 features.
We got python Memory Error when we tried to use Decision tree with higher number of
features. So we decide against using this method.
2.3 Support Vector Machine
In Support Vector Machines, it transforms original training data into a higher dimension
and searches for the linear optimal separating hyperplane. For implementing SVM, we
used python sci-kit learn library.
from sklearn import svm
clf = svm.SVC()
clf = clf.fit(train, outcomes=='t')
preds = clf.predict_proba(test)[:,1]

4
This method took extremely high running times, so we decided against using this
method.
2.4 Logistic Regression
The logistic function
퐹 푡 = 1 1 + 푒!!
and this is equivalent to
퐹 푥 1 − 퐹 푥 = 푒!!!!!!
If we plot the input value 훽! + 훽!푥 and the output value F(x), input can take an input with
any value from negative infinity to positive infinity, whereas the output F(x) is confined to
values between 0 and 1 and hence is interpretable as a probability.
If there are multiple explanatory variables, then the above expression 훽! + 훽!푥 can be
revised to 훽! + 훽!푥! + 훽!푥!+. . .+훽!푥!
To minimize the model to overfit to the training set, we have to tune the the variable ‘C’
which is Inverse of regularization strength. Normally C is must be a very small value.
According to the roc curve values for small C values we identified that the 0.35 fits our
model the best. Here are the roc curve values for the C values we have tested.
C roc curve value
0.2 0.697740468829667
0.31 0.69850667254796539
0.35 0.69853160409851656
0.4 0.69845107304459875
In logistic regression, derived function can overfit the training set, which means it may
describe training set very well but not test data. To minimize the effect of that, we
decided to divide training set into several parts and create LogisticRegression objects
separately, and then to get few sets of predictions. Then we derived final result by
getting mean of them. We tried dividing into 2,3 and 4 parts. Dividing into 2 gives better
results than using as a single training set, also it had higher value than 3 and 4.
model = LogisticRegression(C=0.35)
model.fit(train.tocsr()[:train.shape[0]/2], outcomes[:outcomes.size/2] == 't')
preds1 = model.predict_proba(test)[:, 1]
model.fit(train.tocsr()[(train.shape[0]/2+1):],outcomes[(outcomes.size/2)+1:]== 't')
preds2 = model.predict_proba(test)[:, 1]
preds = (preds1 + preds2)/2

3.0 Data Preprocessing Techniques Used
3.1 Cross Validation using ROC curve
Evaluation method Kaggle has followed is calculating area under ROC curve for the
result set we have given via actual result set they have. But the issue is we were
allowed to submit only 5 submissions per day. So we needed another way to get an
approximation for our ROC value locally to tune parameters in algorithms and features.
So we used cross validation to test is. We separated training set into two parts. 60% for
new training set and 40% for new test set. We applied training algorithms to new
training set and validated results with new test set using ROC values.
3.2 Selecting features to use
For selecting features we analyzed the given dataset in several ways. We plotted
graphs and found the relevance of the feature to the final outcome. By analyzing data
we came to know that some of the features reduces the final outcome while some
features significantly increase the final outcome. We also derived some data fields from
original data fields and we broke some data fields into classes. These features are listed
below.
1. teacher_acctid
2. schoolid
3. school_city
4. school_state
5. school_metro
6. school_district
7. school_county
8. school_charter
9. school_magnet
10. school_year_round
11. school_nlns
12. school_kipp
13. school_charter_ready_promise
14. teacher_prefix
5
X_train, X_test, y_train, y_test = cross_validation.train_test_split(train, outcomes
== 't', test_size=0.4, random_state=0) # separate training set into two parts
model = LogisticRegression(C=3.1) # Runs the algorithms on new training set
model.fit(X_train, y_train)
preds = model.predict_proba(X_test)[:,1]
roc_auc_score(y_test, preds) # calculate ROC value

6
15. teacher_teach_for_america
16. teacher_ny_teaching_fellow
17. primary_focus_subject
18. primary_focus_area
19. resource_type
20. grade_level
21. eligible_double_your_impact_match
22. eligible_almost_home_match
23. resource_type
We used above features without any changes. We found most of the good features
by the trial and error method. Before selecting some features we graphed them versus
the percentage of exciting and not accepted.

3.3 Handling Null values
We noticed that some features in projects have significant amount of null values. Here
is the percentages we have calculated for each feature in projects
teacher_acctid : 0.0
schoolid : 0.0
school_ncesid : 6.4351
school_latitude : 0.0
school_longitude : 0.0
school_city : 0.0
school_state : 0.0
school_zip : 0.0006
school_metro : 12.3337
school_district : 0.1427
school_county : 0.0025
school_charter : 0.0
school_magnet : 0.0
school_year_round : 0.0
school_nlns : 0.0
school_kipp : 0.0
school_charter_ready_promise : 0.0
teacher_prefix : 0.00066
teacher_teach_for_america : 0.0
teacher_ny_teaching_fellow : 0.0
primary_focus_subject : 0.0058
primary_focus_area : 0.0058
secondary_focus_subject : 31.3045
secondary_focus_area : 31.3045
resource_type : 0.0067
poverty_level : 0.0
grade_level : 0.0013
fulfillment_labor_materials : 5.2826
total_price_excluding_optional_support : 0.0
total_price_including_optional_support : 0.0
students_reached : 0.02198
eligible_double_your_impact_match : 0.0
eligible_almost_home_match : 0.0
date_posted : 0.0
10

11
There are some features like secondary_focus_subject and secondary_focus_area
which have more than 30% of null values. So we removed such features from our
training data because they don’t give much information about training set. For the
features that has <10% of null values, we filled those null values from most common
value in the dataset.
for i in range(len(projectCatogorialColumns)):
data = Counter(projects[projectCatogorialColumns[i]])
projects[projectCatogorialColumns[i]] =
projects[projectCatogorialColumns[i]].fillna(data.most_common(1)[0][0])
3.4 Assigning ranges to numerical features
There are some features like
total_price_excluding_optional_support
total_price_including_optional_support
students_reached
school_latitude
school_longitude
which has integers and floating point numerical values. It is highly unlikely that same
exact value appear again and again. Ex. There could be only one school in data set
which has latitude exactly 35.34534. But it is reasonable to take schools in range
32<latitude<36 are in same geographical area. So we defined such ranges for each
numerical feature to improve training data and avoid over fitting with training data.

3.5 Deriving new features from existing features
There are some features in dataset which doesn’t give much information. But by
combining such features and deriving new features may give better results. We
identified some features that can be improved that way.
1. cost
item_unit_price and item_quantity in Resources dataset doesn’t give much information
when they are taken separately. But when we derive a new feature such that
cost = item_unit_price*item_quantity
gives the total cost that is required for each project.
2 month, week
date_posted feature in Projects data set gives the date which each project is posted in
yyyy-mm-dd format. Usually this is used to identify training data and test data from the
12
def div_fulfillment_labor_materials():
#max = 35.0, min = 9.0
arr = []
nparr = np.array(projects.fulfillment_labor_materials)
for i in range(len(nparr)):
arr.append(math.floor(float(nparr[i])/2))
return arr
def div_total_price_excluding_optional_support():
#max = 10250017.0 , min = 0.0
arr = []
nparr =
np.array(projects.total_price_excluding_optional_support)
return arr
def div_total_price_including_optional_support():
#max = 12500020.73 , min = 0.0
arr = []
nparr =
np.array(projects.total_price_including_optional_support)
return arr
def div_students_reached():
#max = 999999.0 , min = 0.0
arr = []
nparr = np.array(projects.students_reached)
return arr

data set. But if we take month separately and add it as a feature it improves overall
training because for some months (like October and November) have high probability of
accepting its requirements. One logical reason is, most companies allocate funds for
CSR and other charity works in some seasons in the year.
13
3 essay_length
We calculated the length of the essay assuming that good description of the project
should have an acceptable number of words.
3.6 Assigning integer labels for string categorical values
There are some features with categorical values in the data set. Some of them are
string values. If we pass those data directly into the training algorithms, it will take large
amount of time and space to do make the training model because it has to work with
string matching and searching which is a very expensive task. To avoid this we
assigned unique integer values for each categorical values in pre processing stage.
Because handing integers is much easier and faster than handling strings this improves
the speed and space requirement of the algorithm.
for i in range(0, projects.shape[1]):
le = LabelEncoder()
projects[:, i] = le.fit_transform(projects[:, i])

3.7Categorical variable binarization (One hot encoding)
Instead of having categorical values vertically in features, we observed that it gives
good results if we expand it horizontally as new features and assign 1 or 0 to each row.
This will need more memory than before because it creates a sparse matrix but this is
computationally very easy for training algorithm as it’s working with 1s and 0s.
ohe = OneHotEncoder()
projects = ohe.fit_transform(projects)
Evaluation method Kaggle has followed is calculating area under ROC curve for the
result set we have given via actual result set they have. But the issue is we were
allowed to submit only 5 submissions per day. So we needed another way to get an
approximation for our ROC value locally to tune parameters in algorithms and features.
So we used cross validation to test is. We separated training set into two parts. 60% for
new training set and 40% for new test set. We applied training algorithms to new
training set and validated results with new test set using ROC values.
3.8 Feature extraction from essays(NLP section)
For project to be exciting or not, essay type details given in essays.csv file such as
essay and need statement can have significant effect. So we extracted important
features using from those details and used them in Logistic Regression to do
classification.
To extract features from essays, we used TfidfVectorizer which is implemented in scikit-learn
text feature extraction library. Using TfidfVectorizer, we can convert a collection of
raw documents to a set of Tf-Idf features.
Tf-Idf which means term frequency-inverse document frequency is a numeric static that
is intended to reflect how important a word is to a document. Tf-idf value increases
proportionally to the number of times a word appears in a document, but is offset by the
frequency of word’s occurrence in the corpus.
TfidfVectorizer in scikitlearn.featureExtraction is equivalent to the combination of a
Count Vectorizer and Tfidf Transformer.
14
Count Vectorizer tokenize the documents and count the occurrences of token and
return them as a sparse matrix.

TfidfTransformer apply Tfidf normalization to the sparse matrix of occurrence
counts.From doing this, it scales down the impact of tokens that occur very frequently in
a given corpus and that are hence empirically less informative than features that occur
in small fraction of the training corpus.
We used following code which created sparse matrixes for training set and test set of
essay attribute in essays.csv.
min_df is a threshold value for remove words which have low frequency in a document.
We checked results using ROC curve by changing min_df and keeping other attributes
constant. Following are the results we got.
5 - 0.69853160409851656
4 - 0.69853160409851656
So we concluded that min_df =4 is the best value for it.
In this method TfidfVectorizer builds a vocabulary that only consider the top
max_features ordered by term frequency across the corpus. We checked results by
using different values for max_features and following are the results of them.
500 - 0.69799790900465308
1000 - 0.69853160409851656
1500 - 0.69883818907894368
2000 - 0.69864848813741032
So we concluded 1500 is the best value for max_features.
15
tfidf = TfidfVectorizer(min_df=4, max_features=1500)
tfidf.fit(traindata[:,5])
tr = tfidf.transform(traindata[:,5])
ts = tfidf.transform(testdata[:,5])

Kaggle KDD Cup Report

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Kaggle KDD Cup Report

Similar to Kaggle KDD Cup Report (20)

Recently uploaded

Recently uploaded (20)

Kaggle KDD Cup Report