- 1. 1 CS4642 - Data Mining & Information Retrieval Report based on KDD Cup 2014 Submission Siriwardena M. P. -100512X Upeksha W. D. -100552T Wijayarathna D. G. C. D. -100596F Wijayarathna Y. M. S. N. -100597J
- 2. 1.0 Introduction DonorsChoose.org is an online charity that makes it easy to help students in need through school donations. Teachers in K-12 schools propose projects requesting materials to enhance the education of their students. When a project reaches its funding goal, they ship the materials to the school. The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that are exceptionally exciting to the business. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn. Participants are provided with six files of data containing the details of the projects in the training set and the test set. While the donations.csv and outcomes.csv files provides data about the projects in the training set, projects.csv, essays.csv and resources.csv files provides information about both training set and the test set. sampleSubmission.csv file provide a sample of results that the competitors are needed to calculate and provide for the competition. ! donations.csv - contains information about the donations to each project. ! essays.csv - contains project text posted by the teachers. ! projects.csv - contains information about each project. ! resources.csv - contains information about the resources requested for each project. ! outcomes.csv - contains information about the outcomes of projects in the training 2 set. ! sampleSubmission.csv - contains the project ids of the test set and shows the submission format for the competition. The data is provided in a relational format and split by dates. Any project posted prior to 2014-01-01 is in the training set. Any project posted after is in the test set. Some projects in the test set may still be live and are ignored in the scoring. Projects that are still alive are not disclosed to avoid leakage regarding the funding status. 2.0 Classification Algorithms used For Implementing above tasks we used following classification algorithms. We used libraries which were implemented in python scikit-learn. By giving same parameters to each of these algorithms, we calculated which algorithm gives best results for above problem.
- 3. 2.1 Stochastic Gradient Descent In Stochastic Gradient Descent, it creates a linear function 푓 푥 = 푤!푥 + 푏 for training set 푥!, 푦! . . . . . . 푥!, 푦! . The target here is to find f(x)so that it will satisfy the most of 푦!in training set. This method uses Gradient Descent method to find optimum values for w, so that intercept b will be minimum. We tried this method with SGDClassifier which was implemented in python scikit-learn libraries. Following is the part of code which did the classification using SGD. 3 from sklearn.linear_model import SGDClassifier model = SGDClassifier( loss = 'modified_huber', penalty = 'l2') model.fit(train, outcomes=='t') preds = model.predict_proba(test)[:,1] Here we create a SGDClassifier and fit it with training data set where train is a sorted sparse matrix with features and outcomes is the array with class labels. Then we can predict probability of class label being ‘t’ for training set. When creating SGDClassifier, we added different values for loss and penalty and, decided the most effective one by observing results using ROC curve. 2.2 Decision Tree Decision tree is also a very popular method used for classification. Decision tree is a predictive model which maps observations about an item to conclusions about the item's target value. We used Decision Tree implemented in Scikit learn library in our project. from sklearn import tree clf = tree.DecisionTreeRegressor() clf = clf.fit(train, outcomes=='t') preds = clf.predict_proba(test)[:,1] But with the resources we have we could only create decision trees with 4-5 features. We got python Memory Error when we tried to use Decision tree with higher number of features. So we decide against using this method. 2.3 Support Vector Machine In Support Vector Machines, it transforms original training data into a higher dimension and searches for the linear optimal separating hyperplane. For implementing SVM, we used python sci-kit learn library. from sklearn import svm clf = svm.SVC() clf = clf.fit(train, outcomes=='t') preds = clf.predict_proba(test)[:,1]
- 4. 4 This method took extremely high running times, so we decided against using this method. 2.4 Logistic Regression The logistic function 퐹 푡 = 1 1 + 푒!! and this is equivalent to 퐹 푥 1 − 퐹 푥 = 푒!!!!!! If we plot the input value 훽! + 훽!푥 and the output value F(x), input can take an input with any value from negative infinity to positive infinity, whereas the output F(x) is confined to values between 0 and 1 and hence is interpretable as a probability. If there are multiple explanatory variables, then the above expression 훽! + 훽!푥 can be revised to 훽! + 훽!푥! + 훽!푥!+. . .+훽!푥! To minimize the model to overfit to the training set, we have to tune the the variable ‘C’ which is Inverse of regularization strength. Normally C is must be a very small value. According to the roc curve values for small C values we identified that the 0.35 fits our model the best. Here are the roc curve values for the C values we have tested. C roc curve value 0.2 0.697740468829667 0.31 0.69850667254796539 0.35 0.69853160409851656 0.4 0.69845107304459875 In logistic regression, derived function can overfit the training set, which means it may describe training set very well but not test data. To minimize the effect of that, we decided to divide training set into several parts and create LogisticRegression objects separately, and then to get few sets of predictions. Then we derived final result by getting mean of them. We tried dividing into 2,3 and 4 parts. Dividing into 2 gives better results than using as a single training set, also it had higher value than 3 and 4. model = LogisticRegression(C=0.35) model.fit(train.tocsr()[:train.shape[0]/2], outcomes[:outcomes.size/2] == 't') preds1 = model.predict_proba(test)[:, 1] model.fit(train.tocsr()[(train.shape[0]/2+1):],outcomes[(outcomes.size/2)+1:]== 't') preds2 = model.predict_proba(test)[:, 1] preds = (preds1 + preds2)/2
- 5. 3.0 Data Preprocessing Techniques Used 3.1 Cross Validation using ROC curve Evaluation method Kaggle has followed is calculating area under ROC curve for the result set we have given via actual result set they have. But the issue is we were allowed to submit only 5 submissions per day. So we needed another way to get an approximation for our ROC value locally to tune parameters in algorithms and features. So we used cross validation to test is. We separated training set into two parts. 60% for new training set and 40% for new test set. We applied training algorithms to new training set and validated results with new test set using ROC values. 3.2 Selecting features to use For selecting features we analyzed the given dataset in several ways. We plotted graphs and found the relevance of the feature to the final outcome. By analyzing data we came to know that some of the features reduces the final outcome while some features significantly increase the final outcome. We also derived some data fields from original data fields and we broke some data fields into classes. These features are listed below. 1. teacher_acctid 2. schoolid 3. school_city 4. school_state 5. school_metro 6. school_district 7. school_county 8. school_charter 9. school_magnet 10. school_year_round 11. school_nlns 12. school_kipp 13. school_charter_ready_promise 14. teacher_prefix 5 X_train, X_test, y_train, y_test = cross_validation.train_test_split(train, outcomes == 't', test_size=0.4, random_state=0) # separate training set into two parts model = LogisticRegression(C=3.1) # Runs the algorithms on new training set model.fit(X_train, y_train) preds = model.predict_proba(X_test)[:,1] roc_auc_score(y_test, preds) # calculate ROC value
- 6. 6 15. teacher_teach_for_america 16. teacher_ny_teaching_fellow 17. primary_focus_subject 18. primary_focus_area 19. resource_type 20. grade_level 21. eligible_double_your_impact_match 22. eligible_almost_home_match 23. resource_type We used above features without any changes. We found most of the good features by the trial and error method. Before selecting some features we graphed them versus the percentage of exciting and not accepted.
- 7. 7
- 8. 8
- 9. 9
- 10. 3.3 Handling Null values We noticed that some features in projects have significant amount of null values. Here is the percentages we have calculated for each feature in projects teacher_acctid : 0.0 schoolid : 0.0 school_ncesid : 6.4351 school_latitude : 0.0 school_longitude : 0.0 school_city : 0.0 school_state : 0.0 school_zip : 0.0006 school_metro : 12.3337 school_district : 0.1427 school_county : 0.0025 school_charter : 0.0 school_magnet : 0.0 school_year_round : 0.0 school_nlns : 0.0 school_kipp : 0.0 school_charter_ready_promise : 0.0 teacher_prefix : 0.00066 teacher_teach_for_america : 0.0 teacher_ny_teaching_fellow : 0.0 primary_focus_subject : 0.0058 primary_focus_area : 0.0058 secondary_focus_subject : 31.3045 secondary_focus_area : 31.3045 resource_type : 0.0067 poverty_level : 0.0 grade_level : 0.0013 fulfillment_labor_materials : 5.2826 total_price_excluding_optional_support : 0.0 total_price_including_optional_support : 0.0 students_reached : 0.02198 eligible_double_your_impact_match : 0.0 eligible_almost_home_match : 0.0 date_posted : 0.0 10
- 11. 11 There are some features like secondary_focus_subject and secondary_focus_area which have more than 30% of null values. So we removed such features from our training data because they don’t give much information about training set. For the features that has <10% of null values, we filled those null values from most common value in the dataset. for i in range(len(projectCatogorialColumns)): data = Counter(projects[projectCatogorialColumns[i]]) projects[projectCatogorialColumns[i]] = projects[projectCatogorialColumns[i]].fillna(data.most_common(1)[0][0]) 3.4 Assigning ranges to numerical features There are some features like total_price_excluding_optional_support total_price_including_optional_support students_reached school_latitude school_longitude which has integers and floating point numerical values. It is highly unlikely that same exact value appear again and again. Ex. There could be only one school in data set which has latitude exactly 35.34534. But it is reasonable to take schools in range 32<latitude<36 are in same geographical area. So we defined such ranges for each numerical feature to improve training data and avoid over fitting with training data.
- 12. 3.5 Deriving new features from existing features There are some features in dataset which doesn’t give much information. But by combining such features and deriving new features may give better results. We identified some features that can be improved that way. 1. cost item_unit_price and item_quantity in Resources dataset doesn’t give much information when they are taken separately. But when we derive a new feature such that cost = item_unit_price*item_quantity gives the total cost that is required for each project. 2 month, week date_posted feature in Projects data set gives the date which each project is posted in yyyy-mm-dd format. Usually this is used to identify training data and test data from the 12 def div_fulfillment_labor_materials(): #max = 35.0, min = 9.0 arr = [] nparr = np.array(projects.fulfillment_labor_materials) for i in range(len(nparr)): arr.append(math.floor(float(nparr[i])/2)) return arr def div_total_price_excluding_optional_support(): #max = 10250017.0 , min = 0.0 arr = [] nparr = np.array(projects.total_price_excluding_optional_support) for i in range(len(nparr)): arr.append(math.floor(float(nparr[i])/50)) return arr def div_total_price_including_optional_support(): #max = 12500020.73 , min = 0.0 arr = [] nparr = np.array(projects.total_price_including_optional_support) for i in range(len(nparr)): arr.append(math.floor(float(nparr[i])/50)) return arr def div_students_reached(): #max = 999999.0 , min = 0.0 arr = [] nparr = np.array(projects.students_reached) for i in range(len(nparr)): arr.append(math.floor(float(nparr[i])/80)) return arr
- 13. data set. But if we take month separately and add it as a feature it improves overall training because for some months (like October and November) have high probability of accepting its requirements. One logical reason is, most companies allocate funds for CSR and other charity works in some seasons in the year. 13 3 essay_length We calculated the length of the essay assuming that good description of the project should have an acceptable number of words. 3.6 Assigning integer labels for string categorical values There are some features with categorical values in the data set. Some of them are string values. If we pass those data directly into the training algorithms, it will take large amount of time and space to do make the training model because it has to work with string matching and searching which is a very expensive task. To avoid this we assigned unique integer values for each categorical values in pre processing stage. Because handing integers is much easier and faster than handling strings this improves the speed and space requirement of the algorithm. for i in range(0, projects.shape[1]): le = LabelEncoder() projects[:, i] = le.fit_transform(projects[:, i])
- 14. 3.7Categorical variable binarization (One hot encoding) Instead of having categorical values vertically in features, we observed that it gives good results if we expand it horizontally as new features and assign 1 or 0 to each row. This will need more memory than before because it creates a sparse matrix but this is computationally very easy for training algorithm as it’s working with 1s and 0s. ohe = OneHotEncoder() projects = ohe.fit_transform(projects) Evaluation method Kaggle has followed is calculating area under ROC curve for the result set we have given via actual result set they have. But the issue is we were allowed to submit only 5 submissions per day. So we needed another way to get an approximation for our ROC value locally to tune parameters in algorithms and features. So we used cross validation to test is. We separated training set into two parts. 60% for new training set and 40% for new test set. We applied training algorithms to new training set and validated results with new test set using ROC values. 3.8 Feature extraction from essays(NLP section) For project to be exciting or not, essay type details given in essays.csv file such as essay and need statement can have significant effect. So we extracted important features using from those details and used them in Logistic Regression to do classification. To extract features from essays, we used TfidfVectorizer which is implemented in scikit-learn text feature extraction library. Using TfidfVectorizer, we can convert a collection of raw documents to a set of Tf-Idf features. Tf-Idf which means term frequency-inverse document frequency is a numeric static that is intended to reflect how important a word is to a document. Tf-idf value increases proportionally to the number of times a word appears in a document, but is offset by the frequency of word’s occurrence in the corpus. TfidfVectorizer in scikitlearn.featureExtraction is equivalent to the combination of a Count Vectorizer and Tfidf Transformer. 14 Count Vectorizer tokenize the documents and count the occurrences of token and return them as a sparse matrix.
- 15. TfidfTransformer apply Tfidf normalization to the sparse matrix of occurrence counts.From doing this, it scales down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in small fraction of the training corpus. We used following code which created sparse matrixes for training set and test set of essay attribute in essays.csv. min_df is a threshold value for remove words which have low frequency in a document. We checked results using ROC curve by changing min_df and keeping other attributes constant. Following are the results we got. 5 - 0.69853160409851656 4 - 0.69853160409851656 So we concluded that min_df =4 is the best value for it. In this method TfidfVectorizer builds a vocabulary that only consider the top max_features ordered by term frequency across the corpus. We checked results by using different values for max_features and following are the results of them. 500 - 0.69799790900465308 1000 - 0.69853160409851656 1500 - 0.69883818907894368 2000 - 0.69864848813741032 So we concluded 1500 is the best value for max_features. 15 tfidf = TfidfVectorizer(min_df=4, max_features=1500) tfidf.fit(traindata[:,5]) tr = tfidf.transform(traindata[:,5]) ts = tfidf.transform(testdata[:,5])