SlideShare a Scribd company logo
1 of 15
Download to read offline
1 
CS4642 - Data Mining & Information Retrieval 
Report based on KDD Cup 2014 Submission 
Siriwardena M. P. -100512X 
Upeksha W. D. -100552T 
Wijayarathna D. G. C. D. -100596F 
Wijayarathna Y. M. S. N. -100597J
1.0 Introduction 
DonorsChoose.org is an online charity that makes it easy to help students in need 
through school donations. Teachers in K-12 schools propose projects requesting 
materials to enhance the education of their students. When a project reaches its funding 
goal, they ship the materials to the school. 
The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that 
are exceptionally exciting to the business. While all projects on the site fulfill some kind 
of need, certain projects have a quality above and beyond what is typical. By identifying 
and recommending such projects early, they will improve funding outcomes, better the 
user experience, and help more students receive the materials they need to learn. 
Participants are provided with six files of data containing the details of the projects in the 
training set and the test set. While the donations.csv and outcomes.csv files provides 
data about the projects in the training set, projects.csv, essays.csv and resources.csv 
files provides information about both training set and the test set. 
sampleSubmission.csv file provide a sample of results that the competitors are needed 
to calculate and provide for the competition. 
! donations.csv - contains information about the donations to each project. 
! essays.csv - contains project text posted by the teachers. 
! projects.csv - contains information about each project. 
! resources.csv - contains information about the resources requested for each project. 
! outcomes.csv - contains information about the outcomes of projects in the training 
2 
set. 
! sampleSubmission.csv - contains the project ids of the test set and shows the 
submission format for the competition. 
The data is provided in a relational format and split by dates. Any project posted prior to 
2014-01-01 is in the training set. Any project posted after is in the test set. Some 
projects in the test set may still be live and are ignored in the scoring. Projects that are 
still alive are not disclosed to avoid leakage regarding the funding status. 
2.0 Classification Algorithms used 
For Implementing above tasks we used following classification algorithms. We used 
libraries which were implemented in python scikit-learn. By giving same parameters to 
each of these algorithms, we calculated which algorithm gives best results for above 
problem.
2.1 Stochastic Gradient Descent 
In Stochastic Gradient Descent, it creates a linear function 푓 푥 = 푤!푥 + 푏 for training set 
푥!, 푦! . . . . . . 푥!, 푦! . The target here is to find f(x)so that it will satisfy the most of 푦!in 
training set. This method uses Gradient Descent method to find optimum values for w, 
so that intercept b will be minimum. 
We tried this method with SGDClassifier which was implemented in python scikit-learn 
libraries. Following is the part of code which did the classification using SGD. 
3 
from sklearn.linear_model import SGDClassifier 
model = SGDClassifier( loss = 'modified_huber', penalty = 'l2') 
model.fit(train, outcomes=='t') 
preds = model.predict_proba(test)[:,1] 
Here we create a SGDClassifier and fit it with training data set where train is a sorted 
sparse matrix with features and outcomes is the array with class labels. Then we can 
predict probability of class label being ‘t’ for training set. 
When creating SGDClassifier, we added different values for loss and penalty and, 
decided the most effective one by observing results using ROC curve. 
2.2 Decision Tree 
Decision tree is also a very popular method used for classification. Decision tree is a 
predictive model which maps observations about an item to conclusions about the 
item's target value. 
We used Decision Tree implemented in Scikit learn library in our project. 
from sklearn import tree 
clf = tree.DecisionTreeRegressor() 
clf = clf.fit(train, outcomes=='t') 
preds = clf.predict_proba(test)[:,1] 
But with the resources we have we could only create decision trees with 4-5 features. 
We got python Memory Error when we tried to use Decision tree with higher number of 
features. So we decide against using this method. 
2.3 Support Vector Machine 
In Support Vector Machines, it transforms original training data into a higher dimension 
and searches for the linear optimal separating hyperplane. For implementing SVM, we 
used python sci-kit learn library. 
from sklearn import svm 
clf = svm.SVC() 
clf = clf.fit(train, outcomes=='t') 
preds = clf.predict_proba(test)[:,1]
4 
This method took extremely high running times, so we decided against using this 
method. 
2.4 Logistic Regression 
The logistic function 
퐹 푡 = 1 1 + 푒!! 
and this is equivalent to 
퐹 푥 1 − 퐹 푥 = 푒!!!!!! 
If we plot the input value 훽! + 훽!푥 and the output value F(x), input can take an input with 
any value from negative infinity to positive infinity, whereas the output F(x) is confined to 
values between 0 and 1 and hence is interpretable as a probability. 
If there are multiple explanatory variables, then the above expression 훽! + 훽!푥 can be 
revised to 훽! + 훽!푥! + 훽!푥!+. . .+훽!푥! 
To minimize the model to overfit to the training set, we have to tune the the variable ‘C’ 
which is Inverse of regularization strength. Normally C is must be a very small value. 
According to the roc curve values for small C values we identified that the 0.35 fits our 
model the best. Here are the roc curve values for the C values we have tested. 
C roc curve value 
0.2 0.697740468829667 
0.31 0.69850667254796539 
0.35 0.69853160409851656 
0.4 0.69845107304459875 
In logistic regression, derived function can overfit the training set, which means it may 
describe training set very well but not test data. To minimize the effect of that, we 
decided to divide training set into several parts and create LogisticRegression objects 
separately, and then to get few sets of predictions. Then we derived final result by 
getting mean of them. We tried dividing into 2,3 and 4 parts. Dividing into 2 gives better 
results than using as a single training set, also it had higher value than 3 and 4. 
model = LogisticRegression(C=0.35) 
model.fit(train.tocsr()[:train.shape[0]/2], outcomes[:outcomes.size/2] == 't') 
preds1 = model.predict_proba(test)[:, 1] 
model.fit(train.tocsr()[(train.shape[0]/2+1):],outcomes[(outcomes.size/2)+1:]== 't') 
preds2 = model.predict_proba(test)[:, 1] 
preds = (preds1 + preds2)/2
3.0 Data Preprocessing Techniques Used 
3.1 Cross Validation using ROC curve 
Evaluation method Kaggle has followed is calculating area under ROC curve for the 
result set we have given via actual result set they have. But the issue is we were 
allowed to submit only 5 submissions per day. So we needed another way to get an 
approximation for our ROC value locally to tune parameters in algorithms and features. 
So we used cross validation to test is. We separated training set into two parts. 60% for 
new training set and 40% for new test set. We applied training algorithms to new 
training set and validated results with new test set using ROC values. 
3.2 Selecting features to use 
For selecting features we analyzed the given dataset in several ways. We plotted 
graphs and found the relevance of the feature to the final outcome. By analyzing data 
we came to know that some of the features reduces the final outcome while some 
features significantly increase the final outcome. We also derived some data fields from 
original data fields and we broke some data fields into classes. These features are listed 
below. 
1. teacher_acctid 
2. schoolid 
3. school_city 
4. school_state 
5. school_metro 
6. school_district 
7. school_county 
8. school_charter 
9. school_magnet 
10. school_year_round 
11. school_nlns 
12. school_kipp 
13. school_charter_ready_promise 
14. teacher_prefix 
5 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(train, outcomes 
== 't', test_size=0.4, random_state=0) # separate training set into two parts 
model = LogisticRegression(C=3.1) # Runs the algorithms on new training set 
model.fit(X_train, y_train) 
preds = model.predict_proba(X_test)[:,1] 
roc_auc_score(y_test, preds) # calculate ROC value
6 
15. teacher_teach_for_america 
16. teacher_ny_teaching_fellow 
17. primary_focus_subject 
18. primary_focus_area 
19. resource_type 
20. grade_level 
21. eligible_double_your_impact_match 
22. eligible_almost_home_match 
23. resource_type 
We used above features without any changes. We found most of the good features 
by the trial and error method. Before selecting some features we graphed them versus 
the percentage of exciting and not accepted.
7
8
9
3.3 Handling Null values 
We noticed that some features in projects have significant amount of null values. Here 
is the percentages we have calculated for each feature in projects 
teacher_acctid : 0.0 
schoolid : 0.0 
school_ncesid : 6.4351 
school_latitude : 0.0 
school_longitude : 0.0 
school_city : 0.0 
school_state : 0.0 
school_zip : 0.0006 
school_metro : 12.3337 
school_district : 0.1427 
school_county : 0.0025 
school_charter : 0.0 
school_magnet : 0.0 
school_year_round : 0.0 
school_nlns : 0.0 
school_kipp : 0.0 
school_charter_ready_promise : 0.0 
teacher_prefix : 0.00066 
teacher_teach_for_america : 0.0 
teacher_ny_teaching_fellow : 0.0 
primary_focus_subject : 0.0058 
primary_focus_area : 0.0058 
secondary_focus_subject : 31.3045 
secondary_focus_area : 31.3045 
resource_type : 0.0067 
poverty_level : 0.0 
grade_level : 0.0013 
fulfillment_labor_materials : 5.2826 
total_price_excluding_optional_support : 0.0 
total_price_including_optional_support : 0.0 
students_reached : 0.02198 
eligible_double_your_impact_match : 0.0 
eligible_almost_home_match : 0.0 
date_posted : 0.0 
10
11 
There are some features like secondary_focus_subject and secondary_focus_area 
which have more than 30% of null values. So we removed such features from our 
training data because they don’t give much information about training set. For the 
features that has <10% of null values, we filled those null values from most common 
value in the dataset. 
for i in range(len(projectCatogorialColumns)): 
data = Counter(projects[projectCatogorialColumns[i]]) 
projects[projectCatogorialColumns[i]] = 
projects[projectCatogorialColumns[i]].fillna(data.most_common(1)[0][0]) 
3.4 Assigning ranges to numerical features 
There are some features like 
total_price_excluding_optional_support 
total_price_including_optional_support 
students_reached 
school_latitude 
school_longitude 
which has integers and floating point numerical values. It is highly unlikely that same 
exact value appear again and again. Ex. There could be only one school in data set 
which has latitude exactly 35.34534. But it is reasonable to take schools in range 
32<latitude<36 are in same geographical area. So we defined such ranges for each 
numerical feature to improve training data and avoid over fitting with training data.
3.5 Deriving new features from existing features 
There are some features in dataset which doesn’t give much information. But by 
combining such features and deriving new features may give better results. We 
identified some features that can be improved that way. 
1. cost 
item_unit_price and item_quantity in Resources dataset doesn’t give much information 
when they are taken separately. But when we derive a new feature such that 
cost = item_unit_price*item_quantity 
gives the total cost that is required for each project. 
2 month, week 
date_posted feature in Projects data set gives the date which each project is posted in 
yyyy-mm-dd format. Usually this is used to identify training data and test data from the 
12 
def div_fulfillment_labor_materials(): 
#max = 35.0, min = 9.0 
arr = [] 
nparr = np.array(projects.fulfillment_labor_materials) 
for i in range(len(nparr)): 
arr.append(math.floor(float(nparr[i])/2)) 
return arr 
def div_total_price_excluding_optional_support(): 
#max = 10250017.0 , min = 0.0 
arr = [] 
nparr = 
np.array(projects.total_price_excluding_optional_support) 
for i in range(len(nparr)): 
arr.append(math.floor(float(nparr[i])/50)) 
return arr 
def div_total_price_including_optional_support(): 
#max = 12500020.73 , min = 0.0 
arr = [] 
nparr = 
np.array(projects.total_price_including_optional_support) 
for i in range(len(nparr)): 
arr.append(math.floor(float(nparr[i])/50)) 
return arr 
def div_students_reached(): 
#max = 999999.0 , min = 0.0 
arr = [] 
nparr = np.array(projects.students_reached) 
for i in range(len(nparr)): 
arr.append(math.floor(float(nparr[i])/80)) 
return arr
data set. But if we take month separately and add it as a feature it improves overall 
training because for some months (like October and November) have high probability of 
accepting its requirements. One logical reason is, most companies allocate funds for 
CSR and other charity works in some seasons in the year. 
13 
3 essay_length 
We calculated the length of the essay assuming that good description of the project 
should have an acceptable number of words. 
3.6 Assigning integer labels for string categorical values 
There are some features with categorical values in the data set. Some of them are 
string values. If we pass those data directly into the training algorithms, it will take large 
amount of time and space to do make the training model because it has to work with 
string matching and searching which is a very expensive task. To avoid this we 
assigned unique integer values for each categorical values in pre processing stage. 
Because handing integers is much easier and faster than handling strings this improves 
the speed and space requirement of the algorithm. 
for i in range(0, projects.shape[1]): 
le = LabelEncoder() 
projects[:, i] = le.fit_transform(projects[:, i])
3.7Categorical variable binarization (One hot encoding) 
Instead of having categorical values vertically in features, we observed that it gives 
good results if we expand it horizontally as new features and assign 1 or 0 to each row. 
This will need more memory than before because it creates a sparse matrix but this is 
computationally very easy for training algorithm as it’s working with 1s and 0s. 
ohe = OneHotEncoder() 
projects = ohe.fit_transform(projects) 
Evaluation method Kaggle has followed is calculating area under ROC curve for the 
result set we have given via actual result set they have. But the issue is we were 
allowed to submit only 5 submissions per day. So we needed another way to get an 
approximation for our ROC value locally to tune parameters in algorithms and features. 
So we used cross validation to test is. We separated training set into two parts. 60% for 
new training set and 40% for new test set. We applied training algorithms to new 
training set and validated results with new test set using ROC values. 
3.8 Feature extraction from essays(NLP section) 
For project to be exciting or not, essay type details given in essays.csv file such as 
essay and need statement can have significant effect. So we extracted important 
features using from those details and used them in Logistic Regression to do 
classification. 
To extract features from essays, we used TfidfVectorizer which is implemented in scikit-learn 
text feature extraction library. Using TfidfVectorizer, we can convert a collection of 
raw documents to a set of Tf-Idf features. 
Tf-Idf which means term frequency-inverse document frequency is a numeric static that 
is intended to reflect how important a word is to a document. Tf-idf value increases 
proportionally to the number of times a word appears in a document, but is offset by the 
frequency of word’s occurrence in the corpus. 
TfidfVectorizer in scikitlearn.featureExtraction is equivalent to the combination of a 
Count Vectorizer and Tfidf Transformer. 
14 
Count Vectorizer tokenize the documents and count the occurrences of token and 
return them as a sparse matrix.
TfidfTransformer apply Tfidf normalization to the sparse matrix of occurrence 
counts.From doing this, it scales down the impact of tokens that occur very frequently in 
a given corpus and that are hence empirically less informative than features that occur 
in small fraction of the training corpus. 
We used following code which created sparse matrixes for training set and test set of 
essay attribute in essays.csv. 
min_df is a threshold value for remove words which have low frequency in a document. 
We checked results using ROC curve by changing min_df and keeping other attributes 
constant. Following are the results we got. 
5 - 0.69853160409851656 
4 - 0.69853160409851656 
So we concluded that min_df =4 is the best value for it. 
In this method TfidfVectorizer builds a vocabulary that only consider the top 
max_features ordered by term frequency across the corpus. We checked results by 
using different values for max_features and following are the results of them. 
500 - 0.69799790900465308 
1000 - 0.69853160409851656 
1500 - 0.69883818907894368 
2000 - 0.69864848813741032 
So we concluded 1500 is the best value for max_features. 
15 
tfidf = TfidfVectorizer(min_df=4, max_features=1500) 
tfidf.fit(traindata[:,5]) 
tr = tfidf.transform(traindata[:,5]) 
ts = tfidf.transform(testdata[:,5])

More Related Content

Viewers also liked

Higgs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportHiggs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportChamila Wijayarathna
 
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research PaperShirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research PaperChamila Wijayarathna
 
Helen Keller, The Story of My Life
Helen Keller, The Story of My LifeHelen Keller, The Story of My Life
Helen Keller, The Story of My LifeChamila Wijayarathna
 
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...Chamila Wijayarathna
 
Training Fuji Xerox Fintec 2010
Training  Fuji Xerox Fintec 2010Training  Fuji Xerox Fintec 2010
Training Fuji Xerox Fintec 2010Tùng Nguyễn
 
Xbotix 2014 Rules undergraduate category
Xbotix 2014 Rules   undergraduate categoryXbotix 2014 Rules   undergraduate category
Xbotix 2014 Rules undergraduate categoryChamila Wijayarathna
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review PresentationChamila Wijayarathna
 
Implementing a Corpus for Sinhala Language
Implementing a Corpus for Sinhala LanguageImplementing a Corpus for Sinhala Language
Implementing a Corpus for Sinhala LanguageChamila Wijayarathna
 
Knock detecting door lock research paper
Knock detecting door lock research paperKnock detecting door lock research paper
Knock detecting door lock research paperChamila Wijayarathna
 
SinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - ThesisSinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - ThesisChamila Wijayarathna
 

Viewers also liked (19)

Higgs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportHiggs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge Report
 
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research PaperShirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
 
Helen Keller, The Story of My Life
Helen Keller, The Story of My LifeHelen Keller, The Story of My Life
Helen Keller, The Story of My Life
 
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
 
Ieee xtreme 5.0 results
Ieee xtreme 5.0 resultsIeee xtreme 5.0 results
Ieee xtreme 5.0 results
 
Programs With Common Sense
Programs With Common SensePrograms With Common Sense
Programs With Common Sense
 
Training Fuji Xerox Fintec 2010
Training  Fuji Xerox Fintec 2010Training  Fuji Xerox Fintec 2010
Training Fuji Xerox Fintec 2010
 
GS0C - "How to Start" Guide
GS0C - "How to Start" GuideGS0C - "How to Start" Guide
GS0C - "How to Start" Guide
 
Market leader elementary
Market leader elementaryMarket leader elementary
Market leader elementary
 
Xbotix 2014 Rules undergraduate category
Xbotix 2014 Rules   undergraduate categoryXbotix 2014 Rules   undergraduate category
Xbotix 2014 Rules undergraduate category
 
IEEE Xtreme Final results 2012
IEEE Xtreme Final results 2012IEEE Xtreme Final results 2012
IEEE Xtreme Final results 2012
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review Presentation
 
Memory technologies
Memory technologiesMemory technologies
Memory technologies
 
Sinmin final presentation
Sinmin final presentation Sinmin final presentation
Sinmin final presentation
 
History of Computer
History of ComputerHistory of Computer
History of Computer
 
Implementing a Corpus for Sinhala Language
Implementing a Corpus for Sinhala LanguageImplementing a Corpus for Sinhala Language
Implementing a Corpus for Sinhala Language
 
Knock detecting door lock research paper
Knock detecting door lock research paperKnock detecting door lock research paper
Knock detecting door lock research paper
 
SinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - ThesisSinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - Thesis
 
Path following robot
Path following robotPath following robot
Path following robot
 

Similar to Data Mining & IR Report on KDD Cup 2014 Submission

Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning projectAlex Austin
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Omkar Rane
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionankit_ppt
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Insurance Optimization
Insurance OptimizationInsurance Optimization
Insurance OptimizationAlbert Chu
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithmsArunangsu Sahu
 
Grid search.pptx
Grid search.pptxGrid search.pptx
Grid search.pptxAbithaSam
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learnedweka Content
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedDataminingTools Inc
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)TarunPaparaju
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
 
Big data project
Big data projectBig data project
Big data projectKedar Kumar
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 

Similar to Data Mining & IR Report on KDD Cup 2014 Submission (20)

Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning project
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
working with python
working with pythonworking with python
working with python
 
Insurance Optimization
Insurance OptimizationInsurance Optimization
Insurance Optimization
 
Customer analytics for e commerce
Customer analytics for e commerceCustomer analytics for e commerce
Customer analytics for e commerce
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
MACHINE LEARNING.pptx
MACHINE LEARNING.pptxMACHINE LEARNING.pptx
MACHINE LEARNING.pptx
 
Grid search.pptx
Grid search.pptxGrid search.pptx
Grid search.pptx
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
Big data project
Big data projectBig data project
Big data project
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
3ml.pdf
3ml.pdf3ml.pdf
3ml.pdf
 

Recently uploaded

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsRommel Regala
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 

Recently uploaded (20)

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World Politics
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 

Data Mining & IR Report on KDD Cup 2014 Submission

  • 1. 1 CS4642 - Data Mining & Information Retrieval Report based on KDD Cup 2014 Submission Siriwardena M. P. -100512X Upeksha W. D. -100552T Wijayarathna D. G. C. D. -100596F Wijayarathna Y. M. S. N. -100597J
  • 2. 1.0 Introduction DonorsChoose.org is an online charity that makes it easy to help students in need through school donations. Teachers in K-12 schools propose projects requesting materials to enhance the education of their students. When a project reaches its funding goal, they ship the materials to the school. The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that are exceptionally exciting to the business. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn. Participants are provided with six files of data containing the details of the projects in the training set and the test set. While the donations.csv and outcomes.csv files provides data about the projects in the training set, projects.csv, essays.csv and resources.csv files provides information about both training set and the test set. sampleSubmission.csv file provide a sample of results that the competitors are needed to calculate and provide for the competition. ! donations.csv - contains information about the donations to each project. ! essays.csv - contains project text posted by the teachers. ! projects.csv - contains information about each project. ! resources.csv - contains information about the resources requested for each project. ! outcomes.csv - contains information about the outcomes of projects in the training 2 set. ! sampleSubmission.csv - contains the project ids of the test set and shows the submission format for the competition. The data is provided in a relational format and split by dates. Any project posted prior to 2014-01-01 is in the training set. Any project posted after is in the test set. Some projects in the test set may still be live and are ignored in the scoring. Projects that are still alive are not disclosed to avoid leakage regarding the funding status. 2.0 Classification Algorithms used For Implementing above tasks we used following classification algorithms. We used libraries which were implemented in python scikit-learn. By giving same parameters to each of these algorithms, we calculated which algorithm gives best results for above problem.
  • 3. 2.1 Stochastic Gradient Descent In Stochastic Gradient Descent, it creates a linear function 푓 푥 = 푤!푥 + 푏 for training set 푥!, 푦! . . . . . . 푥!, 푦! . The target here is to find f(x)so that it will satisfy the most of 푦!in training set. This method uses Gradient Descent method to find optimum values for w, so that intercept b will be minimum. We tried this method with SGDClassifier which was implemented in python scikit-learn libraries. Following is the part of code which did the classification using SGD. 3 from sklearn.linear_model import SGDClassifier model = SGDClassifier( loss = 'modified_huber', penalty = 'l2') model.fit(train, outcomes=='t') preds = model.predict_proba(test)[:,1] Here we create a SGDClassifier and fit it with training data set where train is a sorted sparse matrix with features and outcomes is the array with class labels. Then we can predict probability of class label being ‘t’ for training set. When creating SGDClassifier, we added different values for loss and penalty and, decided the most effective one by observing results using ROC curve. 2.2 Decision Tree Decision tree is also a very popular method used for classification. Decision tree is a predictive model which maps observations about an item to conclusions about the item's target value. We used Decision Tree implemented in Scikit learn library in our project. from sklearn import tree clf = tree.DecisionTreeRegressor() clf = clf.fit(train, outcomes=='t') preds = clf.predict_proba(test)[:,1] But with the resources we have we could only create decision trees with 4-5 features. We got python Memory Error when we tried to use Decision tree with higher number of features. So we decide against using this method. 2.3 Support Vector Machine In Support Vector Machines, it transforms original training data into a higher dimension and searches for the linear optimal separating hyperplane. For implementing SVM, we used python sci-kit learn library. from sklearn import svm clf = svm.SVC() clf = clf.fit(train, outcomes=='t') preds = clf.predict_proba(test)[:,1]
  • 4. 4 This method took extremely high running times, so we decided against using this method. 2.4 Logistic Regression The logistic function 퐹 푡 = 1 1 + 푒!! and this is equivalent to 퐹 푥 1 − 퐹 푥 = 푒!!!!!! If we plot the input value 훽! + 훽!푥 and the output value F(x), input can take an input with any value from negative infinity to positive infinity, whereas the output F(x) is confined to values between 0 and 1 and hence is interpretable as a probability. If there are multiple explanatory variables, then the above expression 훽! + 훽!푥 can be revised to 훽! + 훽!푥! + 훽!푥!+. . .+훽!푥! To minimize the model to overfit to the training set, we have to tune the the variable ‘C’ which is Inverse of regularization strength. Normally C is must be a very small value. According to the roc curve values for small C values we identified that the 0.35 fits our model the best. Here are the roc curve values for the C values we have tested. C roc curve value 0.2 0.697740468829667 0.31 0.69850667254796539 0.35 0.69853160409851656 0.4 0.69845107304459875 In logistic regression, derived function can overfit the training set, which means it may describe training set very well but not test data. To minimize the effect of that, we decided to divide training set into several parts and create LogisticRegression objects separately, and then to get few sets of predictions. Then we derived final result by getting mean of them. We tried dividing into 2,3 and 4 parts. Dividing into 2 gives better results than using as a single training set, also it had higher value than 3 and 4. model = LogisticRegression(C=0.35) model.fit(train.tocsr()[:train.shape[0]/2], outcomes[:outcomes.size/2] == 't') preds1 = model.predict_proba(test)[:, 1] model.fit(train.tocsr()[(train.shape[0]/2+1):],outcomes[(outcomes.size/2)+1:]== 't') preds2 = model.predict_proba(test)[:, 1] preds = (preds1 + preds2)/2
  • 5. 3.0 Data Preprocessing Techniques Used 3.1 Cross Validation using ROC curve Evaluation method Kaggle has followed is calculating area under ROC curve for the result set we have given via actual result set they have. But the issue is we were allowed to submit only 5 submissions per day. So we needed another way to get an approximation for our ROC value locally to tune parameters in algorithms and features. So we used cross validation to test is. We separated training set into two parts. 60% for new training set and 40% for new test set. We applied training algorithms to new training set and validated results with new test set using ROC values. 3.2 Selecting features to use For selecting features we analyzed the given dataset in several ways. We plotted graphs and found the relevance of the feature to the final outcome. By analyzing data we came to know that some of the features reduces the final outcome while some features significantly increase the final outcome. We also derived some data fields from original data fields and we broke some data fields into classes. These features are listed below. 1. teacher_acctid 2. schoolid 3. school_city 4. school_state 5. school_metro 6. school_district 7. school_county 8. school_charter 9. school_magnet 10. school_year_round 11. school_nlns 12. school_kipp 13. school_charter_ready_promise 14. teacher_prefix 5 X_train, X_test, y_train, y_test = cross_validation.train_test_split(train, outcomes == 't', test_size=0.4, random_state=0) # separate training set into two parts model = LogisticRegression(C=3.1) # Runs the algorithms on new training set model.fit(X_train, y_train) preds = model.predict_proba(X_test)[:,1] roc_auc_score(y_test, preds) # calculate ROC value
  • 6. 6 15. teacher_teach_for_america 16. teacher_ny_teaching_fellow 17. primary_focus_subject 18. primary_focus_area 19. resource_type 20. grade_level 21. eligible_double_your_impact_match 22. eligible_almost_home_match 23. resource_type We used above features without any changes. We found most of the good features by the trial and error method. Before selecting some features we graphed them versus the percentage of exciting and not accepted.
  • 7. 7
  • 8. 8
  • 9. 9
  • 10. 3.3 Handling Null values We noticed that some features in projects have significant amount of null values. Here is the percentages we have calculated for each feature in projects teacher_acctid : 0.0 schoolid : 0.0 school_ncesid : 6.4351 school_latitude : 0.0 school_longitude : 0.0 school_city : 0.0 school_state : 0.0 school_zip : 0.0006 school_metro : 12.3337 school_district : 0.1427 school_county : 0.0025 school_charter : 0.0 school_magnet : 0.0 school_year_round : 0.0 school_nlns : 0.0 school_kipp : 0.0 school_charter_ready_promise : 0.0 teacher_prefix : 0.00066 teacher_teach_for_america : 0.0 teacher_ny_teaching_fellow : 0.0 primary_focus_subject : 0.0058 primary_focus_area : 0.0058 secondary_focus_subject : 31.3045 secondary_focus_area : 31.3045 resource_type : 0.0067 poverty_level : 0.0 grade_level : 0.0013 fulfillment_labor_materials : 5.2826 total_price_excluding_optional_support : 0.0 total_price_including_optional_support : 0.0 students_reached : 0.02198 eligible_double_your_impact_match : 0.0 eligible_almost_home_match : 0.0 date_posted : 0.0 10
  • 11. 11 There are some features like secondary_focus_subject and secondary_focus_area which have more than 30% of null values. So we removed such features from our training data because they don’t give much information about training set. For the features that has <10% of null values, we filled those null values from most common value in the dataset. for i in range(len(projectCatogorialColumns)): data = Counter(projects[projectCatogorialColumns[i]]) projects[projectCatogorialColumns[i]] = projects[projectCatogorialColumns[i]].fillna(data.most_common(1)[0][0]) 3.4 Assigning ranges to numerical features There are some features like total_price_excluding_optional_support total_price_including_optional_support students_reached school_latitude school_longitude which has integers and floating point numerical values. It is highly unlikely that same exact value appear again and again. Ex. There could be only one school in data set which has latitude exactly 35.34534. But it is reasonable to take schools in range 32<latitude<36 are in same geographical area. So we defined such ranges for each numerical feature to improve training data and avoid over fitting with training data.
  • 12. 3.5 Deriving new features from existing features There are some features in dataset which doesn’t give much information. But by combining such features and deriving new features may give better results. We identified some features that can be improved that way. 1. cost item_unit_price and item_quantity in Resources dataset doesn’t give much information when they are taken separately. But when we derive a new feature such that cost = item_unit_price*item_quantity gives the total cost that is required for each project. 2 month, week date_posted feature in Projects data set gives the date which each project is posted in yyyy-mm-dd format. Usually this is used to identify training data and test data from the 12 def div_fulfillment_labor_materials(): #max = 35.0, min = 9.0 arr = [] nparr = np.array(projects.fulfillment_labor_materials) for i in range(len(nparr)): arr.append(math.floor(float(nparr[i])/2)) return arr def div_total_price_excluding_optional_support(): #max = 10250017.0 , min = 0.0 arr = [] nparr = np.array(projects.total_price_excluding_optional_support) for i in range(len(nparr)): arr.append(math.floor(float(nparr[i])/50)) return arr def div_total_price_including_optional_support(): #max = 12500020.73 , min = 0.0 arr = [] nparr = np.array(projects.total_price_including_optional_support) for i in range(len(nparr)): arr.append(math.floor(float(nparr[i])/50)) return arr def div_students_reached(): #max = 999999.0 , min = 0.0 arr = [] nparr = np.array(projects.students_reached) for i in range(len(nparr)): arr.append(math.floor(float(nparr[i])/80)) return arr
  • 13. data set. But if we take month separately and add it as a feature it improves overall training because for some months (like October and November) have high probability of accepting its requirements. One logical reason is, most companies allocate funds for CSR and other charity works in some seasons in the year. 13 3 essay_length We calculated the length of the essay assuming that good description of the project should have an acceptable number of words. 3.6 Assigning integer labels for string categorical values There are some features with categorical values in the data set. Some of them are string values. If we pass those data directly into the training algorithms, it will take large amount of time and space to do make the training model because it has to work with string matching and searching which is a very expensive task. To avoid this we assigned unique integer values for each categorical values in pre processing stage. Because handing integers is much easier and faster than handling strings this improves the speed and space requirement of the algorithm. for i in range(0, projects.shape[1]): le = LabelEncoder() projects[:, i] = le.fit_transform(projects[:, i])
  • 14. 3.7Categorical variable binarization (One hot encoding) Instead of having categorical values vertically in features, we observed that it gives good results if we expand it horizontally as new features and assign 1 or 0 to each row. This will need more memory than before because it creates a sparse matrix but this is computationally very easy for training algorithm as it’s working with 1s and 0s. ohe = OneHotEncoder() projects = ohe.fit_transform(projects) Evaluation method Kaggle has followed is calculating area under ROC curve for the result set we have given via actual result set they have. But the issue is we were allowed to submit only 5 submissions per day. So we needed another way to get an approximation for our ROC value locally to tune parameters in algorithms and features. So we used cross validation to test is. We separated training set into two parts. 60% for new training set and 40% for new test set. We applied training algorithms to new training set and validated results with new test set using ROC values. 3.8 Feature extraction from essays(NLP section) For project to be exciting or not, essay type details given in essays.csv file such as essay and need statement can have significant effect. So we extracted important features using from those details and used them in Logistic Regression to do classification. To extract features from essays, we used TfidfVectorizer which is implemented in scikit-learn text feature extraction library. Using TfidfVectorizer, we can convert a collection of raw documents to a set of Tf-Idf features. Tf-Idf which means term frequency-inverse document frequency is a numeric static that is intended to reflect how important a word is to a document. Tf-idf value increases proportionally to the number of times a word appears in a document, but is offset by the frequency of word’s occurrence in the corpus. TfidfVectorizer in scikitlearn.featureExtraction is equivalent to the combination of a Count Vectorizer and Tfidf Transformer. 14 Count Vectorizer tokenize the documents and count the occurrences of token and return them as a sparse matrix.
  • 15. TfidfTransformer apply Tfidf normalization to the sparse matrix of occurrence counts.From doing this, it scales down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in small fraction of the training corpus. We used following code which created sparse matrixes for training set and test set of essay attribute in essays.csv. min_df is a threshold value for remove words which have low frequency in a document. We checked results using ROC curve by changing min_df and keeping other attributes constant. Following are the results we got. 5 - 0.69853160409851656 4 - 0.69853160409851656 So we concluded that min_df =4 is the best value for it. In this method TfidfVectorizer builds a vocabulary that only consider the top max_features ordered by term frequency across the corpus. We checked results by using different values for max_features and following are the results of them. 500 - 0.69799790900465308 1000 - 0.69853160409851656 1500 - 0.69883818907894368 2000 - 0.69864848813741032 So we concluded 1500 is the best value for max_features. 15 tfidf = TfidfVectorizer(min_df=4, max_features=1500) tfidf.fit(traindata[:,5]) tr = tfidf.transform(traindata[:,5]) ts = tfidf.transform(testdata[:,5])