1. The document discusses a data mining competition hosted by DonorsChoose.org to identify school donation projects that are exceptionally exciting. It describes the provided data files and classification algorithms used, including logistic regression, which performed best.
2. Extensive data preprocessing techniques were applied, including feature selection, handling null values, categorizing numeric features, and text feature extraction from project essays. Cross validation was used to evaluate models during development.
3. Logistic regression with data divided into two parts for training performed best, achieving a ROC value of 0.69853 using optimized hyperparameters.
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
** Python Data Science Training : https://www.edureka.co/python **
This Edureka Video on Logistic Regression in Python will give you basic understanding of Logistic Regression Machine Learning Algorithm with examples. In this video, you will also get to see demo on Logistic Regression using Python. Below are the topics covered in this tutorial:
1. What is Regression?
2. What is Logistic Regression?
3. Why use Logistic Regression?
4. Linear vs Logistic Regression
5. Logistic Regression Use Cases
6. Logistic Regression Example Demo in Python
Subscribe to our channel to get video updates. Hit the subscribe button above.
Machine Learning Tutorial Playlist: https://goo.gl/UxjTxm
Sixteen (16) simple rules for building robust machine learning models. Invited talk for the AMA call of the Research Data Alliance (RDA) Early Career and Engagement Interest Group (ECEIG).
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...Edureka!
This Logistic Regression Tutorial shall give you a clear understanding as to how a Logistic Regression machine learning algorithm works in R. Towards the end, in our demo we will be predicting which patients have diabetes using Logistic Regression! In this Logistic Regression Tutorial you will understand:
1) The 5 Questions asked in Data Science
2) What is Regression?
3) Logistic Regression - What and Why?
4) How does Logistic Regression Work?
5) Demo in R: Diabetes Use Case
6) Logistic Regression: Use Cases
Reformulating Branch Coverage as a Many-Objective Optimization ProblemAnnibale Panichella
Test data generation has been extensively investigated as a search problem, where the search goal is to maximize the number of covered program elements (e.g., branches). Recently, the whole suite approach, which combines the fitness functions of single branches into an aggregate, test suite-level fitness, has been demonstrated to be superior to the traditional single-branch at a time approach. In this paper, we propose to consider branch coverage directly as a many-objective optimization problem, instead of aggregating multiple objectives into a single value, as in the whole suite approach. Since programs may have hundreds of branches (objectives), traditional many-objective algorithms that are designed for numerical optimization problems with less than 15 objectives are not applicable. Hence, we introduce a novel highly scalable many-objective genetic algorithm, called MOSA (Many-Objective Sorting Algorithm), suitably defined for the many-objective branch coverage problem. Results achieved on 64 Java classes indicate that the proposed many-objective algorithm is significantly more effective and more efficient than the whole suite approach. In particular, effectiveness (coverage) was significantly improved in 66% of the subjects and efficiency (search budget consumed) was improved in 62% of the subjects on which effectiveness remains the same.
Results for EvoSuite-MOSA at the Third Unit Testing Tool CompetitionAnnibale Panichella
EvoSuite-MOSA is a unit test data generation tool that employs a novel many-objective optimization algorithm suitably developed for branch coverage. It was implemented by extending the EvoSuite test data generation tool. In this paper we present the results achieved by EvoSuite-MOSA in the third Unit Testing Tool Competition at SBST'15. Among six participants, EvoSuite-MOSA stood third with an overall score of 189.22.
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
** Python Data Science Training : https://www.edureka.co/python **
This Edureka Video on Logistic Regression in Python will give you basic understanding of Logistic Regression Machine Learning Algorithm with examples. In this video, you will also get to see demo on Logistic Regression using Python. Below are the topics covered in this tutorial:
1. What is Regression?
2. What is Logistic Regression?
3. Why use Logistic Regression?
4. Linear vs Logistic Regression
5. Logistic Regression Use Cases
6. Logistic Regression Example Demo in Python
Subscribe to our channel to get video updates. Hit the subscribe button above.
Machine Learning Tutorial Playlist: https://goo.gl/UxjTxm
Sixteen (16) simple rules for building robust machine learning models. Invited talk for the AMA call of the Research Data Alliance (RDA) Early Career and Engagement Interest Group (ECEIG).
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...Edureka!
This Logistic Regression Tutorial shall give you a clear understanding as to how a Logistic Regression machine learning algorithm works in R. Towards the end, in our demo we will be predicting which patients have diabetes using Logistic Regression! In this Logistic Regression Tutorial you will understand:
1) The 5 Questions asked in Data Science
2) What is Regression?
3) Logistic Regression - What and Why?
4) How does Logistic Regression Work?
5) Demo in R: Diabetes Use Case
6) Logistic Regression: Use Cases
Reformulating Branch Coverage as a Many-Objective Optimization ProblemAnnibale Panichella
Test data generation has been extensively investigated as a search problem, where the search goal is to maximize the number of covered program elements (e.g., branches). Recently, the whole suite approach, which combines the fitness functions of single branches into an aggregate, test suite-level fitness, has been demonstrated to be superior to the traditional single-branch at a time approach. In this paper, we propose to consider branch coverage directly as a many-objective optimization problem, instead of aggregating multiple objectives into a single value, as in the whole suite approach. Since programs may have hundreds of branches (objectives), traditional many-objective algorithms that are designed for numerical optimization problems with less than 15 objectives are not applicable. Hence, we introduce a novel highly scalable many-objective genetic algorithm, called MOSA (Many-Objective Sorting Algorithm), suitably defined for the many-objective branch coverage problem. Results achieved on 64 Java classes indicate that the proposed many-objective algorithm is significantly more effective and more efficient than the whole suite approach. In particular, effectiveness (coverage) was significantly improved in 66% of the subjects and efficiency (search budget consumed) was improved in 62% of the subjects on which effectiveness remains the same.
Results for EvoSuite-MOSA at the Third Unit Testing Tool CompetitionAnnibale Panichella
EvoSuite-MOSA is a unit test data generation tool that employs a novel many-objective optimization algorithm suitably developed for branch coverage. It was implemented by extending the EvoSuite test data generation tool. In this paper we present the results achieved by EvoSuite-MOSA in the third Unit Testing Tool Competition at SBST'15. Among six participants, EvoSuite-MOSA stood third with an overall score of 189.22.
This project is about building a corpus for Sinhala language. This is the presentation about its literature review. This includes previous literature we referred in this project.
Today, the corpus based approach can be identified as the state of the art methodology in
language learning studying for both prominent and less known languages in the world. The
corpus based approach mines new knowledge on a language by answering two main
questions:
What particular patterns are associated with lexical or grammatical features of the
language?
How do these patterns differ within varieties and registers?
A language corpus can be identified as a collection of authentic texts that are stored
electronically. It contains different language patterns in different genres, time periods and
social variants. Most of the major languages in the world have their own corpora. But corpora
which have been implemented for Sinhala language have so many limitations.
SinMin is a corpus for Sinhala language which is
Continuously updating
Dynamic (Scalable)
Covers wide range of language (Structured and unstructured)
Providing a better interface for users to interact with the corpus
This report contains the comprehensive literature review done and the research, and design
and implementation details of the SinMin corpus. The implementation details are organized
according to the various components of the platform. Testing, and future works have been
discussed towards the end of this report.
Workshop: Your first machine learning projectAlex Austin
Tutorial to help you create your first machine learning project. The goal was to make this straightforward even someone who's never written a line of code. We gave the workshop to MBA students at UC Berkeley and had a lot of fun learning together - don't be intimidated, anyone can do it!
This project is about building a corpus for Sinhala language. This is the presentation about its literature review. This includes previous literature we referred in this project.
Today, the corpus based approach can be identified as the state of the art methodology in
language learning studying for both prominent and less known languages in the world. The
corpus based approach mines new knowledge on a language by answering two main
questions:
What particular patterns are associated with lexical or grammatical features of the
language?
How do these patterns differ within varieties and registers?
A language corpus can be identified as a collection of authentic texts that are stored
electronically. It contains different language patterns in different genres, time periods and
social variants. Most of the major languages in the world have their own corpora. But corpora
which have been implemented for Sinhala language have so many limitations.
SinMin is a corpus for Sinhala language which is
Continuously updating
Dynamic (Scalable)
Covers wide range of language (Structured and unstructured)
Providing a better interface for users to interact with the corpus
This report contains the comprehensive literature review done and the research, and design
and implementation details of the SinMin corpus. The implementation details are organized
according to the various components of the platform. Testing, and future works have been
discussed towards the end of this report.
Workshop: Your first machine learning projectAlex Austin
Tutorial to help you create your first machine learning project. The goal was to make this straightforward even someone who's never written a line of code. We gave the workshop to MBA students at UC Berkeley and had a lot of fun learning together - don't be intimidated, anyone can do it!
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
A Comprehensive and detailed approach using Machine Learning - An international E-Commerce company wants to use some of the most advanced machine learning techniques to analyse their customers with respect to their services and some important customer success matrix. They also have future expansion plans to India.
CGPA otherwise called Cumulative Grade Points. Average is the normal of Grade Points acquired in every one of the subjects secured till date. It is trusted that it gives a general knowledge into the level of devotion, truthfulness and diligent work put by the understudy.
However there might be where an understudy who is remarkable at programming may not appreciate other hypothetical subjects like programming testing. Notwithstanding, CGPA comes up short when such a situation comes into picture.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
1. 1
CS4642 - Data Mining & Information Retrieval
Report based on KDD Cup 2014 Submission
Siriwardena M. P. -100512X
Upeksha W. D. -100552T
Wijayarathna D. G. C. D. -100596F
Wijayarathna Y. M. S. N. -100597J
2. 1.0 Introduction
DonorsChoose.org is an online charity that makes it easy to help students in need
through school donations. Teachers in K-12 schools propose projects requesting
materials to enhance the education of their students. When a project reaches its funding
goal, they ship the materials to the school.
The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that
are exceptionally exciting to the business. While all projects on the site fulfill some kind
of need, certain projects have a quality above and beyond what is typical. By identifying
and recommending such projects early, they will improve funding outcomes, better the
user experience, and help more students receive the materials they need to learn.
Participants are provided with six files of data containing the details of the projects in the
training set and the test set. While the donations.csv and outcomes.csv files provides
data about the projects in the training set, projects.csv, essays.csv and resources.csv
files provides information about both training set and the test set.
sampleSubmission.csv file provide a sample of results that the competitors are needed
to calculate and provide for the competition.
! donations.csv - contains information about the donations to each project.
! essays.csv - contains project text posted by the teachers.
! projects.csv - contains information about each project.
! resources.csv - contains information about the resources requested for each project.
! outcomes.csv - contains information about the outcomes of projects in the training
2
set.
! sampleSubmission.csv - contains the project ids of the test set and shows the
submission format for the competition.
The data is provided in a relational format and split by dates. Any project posted prior to
2014-01-01 is in the training set. Any project posted after is in the test set. Some
projects in the test set may still be live and are ignored in the scoring. Projects that are
still alive are not disclosed to avoid leakage regarding the funding status.
2.0 Classification Algorithms used
For Implementing above tasks we used following classification algorithms. We used
libraries which were implemented in python scikit-learn. By giving same parameters to
each of these algorithms, we calculated which algorithm gives best results for above
problem.
3. 2.1 Stochastic Gradient Descent
In Stochastic Gradient Descent, it creates a linear function 푓 푥 = 푤!푥 + 푏 for training set
푥!, 푦! . . . . . . 푥!, 푦! . The target here is to find f(x)so that it will satisfy the most of 푦!in
training set. This method uses Gradient Descent method to find optimum values for w,
so that intercept b will be minimum.
We tried this method with SGDClassifier which was implemented in python scikit-learn
libraries. Following is the part of code which did the classification using SGD.
3
from sklearn.linear_model import SGDClassifier
model = SGDClassifier( loss = 'modified_huber', penalty = 'l2')
model.fit(train, outcomes=='t')
preds = model.predict_proba(test)[:,1]
Here we create a SGDClassifier and fit it with training data set where train is a sorted
sparse matrix with features and outcomes is the array with class labels. Then we can
predict probability of class label being ‘t’ for training set.
When creating SGDClassifier, we added different values for loss and penalty and,
decided the most effective one by observing results using ROC curve.
2.2 Decision Tree
Decision tree is also a very popular method used for classification. Decision tree is a
predictive model which maps observations about an item to conclusions about the
item's target value.
We used Decision Tree implemented in Scikit learn library in our project.
from sklearn import tree
clf = tree.DecisionTreeRegressor()
clf = clf.fit(train, outcomes=='t')
preds = clf.predict_proba(test)[:,1]
But with the resources we have we could only create decision trees with 4-5 features.
We got python Memory Error when we tried to use Decision tree with higher number of
features. So we decide against using this method.
2.3 Support Vector Machine
In Support Vector Machines, it transforms original training data into a higher dimension
and searches for the linear optimal separating hyperplane. For implementing SVM, we
used python sci-kit learn library.
from sklearn import svm
clf = svm.SVC()
clf = clf.fit(train, outcomes=='t')
preds = clf.predict_proba(test)[:,1]
4. 4
This method took extremely high running times, so we decided against using this
method.
2.4 Logistic Regression
The logistic function
퐹 푡 = 1 1 + 푒!!
and this is equivalent to
퐹 푥 1 − 퐹 푥 = 푒!!!!!!
If we plot the input value 훽! + 훽!푥 and the output value F(x), input can take an input with
any value from negative infinity to positive infinity, whereas the output F(x) is confined to
values between 0 and 1 and hence is interpretable as a probability.
If there are multiple explanatory variables, then the above expression 훽! + 훽!푥 can be
revised to 훽! + 훽!푥! + 훽!푥!+. . .+훽!푥!
To minimize the model to overfit to the training set, we have to tune the the variable ‘C’
which is Inverse of regularization strength. Normally C is must be a very small value.
According to the roc curve values for small C values we identified that the 0.35 fits our
model the best. Here are the roc curve values for the C values we have tested.
C roc curve value
0.2 0.697740468829667
0.31 0.69850667254796539
0.35 0.69853160409851656
0.4 0.69845107304459875
In logistic regression, derived function can overfit the training set, which means it may
describe training set very well but not test data. To minimize the effect of that, we
decided to divide training set into several parts and create LogisticRegression objects
separately, and then to get few sets of predictions. Then we derived final result by
getting mean of them. We tried dividing into 2,3 and 4 parts. Dividing into 2 gives better
results than using as a single training set, also it had higher value than 3 and 4.
model = LogisticRegression(C=0.35)
model.fit(train.tocsr()[:train.shape[0]/2], outcomes[:outcomes.size/2] == 't')
preds1 = model.predict_proba(test)[:, 1]
model.fit(train.tocsr()[(train.shape[0]/2+1):],outcomes[(outcomes.size/2)+1:]== 't')
preds2 = model.predict_proba(test)[:, 1]
preds = (preds1 + preds2)/2
5. 3.0 Data Preprocessing Techniques Used
3.1 Cross Validation using ROC curve
Evaluation method Kaggle has followed is calculating area under ROC curve for the
result set we have given via actual result set they have. But the issue is we were
allowed to submit only 5 submissions per day. So we needed another way to get an
approximation for our ROC value locally to tune parameters in algorithms and features.
So we used cross validation to test is. We separated training set into two parts. 60% for
new training set and 40% for new test set. We applied training algorithms to new
training set and validated results with new test set using ROC values.
3.2 Selecting features to use
For selecting features we analyzed the given dataset in several ways. We plotted
graphs and found the relevance of the feature to the final outcome. By analyzing data
we came to know that some of the features reduces the final outcome while some
features significantly increase the final outcome. We also derived some data fields from
original data fields and we broke some data fields into classes. These features are listed
below.
1. teacher_acctid
2. schoolid
3. school_city
4. school_state
5. school_metro
6. school_district
7. school_county
8. school_charter
9. school_magnet
10. school_year_round
11. school_nlns
12. school_kipp
13. school_charter_ready_promise
14. teacher_prefix
5
X_train, X_test, y_train, y_test = cross_validation.train_test_split(train, outcomes
== 't', test_size=0.4, random_state=0) # separate training set into two parts
model = LogisticRegression(C=3.1) # Runs the algorithms on new training set
model.fit(X_train, y_train)
preds = model.predict_proba(X_test)[:,1]
roc_auc_score(y_test, preds) # calculate ROC value
6. 6
15. teacher_teach_for_america
16. teacher_ny_teaching_fellow
17. primary_focus_subject
18. primary_focus_area
19. resource_type
20. grade_level
21. eligible_double_your_impact_match
22. eligible_almost_home_match
23. resource_type
We used above features without any changes. We found most of the good features
by the trial and error method. Before selecting some features we graphed them versus
the percentage of exciting and not accepted.
10. 3.3 Handling Null values
We noticed that some features in projects have significant amount of null values. Here
is the percentages we have calculated for each feature in projects
teacher_acctid : 0.0
schoolid : 0.0
school_ncesid : 6.4351
school_latitude : 0.0
school_longitude : 0.0
school_city : 0.0
school_state : 0.0
school_zip : 0.0006
school_metro : 12.3337
school_district : 0.1427
school_county : 0.0025
school_charter : 0.0
school_magnet : 0.0
school_year_round : 0.0
school_nlns : 0.0
school_kipp : 0.0
school_charter_ready_promise : 0.0
teacher_prefix : 0.00066
teacher_teach_for_america : 0.0
teacher_ny_teaching_fellow : 0.0
primary_focus_subject : 0.0058
primary_focus_area : 0.0058
secondary_focus_subject : 31.3045
secondary_focus_area : 31.3045
resource_type : 0.0067
poverty_level : 0.0
grade_level : 0.0013
fulfillment_labor_materials : 5.2826
total_price_excluding_optional_support : 0.0
total_price_including_optional_support : 0.0
students_reached : 0.02198
eligible_double_your_impact_match : 0.0
eligible_almost_home_match : 0.0
date_posted : 0.0
10
11. 11
There are some features like secondary_focus_subject and secondary_focus_area
which have more than 30% of null values. So we removed such features from our
training data because they don’t give much information about training set. For the
features that has <10% of null values, we filled those null values from most common
value in the dataset.
for i in range(len(projectCatogorialColumns)):
data = Counter(projects[projectCatogorialColumns[i]])
projects[projectCatogorialColumns[i]] =
projects[projectCatogorialColumns[i]].fillna(data.most_common(1)[0][0])
3.4 Assigning ranges to numerical features
There are some features like
total_price_excluding_optional_support
total_price_including_optional_support
students_reached
school_latitude
school_longitude
which has integers and floating point numerical values. It is highly unlikely that same
exact value appear again and again. Ex. There could be only one school in data set
which has latitude exactly 35.34534. But it is reasonable to take schools in range
32<latitude<36 are in same geographical area. So we defined such ranges for each
numerical feature to improve training data and avoid over fitting with training data.
12. 3.5 Deriving new features from existing features
There are some features in dataset which doesn’t give much information. But by
combining such features and deriving new features may give better results. We
identified some features that can be improved that way.
1. cost
item_unit_price and item_quantity in Resources dataset doesn’t give much information
when they are taken separately. But when we derive a new feature such that
cost = item_unit_price*item_quantity
gives the total cost that is required for each project.
2 month, week
date_posted feature in Projects data set gives the date which each project is posted in
yyyy-mm-dd format. Usually this is used to identify training data and test data from the
12
def div_fulfillment_labor_materials():
#max = 35.0, min = 9.0
arr = []
nparr = np.array(projects.fulfillment_labor_materials)
for i in range(len(nparr)):
arr.append(math.floor(float(nparr[i])/2))
return arr
def div_total_price_excluding_optional_support():
#max = 10250017.0 , min = 0.0
arr = []
nparr =
np.array(projects.total_price_excluding_optional_support)
for i in range(len(nparr)):
arr.append(math.floor(float(nparr[i])/50))
return arr
def div_total_price_including_optional_support():
#max = 12500020.73 , min = 0.0
arr = []
nparr =
np.array(projects.total_price_including_optional_support)
for i in range(len(nparr)):
arr.append(math.floor(float(nparr[i])/50))
return arr
def div_students_reached():
#max = 999999.0 , min = 0.0
arr = []
nparr = np.array(projects.students_reached)
for i in range(len(nparr)):
arr.append(math.floor(float(nparr[i])/80))
return arr
13. data set. But if we take month separately and add it as a feature it improves overall
training because for some months (like October and November) have high probability of
accepting its requirements. One logical reason is, most companies allocate funds for
CSR and other charity works in some seasons in the year.
13
3 essay_length
We calculated the length of the essay assuming that good description of the project
should have an acceptable number of words.
3.6 Assigning integer labels for string categorical values
There are some features with categorical values in the data set. Some of them are
string values. If we pass those data directly into the training algorithms, it will take large
amount of time and space to do make the training model because it has to work with
string matching and searching which is a very expensive task. To avoid this we
assigned unique integer values for each categorical values in pre processing stage.
Because handing integers is much easier and faster than handling strings this improves
the speed and space requirement of the algorithm.
for i in range(0, projects.shape[1]):
le = LabelEncoder()
projects[:, i] = le.fit_transform(projects[:, i])
14. 3.7Categorical variable binarization (One hot encoding)
Instead of having categorical values vertically in features, we observed that it gives
good results if we expand it horizontally as new features and assign 1 or 0 to each row.
This will need more memory than before because it creates a sparse matrix but this is
computationally very easy for training algorithm as it’s working with 1s and 0s.
ohe = OneHotEncoder()
projects = ohe.fit_transform(projects)
Evaluation method Kaggle has followed is calculating area under ROC curve for the
result set we have given via actual result set they have. But the issue is we were
allowed to submit only 5 submissions per day. So we needed another way to get an
approximation for our ROC value locally to tune parameters in algorithms and features.
So we used cross validation to test is. We separated training set into two parts. 60% for
new training set and 40% for new test set. We applied training algorithms to new
training set and validated results with new test set using ROC values.
3.8 Feature extraction from essays(NLP section)
For project to be exciting or not, essay type details given in essays.csv file such as
essay and need statement can have significant effect. So we extracted important
features using from those details and used them in Logistic Regression to do
classification.
To extract features from essays, we used TfidfVectorizer which is implemented in scikit-learn
text feature extraction library. Using TfidfVectorizer, we can convert a collection of
raw documents to a set of Tf-Idf features.
Tf-Idf which means term frequency-inverse document frequency is a numeric static that
is intended to reflect how important a word is to a document. Tf-idf value increases
proportionally to the number of times a word appears in a document, but is offset by the
frequency of word’s occurrence in the corpus.
TfidfVectorizer in scikitlearn.featureExtraction is equivalent to the combination of a
Count Vectorizer and Tfidf Transformer.
14
Count Vectorizer tokenize the documents and count the occurrences of token and
return them as a sparse matrix.
15. TfidfTransformer apply Tfidf normalization to the sparse matrix of occurrence
counts.From doing this, it scales down the impact of tokens that occur very frequently in
a given corpus and that are hence empirically less informative than features that occur
in small fraction of the training corpus.
We used following code which created sparse matrixes for training set and test set of
essay attribute in essays.csv.
min_df is a threshold value for remove words which have low frequency in a document.
We checked results using ROC curve by changing min_df and keeping other attributes
constant. Following are the results we got.
5 - 0.69853160409851656
4 - 0.69853160409851656
So we concluded that min_df =4 is the best value for it.
In this method TfidfVectorizer builds a vocabulary that only consider the top
max_features ordered by term frequency across the corpus. We checked results by
using different values for max_features and following are the results of them.
500 - 0.69799790900465308
1000 - 0.69853160409851656
1500 - 0.69883818907894368
2000 - 0.69864848813741032
So we concluded 1500 is the best value for max_features.
15
tfidf = TfidfVectorizer(min_df=4, max_features=1500)
tfidf.fit(traindata[:,5])
tr = tfidf.transform(traindata[:,5])
ts = tfidf.transform(testdata[:,5])