SlideShare a Scribd company logo
CS4642 - Data Mining & Information
Retrieval
Paper Based on KDDCup 2014 Submission
Group Members:
100227D - Jayaweera W.J.A.I.U.
100470N - Sajeewa G.K.M.C
100476M - Sampath P.L.B.
100612E - Wijewardane M.M.D.T.K.
Group Number : 13
Final Group Rank : 76
Description of Data
In this competition, five data files are available for competitors. They are donations
(contains information about the donations to each project. This is only provided for projects
in the training set), essays (contains project text posted by the teachers. This is provided for
both the training and test set), projects (contains information about each project. This is
provided for both the training and test set), resources (contains information about the
resources requested for each project. This is provided for both the training and test set) and
outcomes (contains information about the outcomes of projects in the training set). Before
starting the knowledge discovery process provided data have been analyzed.
First of all number of data records in each file has been counted to get an idea about the
amount of data available. Projects file has 664098 records, essays file has 664098 records,
outcomes file has 619326 records, resources file has 3667217 records and donations file has
3097989. Our next task was to identify the criterion which is used to differentiate test data
from training data. After reading the competition details we realized that projects after
2014-01-01 belongs to test data set and projects before 2014-01-01 belongs to training data
set. According to that 619326 projects are available for training set and remaining amount
(44772) of projects are for training set. For each of the project in training set, project
description, essay data, resources requested, donations provided and outcome are given. For
each of the project in test set, project description, essay data, resources requested are given.
Data Imbalanced Problem
After having a brief understanding of data provided we started to analyze training set.
When we draw a graph between the project’s posted dates and “is_exciting” attribute, we
realized that there are no exciting projects before April 2014. Graph was completely
skewed to the right side.
This leads to a data imbalanced problem as number of exciting projects is very small
compare to the number of non-exciting projects (exciting - 5.9274%). Histogram of
exciting and non-exciting projects was as follows.
In competition forum there was an explanation for this problem. It said that organization
might not keep track of some of the requirements needed to decide ‘is_exciting’ for the
projects before 2010. Therefore we thought that classification given in outcomes file before
2010 may not correct and we decided to use down sampling technique to handle
imbalanced data (remove projects before 2010). It is true that valuable information may get
lost when projects are removed. But accuracy obtained by removing that data outweigh the
loss of information. Therefore we were able to obtain higher accuracy by down sampling
the given data. All the classifiers that we have used performed well after removing projects
before 2010.
Preprocessing Data
First we analyzed characteristics of mining data using statistical measurements. Using the
data frame describe method we calculated number of records, mean, standard deviation,
minimum value, maximum value and the quartile values for each attribute. Given below is
a statistical measurement of two attributes.
We were able to get an idea about the distribution of attributes using these statistical
measurements.
Filling Missing Values
Initially we used pad method (propagate last valid observation forward) to fill missing
values of all the attributes. But we realized that we can achieve high accuracy by selecting a
filling method based on the type of the attribute. To do that first we calculate the percentage
of missing values. It was as follows,
Highest amount of missing values percentage was for secondary focused subject and
secondary focused area. This is because some projects may have only primary focus area
and primary focus subject. We decided to fill missing secondary values with their
respective primary values. Also we used linear interpolation for numeric values and for
other attributes we used pad method. Later when we tune up classifiers we changed the
method from pad to backfill (use next valid observation) as it obtained a higher accuracy
than pad.
Remove Outliers
When we analyzed data, outliers were detected in some of the attributes. We used scatter
plots to identify outliers. There were outliers in cost related attributes and we replaced them
with the mean value of that attribute. Given below is outlier analysis of cost attribute,
Red circle value can be considered as an outlier as it has a really huge value than other
values. These outliers have caused a lot of problems when we discretize data. To identify
outliers in resources, we used inter quartile range as a measurement.
Label Encoding
We did not use all the attributes for predictions. We focused more on repetitive features as
they will help more to the classifier to make predictions. Most of these repetitive
features/attributes have string values rather than numerical values. Available classifiers do
not accept string values for features. So we used label encoder to transform those string
values to integer values between 0 and n-1, n being the number of different values a feature
can take.
But classifiers expect continuous input and may interpret the categories as being ordered
which is not desired. To make the categorical features to features that can be used with
scikit classifiers we used one-hot encoding. Encoder transformed each categorical feature
with k possible values into k binary features, with only one active for particular sample.
This improved the performance of classifiers to greater extent. For an example SGD
classifier obtained about 0.55 ROC score without hot encoding and with encoding it
obtained about 0.59 ROC score.
Continues Values Discretization
Project attributes such as school longitude, school latitude, zip code and total cost cannot be
directly used for predictions as they are less likely to be repetitive. But this information
cannot be eliminated as they may help to get decisions for classifiers. To make these
attributes more repetitive we used discretization. We put these continuous values into bins
and used the bin index as the attribute. For an example we used discretization for longitude
and latitude and divided projects into five regions (bins) and used region id instead of using
longitude and latitude. Discretization results for total cost attribute as follows,
We applied the same concept for cost related attributes, item count for project, total price of
items per project, number of projects per teacher etc.
This has improved the repetitiveness of attributes to a greater extent and more useful
information has been discovered which can be used by the classifier.
Attribute Construction
Some of the features given in data files cannot be used directly due to various reasons (most
of the times they are highly non repetitive). We used some of these features to construct
new features by combining multiple features or transforming one to another. Given below
is the list of derived attributes.
1. Month- posted date of the project was given but it is less repetitive. We derived
month attribute from the posted date and used it for prediction
2. Essay length- for each project corresponding essay was given but it cannot be used
directly for prediction. Therefore we calculated the length of the each essay after
removing extra spaces within the essay text and used it as an attribute.
3. Need statement length
4. Projects per teacher- we calculated number of projects per teacher by grouping the
projects with ‘teacher_acctid’ and used it as an attribute
5. Total items per project- we calculated total number of items requested per each
project from the details provided in resources file and used it as an attribute
6. Cost of total items per project- we calculated total cost of items requested per each
project from the details provided in resources file and used it as an attribute
Several other derived attributes such as date, short description length has been considered
but they did not yield a significant performance improvement.
Model Selection and Evaluation
We have used three classifiers during the project. First we used decision tree classifier, then
we used logistic regression and finally we used SGD (stochastic gradient decent) classifier.
We started with tree classifier as it was easy to use. To evaluate the performance of
classifiers initially we used the cross validation technique. But later we realized that
competition is using ROC (area under the curve) score for evaluations. So we also used
ROC scores to evaluate the performance of the classifiers. As we had several choices for
classifiers we read several articles about the usage of classifiers. From them we realized
that decision tree normally does not perform well when there is data imbalance problem
and logistic regression was used instead of that.
Logistic regression was performed well with the given data and it achieved about 0.61 ROC
score. To improve the accuracy further more we used SGD classifier (logistic regression
with SGD training). On one hand it is more efficient than the logistic regression so that
predictions can be done in less amount of time. On the other hand it achieved higher
accuracy than the regression classifier. With default parameters for SGD classifier we were
able to achieve about 0.635 ROC score. To tune up the SGD classifier (to find best values
for the parameters) we performed a grid search and found optimum values for the number
of iterations, penalty, shuffle and alpha parameters. Using those values we were able
improve the accuracy up to 0.64 ROC score.
Ensemble Methods
We tried to use boosting algorithm to improve the performance of classifier. Among the
methods available we used “ada boost” method (AdaBoostClassifier) for that.
Implementation provided by scikit library only supports decision tree classifier and SGD
classifier. So we were not able to use logistic regression directly. Instead we tried to use
SGD classifier with boosting algorithm. But accuracy was increased only by an
insignificant amount.
Further Improvements
Essays data contains huge amount of data but they were not used during the predictions
apart from the essay length. We tried to extract essay data using TfidVectorizer but it was
not successful due to memory constraints. As an alternative we tried hashing methods but it
reduced the accuracy of the essay data. We think that accuracy of the classifier may
improve further if some features from the essay data are included in training data. Also use
of ensemble methods will definitely improve the accuracy of predications.
Support Libraries Used
We used ‘Pandas’ data analysis library to generate data frames from the provided comma-
separated values files which can be used with other data analysis and modeling tools which
we used. Other than that we used functions provided with ‘Pandas’ library for generating
bins in order to discretize the attributes with less repetitive values and merging data frames
from several data sources.
Then we used ‘NumPy’ extension library in order to generate multidimensional arrays
using ‘Pandas’ data frames and data series to make it easy to access certain ranges of data
(i.e. separate the indices of training data set from test data set) and locate some properties of
data like median and quartiles. Also when combining derived attributes with existing
attributes functions provided with ‘NumPy’ library was useful.
‘Scikit-learn’ machine learning library was the library we used to integrate data analysis,
preprocessing, classification, regression and modeling tools into our implementations. From
the various tools provided with ‘Scikit-learn’ library we used preprocessing tools like
‘Label Encoder’ and ‘One Hot Encoder’, ‘Standard Scalar’ and text feature extraction tools
classification tools like ‘Decision Tree Classifier’, ‘SGD Classifier’ and ‘Logistic
Regression’, model selection and evaluation tools like ‘Grid Search’, ensemble tools like
‘AdaBoost Classifier’ and metrics like ‘roc_auc_score’ to compute area under the curve
(AUC) from prediction scores as mentioned above.

More Related Content

What's hot

Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)
Andrea Gazzarini
 
Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database DesignOptimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database Design
Waqas Tariq
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
IRJET Journal
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Alessandro Benedetti
 
Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulation
Prabhakar Ganesamurthy
 
Review Mining of Products of Amazon.com
Review Mining of Products of Amazon.comReview Mining of Products of Amazon.com
Review Mining of Products of Amazon.com
Shobhit Monga
 
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Data Works MD
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
IRJET Journal
 

What's hot (8)

Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)
 
Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database DesignOptimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database Design
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 
Cloud workload analysis and simulation
Cloud workload analysis and simulationCloud workload analysis and simulation
Cloud workload analysis and simulation
 
Review Mining of Products of Amazon.com
Review Mining of Products of Amazon.comReview Mining of Products of Amazon.com
Review Mining of Products of Amazon.com
 
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 

Viewers also liked

L’energia i la seva transformació
L’energia i la seva transformacióL’energia i la seva transformació
L’energia i la seva transformaciódominguezvalles
 
2008 election in mongolia
2008 election in mongolia2008 election in mongolia
2008 election in mongolia
Munkhnaran Avirmed
 
Water deficit
Water deficitWater deficit
Water deficit
Chersia
 
1-800-PetMeds September 2011 Business Plan
1-800-PetMeds September 2011 Business Plan1-800-PetMeds September 2011 Business Plan
1-800-PetMeds September 2011 Business Plan
Pet Meds
 
Water deficit
Water deficitWater deficit
Water deficit
Chersia
 
Stream connectors
Stream connectorsStream connectors
Stream connectors
Chamath Sajeewa
 
Corporate laws
Corporate lawsCorporate laws
Corporate laws
charmingattraction
 

Viewers also liked (8)

L’energia i la seva transformació
L’energia i la seva transformacióL’energia i la seva transformació
L’energia i la seva transformació
 
2008 election in mongolia
2008 election in mongolia2008 election in mongolia
2008 election in mongolia
 
QualitySign_IVinogradova_v5
QualitySign_IVinogradova_v5QualitySign_IVinogradova_v5
QualitySign_IVinogradova_v5
 
Water deficit
Water deficitWater deficit
Water deficit
 
1-800-PetMeds September 2011 Business Plan
1-800-PetMeds September 2011 Business Plan1-800-PetMeds September 2011 Business Plan
1-800-PetMeds September 2011 Business Plan
 
Water deficit
Water deficitWater deficit
Water deficit
 
Stream connectors
Stream connectorsStream connectors
Stream connectors
 
Corporate laws
Corporate lawsCorporate laws
Corporate laws
 

Similar to Group13 kdd cup_report_submitted

KDD Cup Research Paper
KDD Cup Research PaperKDD Cup Research Paper
KDD Cup Research Paper
Tharindu Ranasinghe
 
Summary_Classification_Algorithms_Student_Data
Summary_Classification_Algorithms_Student_DataSummary_Classification_Algorithms_Student_Data
Summary_Classification_Algorithms_Student_Data
Madeleine Organ
 
Big data project
Big data projectBig data project
Big data project
Kedar Kumar
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1
Arpita Majumder
 
cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02
PRIYANKA MEHTA
 
BATCH 1 FIRST REVIEW-1.pptx
BATCH 1 FIRST REVIEW-1.pptxBATCH 1 FIRST REVIEW-1.pptx
BATCH 1 FIRST REVIEW-1.pptx
SurajRavi16
 
Unit 5
Unit   5Unit   5
Research proposal
Research proposalResearch proposal
Research proposal
Sadia Sharmin
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
Gerrit Klaschke, CSM
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group Project
Erik Bebernes
 
50120130406007
5012013040600750120130406007
50120130406007
IAEME Publication
 
CompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCICompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCI
Soham Kulkarni
 
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache SparkAttribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
IRJET Journal
 
13_Data Preprocessing in Python.pptx (1).pdf
13_Data Preprocessing in Python.pptx (1).pdf13_Data Preprocessing in Python.pptx (1).pdf
13_Data Preprocessing in Python.pptx (1).pdf
andreyhapantenda
 
PM3 ARTICALS
PM3 ARTICALSPM3 ARTICALS
PM3 ARTICALS
ra na
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Sease
 
Automated Essay Grading using Features Selection
Automated Essay Grading using Features SelectionAutomated Essay Grading using Features Selection
Automated Essay Grading using Features Selection
IRJET Journal
 
Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...
Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...
Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...
Chaudhry Hussain
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
sumitkumar600840
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdf
Danilo Cardona
 

Similar to Group13 kdd cup_report_submitted (20)

KDD Cup Research Paper
KDD Cup Research PaperKDD Cup Research Paper
KDD Cup Research Paper
 
Summary_Classification_Algorithms_Student_Data
Summary_Classification_Algorithms_Student_DataSummary_Classification_Algorithms_Student_Data
Summary_Classification_Algorithms_Student_Data
 
Big data project
Big data projectBig data project
Big data project
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1
 
cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02cloudworkloadanalysisandsimulation-140521153543-phpapp02
cloudworkloadanalysisandsimulation-140521153543-phpapp02
 
BATCH 1 FIRST REVIEW-1.pptx
BATCH 1 FIRST REVIEW-1.pptxBATCH 1 FIRST REVIEW-1.pptx
BATCH 1 FIRST REVIEW-1.pptx
 
Unit 5
Unit   5Unit   5
Unit 5
 
Research proposal
Research proposalResearch proposal
Research proposal
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group Project
 
50120130406007
5012013040600750120130406007
50120130406007
 
CompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCICompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCI
 
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache SparkAttribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
 
13_Data Preprocessing in Python.pptx (1).pdf
13_Data Preprocessing in Python.pptx (1).pdf13_Data Preprocessing in Python.pptx (1).pdf
13_Data Preprocessing in Python.pptx (1).pdf
 
PM3 ARTICALS
PM3 ARTICALSPM3 ARTICALS
PM3 ARTICALS
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Automated Essay Grading using Features Selection
Automated Essay Grading using Features SelectionAutomated Essay Grading using Features Selection
Automated Essay Grading using Features Selection
 
Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...
Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...
Final Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal Defence.pptxFinal...
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdf
 

Recently uploaded

Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 

Recently uploaded (20)

Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 

Group13 kdd cup_report_submitted

  • 1. CS4642 - Data Mining & Information Retrieval Paper Based on KDDCup 2014 Submission Group Members: 100227D - Jayaweera W.J.A.I.U. 100470N - Sajeewa G.K.M.C 100476M - Sampath P.L.B. 100612E - Wijewardane M.M.D.T.K. Group Number : 13 Final Group Rank : 76
  • 2. Description of Data In this competition, five data files are available for competitors. They are donations (contains information about the donations to each project. This is only provided for projects in the training set), essays (contains project text posted by the teachers. This is provided for both the training and test set), projects (contains information about each project. This is provided for both the training and test set), resources (contains information about the resources requested for each project. This is provided for both the training and test set) and outcomes (contains information about the outcomes of projects in the training set). Before starting the knowledge discovery process provided data have been analyzed. First of all number of data records in each file has been counted to get an idea about the amount of data available. Projects file has 664098 records, essays file has 664098 records, outcomes file has 619326 records, resources file has 3667217 records and donations file has 3097989. Our next task was to identify the criterion which is used to differentiate test data from training data. After reading the competition details we realized that projects after 2014-01-01 belongs to test data set and projects before 2014-01-01 belongs to training data set. According to that 619326 projects are available for training set and remaining amount (44772) of projects are for training set. For each of the project in training set, project description, essay data, resources requested, donations provided and outcome are given. For each of the project in test set, project description, essay data, resources requested are given. Data Imbalanced Problem After having a brief understanding of data provided we started to analyze training set. When we draw a graph between the project’s posted dates and “is_exciting” attribute, we realized that there are no exciting projects before April 2014. Graph was completely skewed to the right side. This leads to a data imbalanced problem as number of exciting projects is very small compare to the number of non-exciting projects (exciting - 5.9274%). Histogram of exciting and non-exciting projects was as follows.
  • 3. In competition forum there was an explanation for this problem. It said that organization might not keep track of some of the requirements needed to decide ‘is_exciting’ for the projects before 2010. Therefore we thought that classification given in outcomes file before 2010 may not correct and we decided to use down sampling technique to handle imbalanced data (remove projects before 2010). It is true that valuable information may get lost when projects are removed. But accuracy obtained by removing that data outweigh the loss of information. Therefore we were able to obtain higher accuracy by down sampling the given data. All the classifiers that we have used performed well after removing projects before 2010. Preprocessing Data First we analyzed characteristics of mining data using statistical measurements. Using the data frame describe method we calculated number of records, mean, standard deviation, minimum value, maximum value and the quartile values for each attribute. Given below is a statistical measurement of two attributes. We were able to get an idea about the distribution of attributes using these statistical measurements.
  • 4. Filling Missing Values Initially we used pad method (propagate last valid observation forward) to fill missing values of all the attributes. But we realized that we can achieve high accuracy by selecting a filling method based on the type of the attribute. To do that first we calculate the percentage of missing values. It was as follows, Highest amount of missing values percentage was for secondary focused subject and secondary focused area. This is because some projects may have only primary focus area and primary focus subject. We decided to fill missing secondary values with their respective primary values. Also we used linear interpolation for numeric values and for other attributes we used pad method. Later when we tune up classifiers we changed the method from pad to backfill (use next valid observation) as it obtained a higher accuracy than pad. Remove Outliers When we analyzed data, outliers were detected in some of the attributes. We used scatter plots to identify outliers. There were outliers in cost related attributes and we replaced them with the mean value of that attribute. Given below is outlier analysis of cost attribute,
  • 5. Red circle value can be considered as an outlier as it has a really huge value than other values. These outliers have caused a lot of problems when we discretize data. To identify outliers in resources, we used inter quartile range as a measurement. Label Encoding We did not use all the attributes for predictions. We focused more on repetitive features as they will help more to the classifier to make predictions. Most of these repetitive features/attributes have string values rather than numerical values. Available classifiers do not accept string values for features. So we used label encoder to transform those string values to integer values between 0 and n-1, n being the number of different values a feature can take. But classifiers expect continuous input and may interpret the categories as being ordered which is not desired. To make the categorical features to features that can be used with scikit classifiers we used one-hot encoding. Encoder transformed each categorical feature with k possible values into k binary features, with only one active for particular sample. This improved the performance of classifiers to greater extent. For an example SGD classifier obtained about 0.55 ROC score without hot encoding and with encoding it obtained about 0.59 ROC score. Continues Values Discretization Project attributes such as school longitude, school latitude, zip code and total cost cannot be directly used for predictions as they are less likely to be repetitive. But this information cannot be eliminated as they may help to get decisions for classifiers. To make these attributes more repetitive we used discretization. We put these continuous values into bins and used the bin index as the attribute. For an example we used discretization for longitude and latitude and divided projects into five regions (bins) and used region id instead of using longitude and latitude. Discretization results for total cost attribute as follows,
  • 6. We applied the same concept for cost related attributes, item count for project, total price of items per project, number of projects per teacher etc. This has improved the repetitiveness of attributes to a greater extent and more useful information has been discovered which can be used by the classifier. Attribute Construction Some of the features given in data files cannot be used directly due to various reasons (most of the times they are highly non repetitive). We used some of these features to construct new features by combining multiple features or transforming one to another. Given below is the list of derived attributes. 1. Month- posted date of the project was given but it is less repetitive. We derived month attribute from the posted date and used it for prediction 2. Essay length- for each project corresponding essay was given but it cannot be used directly for prediction. Therefore we calculated the length of the each essay after removing extra spaces within the essay text and used it as an attribute. 3. Need statement length 4. Projects per teacher- we calculated number of projects per teacher by grouping the projects with ‘teacher_acctid’ and used it as an attribute 5. Total items per project- we calculated total number of items requested per each project from the details provided in resources file and used it as an attribute 6. Cost of total items per project- we calculated total cost of items requested per each project from the details provided in resources file and used it as an attribute Several other derived attributes such as date, short description length has been considered but they did not yield a significant performance improvement. Model Selection and Evaluation We have used three classifiers during the project. First we used decision tree classifier, then we used logistic regression and finally we used SGD (stochastic gradient decent) classifier. We started with tree classifier as it was easy to use. To evaluate the performance of classifiers initially we used the cross validation technique. But later we realized that competition is using ROC (area under the curve) score for evaluations. So we also used ROC scores to evaluate the performance of the classifiers. As we had several choices for classifiers we read several articles about the usage of classifiers. From them we realized that decision tree normally does not perform well when there is data imbalance problem and logistic regression was used instead of that. Logistic regression was performed well with the given data and it achieved about 0.61 ROC score. To improve the accuracy further more we used SGD classifier (logistic regression with SGD training). On one hand it is more efficient than the logistic regression so that predictions can be done in less amount of time. On the other hand it achieved higher accuracy than the regression classifier. With default parameters for SGD classifier we were able to achieve about 0.635 ROC score. To tune up the SGD classifier (to find best values
  • 7. for the parameters) we performed a grid search and found optimum values for the number of iterations, penalty, shuffle and alpha parameters. Using those values we were able improve the accuracy up to 0.64 ROC score. Ensemble Methods We tried to use boosting algorithm to improve the performance of classifier. Among the methods available we used “ada boost” method (AdaBoostClassifier) for that. Implementation provided by scikit library only supports decision tree classifier and SGD classifier. So we were not able to use logistic regression directly. Instead we tried to use SGD classifier with boosting algorithm. But accuracy was increased only by an insignificant amount. Further Improvements Essays data contains huge amount of data but they were not used during the predictions apart from the essay length. We tried to extract essay data using TfidVectorizer but it was not successful due to memory constraints. As an alternative we tried hashing methods but it reduced the accuracy of the essay data. We think that accuracy of the classifier may improve further if some features from the essay data are included in training data. Also use of ensemble methods will definitely improve the accuracy of predications. Support Libraries Used We used ‘Pandas’ data analysis library to generate data frames from the provided comma- separated values files which can be used with other data analysis and modeling tools which we used. Other than that we used functions provided with ‘Pandas’ library for generating bins in order to discretize the attributes with less repetitive values and merging data frames from several data sources. Then we used ‘NumPy’ extension library in order to generate multidimensional arrays using ‘Pandas’ data frames and data series to make it easy to access certain ranges of data (i.e. separate the indices of training data set from test data set) and locate some properties of data like median and quartiles. Also when combining derived attributes with existing attributes functions provided with ‘NumPy’ library was useful. ‘Scikit-learn’ machine learning library was the library we used to integrate data analysis, preprocessing, classification, regression and modeling tools into our implementations. From the various tools provided with ‘Scikit-learn’ library we used preprocessing tools like ‘Label Encoder’ and ‘One Hot Encoder’, ‘Standard Scalar’ and text feature extraction tools classification tools like ‘Decision Tree Classifier’, ‘SGD Classifier’ and ‘Logistic Regression’, model selection and evaluation tools like ‘Grid Search’, ensemble tools like ‘AdaBoost Classifier’ and metrics like ‘roc_auc_score’ to compute area under the curve (AUC) from prediction scores as mentioned above.