SlideShare a Scribd company logo
1 of 2
Download to read offline
ACMS 20210: Project Final Write-up
Using Classification Algorithms to Classify Students into Achievement Levels based on School and
Home Factors
Introduction
There are many factors that influence student’s academic success, including factors within the
school as well as at home. We wanted to obtain an interesting dataset related to student success and
implement various classification algorithms on that dataset. Our goal in this project was to classify
potential students into categories based on chosen attributes into high, medium, and low performance
classes. Success in this project would be low error rates for our algorithms; that is, the program would
not often assign a performance class that was not the performance class of the student in actuality. We
planned on implementing three classification algorithms: K-Nearest Neighbors (KNN), Naïve Bayes
(NB), and Linear Discriminant Analysis (LDA) and comparing the error rates of the three algorithms. We
also aimed to make it possible for the user to input data for a student and run a predictive algorithm to
classify the new student.
Accomplishments
We were able to achieve most of our goals in the project. We chose ten attributes in the data set and
implemented KNN and NB on the data set to classify the 395 students. While we initially considered
implementing the third algorithm, LDA, we found the algorithm to be beyond our understanding of
statistics. Instead, we shifted our focus to testing different parameter modifications of the other two
algorithms to classify the students and compare error rates among these different configurations. Each
had a key parameter that could vary within a given range (for KNN, ‘k’ could be any integer from 2 to the
size of the dataset. For practicality, we chose 10 as an upper bound. For Naïve Bayes, ‘m’ could be any
decimal value between 0 and 1. We also tested two tiebreaking methods: randomly choosing a class
between the two that were tied or defaulting to the higher or highest of the two). Our error rates for these
configurations range from 9-22%. While high, there are some reasonable explanations for these error
rates, including small sample size (n=365 observations), choice of attributes (the variables we chose
simply may not be good predictors of academic outcomes), inherent algorithm error (even theoretical),
and data simplification/loss of information by converting some continuous variables into categorical. The
final step of our project was an interactive portion of the program which allows the user to input data for
one or more students and predict a class for that student using KNN or NB.
Implementation
To implement our program, we first inputted the data
from an online database into a struct called “Observation.”
This structure included the attributes (gender, age, travel time
to school, study time, number of classes failed, activity
participation, paid tutoring, free time, period one grade, and
period two grade) for each student. We converted many of the
continuous variables into categorical for simplification, such
as travel and study time (both converted to scales from 1-4).
A main focus of our project was programming the
algorithms ourselves. For KNN (in dataset of n observations),
this involved constructing an nxn distance matrix, ordering the
observations by increasing distance, and looking up the true
outcome of the k indexed observations with the shortest
distances (i.e., the k nearest neighbors). A majority vote
algorithm was performed on the outcome of the determined k
nearest neighbors.
For Naïve Bayes, the crux of the algorithm was to compute a probability (using prior and
posterior probabilities) of that observation being classified into either the low, medium, or high categories
of academic performance. For each algorithm, the results on each observation are stored in a struct of
results (the observation's predicted class, actual class, and a boolean value for whether the prediction was
Figure 1. High Level Flow Charts for Student Academic Classification
true). The specifications of the algorithm (including the algorithm type, parameter
value, tiebreaker rule, and resultant error rate), were stored in the Trial_Desc (for
"trial, descriptive") structure and stored for comparison of algorithm performance
after multiple configurations are run.
To construct the user-input portion, we carefully developed error-checked
functions for the user to enter the data on the new student. We then added this
student's data to the full dataset and adjusted the KNN and NB algorithms to predict
only the class of this last observation (this was done to ease computational load; for
our purposes at this point in the program there was no need to predict the classes of
the other 395 observations).
Lessons and Takeaways
Our program was very complex compared to anything we had built previously
in class an lab assignments. It involved two major algorithms, each with multiple
steps and parameter configurations, and a user-input portion. Given this complexity,
we very quickly learned the importance of planning ahead to maintain the overall
goals of the project and breaking a large problem into smaller parts. We created flow-
charts for each part of our program and refined them until they could be used as a
guide for coding, and used functions to tackle one piece of the problem at a time. We
also learned the importance of testing frequently; for our purposes, we did a lot of
testing with a reduced dataset of only ten observations to ensure the program was
behaving properly. This then required only minimal tuning when we implemented the
program with the full dataset of 395 observations. This testing enabled us to find
many logical errors in both our algorithms as well as the input portion of the program.
Tests, when successful, indicated that we could continue to the next part of our program. In addition, we
learned the importance of communicating as a team. We met at least twice a week outside of class to
discuss our progress and determine our next steps.
Changes and Extensions
Despite our overall satisfaction with our product, there were a few things we would do differently
next time. First, we would explore the use of different containers. We use vectors and matrices in our
program, but sets or maps may be a better way to implement classification algorithms. Since both these
types are sorted by key values, sorting error rates or ordering neighbors would be helpful if there was a
key, rather than a numerical index, to indicate an observation. Also, we would do more initial research
about how to best receive input from the user. We were originally under the impression that cin was a
good way to avoid errors, but when we tested the program, we found many problems with that strategy.
We then did research and found that inputting a string and converting it to another data type (mainly
integer) was a more secure way to input data. Overall, our project could have been improved by doing
some frontend research on container types and inputting data.
This project is also rich with options to expand or improve our project. One possible extension
would allow for more than ten attributes to be used for classification and potentially allow the user to
choose the attributes for a given run of the algorithm. While we chose ten attributes to simplify the scope
of our project, the data set we used in our project contained 33 attributes, which may prove to predict
academic outcomes better. Some potential additional or substitutable attributes could be “nursery”
(attended nursery school-binary yes/no), “health” (current health status-1(very bad) to 5 (very good)), and
“famsize” (family size-binary greater than 3 or less than or equal to 3). Further, our program is rigid in
that it specifies that ten attributes must be used, and that they must be the attributes we selected because
we have fixed a type for each. To make our program more flexible, we could use templates instead of
specifying the type of each attribute inputted by the user, so that the same function could be used for most
attributes. Other possible extension would be to develop additional algorithms to compare to KNN and
NB: contenders include quadratic discriminant, Fisher’s linear discriminant, or logistic regression. We
could also allow the user to specify the configuration of an algorithm (algorithm, parameter, tiebreaker)
running on a whole data set, as they can for the input of a new student.
Figure 2. Snapshot of Results of
Many Algorithm Configurations

More Related Content

What's hot

QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSNexgen Technology
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceVenkat Projects
 
[SAC 2015] Improve General Contextual SLIM Recommendation Algorithms By Facto...
[SAC 2015] Improve General Contextual SLIM Recommendation Algorithms By Facto...[SAC 2015] Improve General Contextual SLIM Recommendation Algorithms By Facto...
[SAC 2015] Improve General Contextual SLIM Recommendation Algorithms By Facto...YONG ZHENG
 
High performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayesHigh performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayeseSAT Journals
 
Software Defect Prediction Using Radial Basis and Probabilistic Neural Networks
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksSoftware Defect Prediction Using Radial Basis and Probabilistic Neural Networks
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksEditor IJCATR
 
Presentation
PresentationPresentation
Presentationbutest
 
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...YONG ZHENG
 
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageIRJET Journal
 

What's hot (8)

QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performance
 
[SAC 2015] Improve General Contextual SLIM Recommendation Algorithms By Facto...
[SAC 2015] Improve General Contextual SLIM Recommendation Algorithms By Facto...[SAC 2015] Improve General Contextual SLIM Recommendation Algorithms By Facto...
[SAC 2015] Improve General Contextual SLIM Recommendation Algorithms By Facto...
 
High performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayesHigh performance intrusion detection using modified k mean & naïve bayes
High performance intrusion detection using modified k mean & naïve bayes
 
Software Defect Prediction Using Radial Basis and Probabilistic Neural Networks
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksSoftware Defect Prediction Using Radial Basis and Probabilistic Neural Networks
Software Defect Prediction Using Radial Basis and Probabilistic Neural Networks
 
Presentation
PresentationPresentation
Presentation
 
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
 
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
 

Similar to Summary_Classification_Algorithms_Student_Data

Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance PredictorIRJET Journal
 
Group13 kdd cup_report_submitted
Group13 kdd cup_report_submittedGroup13 kdd cup_report_submitted
Group13 kdd cup_report_submittedChamath Sajeewa
 
Abstract.doc
Abstract.docAbstract.doc
Abstract.docbutest
 
IRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET Journal
 
Query Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain ObjectsQuery Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain Objectsnexgentechnology
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertainnexgentech15
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptxVishalLabde
 
Comparative performance analysis
Comparative performance analysisComparative performance analysis
Comparative performance analysiscsandit
 
Knowledge and Data Engineering IEEE 2015 Projects
Knowledge and Data Engineering IEEE 2015 ProjectsKnowledge and Data Engineering IEEE 2015 Projects
Knowledge and Data Engineering IEEE 2015 ProjectsVijay Karan
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTSESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTScsandit
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction SystemIRJET Journal
 
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKSDETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKSijseajournal
 
2cee Master Cocomo20071
2cee Master Cocomo200712cee Master Cocomo20071
2cee Master Cocomo20071CS, NcState
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...IOSR Journals
 
PM3 ARTICALS
PM3 ARTICALSPM3 ARTICALS
PM3 ARTICALSra na
 

Similar to Summary_Classification_Algorithms_Student_Data (20)

Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance Predictor
 
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATIONONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
ONE HIDDEN LAYER ANFIS MODEL FOR OOS DEVELOPMENT EFFORT ESTIMATION
 
Group13 kdd cup_report_submitted
Group13 kdd cup_report_submittedGroup13 kdd cup_report_submitted
Group13 kdd cup_report_submitted
 
Abstract.doc
Abstract.docAbstract.doc
Abstract.doc
 
IRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine LearningIRJET- Tracking and Predicting Student Performance using Machine Learning
IRJET- Tracking and Predicting Student Performance using Machine Learning
 
Query Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain ObjectsQuery Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain Objects
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Comparative performance analysis
Comparative performance analysisComparative performance analysis
Comparative performance analysis
 
Knowledge and Data Engineering IEEE 2015 Projects
Knowledge and Data Engineering IEEE 2015 ProjectsKnowledge and Data Engineering IEEE 2015 Projects
Knowledge and Data Engineering IEEE 2015 Projects
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
B05110409
B05110409B05110409
B05110409
 
Unit 5
Unit   5Unit   5
Unit 5
 
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTSESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
 
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKSDETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
DETERMINING THE RISKY SOFTWARE PROJECTS USING ARTIFICIAL NEURAL NETWORKS
 
2cee Master Cocomo20071
2cee Master Cocomo200712cee Master Cocomo20071
2cee Master Cocomo20071
 
De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
 
PM3 ARTICALS
PM3 ARTICALSPM3 ARTICALS
PM3 ARTICALS
 

Summary_Classification_Algorithms_Student_Data

  • 1. ACMS 20210: Project Final Write-up Using Classification Algorithms to Classify Students into Achievement Levels based on School and Home Factors Introduction There are many factors that influence student’s academic success, including factors within the school as well as at home. We wanted to obtain an interesting dataset related to student success and implement various classification algorithms on that dataset. Our goal in this project was to classify potential students into categories based on chosen attributes into high, medium, and low performance classes. Success in this project would be low error rates for our algorithms; that is, the program would not often assign a performance class that was not the performance class of the student in actuality. We planned on implementing three classification algorithms: K-Nearest Neighbors (KNN), Naïve Bayes (NB), and Linear Discriminant Analysis (LDA) and comparing the error rates of the three algorithms. We also aimed to make it possible for the user to input data for a student and run a predictive algorithm to classify the new student. Accomplishments We were able to achieve most of our goals in the project. We chose ten attributes in the data set and implemented KNN and NB on the data set to classify the 395 students. While we initially considered implementing the third algorithm, LDA, we found the algorithm to be beyond our understanding of statistics. Instead, we shifted our focus to testing different parameter modifications of the other two algorithms to classify the students and compare error rates among these different configurations. Each had a key parameter that could vary within a given range (for KNN, ‘k’ could be any integer from 2 to the size of the dataset. For practicality, we chose 10 as an upper bound. For Naïve Bayes, ‘m’ could be any decimal value between 0 and 1. We also tested two tiebreaking methods: randomly choosing a class between the two that were tied or defaulting to the higher or highest of the two). Our error rates for these configurations range from 9-22%. While high, there are some reasonable explanations for these error rates, including small sample size (n=365 observations), choice of attributes (the variables we chose simply may not be good predictors of academic outcomes), inherent algorithm error (even theoretical), and data simplification/loss of information by converting some continuous variables into categorical. The final step of our project was an interactive portion of the program which allows the user to input data for one or more students and predict a class for that student using KNN or NB. Implementation To implement our program, we first inputted the data from an online database into a struct called “Observation.” This structure included the attributes (gender, age, travel time to school, study time, number of classes failed, activity participation, paid tutoring, free time, period one grade, and period two grade) for each student. We converted many of the continuous variables into categorical for simplification, such as travel and study time (both converted to scales from 1-4). A main focus of our project was programming the algorithms ourselves. For KNN (in dataset of n observations), this involved constructing an nxn distance matrix, ordering the observations by increasing distance, and looking up the true outcome of the k indexed observations with the shortest distances (i.e., the k nearest neighbors). A majority vote algorithm was performed on the outcome of the determined k nearest neighbors. For Naïve Bayes, the crux of the algorithm was to compute a probability (using prior and posterior probabilities) of that observation being classified into either the low, medium, or high categories of academic performance. For each algorithm, the results on each observation are stored in a struct of results (the observation's predicted class, actual class, and a boolean value for whether the prediction was Figure 1. High Level Flow Charts for Student Academic Classification
  • 2. true). The specifications of the algorithm (including the algorithm type, parameter value, tiebreaker rule, and resultant error rate), were stored in the Trial_Desc (for "trial, descriptive") structure and stored for comparison of algorithm performance after multiple configurations are run. To construct the user-input portion, we carefully developed error-checked functions for the user to enter the data on the new student. We then added this student's data to the full dataset and adjusted the KNN and NB algorithms to predict only the class of this last observation (this was done to ease computational load; for our purposes at this point in the program there was no need to predict the classes of the other 395 observations). Lessons and Takeaways Our program was very complex compared to anything we had built previously in class an lab assignments. It involved two major algorithms, each with multiple steps and parameter configurations, and a user-input portion. Given this complexity, we very quickly learned the importance of planning ahead to maintain the overall goals of the project and breaking a large problem into smaller parts. We created flow- charts for each part of our program and refined them until they could be used as a guide for coding, and used functions to tackle one piece of the problem at a time. We also learned the importance of testing frequently; for our purposes, we did a lot of testing with a reduced dataset of only ten observations to ensure the program was behaving properly. This then required only minimal tuning when we implemented the program with the full dataset of 395 observations. This testing enabled us to find many logical errors in both our algorithms as well as the input portion of the program. Tests, when successful, indicated that we could continue to the next part of our program. In addition, we learned the importance of communicating as a team. We met at least twice a week outside of class to discuss our progress and determine our next steps. Changes and Extensions Despite our overall satisfaction with our product, there were a few things we would do differently next time. First, we would explore the use of different containers. We use vectors and matrices in our program, but sets or maps may be a better way to implement classification algorithms. Since both these types are sorted by key values, sorting error rates or ordering neighbors would be helpful if there was a key, rather than a numerical index, to indicate an observation. Also, we would do more initial research about how to best receive input from the user. We were originally under the impression that cin was a good way to avoid errors, but when we tested the program, we found many problems with that strategy. We then did research and found that inputting a string and converting it to another data type (mainly integer) was a more secure way to input data. Overall, our project could have been improved by doing some frontend research on container types and inputting data. This project is also rich with options to expand or improve our project. One possible extension would allow for more than ten attributes to be used for classification and potentially allow the user to choose the attributes for a given run of the algorithm. While we chose ten attributes to simplify the scope of our project, the data set we used in our project contained 33 attributes, which may prove to predict academic outcomes better. Some potential additional or substitutable attributes could be “nursery” (attended nursery school-binary yes/no), “health” (current health status-1(very bad) to 5 (very good)), and “famsize” (family size-binary greater than 3 or less than or equal to 3). Further, our program is rigid in that it specifies that ten attributes must be used, and that they must be the attributes we selected because we have fixed a type for each. To make our program more flexible, we could use templates instead of specifying the type of each attribute inputted by the user, so that the same function could be used for most attributes. Other possible extension would be to develop additional algorithms to compare to KNN and NB: contenders include quadratic discriminant, Fisher’s linear discriminant, or logistic regression. We could also allow the user to specify the configuration of an algorithm (algorithm, parameter, tiebreaker) running on a whole data set, as they can for the input of a new student. Figure 2. Snapshot of Results of Many Algorithm Configurations