1. ACMS 20210: Project Final Write-up
Using Classification Algorithms to Classify Students into Achievement Levels based on School and
Home Factors
Introduction
There are many factors that influence student’s academic success, including factors within the
school as well as at home. We wanted to obtain an interesting dataset related to student success and
implement various classification algorithms on that dataset. Our goal in this project was to classify
potential students into categories based on chosen attributes into high, medium, and low performance
classes. Success in this project would be low error rates for our algorithms; that is, the program would
not often assign a performance class that was not the performance class of the student in actuality. We
planned on implementing three classification algorithms: K-Nearest Neighbors (KNN), Naïve Bayes
(NB), and Linear Discriminant Analysis (LDA) and comparing the error rates of the three algorithms. We
also aimed to make it possible for the user to input data for a student and run a predictive algorithm to
classify the new student.
Accomplishments
We were able to achieve most of our goals in the project. We chose ten attributes in the data set and
implemented KNN and NB on the data set to classify the 395 students. While we initially considered
implementing the third algorithm, LDA, we found the algorithm to be beyond our understanding of
statistics. Instead, we shifted our focus to testing different parameter modifications of the other two
algorithms to classify the students and compare error rates among these different configurations. Each
had a key parameter that could vary within a given range (for KNN, ‘k’ could be any integer from 2 to the
size of the dataset. For practicality, we chose 10 as an upper bound. For Naïve Bayes, ‘m’ could be any
decimal value between 0 and 1. We also tested two tiebreaking methods: randomly choosing a class
between the two that were tied or defaulting to the higher or highest of the two). Our error rates for these
configurations range from 9-22%. While high, there are some reasonable explanations for these error
rates, including small sample size (n=365 observations), choice of attributes (the variables we chose
simply may not be good predictors of academic outcomes), inherent algorithm error (even theoretical),
and data simplification/loss of information by converting some continuous variables into categorical. The
final step of our project was an interactive portion of the program which allows the user to input data for
one or more students and predict a class for that student using KNN or NB.
Implementation
To implement our program, we first inputted the data
from an online database into a struct called “Observation.”
This structure included the attributes (gender, age, travel time
to school, study time, number of classes failed, activity
participation, paid tutoring, free time, period one grade, and
period two grade) for each student. We converted many of the
continuous variables into categorical for simplification, such
as travel and study time (both converted to scales from 1-4).
A main focus of our project was programming the
algorithms ourselves. For KNN (in dataset of n observations),
this involved constructing an nxn distance matrix, ordering the
observations by increasing distance, and looking up the true
outcome of the k indexed observations with the shortest
distances (i.e., the k nearest neighbors). A majority vote
algorithm was performed on the outcome of the determined k
nearest neighbors.
For Naïve Bayes, the crux of the algorithm was to compute a probability (using prior and
posterior probabilities) of that observation being classified into either the low, medium, or high categories
of academic performance. For each algorithm, the results on each observation are stored in a struct of
results (the observation's predicted class, actual class, and a boolean value for whether the prediction was
Figure 1. High Level Flow Charts for Student Academic Classification
2. true). The specifications of the algorithm (including the algorithm type, parameter
value, tiebreaker rule, and resultant error rate), were stored in the Trial_Desc (for
"trial, descriptive") structure and stored for comparison of algorithm performance
after multiple configurations are run.
To construct the user-input portion, we carefully developed error-checked
functions for the user to enter the data on the new student. We then added this
student's data to the full dataset and adjusted the KNN and NB algorithms to predict
only the class of this last observation (this was done to ease computational load; for
our purposes at this point in the program there was no need to predict the classes of
the other 395 observations).
Lessons and Takeaways
Our program was very complex compared to anything we had built previously
in class an lab assignments. It involved two major algorithms, each with multiple
steps and parameter configurations, and a user-input portion. Given this complexity,
we very quickly learned the importance of planning ahead to maintain the overall
goals of the project and breaking a large problem into smaller parts. We created flow-
charts for each part of our program and refined them until they could be used as a
guide for coding, and used functions to tackle one piece of the problem at a time. We
also learned the importance of testing frequently; for our purposes, we did a lot of
testing with a reduced dataset of only ten observations to ensure the program was
behaving properly. This then required only minimal tuning when we implemented the
program with the full dataset of 395 observations. This testing enabled us to find
many logical errors in both our algorithms as well as the input portion of the program.
Tests, when successful, indicated that we could continue to the next part of our program. In addition, we
learned the importance of communicating as a team. We met at least twice a week outside of class to
discuss our progress and determine our next steps.
Changes and Extensions
Despite our overall satisfaction with our product, there were a few things we would do differently
next time. First, we would explore the use of different containers. We use vectors and matrices in our
program, but sets or maps may be a better way to implement classification algorithms. Since both these
types are sorted by key values, sorting error rates or ordering neighbors would be helpful if there was a
key, rather than a numerical index, to indicate an observation. Also, we would do more initial research
about how to best receive input from the user. We were originally under the impression that cin was a
good way to avoid errors, but when we tested the program, we found many problems with that strategy.
We then did research and found that inputting a string and converting it to another data type (mainly
integer) was a more secure way to input data. Overall, our project could have been improved by doing
some frontend research on container types and inputting data.
This project is also rich with options to expand or improve our project. One possible extension
would allow for more than ten attributes to be used for classification and potentially allow the user to
choose the attributes for a given run of the algorithm. While we chose ten attributes to simplify the scope
of our project, the data set we used in our project contained 33 attributes, which may prove to predict
academic outcomes better. Some potential additional or substitutable attributes could be “nursery”
(attended nursery school-binary yes/no), “health” (current health status-1(very bad) to 5 (very good)), and
“famsize” (family size-binary greater than 3 or less than or equal to 3). Further, our program is rigid in
that it specifies that ten attributes must be used, and that they must be the attributes we selected because
we have fixed a type for each. To make our program more flexible, we could use templates instead of
specifying the type of each attribute inputted by the user, so that the same function could be used for most
attributes. Other possible extension would be to develop additional algorithms to compare to KNN and
NB: contenders include quadratic discriminant, Fisher’s linear discriminant, or logistic regression. We
could also allow the user to specify the configuration of an algorithm (algorithm, parameter, tiebreaker)
running on a whole data set, as they can for the input of a new student.
Figure 2. Snapshot of Results of
Many Algorithm Configurations