Using the Students Performance in Exams Dataset we will try to understand what affects the exam scores. The data is limited, but it will present a good visualization to spot the relations. First of all, we explore our data and after that we apply Naive Bayes Classification technique for evaluation purpose.
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
Student Performance Data Mining Project Report
1. Page | 1
PROJECT REPORT
STUDENT PERFORMANCE
(DATAMINING)
BS (SE)2017
GROUP MEMBER(S):
NAME: HAFSAHABIB 2017/COMP/BS(SE)-21597
NAME: MUNIBAJAVIAD 2017/COMP/BS(SE)-21621
SUPERVISOR:
MISS SADIA JAVED
29TH APRIL, 2019
DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
JINNAH UNIVERSITY FOR WOMEN
5-C NAZIMABAD, KARACHI 74600
2. Page | 2
Table of Contents
1. Introduction.................................................................................................................................3
2. Description of the problem and problem domain....................................................................3
3. Description of implemented data mining techniques/methods...............................................3
3.1. Naïve Bayes Classifier..........................................................................................................3
4. Data Set........................................................................................................................................3
4.1. Exploring the Data Set.........................................................................................................4
4.1.1. General Distribution of Exam Scores ..........................................................................4
4.1.2. Exam scores based on the gender.................................................................................5
4.1.3. Exam scores based on the Parent Level of Education................................................6
4.1.4. Exam scores based on the Lunch Type........................................................................7
4.1.5. Exam scores based on theTest Prepration Course .....................................................7
5. Implementation ...........................................................................................................................8
5.1. Operators ..............................................................................................................................8
6. Results and evaluation/discussion of the results ......................................................................9
7. Future directions/ideas how to extend and enhance the technique......................................10
8. Conclusion .................................................................................................................................10
9. References..................................................................................................................................10
3. Page | 3
1. Introduction
Using the Students Performance in Exams Dataset we will try to understand what affects the
exam scores. The data is limited, but it will present a good visualization to spot the relations. First
of all, we explore our data and after that we apply Naive Bayes Classification technique for
evaluation purpose.
2. Description of the problem and problem domain
To understand the influence of the parent’s background, test preparation etc. on students’
performance.
Objectives
Check the dataset and tidying the data if needed.
Visualize the data to understand the effects of different factors on a student performance.
Check the effectiveness of test preparation course.
Check what are the major factors influencing the test scores.
3. Description of implemented data mining techniques/methods
3.1. Naïve Bayes Classifier
Bayesian classifiers are statistical classifiers that predict class membership by probabilities, such
as the probability that a given sample belongs to a particular class. Naive Bayes algorithms
assume that the effect that an attribute plays on a given class is independent of the values of other
attributes. However, in practice, dependencies often exist among attributes; hence Bayesian
networks are graphical models, which can describe joint conditional probability distributions.
Bayesian classifiers are popular classification algorithms due to their simplicity, computational
efficiency and very good performance for real-world problems. Another important advantage is
also that the Bayesian models are fast to train and to evaluate, and have a high accuracy in many
domains.
4. Data Set
Gender: Gender of the student (i.e. Male, Female)
Ethnicity: Ethnicity to which the student belongs (i.e. group A, B, C, D, E)
Parent level of Education: Education level of the parents/guardian of the student (i.e.
high school, bachelor’s degree, master’s degree, some college, associate’s degree)
Lunch: Standard of the lunch provided to the student in school (i.e. standard,
free/reduced)
Test preparation course: Whether the student took the preparation course (i.e. none,
completed)
Math score: Mathematics score of the student (from 0 to 100)
Reading score: Reading score of the student (from 0 to 100)
Writing score: Writing score of the student (from 0 to 100)
Student Performance: Overall performance of the student (i.e. Good, Average, Bad,
Worst)
4. Page | 4
4.1. Exploring the Data Set
Firstly, We Import the dataset repository and display first few rows of the dataset.
4.1.1. General Distribution of Exam Scores
There are 5 features which might affect the scores of each exam. First thing to analyses would be
to see how the scores are distributed within each exam (Math’s, Reading, and Writing). We will
plot histograms to see if there any differences in the scores' distribution.
5. Page | 5
The scores are distributed in the Gaussian manner. It is hard to draw any conclusion from the
graphs above: they all look very similar and we don't have enough data for the plots to look more
smoothly.
4.1.2. Exam scores based on the gender
Graphical representation of the exam scores’ based on the gender (i.e. Male, Female).
6. Page | 6
4.1.3. Exam scores based on the Parent Level of Education
Displaying the mean values as a table or a heat map.
Indeed, it seems that a lower parental level of education has a negative impact on the exam scores.
A child of parents who’s the highest education level was college or high school has noticeably
lower exam scores than their peers. Similarly, parents with master's or bachelor's degree have
children who scores much better in the exams.
7. Page | 7
4.1.4. Exam scores based on the Lunch Type
It might be amusing to think that type of lunch students have is correlated to their exam scores.
On the other hand, we can see from the dataset that there are two types of
lunch: standard and free/reduced. So it depends on the parents' financial situation rather than on
the type of the dish. There might be some correlation be here, so let's try to visualize the problem.
According to above visualization, there is a huge disproportion between students who have
a free/reduced lunch when compared to those having standard lunch.
4.1.5. Exam scores based on theTest Prepration Course
The last thing we explore in this dataset is to determine how the completion of the test preparation
course affects the exam scores by using heat map. There are only two categorical
variables: none and completed.
8. Page | 8
5. Implementation
This dataset is clean and free of unwanted data. We don’t have to go through the processes of
cleaning the data. In our data set Student Performance, we apply Naïve Bayes classification
technique. Naïve Bayes classifier is a famous approach for supervised learning. It mainly
classifies a test data provided with the fact that training data is used to train up the model. There
exist 8 features and 1 label named as Student performance.
Student Performance is the class label which needs to be predicted. As the testing data is not
separately provided thereby, we will split this dataset for training and testing respectively. We are
using the ratio of 70:30 for training and testing respectively.
We then train Naïve Bayes model using 70% of the dataset and then classify the rest 30% of the
data. After that we Measure performance parameters i.e. accuracy, precision and recall to show
how much accurate the model has been for the dataset.
5.1. Operators
The details of the operators that are used for the creation of the process are as follows:
Retrieve
This Operator can access stored information in the Repository and load them into the Process.
Set Role
This Operator is used to change the role of one or more Attributes.
Split data
This operator produces the desired number of subsets of the given Data Set.
Naïve Bayes
This Operator generates a Naive Bayes classification model.
Apply Model
This Operator applies a model on the given Data Set.
Performance
This operator is used for performance evaluation. It delivers a list of performance criteria
values. These performance criteria are automatically determined in order to fit the learning
task type.
9. Page | 9
6. Results and evaluation/discussion of the results
Confusion Matrix
Here, the result of the process of data set “Student Performance” is shown below in the form of
confusion matrix. This table shows the accuracy, class precision and class recall.
The following criteria are added for binominal classification tasks:
Accuracy
Precision
Recall
Accuracy is calculated by taking the percentage of correct predictions over the total number of
examples. Correct prediction means examples where the value of the prediction attribute is equal
to the value of the label attribute.
Here, the Accuracy of the Student Performance data set is 92.64%
10. Page | 10
7. Future directions/ideas how to extend and enhance the technique
By using the process or model we can predict more about the student performances and theirs
factors involves with them.
In future, this can be implemented in any university by using this process we can calculate the
GPA of the student in advance by just knowing their previous GPA.
In schools, we can calculate the performance of the worst student so that by knowing the name of
those students, teacher may focus more on such type of students.
8. Conclusion
We have already seen the insights of the Data, the summary is written below:
135 students failed in mathematics, 90 students failed in reading examination, 114
students failed in writing examination and overall 103 students failed the examination.
Reading score and Writing score are positively linearly correlated with correlation
coefficient 0.95(approx.).
Students who belongs to group D in ethnicity performed very well.
Test Preparation Course is very effective. We saw that the students who had completed
their test preparation course failed less in number.
Students who take standard lunch performed very well than others.
In case of parental education level, the parents with master's or bachelor's degree have
children who scores much better in the exams.
The Accuracy of the Student Performance data set is 92.64% calculated by the naïve
Bayes classifier process.
9. References
https://www.kaggle.com/spscientist/students-performance-in-exams#StudentsPerformance.csv