SlideShare a Scribd company logo
i 
FORECASTING A STUDENT’S EDUCATION 
FULFILLMENT USING REGRESSION ANALYSIS 
Submitted by 
RAM G ATHREYA 
Roll No.: 1202FOSS0019 
Reg. No.: 75812200021 
A PROJECT REPORT 
Submitted to the 
FACULTY OF SCIENCE AND HUMANITIES 
in partial fulfillment for the requirement of award of the degree of 
MASTER OF SCIENCE 
IN 
FREE / OPEN SOURCE SOFTWARE (CS-FOSS) 
CENTRE FOR DISTANCE EDUCATION 
ANNA UNIVERSITY 
CHENNAI 600 025 
AUGUST 2014
ii 
CENTRE FOR DISTANCE EDUCATION 
ANNA UNIVERSITY 
CHENNAI 600 025 
BONA FIDE CERTIFICATE 
Certified that this Project report titled “FORECASTING A STUDENT’S 
EDUCATION USING REGRESSION ANALYSIS” is the bona fide work of Mr. RAM G 
ATHREYA, who carried out the research under my supervision. I certify further, that 
to the best of my knowledge the work reported herein does not form part of any 
other Project report or dissertation on the basis of which a degree or award was 
conferred on an earlier occasion on this or any other candidate. 
RAM G ATHREYA Dr. SRINIVASAN SUNDARARAJAN 
Student at Anna University Professor
iii 
CERTIFICATE OF VIVA-VOCE-EXAMINATION 
This is to certify that Thiru/Mr. RAM G ATHREYA 
(Roll No. 1202FOSS0019; Register No. 75812200021) has been subjected to Viva-voce- 
Examination on 14 September 2014 at 9:30 AM at the Study centre The AU-KBC 
research Centre, Madras Institute of Technology, Anna Universisty, Chrompet, 
Chennai 600044. 
Internal Examiner External Examiner 
Name : Name : 
(in capital letters) (in capital letters) 
Designation : Designation : 
Address : Address : 
Coordinator centre 
Name : 
(in capital letters) 
Designation : 
Address : 
Date :
iv 
ACKNOWLEDGEMENT 
I am highly indebted to my guide Dr. SRINIVASAN SUNDARARAJAN for his 
guidance, monitoring, constant supervision, kind co-operation and encouragement 
that helped me in completion of this project. 
I would also like to express my special gratitude to AU-KBC faculties involved in 
M.Sc. (CS-FOSS) course for their cordial support and guidance as well as for 
providing necessary information regarding the project and also for their support in 
completing the project. 
Finally, I thank Center of Distance Education, Anna University for giving me an 
opportunity to do this project.
v 
ABSTRACT 
Our government spends substantial amount of resources in educating our 
children. Additionally several welfare schemes are introduced aimed especially at 
underprivileged children to ensure that all of them complete a basic level of 
education. In spite of these measures many students do not complete their basic 
education. 
The aim of this project is to formulate a Supervised Learning Algorithm 
that will aid in identifying such students who have a higher likelihood of not 
completing their education. 
To perform this task the algorithm will perform Logistic Regression 
Analysis on historical data of students from a given school. The historical data 
includes basic background information (features) such as gender, community, 
number of siblings etc. It must be noted that the historical data also contains 
information on whether the student completed his/her education, which is the 
outcome we are interested in. Typically a student finishing education will be 
denoted using a value of 1 and a student not finishing will be denoted with a value 
of 0. 
Based on the training (historical) data a logistic classifier can be built. Such 
a classifier after learning from the training set will develop specific weightages for 
each of the features. These weightages can then be extrapolated into an equation 
that can be used for prediction. 
That is we can apply the equation on a current student (whose background 
we already know) to calculate the probability that he/she will complete his/her 
education.
vi 
Such an algorithm will be beneficial to government agencies since it can 
serve as an early warning system using which they can take more proactive action 
to prevent a student from dropping out. Policy makers can also use it as a tool to 
identify schools that are more vulnerable and direct their resources and energies to 
help them.
vii 
சசசசசசசசச 
சசசச சசசச சசசச சசசசசசசசசச சசசசச சசசசசசச 
சசசசசசச சசசச சசசசசசசசசசசசச. சசசசசசசச பல 
சசசசசசசசசசசசசச சசசசசசசசசசச சசசசசசச சசசசச சசச 
சசசசசசசச சசசச சசசசசசச சசசசச சசசசச சசசசசசசசச 
சசசசசசசசசச சசசசசசசசசச சசசசசசச. சசசச 
சசசசசசசசசசசச சசச பல சசசசசசசசச சசசச சசசசசசசச 
சசசசசசச சசசசசசச. 
சசசச சசசசசசசசசசச சசசசசசசச, சசசசசச சசசசச சசசசசச 
சசசச சசசசசசசச சசசசசச சசசச சசசசசச சசசசசசசசச 
சசசசசசசச சசசசசச சசசசச சசச சசசசசசசசசசசசசச 
சசசசசச சசசசசசச சசசசசசசசச சசசசசச. 
சசசசசசச சசசசசசசசசசசசசச சசசசசசசசச சசசசசசச 
சசசசசசசசச சசசசசசசச சசசச சசசசசசசசச சசசசசசசசச 
சசசசசசசசசசச சசசசச சசசச சசசசச சசசசச. சசசசசசசச 
சசசச சசசசசசசச சசசசசசச சசசசச சசசசசச சசசசசசச, 
சசசசசச, சசசசசசசச சசசச சசசசசச சசசசசசச சசசசசசச 
சசசசசச சசசசச சசசசச / சசசசச சசசசச, சசசசசச சசசசசச 
சசசசச சசச சசசசசசசசசசசசச சசசசசசசச சசசசசச 
சசசசசசசசசசசசசச சசசசசசசசச (சசசசசசசசச) சசசசசசசச . 
சசசசசசசச சசச சசசசசச சசசசசசச சசசசச சசச சசசசசசச 1 
சசசசசசசசசசச சசசசசசசசசசசசசச சசசசசசச சசச சசசசசச 
0 சசச சசசசசசச சசசசசசசசசசசசசச சசசசசசசச. 
சசசசசசச சசசசசசசசசசசச (சசசசசசசச) சசசச 
சசசசசசசசசச சசசசசசசசசசசச சசசசசசசசசசசச. சசசசசசச 
சசசசசசசசசசச சசசசசசச சசசசச சசசசசசச சசசசசசச சசச 
சசசசசசசசசசசச சசசசசசசசச சசசசசசசச சசசசசசசசசசச
viii 
சசசசசசச சசசசசசசசசசச. சசசச சசசசசசசசசச சசசசசசச 
சசசசசசச சசசசசசசசசச சசசசசசசச சசசசச சசச 
சசசசசசசசசசசச சசசசசசசசசசசசச. 
சசசசச சசசச (சசசச சசசசசசச சசசச சசசசசசச சசசசசசசச) 
சசசச / சசசச சசசச / சசசசச சசசசச சசசசசசச சசசசசசசச 
சசசசச சசசசசசசசச சசசசசசச சசசசசசசசச சசசசசச சசசச 
சசசசசசசசசச சசசசசசசசசசசச சசசசசசசச. 
சசச சசசசசசச சசசசசச சசசசசசசசசச சசசசசசச சசச 
சசசசசச சசசசசச சசசசசசச சசசசசசசசசசசசச சசசசசசசசச 
சசசசசச சசசசசசசச சசசசசசசசசசச சசச சசசசச 
சசசசசசசசசச சசசசசசச சசசசசசசச சசசசசசசச சசசசசசசச 
சசசசசசச சசச சசசசசசச சசசசசசச சசசசசசசசசசசசசசச 
சசசசசசசசசச சசசசசசசசச. சசசசசசச சசசசசசசசசசசசச 
சசசசசசச சசசசசச சசசசசசச சசசசசசச சசசசசச சசசசசச 
சசசசசசசசசசசசசச சசசசச சசசசசசசச சசசசசசசச 
சசசசசசச சசசசசசசச சசச சசசசசசசச சசச சசசசசசசசசச 
சசசசசசசச.
ix 
TABLE OF CONTENTS 
CHAPTER NO TITLE PAGE NO 
ACKNOWLEDGEMENT iv 
ABSTRACT v 
ABSTRACT IN TAMIL vii 
LIST OF FIGURES xii 
LIST OF TABLES xiii 
LIST OF ABBREVIATIONS xiv 
1 INTRODUCTION 1 
1.1 OVERVIEW OF THE PROJECT 1 
1.2 LITERATURE SURVEY 2 
1.3 PROPOSED SYSTEM 2 
1.4 SCOPE 2 
2 REQUIREMENT SPECIFICATION 4 
2.1 INTRODUCTION 4 
2.2 OVERALL DESCRIPTION 4 
2.2.1 PRODUCT PERSPECTIVE 5 
2.2.2 PRODUCT FUNCTIONS 5 
3 PROJECT REQUIREMENTS 7 
3.1 SOFTWARE REQUIREMENTS 7 
3.2 HARDWARE REQUIREMENTS 7 
4 SYSTEM DESIGN 9
x 
4.1 METHODOLOGY 9 
4.2 ALGORITHM 9 
4.2.1 SUPERVISED LEARNING 10 
4.2.2 CLASSIFICATION 11 
4.2.3 LOGISTIC REGRESSION 13 
4.3 DATA COLLECTION 15 
4.3.1 FEATURE DETECTION 15 
4.3.1.1 PERSONAL 15 
4.3.1.2 ENVIRONMENTAL 15 
4.3.1.3 SCHOOL 16 
4.3.2 DATASET GENERATION 16 
4.4 MODELING 18 
4.4.1 HYPOTHESIS DEVELOPMENT 19 
4.4.2 GENERALIZATION ERROR 19 
4.5 VALIDATION 20 
4.5.1 DATASET PARTITIONING 21 
4.5.1.1 TRAINING DATASET 21 
4.5.1.2 CV DATASET 22 
4.5.2 COST FUNCTION 23 
4.5.3 ERROR METRICS 24 
4.5.3.1 TRAINING AND CV 
ERROR 25 
4.5.3.2 F1 SCORE 25 
4.5.3.3 W – SCORE 26 
4.5.4 LEARNING CURVES 27 
4.6 PREDICTION 29 
5 IMPLEMENTATION 31 
5.1 R 31
xi 
5.1.1 COST FUNCTION.R 31 
5.1.2 F1SCORE.R 31 
5.1.3 GENERATEDATASET.R 32 
5.1.4 GENERATEVECTOR.R 34 
5.1.5 INIT.R 36 
5.1.6 LEARNINGCURVE.R 37 
5.1.7 MYSQL.R 39 
5.1.8 PERCRANK.R 39 
5.1.9 PLOTLEARNINGCURVE.R 39 
5.1.10 PREDICTION.R 40 
5.1.11 PREDICTOR.R 40 
5.1.12 RANDOMIZEDATASET.R 41 
5.2 NODE.JS 41 
5.2.1 APP.JS 41 
5.2.2 PACKAGE.JSON 42 
5.2.3 ROUTES.JS 43 
5.2.4 INDEX.JADE 45 
5.2.5 PREDICT.JADE 47 
5.2.6 UPLOAD.JADE 52 
6 RESULTS 54 
6.1 DATASET UPLOAD 54 
6.2 UPLOAD RESULT 55 
6.3 PREDICTION 56 
7 CONCLUSIONS 57 
8 REFERENCES 58
xii 
LIST OF FIGURES 
FIGURE NO TITLE PAGE NO 
4.1 Logistic Regression Curve 
4.2 Dataset Generation 
4.3 Modeling 
4.4 Dataset Partitioning 
4.5 Developing Multiple Models 
4.6 Calculating Cross-Validation Errors 
4.7 Single Subject Learning 
4.8 Learning from Experience 
4.9 Score & Learning Time vs Experience 
4.10 Training & Cross – Validation Error 
Convergence 
4.11 Choosing the Best Model 
4.12 Prediction 
6.1 Upload Result 
6.2 Prediction Screen 
6.3 Predicting Student will not Dropout 
6.4 Predicting Student will Dropout
xiii 
LIST OF TABLES 
TABLE NO TITLE PAGE NO 
4.1 Sample Dataset 17
xiv 
LIST OF ABBREVIATIONS 
FOSS Free and Open Source Software 
IDE Integrated Development Environment 
OS Operating System 
PTR Pupil Teacher Ratio 
SCR Student Classroom Ratio
1 
CHAPTER 1 
INTRODUCTION 
1.1 OVERVIEW OF THE PROJECT 
Dropout is a universal phenomenon of the education system in India, which is 
spread across all levels of education, in all parts of the country, and across socio-economic 
groups the dropout rates are much higher for educationally backward 
states and districts. Girls in India tend to have higher dropout rates than boys. 
Similarly, children belonging to the socially disadvantaged groups like Scheduled 
Castes and Scheduled Tribes have the higher dropout rates in comparison to the 
general population. 
There are also regional and location wise differences and the children living in 
rural areas are more likely to drop out of school. In order to reduce wastage and 
improve the efficiency of education system, educational planners need to 
understand and identify the social groups that are more susceptible to dropout and 
the reasons for their dropping out. 
Keeping the above context in perspective, it would be helpful to develop a 
system or an algorithm that can systematically identify such vulnerable students 
who have a higher likelihood of dropping out from school. The goal of this project 
is to develop such an algorithm or system. 
Hopefully such an algorithm or system could assist educational planners and 
administrative staff of educational institutions to better allocate resources and 
make better decisions, which could curb this growing dropout problem.
2 
1.2 LITERATURE SURVEY 
The literature survey covers existing research and studies with respect to the 
dropout problem. They are grouped into three broad categories: 
1 Research Papers 
2 Surveys 
3 Govt Reports 
The detailed list of resources researched during the literature survey is 
provided in the references section. 
1.3 PROPOSED SYSTEM 
The proposed system will implement an algorithm that will take in student 
data as input and learn from it. This learned function, otherwise called as the 
hypothesis will serve as an approximate explanation of the data. Error metrics and 
validation techniques will be used to determine the accuracy of the hypothesis. 
The best hypothesis that fits the data will then be used for prediction. The final 
goal of the algorithm is to make reasonably accurate predictions of new unlabeled 
data. Unlabeled data is data for which the outcome is unknown. 
This system will be implemented in such a way that it can be operated from a 
web interface where the user can upload datasets as well as make predictions 
based on learned data. 
1.4 SCOPE
3 
The algorithm developed is an exploratory proof – of – concept system that 
uses machine learning and statistical techniques to make predictions based on 
student data. The validity of the results is entirely dependent on the accuracy of 
the data and how the algorithm processes it. 
Since comprehensive student data was not available for making the algorithm 
as best as possible, this iteration of the system can only serve as a proof – of – 
concept on what is possible and cannot be directly used in the real world, in its 
present form, as a decision making or policy making tool.
4 
CHAPTER 2 
REQUIREMENT SPECIFICATION 
2.1 INTRODUCTION 
A software requirements specification (SRS) defines the requirements of a 
software system. It is a description of the behavior of a system to be developed 
and may include a set of use cases. In addition it also contains non-functional 
requirements. Non-functional requirements impose constraints on the design or 
implementation (such as performance requirements, quality standards, or design 
constraints). 
This project requires storage and processing of medium to large volumes of 
data/datasets. Such datasets will be passed through the algorithm initially during a 
training phase, during this time the algorithm will learn using the training data. 
After training is completed the algorithm would then be required to make 
predictions for new unlabeled data based on what it learned from the training data. 
Additionally it would be helpful it the algorithm can be operated from a 
Web User Interface which will be more user friendly than issuing commands from 
the command line. 
2.2 OVERALL DESCRIPTION 
This section will outline a holistic description of the project, which includes 
different perspectives, constraints, functional and non – functional requirements of 
the project.
5 
2.2.1 PRODUCT PERSPECTIVE 
The system has 4 main tasks that are 
 Data Collection 
 Modeling 
 Validation 
 Prediction 
In the data collection phase the data required for the 
algorithm is gathered converted into a suitable form and supplied to 
the system for learning. 
In the modeling phase the algorithm tries to generate models 
that try to explain the data that has been gathered. Machine Learning 
techniques are used in this phase to generate multiple models of 
which the best gets chosen in later stages. 
In the validation phase the different models are evaluated 
based on performance and the best among them is chosen as the 
candidate algorithm that can be used for prediction 
Finally in the prediction phase the chosen model is used for 
making actual real world predictions. 
2.2.2 PRODUCT FUNCTIONS 
The system has two main functions that are 
 Training
6 
 Prediction 
In the training phase the dataset is supplied to the algorithm using 
which the best model is developed for prediction 
In the prediction phase the learnt algorithm can be actually put to 
use that is it can be used to make predictions for unlabeled data. 
How these processes are implemented is explained in detail in 
subsequent sections.
7 
CHAPTER 3 
PROJECT REQUIREMENTS 
The project requirement is to develop an algorithm that can classify 
students on whether they would complete education or not (dropout). To achieve 
this a system needs to be created that can be operated from a web user interface 
that will supply data for training or can make predictions based on already trained 
data. 
3.1 SOFTWARE REQUIREMENTS 
The software requirements for this project are: 
 R – R is a free software programming language and software 
environment for statistical computing and graphics. 
 Node.js - Node.js is a cross-platform runtime environment and a 
library for running applications written in JavaScript outside the 
browser (for example, on the server) 
 Netbeans - NetBeans is an integrated development 
environment (IDE) for developing primarily with Java, but also with 
other languages, in particular PHP, C++, Node.js & HTML5 
 RStudio – RStudio is a free and open source (FOSS) integrated 
development environment for R, a programming language for 
statistical computing and graphics 
 LINUX – LINUX is a POSIX-compliant computer operating system 
(OS) assembled under the model of free and open source software. 
3.2 HARDWARE REQUIREMENTS
8 
The hardware requirements define a set of (minimum) hardware that must 
be available to run the system. 
 Hardware System that can support LINUX Operating System 
 2 – 4 GB of RAM 
 Internet Connectivity
9 
CHAPTER 4 
SYSTEM DESIGN 
System design is the process of defining the architecture, components, 
modules, interfaces and data for a system to satisfy specified requirements. System 
design encompasses activities such as systems analysis, systems architecture and 
systems engineering. 
4.1 METHODOLOGY 
A software development methodology or system development methodology 
in software engineering is a framework that is used to structure, plan and control 
the process of developing a software system. 
This project consists of four distinct phases that are 
 Data Collection 
 Modeling 
 Validation 
 Prediction 
4.2 ALGORITHM 
The system will use a Logistic Regression Classifier, which is a Supervised 
Machine Learning Algorithm. This algorithm will take student data as input and 
predict an outcome. Outcomes are typically binary that is either a TRUE or 
FALSE. A TRUE value indicates that a student will dropout while FALSE means 
the student will not dropout.
10 
Since the algorithm will return only one of two possible outcomes it can 
also be called as a binary/binomial classifier. 
4.2.1 SUPERVISED LEARNING 
Supervised learning is the machine-learning task of inferring a 
function from labeled training data. The training data consist of a set of 
training examples. Typically the training data for this project will consist of 
data about students based on features that will be defined later in this 
document. 
In supervised learning, each example is a pair consisting of an input 
object (typically a vector) and a desired output value (also called the 
supervisory signal). A supervised learning algorithm analyzes the training 
data and produces an inferred function, which can be used for mapping new 
examples. New examples are usually unlabeled data that we need to predict. 
An optimal scenario will allow for the algorithm to correctly determine the 
class labels for unseen instances. This requires the learning algorithm to 
generalize from the training data to unseen situations in a "reasonable" way. 
In order to solve a given problem of supervised learning, the system 
has to perform the following steps: 
1. Determine the type of training examples : The kind of data that is 
to be used as the training set needs to be determined first. In the case 
of handwriting analysis, for example, this might be a single 
handwritten character, an entire handwritten word, or an entire line 
of handwriting 
2. Gather a training set : The training set needs to be representative 
of the real-world use of the function. Thus, a set of input objects is
11 
gathered and corresponding outputs are also gathered, either from 
human experts or from measurements 
3. Determine the input feature representation of the learned 
function: The accuracy of the learned function depends strongly on 
how the input object is represented. Typically, the input object is 
transformed into a feature vector, which contains a number of 
features that are descriptive of the object. The number of features 
should not be too large; but should contain enough information to 
accurately predict the output. 
4. Determine the learning algorithm : The correct learning algorithm 
that models the available data should be identified and applied. For 
example the learning algorithm may be support vector machines or 
decision trees 
5. Complete the design : Run the learning algorithm on the gathered 
training set. Some supervised learning algorithms require certain 
control parameters. These parameters may be adjusted by optimizing 
performance on a subset (called a validation set) of the training set, 
or via cross-validation. 
6. Evaluate the accuracy of the learned function : After parameter 
adjustment and learning, the performance of the resulting function 
should be measured on a test set that is separate from the training 
set. 
4.2.2 CLASSIFICATION 
In machine learning and statistics, classification is the problem of 
identifying to which of a set of categories (sub-populations) a new 
observation belongs, on the basis of a training set of data containing 
observations (or instances) whose category membership is known. The
12 
individual observations are analyzed into a set of quantifiable properties, 
known as various explanatory variables, features, etc. These properties may 
variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), 
ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number 
of occurrences of a part word in an email) or real-valued (e.g. a 
measurement of blood pressure). Some algorithms work only in terms of 
discrete data and require that real-valued or integer-valued data be 
discretized into groups (e.g. less than 5, between 5 and 10, or greater than 
10). An example would be assigning a given email into "spam" or "non-spam" 
classes or assigning a diagnosis to a given patient as described by 
observed characteristics of the patient (gender, blood pressure, presence or 
absence of certain symptoms, etc.). 
An algorithm that implements classification, especially in a concrete 
implementation, is known as a classifier. The term "classifier" sometimes 
also refers to the mathematical function, implemented by a classification 
algorithm, that maps input data to a category. 
In the terminology of machine learning, classification is considered 
an instance of supervised learning, i.e. learning where a training set of 
correctly identified observations is available. The corresponding 
unsupervised procedure is known as clustering or cluster analysis, and 
involves grouping data into categories based on some measure of inherent 
similarity (e.g. the distance between instances, considered as vectors in a 
multi-dimensional vector space). 
In statistics, where classification is often done with logistic 
regression or a similar procedure, the properties of observations are termed 
explanatory variables (or independent variables, regressors, etc.), and the
13 
categories to be predicted are known as outcomes, which are considered to 
be possible values of the dependent variable. In machine learning, the 
observations are often known as instances, the explanatory variables are 
termed features (grouped into a feature vector), and the possible categories 
to be predicted are classes. There is also some argument over whether 
classification methods that do not involve a statistical model can be 
considered "statistical". 
4.2.3 LOGISTIC REGRESSION 
In statistics, logistic regression, or logit regression, is a type of 
probabilistic statistical classification model. It is also used to predict a 
binary response from a binary predictor, used for predicting the outcome of 
a categorical dependent variable (i.e., a class label) based on one or more 
predictor variables (features). That is, it is used in estimating the parameters 
of a qualitative response model. The probabilities describing the possible 
outcomes of a single trial are modeled, as a function of the explanatory 
(predictor) variables, using a logistic function. Logistic Regression is used 
to refer specifically to the problem in which the dependent variable is 
binary—that is, the number of available categories is two, while problems 
with more than two categories are referred to as multinomial logistic 
regression. 
Logistic regression measures the relationship between a categorical 
dependent variable and one or more independent variables, which are 
usually (but not necessarily) continuous, by using probability scores as the 
predicted values of the dependent variable.
14 
Fig 4.1 : Logistic Regression Curve 
The formula for Logistic Regression can be expressed as : 
퐹(푥) = 
1 
1 + 푒−푥 
Eq 4.1 : Logistic Regression Formula 
where : 
 F(x) is the output 
 x is the input 
 e is Euler’s number 
It must be noted that 퐹(푥) can have a value only between 0 to 1 for 
any value of x that may be between (−∞, ∞) . Using the above equation we 
can define a value 푘 휖 (0, 1) such that all values of 퐹(푥) ≥ 푘 is true while 
those lesser are false or vice versa, thereby classifying the data into two 
distinct parts.
15 
4.3 DATA COLLECTION 
4.3.1 FEATURE DETECTION 
Based on the literature survey six features have been identified as 
major observable factors that can affect the final outcome regarding the 
education fulfillment of a student. 
The six features can be grouped into three categories that are: 
1. Personal Features 
2. Environmental Features 
3. School Features 
4.3.1.1 PERSONAL 
Personal features are those features that are based on the 
characteristics of the student or his/her parents, family background 
etc. The personal features that are being considered by the algorithm 
are: 
1. Gender: Values can be Male or Female 
2. Poverty: Values can be Yes or No 
3. Community: Values can be General, OBC, SC, ST 
4.3.1.2 ENVIRONMENTAL 
Environmental features are those features that are based on 
the student’s environment, locality, geography etc. The
16 
environmental features that are being considered by the algorithm 
are: 
1. Rural: Values can be Yes or No 
4.3.1.3 SCHOOL 
School features are those features that are based on the 
characteristics of the school where the student studies. The school 
features that are being considered by the algorithm are: 
Pupil Teacher Ratio: Pupil–teacher ratio is the number of students 
who attend a school or university divided by the number of teachers 
in the institution. For example, a pupil–teacher ratio of 10:1 
indicates that there are 10 students for every one teacher. The term 
can also be reversed to create a teacher–pupil ratio. 
Student Classroom Ratio: Student – classroom ratio is the number 
of students per classroom in an education institution. For example, a 
student – classroom ratio of 40:1 indicates that there are 40 students 
for every classroom. 
1. Pupil Teacher Ratio: Values can be Low (1 Teacher : 
<30 Students), Medium (1 Teacher : 30 – 40 Students) and 
High (1 Teacher : 40+ Students) 
2. Student Classroom Ratio: Values can be Low (1 
Classroom : <30 Students), Medium (1 Classroom: 30 – 
40 Students) and High (1 Classroom: 40+ Students) 
4.3.2 DATASET GENERATION
17 
Based on statistics derived from the literature survey and the features 
mentioned above the dataset for modeling is generated. The tables given 
below extrapolate statistical findings compiled from the literature survey: 
Feature Value Distribution Dropout Chance 
Gender Male 52% 39% 
Gender Female 48% 41% 
Poverty Yes 22% 80% 
Poverty No 78% 27% 
Rural Yes 75% 45% 
Rural No 25% 20% 
Community General 30% 10% 
Community OBC 40% 48% 
Community SC 20% 64% 
Community ST 10% 69% 
PTR Low 20% 15% 
PTR Medium 30% 35% 
PTR High 50% 55% 
SCR Low 18% 22% 
SCR Medium 33% 25% 
SCR High 49% 60% 
Table 4.1 : Sample Dataset 
The above table shows the distribution of each feature in the student 
population and the corresponding dropout chance of each feature within 
that population. For example when considering 100 students there are 52
18 
male students and 42 female students and the chance that a female student 
drops out is 41%. 
Overall Dropout Percentage was found to be 40%. That is 40% of 
the student population dropout of school. Using the above statistics a 
dataset can be generated for further analysis. 
Fig 4.2 : Dataset Generation 
4.4 MODELING 
Data modeling in software engineering is the process of creating a data 
model for an information system by applying formal data modeling techniques.
19 
Fig 4.3 : Modeling 
4.4.1 HYPOTHESIS DEVELOPMENT 
A Hypothesis (plural hypotheses) is a proposed explanation for a 
phenomenon. A working hypothesis is a provisionally accepted hypothesis 
proposed for further research. In the context of Machine Learning the 
hypotheses is also called as the Learned Function. 
In the context of this project the learned function is a working 
hypothesis that tries to explain the training dataset of students. Based on the 
observations/outcomes of the training dataset the learned algorithm will 
develop weightages for each of the features that have been selected. These 
weightages will then be used for predicting outcomes in a future dataset. 
4.4.2 GENERALIZATION ERROR 
The generalization error of a machine-learning model is a function 
that measures how well a learning machine generalizes to unseen data. It is
20 
measured as the distance between the error on the training set and the test 
set and is averaged over the entire set of possible training data that can be 
generated after each iteration of the learning process. It has this name 
because this function indicates the capacity of a machine that learns with 
the specified algorithm to infer a rule (or generalize). 
The theoretical model assumes a probability distribution of the 
examples, and a function giving the exact target. The model can also 
include noise in the example (in the input and/or target output). The 
generalization error is usually defined as the expected value of the square of 
the difference between the learned function and the exact target (mean-square 
error) 
The performance of a machine learning algorithm is measured by 
plots of the generalization error values through the learning process and are 
called learning curves. 
4.5 VALIDATION 
In statistics, model validation is the process of deciding whether the 
numerical results quantifying hypothesized relationships between variables, 
obtained from machine learning analysis, are in fact acceptable as descriptions of 
the data. 
The validation process can involve analyzing the goodness of fit of the 
model, analyzing whether the model residuals are random, and checking whether 
the model's predictive performance deteriorates substantially when applied to data 
that were not used in model estimation.
21 
4.5.1 DATASET PARTITIONING 
In model validation for assessing the results of statistical analysis the 
dataset is generally partitioned into two separate datasets. They are : 
1. Training Dataset 
2. Cross – Validation(CV) Dataset 
The model is typically trained on the training dataset and then tested 
on the cross – validation dataset that contains examples that are 
independent of the training data. The actual training, cross – validation split 
is upto the person doing the analysis. Usually ranges between 80-20% 
(training – cv) or 70-30% is preferred so that the model has enough 
examples for training the model. 
Fig 4.4 : Dataset Partitioning 
4.5.1.1 TRAINING DATASET
22 
A training set is a set of data used in various areas of 
information science to discover potentially predictive relationships. 
Training sets are used in artificial intelligence, machine learning, 
genetic programming, intelligent systems, and statistics. In all these 
fields, a training set has much the same role and is often used in 
conjunction with a test set. 
Fig 4.5 : Developing Multiple Models 
4.5.1.2 CV DATASET 
Cross-validation, sometimes called rotation estimation, is a 
model validation technique for assessing how the results of a 
statistical analysis will generalize to an independent data set. It is 
mainly used in settings where the goal is prediction, and one wants 
to estimate how accurately a predictive model will perform in 
practice. In a prediction problem, a model is usually given a dataset 
of known data on which training is run (training dataset), and a 
dataset of unknown data (or first seen data) against which the model 
is tested (testing dataset). The goal of cross validation is to define a
23 
dataset to "test" the model in the training phase (i.e., the validation 
dataset), in order to limit problems like overfitting, give an insight 
on how the model will generalize to an independent data set (i.e., an 
unknown dataset, for instance from a real problem), etc. 
One round of cross-validation involves partitioning a sample 
of data into complementary subsets, performing the analysis on one 
subset (called the training set), and validating the analysis on the 
other subset (called the validation set or testing set). To reduce 
variability, multiple rounds of cross-validation are performed using 
different partitions, and the validation results are averaged over the 
rounds. 
Fig 4.6 : Calculating Cross-Validation Errors 
4.5.2 COST FUNCTION 
In mathematical optimization, statistics, decision theory and machine 
learning, a cost function or loss function is a function that maps an event or 
values of one or more variables onto a real number intuitively representing
24 
some "cost" associated with the event. An optimization problem seeks to 
minimize a loss function. An objective function is either a loss function or 
its negative (sometimes called a reward function or a utility function), in 
which case it is to be maximized. 
In statistics, typically a loss function is used for parameter 
estimation, and the event in question is some function of the difference 
between estimated and true values for an instance of data. 
The cost function is expressed as : 
퐽(휃) = 
1 
2푚 
푚 
Σ(ℎ휃 (푥(푖)) − 푦(푖))2 
푖=1 
Eq 4.2 : Cost Function or Error Function 
where : 
 J is the Cost 
 m is the number of training examples 
 h(x) is the hypothesis 
 y is the actual value or the result vector 
4.5.3 ERROR METRICS 
Error metrics are systematic benchmarking measures that are used 
for calculating the accuracy or effectiveness of the system. The cost 
function is described above is a good example of an error metric. The 
following error metrics are used for validation of the generated models and 
in choosing the best among them:
25 
 Training and CV Error 
 F1 Score 
 W – Score 
4.5.3.1 TRAINING AND CV ERROR 
Training error is cost function error of the trained model on 
the training set. That is after training the model the training dataset is 
supplied again to the model as input to make predictions. These 
predictions made by the model are compared against the actual 
outcomes in the dataset and the error between the two is calculated 
using the cost function formula. The resulting value is the cost 
function error. 
The cross – validation error is similar to the training error 
except it is calculated on the cross – validation set. The benefit here 
is that the cross – validation set is new data and has none of the 
training examples of the training set and thus can be a better estimate 
of the accuracy of the system. Ideally the system’s cross – validation 
error should be similar to the training error in which case the model 
is a good estimate of the underlying data. 
4.5.3.2 F1 Score 
In statistical analysis of binary classification, the F1 score 
(also F-score or F-measure) is a measure of a test's accuracy. It 
considers both the precision p and the recall r of the test to compute 
the score: p is the number of correct results divided by the number of 
all returned results and r is the number of correct results divided by
26 
the number of results that should have been returned. The F1 score 
can be interpreted as a weighted average of the precision and recall, 
where an F1 score reaches its best value at 1 and worst score at 0. 
퐹1 = 2 . 
푃푟푒푐푖푠푖표푛 . 푅푒푐푎푙푙 
푃푟푒푐푖푠푖표푛 + 푅푒푐푎푙푙 
Eq 4.3 : F1 – Score 
푃푟푒푐푖푠푖표푛 = 
푇푟푢푒 푃표푠푖푡푖푣푒푠 
푇푟푢푒 푃표푠푖푡푖푣푒푠 + 퐹푎푙푠푒 푃표푠푖푡푖푣푒푠 
Eq 4.4 : Precision 
푅푒푐푎푙푙 = 
푇푟푢푒 푃표푠푖푡푖푣푒푠 
푇푟푢푒 푃표푠푖푡푖푣푒푠 + 퐹푎푙푠푒 푁푒푔푎푡푖푣푒푠 
Eq 4.5 : Recall 
4.5.3.3 W – Score 
The W-Score is a combination of the training, cross 
validation errors using which the best model gets chosen. The best 
model that gets chosen will have the least W – Score. The W – Score 
is expressed as : 
푊 = (1 − 푓1). 
Σ 푇푟푎푖푛 퐸푟푟표푟 
푁푇 
. 
Σ 퐶푉 퐸푟푟표푟 
푁퐶푉
27 
Eq 4.6 : W - Score 
where: 
 W – W-Score 
 f1 – F1 Score 
 NT – Number of Training Examples 
 NCV – Number of Cross – Validation Examples 
4.5.4 LEARNING CURVES 
Fig 4.7 : Single Subject Learning Fig 4.8 : Learning from Experience 
Fig 4.9 : Score & Learning Time vs Experience
28 
A learning curve is a graphical representation of the increase of learning 
(vertical axis) with experience (horizontal axis). Although the curve for a single 
subject may be erratic (Fig 4.7), when a large number of trials are averaged, a 
smooth curve results, which can be described with a mathematical function (Fig 
4.8). Depending on the metric used for learning (or proficiency) the curve can 
either rise or fall with experience (Fig 4.9). 
Within the context of the project the horizontal axis will be training 
examples, which is basically derived from experience, and the vertical axis is the 
cost function error. Ideally the cost function error should decrease with increase in 
training examples. 
But there are two types of errors, that is the training error and the cross – 
validation error. With increase in training examples the training error would 
increase gradually so as to prevent overfitting and since the training dataset has to 
explain a diverse spectrum of examples. But it should not increase exponentially. 
Also if the model is efficient then it should perform just as good on new data as it 
does on the training dataset. So the cross – validation error must decrease with 
increase in training examples. 
Thus the ideal model will have a small increase in training error with 
increase in training examples and the cross – validation error should decrease with 
increase in training examples and the two errors must converge as shown in (Fig 
4.10).
29 
Fig 4.10 : Training & Cross – Validation Error Convergence 
Fig 4.11 : Choosing the Best Model 
4.6 PREDICTION 
Prediction is the final step in the process. After selecting the best model that 
fits the given dataset the model can be put to use on actual real world unlabeled 
data. That is it can be used to predict data for which the outcomes are not known.
30 
The prediction process begins with the algorithm being supplied unlabeled student 
data using which it predicts an outcome, which is whether the student will dropout 
or not. 
Fig 4.12 : Prediction
31 
CHAPTER 5 
IMPLEMENTATION 
5.1 R 
5.1.1 COST FUNCTION 
costFunction <- function(dataset, prediction){ 
dataset <- as.numeric(dataset); 
prediction <- as.numeric(prediction); 
m = length(dataset); 
J = 1 / (2 * m) * sum((dataset - prediction) ^ 2); 
return(J); 
} 
5.1.2 F1SCORE.R 
f1Score = function(data, prediction){ 
data <- as.numeric(data); 
prediction <- as.numeric(prediction); 
true_positives <- sum(data); 
false_positives <- sum(prediction == !data & prediction); 
false_negatives <- sum(data == !prediction & !prediction); 
precision <- true_positives / (true_positives + false_positives); 
recall <- true_positives / (true_positives + false_negatives);
32 
return(as.numeric(2 * precision * recall / (precision + recall))); 
} 
5.1.3 GENERATEDATASET.R 
generateDataset <- function(n, dropout_percentage){ 
source('generateVector.R'); 
source('percRank.R') 
#Gender List 
gender_list <- list(data = factor(c("Male", "Female")), 
dist = list(Male = 0.52, Female = 0.48), 
w = list(Male = 0.39, Female = 0.41)); 
#Poverty List 
poverty_list <- list(data = factor(c("Yes", "No")), 
dist = list(Yes = 0.22, No = 0.78), 
w = list(Yes = 0.80, No = 0.27)); 
#Community List 
community_list <- list(data = factor(c("General", "OBC", "SC", "ST")), 
dist = list(General = 0.30, OBC = 0.40, SC = 0.20, ST = 0.10), 
w = list(General = 0.10, OBC = 0.48, SC = 0.64, ST = 0.69)); 
#Rural List 
rural_list <- list(data = factor(c("Yes", "No")), 
dist = list(Yes = 0.75, No = 0.25), 
w = list(Yes = 0.45, No = 0.20));
33 
#Pupil Teacher Ratio List 
ptr_list <- list(data = factor(c("Low", "Medium", "High"), order = TRUE), 
dist = list(Low = 0.20, Medium = 0.30, High = 0.50), 
w = list(Low = 0.15, Medium = 0.35, High = 0.55)); 
#Student Classroom Ratio List 
scr_list <- list(data = factor(c("Low", "Medium", "High"), order = TRUE), 
dist = list(Low = 0.18, Medium = 0.33, High = 0.49), 
w = list(Low = 0.22, Medium = 0.25, High = 0.60)); 
Gender <- generateVector(n, gender_list); 
Poverty <- generateVector(n, poverty_list); 
Community <- generateVector(n, community_list); 
Rural <- generateVector(n, rural_list); 
PTR <- generateVector(n, ptr_list); 
SCR <- generateVector(n, scr_list); 
getW <- function(list, vector, index){ 
value <- as.character(vector[index]); 
return(as.numeric(list$w[value])); 
} 
weightage_vector <- vector('numeric'); 
for(i in 1:n){ 
gender_weightage <- getW(gender_list, Gender, i); 
poverty_weightage <- getW(poverty_list, Poverty, i); 
community_weightage <- getW(community_list, Community, i);
34 
rural_weightage <- getW(rural_list, Rural, i); 
weightage_vector[i] <- 
gender_weightage + 
poverty_weightage + 
community_weightage + 
rural_weightage + 
getW(ptr_list, PTR, i) + 
getW(scr_list, SCR, i) 
; 
} 
w_rank <- percRank(weightage_vector); 
Dropout <- w_rank >= (1 - dropout_percentage); 
data <- data.frame(Gender, Poverty, Community, Rural, PTR, SCR, 
Dropout); 
write.csv(file="data.csv", x=data) 
} 
5.1.4 GENERATEVECTOR.R 
generateVector <- function(n, list){ 
dist <- list$dist; 
p <- c(length(list$data)); 
#Generate probability series 
k = 1; 
for(i in dist){ 
if(k == 1){
35 
p[k] = i; 
} 
else{ 
p[k] = i + p[k - 1]; 
} 
k = k + 1; 
} 
#Get index of value that will be added to the vector 
getIndex <- function(p, r){ 
k = 1; 
for(i in p){ 
if(r <= i){ 
break; 
} 
k = k + 1; 
} 
return(k); 
} 
#Generate Vector 
result <- factor(list$data); 
for(i in 1:n) { 
index <- getIndex(p, runif(1)); 
value <- list$data[index]; 
result[i] = value; 
} 
return(result);
36 
} 
5.1.5 INIT.R 
setwd('/Users/ramathreya/Sites/foss-project/r'); 
source('generateDataset.R'); 
source('randomizeDataset.R'); 
source('predictor.R'); 
source('costFunction.R'); 
source('f1Score.R'); 
source('learningCurve.R'); 
source('plotLearningCurve.R'); 
partition <- 0.7; 
start <- 100; 
interval <- 500; 
dataset <- read.csv(file="input.csv"); 
n <- nrow(dataset); 
png('../public/plot.png'); 
opar <- par(no.readonly=TRUE) 
par(mfrow=c(3, 3)); 
z <- c();
37 
train <- c(); 
cv <- c(); 
f1 <- c(); 
seq_range <- seq(0.1, 0.9, 0.1); 
for(i in seq_range){ 
curves <- learningCurve(dataset, start, n, interval, partition, "Dropout", 
predictor, i); 
plotLearningCurve(curves$m, curves$train, curves$test, c("Plot when Z is 
", i), "Training Examples", "Error"); 
train_last <- tail(curves$train, 1); 
cv_last <- tail(curves$test, 1); 
z <- c(z, i); 
cv <- c(cv, sum(abs(curves$test)) / length(curves$test)); 
train <- c(train, sum(abs(curves$train)) / length(curves$train)); 
f1 <- c(f1, sum(abs(curves$f1)) / length(curves$f1)); 
} 
w <- (1-f1) * train * cv; 
analysis <- data.frame(z, train, cv, f1, w); 
min_index <- which(w==min(w)); 
write.csv(seq_range[min_index], file="out.z") 
dev.off(); 
5.1.6 LEARNINGCURVE.R
38 
learningCurve <- function(dataset, start, end, interval, partition, column, 
predictor, z){ 
train_plot <- c(); 
test_plot <- c(); 
x <- c(); 
f1 <- c(); 
for(i in seq(start, end, interval)){ 
m <- i * partition; 
training_dataset <- dataset[1:m, ]; 
test_dataset <- dataset[(m+1):i, ]; 
train_actual <- unlist(training_dataset[column]); 
test_actual <- unlist(test_dataset[column]); 
predictor_formula <- predictor(training_dataset); 
train_pred <- predict(predictor_formula, type="response", 
training_dataset) >= z; 
test_pred <- predict(predictor_formula, type="response", test_dataset) >= 
z; 
train_cost <- costFunction(train_actual, train_pred); 
test_cost <- costFunction(test_actual, test_pred); 
f1 <- c(f1, f1Score(test_actual, test_pred)); 
x <- c(x, i);
39 
train_plot <- c(train_plot, train_cost); 
test_plot <- c(test_plot, test_cost); 
} 
return(list(train=train_plot, test=test_plot, m=x, f1=f1)); 
} 
5.1.7 MYSQL.R 
library(RMySQL) 
db = dbConnect(MySQL(), user='root', password='', dbname='mobile_crm', 
host='localhost') 
5.1.8 PERCRANK.R 
percRank <- function(x) trunc(rank(x)) / length(x) 
5.1.9 PLOTLEARNINGCURVE.R 
plotLearningCurve <- function(m, train_plot, test_plot, title, xlab, ylab, 
rnge=range(0, 0.15)){ 
plot(m, train_plot, type="l", col="red", xlab=NA, ylab=NA, ylim=rnge); 
par(new=TRUE); 
plot(m, test_plot, type="l", col="green", xlab=NA, ylab=NA, ylim=rnge, 
axes=FALSE); 
par(new=TRUE); 
legend('topright', c("Training", "C.V"), 
bty="n", lty=1, lwd=0.5, cex=0.5, 
col=c('red', 'green'));
40 
title(title, 
xlab=xlab, 
ylab=ylab); 
} 
5.1.10 PREDICTION.R 
setwd('/Users/ramathreya/Sites/foss-project/r'); 
source('predictor.R'); 
dataset <- read.csv(file="input.csv"); 
z <- read.csv(file="out.z") 
z <- z[1, 'x']; 
predictor_formula <- predictor(dataset); 
input <- read.csv('predict-input.csv'); 
dataset <- rbind(dataset, input) 
l <- nrow(dataset); 
prediction <- predict(predictor_formula, newdata=dataset, 
type="response"); 
prediction <- (prediction[l] >= z); 
fileConn<-file("output") 
writeLines(c(toString(prediction)), fileConn) 
close(fileConn) 
5.1.11 PREDICTOR.R
41 
predictor <- function(dataset){ 
formula <- glm( 
formula = Dropout ~ cbind(Gender, Poverty, Community, Rural, PTR, SCR), 
family = binomial, 
data = dataset); 
return(formula); 
} 
5.1.12 RANDOMIZEDATASET.R 
randomizeDataset <- function(dataset){ 
result <- subset(dataset, FALSE); 
l <- nrow(dataset); 
s <- sample(seq(1, l), l); 
for(i in 1:l){ 
result[i, ] <- dataset[s[i], ]; 
} 
return(result); 
} 
5.2 NODE.JS 
5.2.1 APP.JS 
; 
var express = require('express'); 
var http = require('http'); 
var path = require('path');
42 
var bodyParser = require('body-parser'); 
app = express(); 
app.configure(function() { 
app.set('views', __dirname + '/app/views'); 
app.set('view engine', 'jade'); 
app.use(express.static(path.join(__dirname, 'public'))); 
app.use(express.cookieParser()); 
app.use(express.methodOverride()); 
app.use(express.session({secret: 'keyboard cat'})); 
app.use(bodyParser.json()); 
app.use(express.json()); // to support JSON-encoded bodies 
app.use(express.urlencoded()); // to support URL-encoded bodies 
app.locals.basedir = path.join(__dirname, '/app/views'); 
app.use(app.router); 
app.basepath = __dirname; 
require('./routes')(); 
http.createServer(app).listen(3000, function() { 
console.log('Server Started'); 
}); 
}); 
5.2.2 PACKAGE.JSON 
{ 
"name": "foss-project", 
"scripts": {
43 
"start": "node app" 
}, 
"dependencies": { 
"body-parser": "^1.5.2", 
"connect": "*", 
"express": "3.4.0", 
"formidable": "1.0.15", 
"jade": "*", 
"request": "2.x" 
}, 
"engines": { 
"node": "0.10.x", 
"npm": "1.2.x" 
} 
} 
5.2.3 ROUTES.JS 
; 
var formidable = require('formidable'), 
util = require('util'), 
fs = require('fs'), 
sys = require('sys'), 
exec = require('child_process').exec; 
module.exports = function() { 
app.get('/', function(req, res) { 
res.render('index'); 
});
44 
app.post('/upload', function(req, res) { 
// parse a file upload 
var form = new formidable.IncomingForm(); 
form.parse(req, function(err, fields, files) { 
//Write to CSV file within r folder 
fs.readFile(files.upload.path, function(err, data) { 
var newPath = __dirname + "/r/input.csv"; 
fs.writeFile(newPath, data, function(err) { 
function puts(error, stdout, stderr) { 
res.render('upload'); 
} 
exec("Rscript r/init.R", puts); 
}); 
}); 
}); 
return; 
}); 
app.get('/predict', function(req, res) { 
res.render('predict'); 
}); 
app.post('/predict', function(req, res) { 
var json = JSON.parse(req.body.json); 
var key_string = '"",', value_string = '"",';
45 
for(var i in json){ 
key_string += json[i].name + ','; 
value_string += json[i].value + ','; 
} 
key_string += 'Dropout' 
value_string += '""'; 
var string = key_string + 'n' + value_string + 'n'; 
fs.writeFile('r/predict-input.csv', string, function(err) { 
function puts(error, stdout, stderr) { 
fs.readFile('r/output', 'utf-8', function(err, data) { 
res.end(data); 
}); 
} 
exec("Rscript r/prediction.R", puts); 
}); 
}); 
}; 
5.2.4 INDEX.JADE 
doctype html 
html 
head 
title Dashboard 
meta(charset="UTF-8") 
meta(content='width=device-width, initial-scale=1, maximum-scale=1, 
user-scalable=no' name='viewport')
46 
link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css") 
link(rel="stylesheet",href="css/font-awesome. 
min.css",type="text/css") 
link(rel="stylesheet",href="css/ionicons.min.css",type="text/css") 
link(rel="stylesheet",href="css/morris/morris.css",type="text/css") 
link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap- 
1.2.2.css",type="text/css") 
link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3- 
wysihtml5.min.css",type="text/css") 
link(rel="stylesheet",href="css/AdminLTE.css",type="text/css") 
body(class="skin-blue") 
header(class="header") 
a(href="/",class="logo") FOSS Project 
nav(class="navbar navbar-static-top",role="navigation") 
a(href="#",class="navbar-btn sidebar-toggle",data-toggle=" 
offcanvas",role="button") 
span(class="sr-only") Toggle Navigation 
span(class="icon-bar") 
span(class="icon-bar") 
span(class="icon-bar") 
div(class="wrapper row-offcanvas row-offcanvas-left") 
aside(class="left-side sidebar-offcanvas") 
section(class="sidebar") 
ul(class="sidebar-menu") 
li 
a(href="/") 
i(class="fa fa-upload") 
span Upload
47 
li 
a(href="/predict") 
i(class="fa fa-search") 
span Predict 
aside(class="right-side") 
section(class="content-header") 
h1 Upload 
section 
div(class="box box-primary") 
form(action="/upload",enctype="multipart/form-data", 
method="post",role="form") 
div(class="box-body") 
div(class="form-group") 
input(type="file",name="upload",multiple="multiple") 
div(class="box-footer") 
button(type="submit",class="btn btn-primary") 
Upload 
script(src="js/jquery.js") 
script(src="js/bootstrap.min.js") 
script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js") 
script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-en. 
js") 
script(src="js/AdminLTE/app.js") 
5.2.5 PREDICT.JADE
48 
doctype html 
html 
head 
title Dashboard 
meta(charset="UTF-8") 
meta(content='width=device-width, initial-scale=1, maximum-scale=1, 
user-scalable=no' name='viewport') 
link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css") 
link(rel="stylesheet",href="css/font-awesome. 
min.css",type="text/css") 
link(rel="stylesheet",href="css/ionicons.min.css",type="text/css") 
link(rel="stylesheet",href="css/morris/morris.css",type="text/css") 
link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap- 
1.2.2.css",type="text/css") 
link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3- 
wysihtml5.min.css",type="text/css") 
link(rel="stylesheet",href="css/AdminLTE.css",type="text/css") 
body(class="skin-blue") 
header(class="header") 
a(href="/",class="logo") FOSS Project 
nav(class="navbar navbar-static-top",role="navigation") 
a(href="#",class="navbar-btn sidebar-toggle",data-toggle=" 
offcanvas",role="button") 
span(class="sr-only") Toggle Navigation 
span(class="icon-bar") 
span(class="icon-bar") 
span(class="icon-bar") 
div(class="wrapper row-offcanvas row-offcanvas-left")
49 
aside(class="left-side sidebar-offcanvas") 
section(class="sidebar") 
ul(class="sidebar-menu") 
li 
a(href="/") 
i(class="fa fa-upload") 
span Upload 
li 
a(href="/predict") 
i(class="fa fa-search") 
span Predict 
aside(class="right-side") 
section(class="content-header") 
h1 Predict 
section 
div(class="box box-primary") 
form(action="#",enctype="multipart/form-data", 
method="post",role="form",id="form") 
div(class="box-body") 
div(class="form-group col-md-2") 
label Gender 
select(class="form-control",name="Gender") 
option(value="Male") Male 
option(value="Female") Female 
div(class="form-group col-md-2") 
label Poverty 
select(class="form-control",name="Poverty") 
option(value="Yes") Yes 
option(value="No") No
50 
div(class="form-group col-md-2") 
label Community 
select(class="form-control",name="Community") 
option(value="General") General 
option(value="OBC") OBC 
option(value="SC") SC 
option(value="ST") ST 
div(class="form-group col-md-2") 
label Rural 
select(class="form-control",name="Rural") 
option(value="Yes") Yes 
option(value="No") No 
div(class="form-group col-md-2") 
label PTR 
select(class="form-control",name="PTR") 
option(value="Low") Low 
option(value="Medium") Medium 
option(value="High") High 
div(class="form-group col-md-2") 
label SCR 
select(class="form-control",name="SCR") 
option(value="Low") Low 
option(value="Medium") Medium 
option(value="High") High 
div(class="box-footer",style="margin-left: 5px;") 
button(type="button",class="btn btn-primary", 
id="submit") Predict 
label(id="outcome",style="margin-left: 10px;")
51 
script(src="js/jquery.js") 
script(src="js/bootstrap.min.js") 
script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js") 
script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-en. 
js") 
script(src="js/AdminLTE/app.js") 
script(type="text/javascript"). 
$(document).ready(function(){ 
$('#submit').click(function(){ 
var json = JSON.stringify($('#form').serializeArray()); 
$.ajax({ 
url: '/predict', 
method: 'post', 
data: { 
json: json 
}, 
success: function(response){ 
var label = $('#outcome'); 
if(response.indexOf("TRUE") >= 0){ 
label.css('color', 'red'); 
label.html('Student will Dropout'); 
} 
else{ 
label.css('color', 'green'); 
label.html('Student will Not Dropout'); 
} 
} 
}); 
});
52 
}); 
5.2.6 UPLOAD.JADE 
doctype html 
html 
head 
title Dashboard 
meta(charset="UTF-8") 
meta(content='width=device-width, initial-scale=1, maximum-scale=1, 
user-scalable=no' name='viewport') 
link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css") 
link(rel="stylesheet",href="css/font-awesome. 
min.css",type="text/css") 
link(rel="stylesheet",href="css/ionicons.min.css",type="text/css") 
link(rel="stylesheet",href="css/morris/morris.css",type="text/css") 
link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap- 
1.2.2.css",type="text/css") 
link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3- 
wysihtml5.min.css",type="text/css") 
link(rel="stylesheet",href="css/AdminLTE.css",type="text/css") 
body(class="skin-blue") 
header(class="header") 
a(href="/",class="logo") FOSS Project 
nav(class="navbar navbar-static-top",role="navigation") 
a(href="#",class="navbar-btn sidebar-toggle",data-toggle=" 
offcanvas",role="button") 
span(class="sr-only") Toggle Navigation
53 
span(class="icon-bar") 
span(class="icon-bar") 
span(class="icon-bar") 
div(class="wrapper row-offcanvas row-offcanvas-left") 
aside(class="left-side sidebar-offcanvas") 
section(class="sidebar") 
ul(class="sidebar-menu") 
li 
a(href="/") 
i(class="fa fa-upload") 
span Upload 
li 
a(href="/predict") 
i(class="fa fa-search") 
span Predict 
aside(class="right-side") 
section(class="content-header") 
h1 Learning Curves 
section 
div(class="box box-primary") 
iframe(src="plot.png",style="width: 600px; height: 
500px;",frameborder="0") 
script(src="js/jquery.js") 
script(src="js/bootstrap.min.js") 
script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js") 
script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-en. 
js") 
script(src="js/AdminLTE/app.js")
54 
CHAPTER 6 
RESULTS 
6.1 DATASET UPLOAD
55 
6.2 UPLOAD RESULT 
Fig 6.1 : Upload Result 
6.3 PREDICTION 
Fig 6.2 : Prediction Screen
56 
Fig 6.3 : Predicting Student will not Dropout 
Fig 6.4 : Predicting Student will Dropout
57 
CHAPTER 7 
CONCLUSIONS 
The advent of Information Technology and the Internet has lead to vast 
amounts of data being gathered and stored in multiple formats by multiple sources. 
Thus both big corporations as well as Government Agencies are attempting to tap 
into these vast troves of data for making better decisions and creating eff icient 
processes. Several techniques such as Machine Learning, Neural Networks etc, 
which are commonly termed as Big Data, are trying to revolutionize the way we 
analyze information and are adding real value. 
This project was inspired by such technologies. The aim was to create an 
objective mechanism for solving the dropout problem that could be used for policy 
making. This algorithm could provide an objective solution by identifying 
vulnerable students who truly need help and thereby improve retention and 
completion rates in schools. 
Personally, it was a great opportunity for me to discover an area of 
programming that I had wanted to learn for some time now. At the same time 
getting a chance to solve a real world problem that is vital to our society made it 
all the more worthwhile. I humbly admit that the algorithm developed is in no way 
perfect but it was a determined attempt from my end to prove what is possible. 
Hopefully people after me would take this up and extend it to such a point that it 
can be of use to Government Agencies and provide real value to students who are 
the final beneficiaries of this system and the future of our nation.
58 
CHAPTER 8 
REFERENCES 
RESEARCH PAPERS 
 Data Mining: A prediction for Student's Performance Using Classification 
Method (World Journal of Computer Application and Technology) 
 A comparative study for predicting student’s academic performance using 
Bayesian Network Classifiers (IOSR Journal of Engineering) 
 School Dropout across Indian States and UTs: An Econometric Study 
(International Research Journal of Social Sciences) 
 Mining Educational Data to Analyze Students’ Performance (International 
Journal of Advanced Computer Science and Applications) 
 Gender Issues and Dropout Rates in India: Major Barrier in Providing 
Education for All (Amirtham, N. S. & Kundupuzhakkal, S. / Educationia 
Confab) 
 Mining Educational Data Using Classification to Decrease Dropout Rate of 
Students (INTERNATIONAL JOURNAL OF MULTIDISCIPLINARY 
SCIENCES AND ENGINEERING) 
 Predicting Students Academic Performance Using Education Data Mining 
(International Journal of Computer Science and Mobile Computing) 
 Prediction of student academic performance by an application of data 
mining techniques (2011 International Conference on Management and 
Artificial Intelligence) 
 Educational Data Mining: A Review of the State-of-the-Art(Transactions 
on Systems, Man, and Cybernetics)
59 
SURVEYS 
 School Drop out: Patterns, Causes, Changes and Policies (UNESCO) 
 The Criticality of Pupil Teacher Ratio (Azim Preji Foundation) 
 Survey for Assessment of Dropout Rates at Elementary Level in 21 States 
(edCil) 
 Right to Education Report Card (ANNUAL STATUS OF EDUCATION 
REPORT 2011) 
 How High Are Dropout Rates in India? (Economic and Political Weekly 
March 17, 2007) 
GOVERNMENT REPORTS 
 Review, Examination and Validation of Data on Dropout in Karnataka 
(Department of Education Government of Karnataka) 
 Drop – out rate at primary level: A note based on DISE 2003 – 04 & 2004 – 
05 data (National Institute of Educational Planning and Administration) 
 Dropout in Secondary Education: A Study of Children Living in Slums of 
Delhi (National University of Educational Planning and Administration) 
BOOKS 
 Data Mining: Concepts and Techniques (Jiawei Han 
and Micheline Kamber) 
 R in Action (Robert I. Kabacoff)
60 
LINKS 
 http://www.wikipedia.org 
 http://scholar.google.com 
 https://www.coursera.org/course/ml

More Related Content

Similar to Forecasting a Student's Education Fulfillment using Regression Analysis

DO_s2021_021 (1).pdf
DO_s2021_021 (1).pdfDO_s2021_021 (1).pdf
DO_s2021_021 (1).pdf
SumilhigMizzy
 
building an automated student record.
building an automated student record. building an automated student record.
building an automated student record.
danieljethrobote
 
Cultural Geography Classes
Cultural Geography ClassesCultural Geography Classes
Cultural Geography Classes
Tracy Berry
 
DATA ANALYST COURSE TUTORIAL AND SMALL EXAMPLES
DATA ANALYST COURSE TUTORIAL AND SMALL EXAMPLESDATA ANALYST COURSE TUTORIAL AND SMALL EXAMPLES
DATA ANALYST COURSE TUTORIAL AND SMALL EXAMPLES
MURTHYVENKAT2
 
ICFAI Projects and Operations Management - Solved assignments and case study ...
ICFAI Projects and Operations Management - Solved assignments and case study ...ICFAI Projects and Operations Management - Solved assignments and case study ...
ICFAI Projects and Operations Management - Solved assignments and case study ...
smumbahelp
 
Educational excellence framework book rev. 1.5
Educational excellence framework book rev. 1.5Educational excellence framework book rev. 1.5
Educational excellence framework book rev. 1.5
Malek Ghazo
 
Research on dynamic effects of employability of vocational college students i...
Research on dynamic effects of employability of vocational college students i...Research on dynamic effects of employability of vocational college students i...
Research on dynamic effects of employability of vocational college students i...
ijcsit
 
Staff development leadership-institute_for_principles-clara_boswell-1985-157p...
Staff development leadership-institute_for_principles-clara_boswell-1985-157p...Staff development leadership-institute_for_principles-clara_boswell-1985-157p...
Staff development leadership-institute_for_principles-clara_boswell-1985-157p...
RareBooksnRecords
 
My Own Creative Process And Transformative Experiences...
My Own Creative Process And Transformative Experiences...My Own Creative Process And Transformative Experiences...
My Own Creative Process And Transformative Experiences...
Kristi Anderson
 
IAS Optional Public Administration
IAS Optional Public AdministrationIAS Optional Public Administration
IAS Optional Public Administration
VVR IAS Exam Preparation
 
School in any pocket
School in any pocketSchool in any pocket
School in any pocket
Konstans Kostova
 
Bruce Mims Dissertation
Bruce Mims DissertationBruce Mims Dissertation
Bruce Mims Dissertation
Bruce Mims
 
All you need to know about NANS success plans
All you need to know about NANS success plansAll you need to know about NANS success plans
All you need to know about NANS success plans
New Approaches, New Solutions
 
FMTitlePage.indd iv 040913 1010 AMManagemen.docx
FMTitlePage.indd   iv 040913   1010 AMManagemen.docxFMTitlePage.indd   iv 040913   1010 AMManagemen.docx
FMTitlePage.indd iv 040913 1010 AMManagemen.docx
keugene1
 
Practical Research 1 power point presentation.pptx
Practical Research 1 power point presentation.pptxPractical Research 1 power point presentation.pptx
Practical Research 1 power point presentation.pptx
charnethabellona
 
IRJET- Learning Assistance System for Autistic Child
IRJET- Learning Assistance System for Autistic ChildIRJET- Learning Assistance System for Autistic Child
IRJET- Learning Assistance System for Autistic Child
IRJET Journal
 
Activity completion-report
Activity completion-reportActivity completion-report
Activity completion-report
jovitoincoy1
 
A Blueprint For Success Case Studies Of Successful Pre-College Outreach Prog...
A Blueprint For Success  Case Studies Of Successful Pre-College Outreach Prog...A Blueprint For Success  Case Studies Of Successful Pre-College Outreach Prog...
A Blueprint For Success Case Studies Of Successful Pre-College Outreach Prog...
Raquel Pellicier
 
Pre-Program Evaluation Essay
Pre-Program Evaluation EssayPre-Program Evaluation Essay
Pre-Program Evaluation Essay
Katie Parker
 
18-09-082. MONITORING AND EVALUATION OF WORK IMMERSION.pdf
18-09-082. MONITORING AND EVALUATION OF WORK IMMERSION.pdf18-09-082. MONITORING AND EVALUATION OF WORK IMMERSION.pdf
18-09-082. MONITORING AND EVALUATION OF WORK IMMERSION.pdf
FrintzEmilFlores
 

Similar to Forecasting a Student's Education Fulfillment using Regression Analysis (20)

DO_s2021_021 (1).pdf
DO_s2021_021 (1).pdfDO_s2021_021 (1).pdf
DO_s2021_021 (1).pdf
 
building an automated student record.
building an automated student record. building an automated student record.
building an automated student record.
 
Cultural Geography Classes
Cultural Geography ClassesCultural Geography Classes
Cultural Geography Classes
 
DATA ANALYST COURSE TUTORIAL AND SMALL EXAMPLES
DATA ANALYST COURSE TUTORIAL AND SMALL EXAMPLESDATA ANALYST COURSE TUTORIAL AND SMALL EXAMPLES
DATA ANALYST COURSE TUTORIAL AND SMALL EXAMPLES
 
ICFAI Projects and Operations Management - Solved assignments and case study ...
ICFAI Projects and Operations Management - Solved assignments and case study ...ICFAI Projects and Operations Management - Solved assignments and case study ...
ICFAI Projects and Operations Management - Solved assignments and case study ...
 
Educational excellence framework book rev. 1.5
Educational excellence framework book rev. 1.5Educational excellence framework book rev. 1.5
Educational excellence framework book rev. 1.5
 
Research on dynamic effects of employability of vocational college students i...
Research on dynamic effects of employability of vocational college students i...Research on dynamic effects of employability of vocational college students i...
Research on dynamic effects of employability of vocational college students i...
 
Staff development leadership-institute_for_principles-clara_boswell-1985-157p...
Staff development leadership-institute_for_principles-clara_boswell-1985-157p...Staff development leadership-institute_for_principles-clara_boswell-1985-157p...
Staff development leadership-institute_for_principles-clara_boswell-1985-157p...
 
My Own Creative Process And Transformative Experiences...
My Own Creative Process And Transformative Experiences...My Own Creative Process And Transformative Experiences...
My Own Creative Process And Transformative Experiences...
 
IAS Optional Public Administration
IAS Optional Public AdministrationIAS Optional Public Administration
IAS Optional Public Administration
 
School in any pocket
School in any pocketSchool in any pocket
School in any pocket
 
Bruce Mims Dissertation
Bruce Mims DissertationBruce Mims Dissertation
Bruce Mims Dissertation
 
All you need to know about NANS success plans
All you need to know about NANS success plansAll you need to know about NANS success plans
All you need to know about NANS success plans
 
FMTitlePage.indd iv 040913 1010 AMManagemen.docx
FMTitlePage.indd   iv 040913   1010 AMManagemen.docxFMTitlePage.indd   iv 040913   1010 AMManagemen.docx
FMTitlePage.indd iv 040913 1010 AMManagemen.docx
 
Practical Research 1 power point presentation.pptx
Practical Research 1 power point presentation.pptxPractical Research 1 power point presentation.pptx
Practical Research 1 power point presentation.pptx
 
IRJET- Learning Assistance System for Autistic Child
IRJET- Learning Assistance System for Autistic ChildIRJET- Learning Assistance System for Autistic Child
IRJET- Learning Assistance System for Autistic Child
 
Activity completion-report
Activity completion-reportActivity completion-report
Activity completion-report
 
A Blueprint For Success Case Studies Of Successful Pre-College Outreach Prog...
A Blueprint For Success  Case Studies Of Successful Pre-College Outreach Prog...A Blueprint For Success  Case Studies Of Successful Pre-College Outreach Prog...
A Blueprint For Success Case Studies Of Successful Pre-College Outreach Prog...
 
Pre-Program Evaluation Essay
Pre-Program Evaluation EssayPre-Program Evaluation Essay
Pre-Program Evaluation Essay
 
18-09-082. MONITORING AND EVALUATION OF WORK IMMERSION.pdf
18-09-082. MONITORING AND EVALUATION OF WORK IMMERSION.pdf18-09-082. MONITORING AND EVALUATION OF WORK IMMERSION.pdf
18-09-082. MONITORING AND EVALUATION OF WORK IMMERSION.pdf
 

More from Ram G Athreya

Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia ChatbotEnhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Ram G Athreya
 
GSoC 2017 Proposal - Chatbot for DBpedia
GSoC 2017 Proposal - Chatbot for DBpedia GSoC 2017 Proposal - Chatbot for DBpedia
GSoC 2017 Proposal - Chatbot for DBpedia
Ram G Athreya
 
Human Computer Interaction - Final Report of a concept Car Infotainment System
Human Computer Interaction - Final Report of a concept Car Infotainment SystemHuman Computer Interaction - Final Report of a concept Car Infotainment System
Human Computer Interaction - Final Report of a concept Car Infotainment System
Ram G Athreya
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
Ram G Athreya
 
Semi-Automated Security Testing of Web applications
Semi-Automated Security Testing of Web applicationsSemi-Automated Security Testing of Web applications
Semi-Automated Security Testing of Web applications
Ram G Athreya
 
Feature driven agile oriented web applications
Feature driven agile oriented web applicationsFeature driven agile oriented web applications
Feature driven agile oriented web applications
Ram G Athreya
 

More from Ram G Athreya (6)

Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia ChatbotEnhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
 
GSoC 2017 Proposal - Chatbot for DBpedia
GSoC 2017 Proposal - Chatbot for DBpedia GSoC 2017 Proposal - Chatbot for DBpedia
GSoC 2017 Proposal - Chatbot for DBpedia
 
Human Computer Interaction - Final Report of a concept Car Infotainment System
Human Computer Interaction - Final Report of a concept Car Infotainment SystemHuman Computer Interaction - Final Report of a concept Car Infotainment System
Human Computer Interaction - Final Report of a concept Car Infotainment System
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
 
Semi-Automated Security Testing of Web applications
Semi-Automated Security Testing of Web applicationsSemi-Automated Security Testing of Web applications
Semi-Automated Security Testing of Web applications
 
Feature driven agile oriented web applications
Feature driven agile oriented web applicationsFeature driven agile oriented web applications
Feature driven agile oriented web applications
 

Recently uploaded

一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 

Recently uploaded (20)

一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 

Forecasting a Student's Education Fulfillment using Regression Analysis

  • 1. i FORECASTING A STUDENT’S EDUCATION FULFILLMENT USING REGRESSION ANALYSIS Submitted by RAM G ATHREYA Roll No.: 1202FOSS0019 Reg. No.: 75812200021 A PROJECT REPORT Submitted to the FACULTY OF SCIENCE AND HUMANITIES in partial fulfillment for the requirement of award of the degree of MASTER OF SCIENCE IN FREE / OPEN SOURCE SOFTWARE (CS-FOSS) CENTRE FOR DISTANCE EDUCATION ANNA UNIVERSITY CHENNAI 600 025 AUGUST 2014
  • 2. ii CENTRE FOR DISTANCE EDUCATION ANNA UNIVERSITY CHENNAI 600 025 BONA FIDE CERTIFICATE Certified that this Project report titled “FORECASTING A STUDENT’S EDUCATION USING REGRESSION ANALYSIS” is the bona fide work of Mr. RAM G ATHREYA, who carried out the research under my supervision. I certify further, that to the best of my knowledge the work reported herein does not form part of any other Project report or dissertation on the basis of which a degree or award was conferred on an earlier occasion on this or any other candidate. RAM G ATHREYA Dr. SRINIVASAN SUNDARARAJAN Student at Anna University Professor
  • 3. iii CERTIFICATE OF VIVA-VOCE-EXAMINATION This is to certify that Thiru/Mr. RAM G ATHREYA (Roll No. 1202FOSS0019; Register No. 75812200021) has been subjected to Viva-voce- Examination on 14 September 2014 at 9:30 AM at the Study centre The AU-KBC research Centre, Madras Institute of Technology, Anna Universisty, Chrompet, Chennai 600044. Internal Examiner External Examiner Name : Name : (in capital letters) (in capital letters) Designation : Designation : Address : Address : Coordinator centre Name : (in capital letters) Designation : Address : Date :
  • 4. iv ACKNOWLEDGEMENT I am highly indebted to my guide Dr. SRINIVASAN SUNDARARAJAN for his guidance, monitoring, constant supervision, kind co-operation and encouragement that helped me in completion of this project. I would also like to express my special gratitude to AU-KBC faculties involved in M.Sc. (CS-FOSS) course for their cordial support and guidance as well as for providing necessary information regarding the project and also for their support in completing the project. Finally, I thank Center of Distance Education, Anna University for giving me an opportunity to do this project.
  • 5. v ABSTRACT Our government spends substantial amount of resources in educating our children. Additionally several welfare schemes are introduced aimed especially at underprivileged children to ensure that all of them complete a basic level of education. In spite of these measures many students do not complete their basic education. The aim of this project is to formulate a Supervised Learning Algorithm that will aid in identifying such students who have a higher likelihood of not completing their education. To perform this task the algorithm will perform Logistic Regression Analysis on historical data of students from a given school. The historical data includes basic background information (features) such as gender, community, number of siblings etc. It must be noted that the historical data also contains information on whether the student completed his/her education, which is the outcome we are interested in. Typically a student finishing education will be denoted using a value of 1 and a student not finishing will be denoted with a value of 0. Based on the training (historical) data a logistic classifier can be built. Such a classifier after learning from the training set will develop specific weightages for each of the features. These weightages can then be extrapolated into an equation that can be used for prediction. That is we can apply the equation on a current student (whose background we already know) to calculate the probability that he/she will complete his/her education.
  • 6. vi Such an algorithm will be beneficial to government agencies since it can serve as an early warning system using which they can take more proactive action to prevent a student from dropping out. Policy makers can also use it as a tool to identify schools that are more vulnerable and direct their resources and energies to help them.
  • 7. vii சசசசசசசசச சசசச சசசச சசசச சசசசசசசசசச சசசசச சசசசசசச சசசசசசச சசசச சசசசசசசசசசசசச. சசசசசசசச பல சசசசசசசசசசசசசச சசசசசசசசசசச சசசசசசச சசசசச சசச சசசசசசசச சசசச சசசசசசச சசசசச சசசசச சசசசசசசசச சசசசசசசசசச சசசசசசசசசச சசசசசசச. சசசச சசசசசசசசசசசச சசச பல சசசசசசசசச சசசச சசசசசசசச சசசசசசச சசசசசசச. சசசச சசசசசசசசசசச சசசசசசசச, சசசசசச சசசசச சசசசசச சசசச சசசசசசசச சசசசசச சசசச சசசசசச சசசசசசசசச சசசசசசசச சசசசசச சசசசச சசச சசசசசசசசசசசசசச சசசசசச சசசசசசச சசசசசசசசச சசசசசச. சசசசசசச சசசசசசசசசசசசசச சசசசசசசசச சசசசசசச சசசசசசசசச சசசசசசசச சசசச சசசசசசசசச சசசசசசசசச சசசசசசசசசசச சசசசச சசசச சசசசச சசசசச. சசசசசசசச சசசச சசசசசசசச சசசசசசச சசசசச சசசசசச சசசசசசச, சசசசசச, சசசசசசசச சசசச சசசசசச சசசசசசச சசசசசசச சசசசசச சசசசச சசசசச / சசசசச சசசசச, சசசசசச சசசசசச சசசசச சசச சசசசசசசசசசசசச சசசசசசசச சசசசசச சசசசசசசசசசசசசச சசசசசசசசச (சசசசசசசசச) சசசசசசசச . சசசசசசசச சசச சசசசசச சசசசசசச சசசசச சசச சசசசசசச 1 சசசசசசசசசசச சசசசசசசசசசசசசச சசசசசசச சசச சசசசசச 0 சசச சசசசசசச சசசசசசசசசசசசசச சசசசசசசச. சசசசசசச சசசசசசசசசசசச (சசசசசசசச) சசசச சசசசசசசசசச சசசசசசசசசசசச சசசசசசசசசசசச. சசசசசசச சசசசசசசசசசச சசசசசசச சசசசச சசசசசசச சசசசசசச சசச சசசசசசசசசசசச சசசசசசசசச சசசசசசசச சசசசசசசசசசச
  • 8. viii சசசசசசச சசசசசசசசசசச. சசசச சசசசசசசசசச சசசசசசச சசசசசசச சசசசசசசசசச சசசசசசசச சசசசச சசச சசசசசசசசசசசச சசசசசசசசசசசசச. சசசசச சசசச (சசசச சசசசசசச சசசச சசசசசசச சசசசசசசச) சசசச / சசசச சசசச / சசசசச சசசசச சசசசசசச சசசசசசசச சசசசச சசசசசசசசச சசசசசசச சசசசசசசசச சசசசசச சசசச சசசசசசசசசச சசசசசசசசசசசச சசசசசசசச. சசச சசசசசசச சசசசசச சசசசசசசசசச சசசசசசச சசச சசசசசச சசசசசச சசசசசசச சசசசசசசசசசசசச சசசசசசசசச சசசசசச சசசசசசசச சசசசசசசசசசச சசச சசசசச சசசசசசசசசச சசசசசசச சசசசசசசச சசசசசசசச சசசசசசசச சசசசசசச சசச சசசசசசச சசசசசசச சசசசசசசசசசசசசசச சசசசசசசசசச சசசசசசசசச. சசசசசசச சசசசசசசசசசசசச சசசசசசச சசசசசச சசசசசசச சசசசசசச சசசசசச சசசசசச சசசசசசசசசசசசசச சசசசச சசசசசசசச சசசசசசசச சசசசசசச சசசசசசசச சசச சசசசசசசச சசச சசசசசசசசசச சசசசசசசச.
  • 9. ix TABLE OF CONTENTS CHAPTER NO TITLE PAGE NO ACKNOWLEDGEMENT iv ABSTRACT v ABSTRACT IN TAMIL vii LIST OF FIGURES xii LIST OF TABLES xiii LIST OF ABBREVIATIONS xiv 1 INTRODUCTION 1 1.1 OVERVIEW OF THE PROJECT 1 1.2 LITERATURE SURVEY 2 1.3 PROPOSED SYSTEM 2 1.4 SCOPE 2 2 REQUIREMENT SPECIFICATION 4 2.1 INTRODUCTION 4 2.2 OVERALL DESCRIPTION 4 2.2.1 PRODUCT PERSPECTIVE 5 2.2.2 PRODUCT FUNCTIONS 5 3 PROJECT REQUIREMENTS 7 3.1 SOFTWARE REQUIREMENTS 7 3.2 HARDWARE REQUIREMENTS 7 4 SYSTEM DESIGN 9
  • 10. x 4.1 METHODOLOGY 9 4.2 ALGORITHM 9 4.2.1 SUPERVISED LEARNING 10 4.2.2 CLASSIFICATION 11 4.2.3 LOGISTIC REGRESSION 13 4.3 DATA COLLECTION 15 4.3.1 FEATURE DETECTION 15 4.3.1.1 PERSONAL 15 4.3.1.2 ENVIRONMENTAL 15 4.3.1.3 SCHOOL 16 4.3.2 DATASET GENERATION 16 4.4 MODELING 18 4.4.1 HYPOTHESIS DEVELOPMENT 19 4.4.2 GENERALIZATION ERROR 19 4.5 VALIDATION 20 4.5.1 DATASET PARTITIONING 21 4.5.1.1 TRAINING DATASET 21 4.5.1.2 CV DATASET 22 4.5.2 COST FUNCTION 23 4.5.3 ERROR METRICS 24 4.5.3.1 TRAINING AND CV ERROR 25 4.5.3.2 F1 SCORE 25 4.5.3.3 W – SCORE 26 4.5.4 LEARNING CURVES 27 4.6 PREDICTION 29 5 IMPLEMENTATION 31 5.1 R 31
  • 11. xi 5.1.1 COST FUNCTION.R 31 5.1.2 F1SCORE.R 31 5.1.3 GENERATEDATASET.R 32 5.1.4 GENERATEVECTOR.R 34 5.1.5 INIT.R 36 5.1.6 LEARNINGCURVE.R 37 5.1.7 MYSQL.R 39 5.1.8 PERCRANK.R 39 5.1.9 PLOTLEARNINGCURVE.R 39 5.1.10 PREDICTION.R 40 5.1.11 PREDICTOR.R 40 5.1.12 RANDOMIZEDATASET.R 41 5.2 NODE.JS 41 5.2.1 APP.JS 41 5.2.2 PACKAGE.JSON 42 5.2.3 ROUTES.JS 43 5.2.4 INDEX.JADE 45 5.2.5 PREDICT.JADE 47 5.2.6 UPLOAD.JADE 52 6 RESULTS 54 6.1 DATASET UPLOAD 54 6.2 UPLOAD RESULT 55 6.3 PREDICTION 56 7 CONCLUSIONS 57 8 REFERENCES 58
  • 12. xii LIST OF FIGURES FIGURE NO TITLE PAGE NO 4.1 Logistic Regression Curve 4.2 Dataset Generation 4.3 Modeling 4.4 Dataset Partitioning 4.5 Developing Multiple Models 4.6 Calculating Cross-Validation Errors 4.7 Single Subject Learning 4.8 Learning from Experience 4.9 Score & Learning Time vs Experience 4.10 Training & Cross – Validation Error Convergence 4.11 Choosing the Best Model 4.12 Prediction 6.1 Upload Result 6.2 Prediction Screen 6.3 Predicting Student will not Dropout 6.4 Predicting Student will Dropout
  • 13. xiii LIST OF TABLES TABLE NO TITLE PAGE NO 4.1 Sample Dataset 17
  • 14. xiv LIST OF ABBREVIATIONS FOSS Free and Open Source Software IDE Integrated Development Environment OS Operating System PTR Pupil Teacher Ratio SCR Student Classroom Ratio
  • 15. 1 CHAPTER 1 INTRODUCTION 1.1 OVERVIEW OF THE PROJECT Dropout is a universal phenomenon of the education system in India, which is spread across all levels of education, in all parts of the country, and across socio-economic groups the dropout rates are much higher for educationally backward states and districts. Girls in India tend to have higher dropout rates than boys. Similarly, children belonging to the socially disadvantaged groups like Scheduled Castes and Scheduled Tribes have the higher dropout rates in comparison to the general population. There are also regional and location wise differences and the children living in rural areas are more likely to drop out of school. In order to reduce wastage and improve the efficiency of education system, educational planners need to understand and identify the social groups that are more susceptible to dropout and the reasons for their dropping out. Keeping the above context in perspective, it would be helpful to develop a system or an algorithm that can systematically identify such vulnerable students who have a higher likelihood of dropping out from school. The goal of this project is to develop such an algorithm or system. Hopefully such an algorithm or system could assist educational planners and administrative staff of educational institutions to better allocate resources and make better decisions, which could curb this growing dropout problem.
  • 16. 2 1.2 LITERATURE SURVEY The literature survey covers existing research and studies with respect to the dropout problem. They are grouped into three broad categories: 1 Research Papers 2 Surveys 3 Govt Reports The detailed list of resources researched during the literature survey is provided in the references section. 1.3 PROPOSED SYSTEM The proposed system will implement an algorithm that will take in student data as input and learn from it. This learned function, otherwise called as the hypothesis will serve as an approximate explanation of the data. Error metrics and validation techniques will be used to determine the accuracy of the hypothesis. The best hypothesis that fits the data will then be used for prediction. The final goal of the algorithm is to make reasonably accurate predictions of new unlabeled data. Unlabeled data is data for which the outcome is unknown. This system will be implemented in such a way that it can be operated from a web interface where the user can upload datasets as well as make predictions based on learned data. 1.4 SCOPE
  • 17. 3 The algorithm developed is an exploratory proof – of – concept system that uses machine learning and statistical techniques to make predictions based on student data. The validity of the results is entirely dependent on the accuracy of the data and how the algorithm processes it. Since comprehensive student data was not available for making the algorithm as best as possible, this iteration of the system can only serve as a proof – of – concept on what is possible and cannot be directly used in the real world, in its present form, as a decision making or policy making tool.
  • 18. 4 CHAPTER 2 REQUIREMENT SPECIFICATION 2.1 INTRODUCTION A software requirements specification (SRS) defines the requirements of a software system. It is a description of the behavior of a system to be developed and may include a set of use cases. In addition it also contains non-functional requirements. Non-functional requirements impose constraints on the design or implementation (such as performance requirements, quality standards, or design constraints). This project requires storage and processing of medium to large volumes of data/datasets. Such datasets will be passed through the algorithm initially during a training phase, during this time the algorithm will learn using the training data. After training is completed the algorithm would then be required to make predictions for new unlabeled data based on what it learned from the training data. Additionally it would be helpful it the algorithm can be operated from a Web User Interface which will be more user friendly than issuing commands from the command line. 2.2 OVERALL DESCRIPTION This section will outline a holistic description of the project, which includes different perspectives, constraints, functional and non – functional requirements of the project.
  • 19. 5 2.2.1 PRODUCT PERSPECTIVE The system has 4 main tasks that are  Data Collection  Modeling  Validation  Prediction In the data collection phase the data required for the algorithm is gathered converted into a suitable form and supplied to the system for learning. In the modeling phase the algorithm tries to generate models that try to explain the data that has been gathered. Machine Learning techniques are used in this phase to generate multiple models of which the best gets chosen in later stages. In the validation phase the different models are evaluated based on performance and the best among them is chosen as the candidate algorithm that can be used for prediction Finally in the prediction phase the chosen model is used for making actual real world predictions. 2.2.2 PRODUCT FUNCTIONS The system has two main functions that are  Training
  • 20. 6  Prediction In the training phase the dataset is supplied to the algorithm using which the best model is developed for prediction In the prediction phase the learnt algorithm can be actually put to use that is it can be used to make predictions for unlabeled data. How these processes are implemented is explained in detail in subsequent sections.
  • 21. 7 CHAPTER 3 PROJECT REQUIREMENTS The project requirement is to develop an algorithm that can classify students on whether they would complete education or not (dropout). To achieve this a system needs to be created that can be operated from a web user interface that will supply data for training or can make predictions based on already trained data. 3.1 SOFTWARE REQUIREMENTS The software requirements for this project are:  R – R is a free software programming language and software environment for statistical computing and graphics.  Node.js - Node.js is a cross-platform runtime environment and a library for running applications written in JavaScript outside the browser (for example, on the server)  Netbeans - NetBeans is an integrated development environment (IDE) for developing primarily with Java, but also with other languages, in particular PHP, C++, Node.js & HTML5  RStudio – RStudio is a free and open source (FOSS) integrated development environment for R, a programming language for statistical computing and graphics  LINUX – LINUX is a POSIX-compliant computer operating system (OS) assembled under the model of free and open source software. 3.2 HARDWARE REQUIREMENTS
  • 22. 8 The hardware requirements define a set of (minimum) hardware that must be available to run the system.  Hardware System that can support LINUX Operating System  2 – 4 GB of RAM  Internet Connectivity
  • 23. 9 CHAPTER 4 SYSTEM DESIGN System design is the process of defining the architecture, components, modules, interfaces and data for a system to satisfy specified requirements. System design encompasses activities such as systems analysis, systems architecture and systems engineering. 4.1 METHODOLOGY A software development methodology or system development methodology in software engineering is a framework that is used to structure, plan and control the process of developing a software system. This project consists of four distinct phases that are  Data Collection  Modeling  Validation  Prediction 4.2 ALGORITHM The system will use a Logistic Regression Classifier, which is a Supervised Machine Learning Algorithm. This algorithm will take student data as input and predict an outcome. Outcomes are typically binary that is either a TRUE or FALSE. A TRUE value indicates that a student will dropout while FALSE means the student will not dropout.
  • 24. 10 Since the algorithm will return only one of two possible outcomes it can also be called as a binary/binomial classifier. 4.2.1 SUPERVISED LEARNING Supervised learning is the machine-learning task of inferring a function from labeled training data. The training data consist of a set of training examples. Typically the training data for this project will consist of data about students based on features that will be defined later in this document. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. New examples are usually unlabeled data that we need to predict. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. In order to solve a given problem of supervised learning, the system has to perform the following steps: 1. Determine the type of training examples : The kind of data that is to be used as the training set needs to be determined first. In the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting 2. Gather a training set : The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is
  • 25. 11 gathered and corresponding outputs are also gathered, either from human experts or from measurements 3. Determine the input feature representation of the learned function: The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large; but should contain enough information to accurately predict the output. 4. Determine the learning algorithm : The correct learning algorithm that models the available data should be identified and applied. For example the learning algorithm may be support vector machines or decision trees 5. Complete the design : Run the learning algorithm on the gathered training set. Some supervised learning algorithms require certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation. 6. Evaluate the accuracy of the learned function : After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set. 4.2.2 CLASSIFICATION In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. The
  • 26. 12 individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables, features, etc. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a part word in an email) or real-valued (e.g. a measurement of blood pressure). Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10). An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). In statistics, where classification is often done with logistic regression or a similar procedure, the properties of observations are termed explanatory variables (or independent variables, regressors, etc.), and the
  • 27. 13 categories to be predicted are known as outcomes, which are considered to be possible values of the dependent variable. In machine learning, the observations are often known as instances, the explanatory variables are termed features (grouped into a feature vector), and the possible categories to be predicted are classes. There is also some argument over whether classification methods that do not involve a statistical model can be considered "statistical". 4.2.3 LOGISTIC REGRESSION In statistics, logistic regression, or logit regression, is a type of probabilistic statistical classification model. It is also used to predict a binary response from a binary predictor, used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features). That is, it is used in estimating the parameters of a qualitative response model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Logistic Regression is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two, while problems with more than two categories are referred to as multinomial logistic regression. Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable.
  • 28. 14 Fig 4.1 : Logistic Regression Curve The formula for Logistic Regression can be expressed as : 퐹(푥) = 1 1 + 푒−푥 Eq 4.1 : Logistic Regression Formula where :  F(x) is the output  x is the input  e is Euler’s number It must be noted that 퐹(푥) can have a value only between 0 to 1 for any value of x that may be between (−∞, ∞) . Using the above equation we can define a value 푘 휖 (0, 1) such that all values of 퐹(푥) ≥ 푘 is true while those lesser are false or vice versa, thereby classifying the data into two distinct parts.
  • 29. 15 4.3 DATA COLLECTION 4.3.1 FEATURE DETECTION Based on the literature survey six features have been identified as major observable factors that can affect the final outcome regarding the education fulfillment of a student. The six features can be grouped into three categories that are: 1. Personal Features 2. Environmental Features 3. School Features 4.3.1.1 PERSONAL Personal features are those features that are based on the characteristics of the student or his/her parents, family background etc. The personal features that are being considered by the algorithm are: 1. Gender: Values can be Male or Female 2. Poverty: Values can be Yes or No 3. Community: Values can be General, OBC, SC, ST 4.3.1.2 ENVIRONMENTAL Environmental features are those features that are based on the student’s environment, locality, geography etc. The
  • 30. 16 environmental features that are being considered by the algorithm are: 1. Rural: Values can be Yes or No 4.3.1.3 SCHOOL School features are those features that are based on the characteristics of the school where the student studies. The school features that are being considered by the algorithm are: Pupil Teacher Ratio: Pupil–teacher ratio is the number of students who attend a school or university divided by the number of teachers in the institution. For example, a pupil–teacher ratio of 10:1 indicates that there are 10 students for every one teacher. The term can also be reversed to create a teacher–pupil ratio. Student Classroom Ratio: Student – classroom ratio is the number of students per classroom in an education institution. For example, a student – classroom ratio of 40:1 indicates that there are 40 students for every classroom. 1. Pupil Teacher Ratio: Values can be Low (1 Teacher : <30 Students), Medium (1 Teacher : 30 – 40 Students) and High (1 Teacher : 40+ Students) 2. Student Classroom Ratio: Values can be Low (1 Classroom : <30 Students), Medium (1 Classroom: 30 – 40 Students) and High (1 Classroom: 40+ Students) 4.3.2 DATASET GENERATION
  • 31. 17 Based on statistics derived from the literature survey and the features mentioned above the dataset for modeling is generated. The tables given below extrapolate statistical findings compiled from the literature survey: Feature Value Distribution Dropout Chance Gender Male 52% 39% Gender Female 48% 41% Poverty Yes 22% 80% Poverty No 78% 27% Rural Yes 75% 45% Rural No 25% 20% Community General 30% 10% Community OBC 40% 48% Community SC 20% 64% Community ST 10% 69% PTR Low 20% 15% PTR Medium 30% 35% PTR High 50% 55% SCR Low 18% 22% SCR Medium 33% 25% SCR High 49% 60% Table 4.1 : Sample Dataset The above table shows the distribution of each feature in the student population and the corresponding dropout chance of each feature within that population. For example when considering 100 students there are 52
  • 32. 18 male students and 42 female students and the chance that a female student drops out is 41%. Overall Dropout Percentage was found to be 40%. That is 40% of the student population dropout of school. Using the above statistics a dataset can be generated for further analysis. Fig 4.2 : Dataset Generation 4.4 MODELING Data modeling in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques.
  • 33. 19 Fig 4.3 : Modeling 4.4.1 HYPOTHESIS DEVELOPMENT A Hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. A working hypothesis is a provisionally accepted hypothesis proposed for further research. In the context of Machine Learning the hypotheses is also called as the Learned Function. In the context of this project the learned function is a working hypothesis that tries to explain the training dataset of students. Based on the observations/outcomes of the training dataset the learned algorithm will develop weightages for each of the features that have been selected. These weightages will then be used for predicting outcomes in a future dataset. 4.4.2 GENERALIZATION ERROR The generalization error of a machine-learning model is a function that measures how well a learning machine generalizes to unseen data. It is
  • 34. 20 measured as the distance between the error on the training set and the test set and is averaged over the entire set of possible training data that can be generated after each iteration of the learning process. It has this name because this function indicates the capacity of a machine that learns with the specified algorithm to infer a rule (or generalize). The theoretical model assumes a probability distribution of the examples, and a function giving the exact target. The model can also include noise in the example (in the input and/or target output). The generalization error is usually defined as the expected value of the square of the difference between the learned function and the exact target (mean-square error) The performance of a machine learning algorithm is measured by plots of the generalization error values through the learning process and are called learning curves. 4.5 VALIDATION In statistics, model validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from machine learning analysis, are in fact acceptable as descriptions of the data. The validation process can involve analyzing the goodness of fit of the model, analyzing whether the model residuals are random, and checking whether the model's predictive performance deteriorates substantially when applied to data that were not used in model estimation.
  • 35. 21 4.5.1 DATASET PARTITIONING In model validation for assessing the results of statistical analysis the dataset is generally partitioned into two separate datasets. They are : 1. Training Dataset 2. Cross – Validation(CV) Dataset The model is typically trained on the training dataset and then tested on the cross – validation dataset that contains examples that are independent of the training data. The actual training, cross – validation split is upto the person doing the analysis. Usually ranges between 80-20% (training – cv) or 70-30% is preferred so that the model has enough examples for training the model. Fig 4.4 : Dataset Partitioning 4.5.1.1 TRAINING DATASET
  • 36. 22 A training set is a set of data used in various areas of information science to discover potentially predictive relationships. Training sets are used in artificial intelligence, machine learning, genetic programming, intelligent systems, and statistics. In all these fields, a training set has much the same role and is often used in conjunction with a test set. Fig 4.5 : Developing Multiple Models 4.5.1.2 CV DATASET Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset). The goal of cross validation is to define a
  • 37. 23 dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent data set (i.e., an unknown dataset, for instance from a real problem), etc. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Fig 4.6 : Calculating Cross-Validation Errors 4.5.2 COST FUNCTION In mathematical optimization, statistics, decision theory and machine learning, a cost function or loss function is a function that maps an event or values of one or more variables onto a real number intuitively representing
  • 38. 24 some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative (sometimes called a reward function or a utility function), in which case it is to be maximized. In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The cost function is expressed as : 퐽(휃) = 1 2푚 푚 Σ(ℎ휃 (푥(푖)) − 푦(푖))2 푖=1 Eq 4.2 : Cost Function or Error Function where :  J is the Cost  m is the number of training examples  h(x) is the hypothesis  y is the actual value or the result vector 4.5.3 ERROR METRICS Error metrics are systematic benchmarking measures that are used for calculating the accuracy or effectiveness of the system. The cost function is described above is a good example of an error metric. The following error metrics are used for validation of the generated models and in choosing the best among them:
  • 39. 25  Training and CV Error  F1 Score  W – Score 4.5.3.1 TRAINING AND CV ERROR Training error is cost function error of the trained model on the training set. That is after training the model the training dataset is supplied again to the model as input to make predictions. These predictions made by the model are compared against the actual outcomes in the dataset and the error between the two is calculated using the cost function formula. The resulting value is the cost function error. The cross – validation error is similar to the training error except it is calculated on the cross – validation set. The benefit here is that the cross – validation set is new data and has none of the training examples of the training set and thus can be a better estimate of the accuracy of the system. Ideally the system’s cross – validation error should be similar to the training error in which case the model is a good estimate of the underlying data. 4.5.3.2 F1 Score In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by
  • 40. 26 the number of results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. 퐹1 = 2 . 푃푟푒푐푖푠푖표푛 . 푅푒푐푎푙푙 푃푟푒푐푖푠푖표푛 + 푅푒푐푎푙푙 Eq 4.3 : F1 – Score 푃푟푒푐푖푠푖표푛 = 푇푟푢푒 푃표푠푖푡푖푣푒푠 푇푟푢푒 푃표푠푖푡푖푣푒푠 + 퐹푎푙푠푒 푃표푠푖푡푖푣푒푠 Eq 4.4 : Precision 푅푒푐푎푙푙 = 푇푟푢푒 푃표푠푖푡푖푣푒푠 푇푟푢푒 푃표푠푖푡푖푣푒푠 + 퐹푎푙푠푒 푁푒푔푎푡푖푣푒푠 Eq 4.5 : Recall 4.5.3.3 W – Score The W-Score is a combination of the training, cross validation errors using which the best model gets chosen. The best model that gets chosen will have the least W – Score. The W – Score is expressed as : 푊 = (1 − 푓1). Σ 푇푟푎푖푛 퐸푟푟표푟 푁푇 . Σ 퐶푉 퐸푟푟표푟 푁퐶푉
  • 41. 27 Eq 4.6 : W - Score where:  W – W-Score  f1 – F1 Score  NT – Number of Training Examples  NCV – Number of Cross – Validation Examples 4.5.4 LEARNING CURVES Fig 4.7 : Single Subject Learning Fig 4.8 : Learning from Experience Fig 4.9 : Score & Learning Time vs Experience
  • 42. 28 A learning curve is a graphical representation of the increase of learning (vertical axis) with experience (horizontal axis). Although the curve for a single subject may be erratic (Fig 4.7), when a large number of trials are averaged, a smooth curve results, which can be described with a mathematical function (Fig 4.8). Depending on the metric used for learning (or proficiency) the curve can either rise or fall with experience (Fig 4.9). Within the context of the project the horizontal axis will be training examples, which is basically derived from experience, and the vertical axis is the cost function error. Ideally the cost function error should decrease with increase in training examples. But there are two types of errors, that is the training error and the cross – validation error. With increase in training examples the training error would increase gradually so as to prevent overfitting and since the training dataset has to explain a diverse spectrum of examples. But it should not increase exponentially. Also if the model is efficient then it should perform just as good on new data as it does on the training dataset. So the cross – validation error must decrease with increase in training examples. Thus the ideal model will have a small increase in training error with increase in training examples and the cross – validation error should decrease with increase in training examples and the two errors must converge as shown in (Fig 4.10).
  • 43. 29 Fig 4.10 : Training & Cross – Validation Error Convergence Fig 4.11 : Choosing the Best Model 4.6 PREDICTION Prediction is the final step in the process. After selecting the best model that fits the given dataset the model can be put to use on actual real world unlabeled data. That is it can be used to predict data for which the outcomes are not known.
  • 44. 30 The prediction process begins with the algorithm being supplied unlabeled student data using which it predicts an outcome, which is whether the student will dropout or not. Fig 4.12 : Prediction
  • 45. 31 CHAPTER 5 IMPLEMENTATION 5.1 R 5.1.1 COST FUNCTION costFunction <- function(dataset, prediction){ dataset <- as.numeric(dataset); prediction <- as.numeric(prediction); m = length(dataset); J = 1 / (2 * m) * sum((dataset - prediction) ^ 2); return(J); } 5.1.2 F1SCORE.R f1Score = function(data, prediction){ data <- as.numeric(data); prediction <- as.numeric(prediction); true_positives <- sum(data); false_positives <- sum(prediction == !data & prediction); false_negatives <- sum(data == !prediction & !prediction); precision <- true_positives / (true_positives + false_positives); recall <- true_positives / (true_positives + false_negatives);
  • 46. 32 return(as.numeric(2 * precision * recall / (precision + recall))); } 5.1.3 GENERATEDATASET.R generateDataset <- function(n, dropout_percentage){ source('generateVector.R'); source('percRank.R') #Gender List gender_list <- list(data = factor(c("Male", "Female")), dist = list(Male = 0.52, Female = 0.48), w = list(Male = 0.39, Female = 0.41)); #Poverty List poverty_list <- list(data = factor(c("Yes", "No")), dist = list(Yes = 0.22, No = 0.78), w = list(Yes = 0.80, No = 0.27)); #Community List community_list <- list(data = factor(c("General", "OBC", "SC", "ST")), dist = list(General = 0.30, OBC = 0.40, SC = 0.20, ST = 0.10), w = list(General = 0.10, OBC = 0.48, SC = 0.64, ST = 0.69)); #Rural List rural_list <- list(data = factor(c("Yes", "No")), dist = list(Yes = 0.75, No = 0.25), w = list(Yes = 0.45, No = 0.20));
  • 47. 33 #Pupil Teacher Ratio List ptr_list <- list(data = factor(c("Low", "Medium", "High"), order = TRUE), dist = list(Low = 0.20, Medium = 0.30, High = 0.50), w = list(Low = 0.15, Medium = 0.35, High = 0.55)); #Student Classroom Ratio List scr_list <- list(data = factor(c("Low", "Medium", "High"), order = TRUE), dist = list(Low = 0.18, Medium = 0.33, High = 0.49), w = list(Low = 0.22, Medium = 0.25, High = 0.60)); Gender <- generateVector(n, gender_list); Poverty <- generateVector(n, poverty_list); Community <- generateVector(n, community_list); Rural <- generateVector(n, rural_list); PTR <- generateVector(n, ptr_list); SCR <- generateVector(n, scr_list); getW <- function(list, vector, index){ value <- as.character(vector[index]); return(as.numeric(list$w[value])); } weightage_vector <- vector('numeric'); for(i in 1:n){ gender_weightage <- getW(gender_list, Gender, i); poverty_weightage <- getW(poverty_list, Poverty, i); community_weightage <- getW(community_list, Community, i);
  • 48. 34 rural_weightage <- getW(rural_list, Rural, i); weightage_vector[i] <- gender_weightage + poverty_weightage + community_weightage + rural_weightage + getW(ptr_list, PTR, i) + getW(scr_list, SCR, i) ; } w_rank <- percRank(weightage_vector); Dropout <- w_rank >= (1 - dropout_percentage); data <- data.frame(Gender, Poverty, Community, Rural, PTR, SCR, Dropout); write.csv(file="data.csv", x=data) } 5.1.4 GENERATEVECTOR.R generateVector <- function(n, list){ dist <- list$dist; p <- c(length(list$data)); #Generate probability series k = 1; for(i in dist){ if(k == 1){
  • 49. 35 p[k] = i; } else{ p[k] = i + p[k - 1]; } k = k + 1; } #Get index of value that will be added to the vector getIndex <- function(p, r){ k = 1; for(i in p){ if(r <= i){ break; } k = k + 1; } return(k); } #Generate Vector result <- factor(list$data); for(i in 1:n) { index <- getIndex(p, runif(1)); value <- list$data[index]; result[i] = value; } return(result);
  • 50. 36 } 5.1.5 INIT.R setwd('/Users/ramathreya/Sites/foss-project/r'); source('generateDataset.R'); source('randomizeDataset.R'); source('predictor.R'); source('costFunction.R'); source('f1Score.R'); source('learningCurve.R'); source('plotLearningCurve.R'); partition <- 0.7; start <- 100; interval <- 500; dataset <- read.csv(file="input.csv"); n <- nrow(dataset); png('../public/plot.png'); opar <- par(no.readonly=TRUE) par(mfrow=c(3, 3)); z <- c();
  • 51. 37 train <- c(); cv <- c(); f1 <- c(); seq_range <- seq(0.1, 0.9, 0.1); for(i in seq_range){ curves <- learningCurve(dataset, start, n, interval, partition, "Dropout", predictor, i); plotLearningCurve(curves$m, curves$train, curves$test, c("Plot when Z is ", i), "Training Examples", "Error"); train_last <- tail(curves$train, 1); cv_last <- tail(curves$test, 1); z <- c(z, i); cv <- c(cv, sum(abs(curves$test)) / length(curves$test)); train <- c(train, sum(abs(curves$train)) / length(curves$train)); f1 <- c(f1, sum(abs(curves$f1)) / length(curves$f1)); } w <- (1-f1) * train * cv; analysis <- data.frame(z, train, cv, f1, w); min_index <- which(w==min(w)); write.csv(seq_range[min_index], file="out.z") dev.off(); 5.1.6 LEARNINGCURVE.R
  • 52. 38 learningCurve <- function(dataset, start, end, interval, partition, column, predictor, z){ train_plot <- c(); test_plot <- c(); x <- c(); f1 <- c(); for(i in seq(start, end, interval)){ m <- i * partition; training_dataset <- dataset[1:m, ]; test_dataset <- dataset[(m+1):i, ]; train_actual <- unlist(training_dataset[column]); test_actual <- unlist(test_dataset[column]); predictor_formula <- predictor(training_dataset); train_pred <- predict(predictor_formula, type="response", training_dataset) >= z; test_pred <- predict(predictor_formula, type="response", test_dataset) >= z; train_cost <- costFunction(train_actual, train_pred); test_cost <- costFunction(test_actual, test_pred); f1 <- c(f1, f1Score(test_actual, test_pred)); x <- c(x, i);
  • 53. 39 train_plot <- c(train_plot, train_cost); test_plot <- c(test_plot, test_cost); } return(list(train=train_plot, test=test_plot, m=x, f1=f1)); } 5.1.7 MYSQL.R library(RMySQL) db = dbConnect(MySQL(), user='root', password='', dbname='mobile_crm', host='localhost') 5.1.8 PERCRANK.R percRank <- function(x) trunc(rank(x)) / length(x) 5.1.9 PLOTLEARNINGCURVE.R plotLearningCurve <- function(m, train_plot, test_plot, title, xlab, ylab, rnge=range(0, 0.15)){ plot(m, train_plot, type="l", col="red", xlab=NA, ylab=NA, ylim=rnge); par(new=TRUE); plot(m, test_plot, type="l", col="green", xlab=NA, ylab=NA, ylim=rnge, axes=FALSE); par(new=TRUE); legend('topright', c("Training", "C.V"), bty="n", lty=1, lwd=0.5, cex=0.5, col=c('red', 'green'));
  • 54. 40 title(title, xlab=xlab, ylab=ylab); } 5.1.10 PREDICTION.R setwd('/Users/ramathreya/Sites/foss-project/r'); source('predictor.R'); dataset <- read.csv(file="input.csv"); z <- read.csv(file="out.z") z <- z[1, 'x']; predictor_formula <- predictor(dataset); input <- read.csv('predict-input.csv'); dataset <- rbind(dataset, input) l <- nrow(dataset); prediction <- predict(predictor_formula, newdata=dataset, type="response"); prediction <- (prediction[l] >= z); fileConn<-file("output") writeLines(c(toString(prediction)), fileConn) close(fileConn) 5.1.11 PREDICTOR.R
  • 55. 41 predictor <- function(dataset){ formula <- glm( formula = Dropout ~ cbind(Gender, Poverty, Community, Rural, PTR, SCR), family = binomial, data = dataset); return(formula); } 5.1.12 RANDOMIZEDATASET.R randomizeDataset <- function(dataset){ result <- subset(dataset, FALSE); l <- nrow(dataset); s <- sample(seq(1, l), l); for(i in 1:l){ result[i, ] <- dataset[s[i], ]; } return(result); } 5.2 NODE.JS 5.2.1 APP.JS ; var express = require('express'); var http = require('http'); var path = require('path');
  • 56. 42 var bodyParser = require('body-parser'); app = express(); app.configure(function() { app.set('views', __dirname + '/app/views'); app.set('view engine', 'jade'); app.use(express.static(path.join(__dirname, 'public'))); app.use(express.cookieParser()); app.use(express.methodOverride()); app.use(express.session({secret: 'keyboard cat'})); app.use(bodyParser.json()); app.use(express.json()); // to support JSON-encoded bodies app.use(express.urlencoded()); // to support URL-encoded bodies app.locals.basedir = path.join(__dirname, '/app/views'); app.use(app.router); app.basepath = __dirname; require('./routes')(); http.createServer(app).listen(3000, function() { console.log('Server Started'); }); }); 5.2.2 PACKAGE.JSON { "name": "foss-project", "scripts": {
  • 57. 43 "start": "node app" }, "dependencies": { "body-parser": "^1.5.2", "connect": "*", "express": "3.4.0", "formidable": "1.0.15", "jade": "*", "request": "2.x" }, "engines": { "node": "0.10.x", "npm": "1.2.x" } } 5.2.3 ROUTES.JS ; var formidable = require('formidable'), util = require('util'), fs = require('fs'), sys = require('sys'), exec = require('child_process').exec; module.exports = function() { app.get('/', function(req, res) { res.render('index'); });
  • 58. 44 app.post('/upload', function(req, res) { // parse a file upload var form = new formidable.IncomingForm(); form.parse(req, function(err, fields, files) { //Write to CSV file within r folder fs.readFile(files.upload.path, function(err, data) { var newPath = __dirname + "/r/input.csv"; fs.writeFile(newPath, data, function(err) { function puts(error, stdout, stderr) { res.render('upload'); } exec("Rscript r/init.R", puts); }); }); }); return; }); app.get('/predict', function(req, res) { res.render('predict'); }); app.post('/predict', function(req, res) { var json = JSON.parse(req.body.json); var key_string = '"",', value_string = '"",';
  • 59. 45 for(var i in json){ key_string += json[i].name + ','; value_string += json[i].value + ','; } key_string += 'Dropout' value_string += '""'; var string = key_string + 'n' + value_string + 'n'; fs.writeFile('r/predict-input.csv', string, function(err) { function puts(error, stdout, stderr) { fs.readFile('r/output', 'utf-8', function(err, data) { res.end(data); }); } exec("Rscript r/prediction.R", puts); }); }); }; 5.2.4 INDEX.JADE doctype html html head title Dashboard meta(charset="UTF-8") meta(content='width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no' name='viewport')
  • 60. 46 link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css") link(rel="stylesheet",href="css/font-awesome. min.css",type="text/css") link(rel="stylesheet",href="css/ionicons.min.css",type="text/css") link(rel="stylesheet",href="css/morris/morris.css",type="text/css") link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap- 1.2.2.css",type="text/css") link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3- wysihtml5.min.css",type="text/css") link(rel="stylesheet",href="css/AdminLTE.css",type="text/css") body(class="skin-blue") header(class="header") a(href="/",class="logo") FOSS Project nav(class="navbar navbar-static-top",role="navigation") a(href="#",class="navbar-btn sidebar-toggle",data-toggle=" offcanvas",role="button") span(class="sr-only") Toggle Navigation span(class="icon-bar") span(class="icon-bar") span(class="icon-bar") div(class="wrapper row-offcanvas row-offcanvas-left") aside(class="left-side sidebar-offcanvas") section(class="sidebar") ul(class="sidebar-menu") li a(href="/") i(class="fa fa-upload") span Upload
  • 61. 47 li a(href="/predict") i(class="fa fa-search") span Predict aside(class="right-side") section(class="content-header") h1 Upload section div(class="box box-primary") form(action="/upload",enctype="multipart/form-data", method="post",role="form") div(class="box-body") div(class="form-group") input(type="file",name="upload",multiple="multiple") div(class="box-footer") button(type="submit",class="btn btn-primary") Upload script(src="js/jquery.js") script(src="js/bootstrap.min.js") script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js") script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-en. js") script(src="js/AdminLTE/app.js") 5.2.5 PREDICT.JADE
  • 62. 48 doctype html html head title Dashboard meta(charset="UTF-8") meta(content='width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no' name='viewport') link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css") link(rel="stylesheet",href="css/font-awesome. min.css",type="text/css") link(rel="stylesheet",href="css/ionicons.min.css",type="text/css") link(rel="stylesheet",href="css/morris/morris.css",type="text/css") link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap- 1.2.2.css",type="text/css") link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3- wysihtml5.min.css",type="text/css") link(rel="stylesheet",href="css/AdminLTE.css",type="text/css") body(class="skin-blue") header(class="header") a(href="/",class="logo") FOSS Project nav(class="navbar navbar-static-top",role="navigation") a(href="#",class="navbar-btn sidebar-toggle",data-toggle=" offcanvas",role="button") span(class="sr-only") Toggle Navigation span(class="icon-bar") span(class="icon-bar") span(class="icon-bar") div(class="wrapper row-offcanvas row-offcanvas-left")
  • 63. 49 aside(class="left-side sidebar-offcanvas") section(class="sidebar") ul(class="sidebar-menu") li a(href="/") i(class="fa fa-upload") span Upload li a(href="/predict") i(class="fa fa-search") span Predict aside(class="right-side") section(class="content-header") h1 Predict section div(class="box box-primary") form(action="#",enctype="multipart/form-data", method="post",role="form",id="form") div(class="box-body") div(class="form-group col-md-2") label Gender select(class="form-control",name="Gender") option(value="Male") Male option(value="Female") Female div(class="form-group col-md-2") label Poverty select(class="form-control",name="Poverty") option(value="Yes") Yes option(value="No") No
  • 64. 50 div(class="form-group col-md-2") label Community select(class="form-control",name="Community") option(value="General") General option(value="OBC") OBC option(value="SC") SC option(value="ST") ST div(class="form-group col-md-2") label Rural select(class="form-control",name="Rural") option(value="Yes") Yes option(value="No") No div(class="form-group col-md-2") label PTR select(class="form-control",name="PTR") option(value="Low") Low option(value="Medium") Medium option(value="High") High div(class="form-group col-md-2") label SCR select(class="form-control",name="SCR") option(value="Low") Low option(value="Medium") Medium option(value="High") High div(class="box-footer",style="margin-left: 5px;") button(type="button",class="btn btn-primary", id="submit") Predict label(id="outcome",style="margin-left: 10px;")
  • 65. 51 script(src="js/jquery.js") script(src="js/bootstrap.min.js") script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js") script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-en. js") script(src="js/AdminLTE/app.js") script(type="text/javascript"). $(document).ready(function(){ $('#submit').click(function(){ var json = JSON.stringify($('#form').serializeArray()); $.ajax({ url: '/predict', method: 'post', data: { json: json }, success: function(response){ var label = $('#outcome'); if(response.indexOf("TRUE") >= 0){ label.css('color', 'red'); label.html('Student will Dropout'); } else{ label.css('color', 'green'); label.html('Student will Not Dropout'); } } }); });
  • 66. 52 }); 5.2.6 UPLOAD.JADE doctype html html head title Dashboard meta(charset="UTF-8") meta(content='width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no' name='viewport') link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css") link(rel="stylesheet",href="css/font-awesome. min.css",type="text/css") link(rel="stylesheet",href="css/ionicons.min.css",type="text/css") link(rel="stylesheet",href="css/morris/morris.css",type="text/css") link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap- 1.2.2.css",type="text/css") link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3- wysihtml5.min.css",type="text/css") link(rel="stylesheet",href="css/AdminLTE.css",type="text/css") body(class="skin-blue") header(class="header") a(href="/",class="logo") FOSS Project nav(class="navbar navbar-static-top",role="navigation") a(href="#",class="navbar-btn sidebar-toggle",data-toggle=" offcanvas",role="button") span(class="sr-only") Toggle Navigation
  • 67. 53 span(class="icon-bar") span(class="icon-bar") span(class="icon-bar") div(class="wrapper row-offcanvas row-offcanvas-left") aside(class="left-side sidebar-offcanvas") section(class="sidebar") ul(class="sidebar-menu") li a(href="/") i(class="fa fa-upload") span Upload li a(href="/predict") i(class="fa fa-search") span Predict aside(class="right-side") section(class="content-header") h1 Learning Curves section div(class="box box-primary") iframe(src="plot.png",style="width: 600px; height: 500px;",frameborder="0") script(src="js/jquery.js") script(src="js/bootstrap.min.js") script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js") script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-en. js") script(src="js/AdminLTE/app.js")
  • 68. 54 CHAPTER 6 RESULTS 6.1 DATASET UPLOAD
  • 69. 55 6.2 UPLOAD RESULT Fig 6.1 : Upload Result 6.3 PREDICTION Fig 6.2 : Prediction Screen
  • 70. 56 Fig 6.3 : Predicting Student will not Dropout Fig 6.4 : Predicting Student will Dropout
  • 71. 57 CHAPTER 7 CONCLUSIONS The advent of Information Technology and the Internet has lead to vast amounts of data being gathered and stored in multiple formats by multiple sources. Thus both big corporations as well as Government Agencies are attempting to tap into these vast troves of data for making better decisions and creating eff icient processes. Several techniques such as Machine Learning, Neural Networks etc, which are commonly termed as Big Data, are trying to revolutionize the way we analyze information and are adding real value. This project was inspired by such technologies. The aim was to create an objective mechanism for solving the dropout problem that could be used for policy making. This algorithm could provide an objective solution by identifying vulnerable students who truly need help and thereby improve retention and completion rates in schools. Personally, it was a great opportunity for me to discover an area of programming that I had wanted to learn for some time now. At the same time getting a chance to solve a real world problem that is vital to our society made it all the more worthwhile. I humbly admit that the algorithm developed is in no way perfect but it was a determined attempt from my end to prove what is possible. Hopefully people after me would take this up and extend it to such a point that it can be of use to Government Agencies and provide real value to students who are the final beneficiaries of this system and the future of our nation.
  • 72. 58 CHAPTER 8 REFERENCES RESEARCH PAPERS  Data Mining: A prediction for Student's Performance Using Classification Method (World Journal of Computer Application and Technology)  A comparative study for predicting student’s academic performance using Bayesian Network Classifiers (IOSR Journal of Engineering)  School Dropout across Indian States and UTs: An Econometric Study (International Research Journal of Social Sciences)  Mining Educational Data to Analyze Students’ Performance (International Journal of Advanced Computer Science and Applications)  Gender Issues and Dropout Rates in India: Major Barrier in Providing Education for All (Amirtham, N. S. & Kundupuzhakkal, S. / Educationia Confab)  Mining Educational Data Using Classification to Decrease Dropout Rate of Students (INTERNATIONAL JOURNAL OF MULTIDISCIPLINARY SCIENCES AND ENGINEERING)  Predicting Students Academic Performance Using Education Data Mining (International Journal of Computer Science and Mobile Computing)  Prediction of student academic performance by an application of data mining techniques (2011 International Conference on Management and Artificial Intelligence)  Educational Data Mining: A Review of the State-of-the-Art(Transactions on Systems, Man, and Cybernetics)
  • 73. 59 SURVEYS  School Drop out: Patterns, Causes, Changes and Policies (UNESCO)  The Criticality of Pupil Teacher Ratio (Azim Preji Foundation)  Survey for Assessment of Dropout Rates at Elementary Level in 21 States (edCil)  Right to Education Report Card (ANNUAL STATUS OF EDUCATION REPORT 2011)  How High Are Dropout Rates in India? (Economic and Political Weekly March 17, 2007) GOVERNMENT REPORTS  Review, Examination and Validation of Data on Dropout in Karnataka (Department of Education Government of Karnataka)  Drop – out rate at primary level: A note based on DISE 2003 – 04 & 2004 – 05 data (National Institute of Educational Planning and Administration)  Dropout in Secondary Education: A Study of Children Living in Slums of Delhi (National University of Educational Planning and Administration) BOOKS  Data Mining: Concepts and Techniques (Jiawei Han and Micheline Kamber)  R in Action (Robert I. Kabacoff)
  • 74. 60 LINKS  http://www.wikipedia.org  http://scholar.google.com  https://www.coursera.org/course/ml