This document compares different methods for analyzing clustered educational data, including tree-based data mining algorithms and hierarchical linear modeling (HLM). It finds that mixed-effects Random Forest demonstrated the highest prediction accuracy on PISA 2018 data, while retaining interpretability limitations of tree models. HLM provides a powerful framework for examining multilevel relationships and assessing various factor impacts, making it a useful approach. The study offers insights for selecting suitable methods for clustered educational dataset analysis.
Dr. S. Saravana Kumar “A Systematic Review on the Educational Data Mining and its Implementation in the Applications ” United International Journal for Research & Technology (UIJRT), Volume 01, Issue 09, pp. 01-03, 2020. https://uijrt.com/articles/v1i9/UIJRTV1I90001.pdf
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
Prediction of student performance has become an essential issue for improving the educational system. However, this has turned to be a challenging task due to the huge quantity of data in the educational environment. Educational data mining is an emerging field that aims to develop techniques to manipulate and explore the sizable educational data. Classification is one of the primary approaches of the educational data mining methods that is the most widely used for predicting student performance and characteristics. In this work, three linear classification techniques; logistic regression, support vector machines (SVM), and stochastic gradient descent (SGD), and three nonlinear classification methods; decision tree, random forest and adaptive boosting (AdaBoost) are explored and evaluated on a dataset of ASSISTment system. A k-fold cross validation method is used to evaluate the implemented techniques. The results demonstrate that decision tree algorithm outperforms the other techniques, with an average accuracy of 0.7254, an average sensitivity of 0.8036 and an average specificity of 0.901. Furthermore, the importance of the utilized features is obtained and the system performance is computed using the most significant features. The results reveal that the best performance is reached using the first 80 important features with accuracy, sensitivity and specificity of 0.7252, 0.8042 and 0.9016, respectively.
Data Mining for Education
Ryan S.J.d. Baker, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
rsbaker@cmu.edu
Article to appear as
Baker, R.S.J.d. (in press) Data Mining for Education. To appear in McGaw, B., Peterson, P.,
Baker, E. (Eds.) International Encyclopedia of Education (3rd edition). Oxford, UK: Elsevier.
This is a pre-print draft. Final article may involve minor changes and different formatting.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
PREDICTING SUCCESS: AN APPLICATION OF DATA MINING TECHNIQUES TO STUDENT OUTCOMESIJDKP
This project examines the effectiveness of applying machine learning techniques to the realm of college
student success, specifically with the intent of discovering and identifying those student characteristics and
factors that show the strongest predictive capability with regards to successful graduation. The student
data examined consists of first time freshmen and transfer students who matriculated at California State
University San Marcos in the period of Fall 2000 through Fall 2010 and who either graduated successfully
or discontinued their education. Operating on over 30,000 student observations, random forests are used
to determine the relative importance of the student characteristics with genetic algorithms to perform
feature selection and pruning. To improve the machine learning algorithm cross validated hyperparameter tuning was also implemented. Overall predictive strength is relatively high as measured by the
Matthews Correlation Coefficient, and both intuitive and novel features which provide support for the
learning model are explored.
Predicting Success : An Application of Data Mining Techniques to Student Outc...IJDKP
This project examines the effectiveness of applying machine learning techniques to the realm of college
student success, specifically with the intent of discovering and identifying those student characteristics and
factors that show the strongest predictive capability with regards to successful graduation. The student
data examined consists of first time freshmen and transfer students who matriculated at California State
University San Marcos in the period of Fall 2000 through Fall 2010 and who either graduated successfully
or discontinued their education. Operating on over 30,000 student observations, random forests are used
to determine the relative importance of the student characteristics with genetic algorithms to perform
feature selection and pruning. To improve the machine learning algorithm cross validated hyperparameter
tuning was also implemented. Overall predictive strength is relatively high as measured by the
Matthews Correlation Coefficient, and both intuitive and novel features which provide support for the
learning model are explored.
PREDICTING SUCCESS: AN APPLICATION OF DATA MINING TECHNIQUES TO STUDENT OUTCOMESIJDKP
This project examines the effectiveness of applying machine learning techniques to the realm of college
student success, specifically with the intent of discovering and identifying those student characteristics and
factors that show the strongest predictive capability with regards to successful graduation. The student
data examined consists of first time freshmen and transfer students who matriculated at California State
University San Marcos in the period of Fall 2000 through Fall 2010 and who either graduated successfully
or discontinued their education. Operating on over 30,000 student observations, random forests are used
to determine the relative importance of the student characteristics with genetic algorithms to perform
feature selection and pruning. To improve the machine learning algorithm cross validated hyperparameter tuning was also implemented. Overall predictive strength is relatively high as measured by the
Matthews Correlation Coefficient, and both intuitive and novel features which provide support for the
learning model are explored.
Dr. S. Saravana Kumar “A Systematic Review on the Educational Data Mining and its Implementation in the Applications ” United International Journal for Research & Technology (UIJRT), Volume 01, Issue 09, pp. 01-03, 2020. https://uijrt.com/articles/v1i9/UIJRTV1I90001.pdf
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
Prediction of student performance has become an essential issue for improving the educational system. However, this has turned to be a challenging task due to the huge quantity of data in the educational environment. Educational data mining is an emerging field that aims to develop techniques to manipulate and explore the sizable educational data. Classification is one of the primary approaches of the educational data mining methods that is the most widely used for predicting student performance and characteristics. In this work, three linear classification techniques; logistic regression, support vector machines (SVM), and stochastic gradient descent (SGD), and three nonlinear classification methods; decision tree, random forest and adaptive boosting (AdaBoost) are explored and evaluated on a dataset of ASSISTment system. A k-fold cross validation method is used to evaluate the implemented techniques. The results demonstrate that decision tree algorithm outperforms the other techniques, with an average accuracy of 0.7254, an average sensitivity of 0.8036 and an average specificity of 0.901. Furthermore, the importance of the utilized features is obtained and the system performance is computed using the most significant features. The results reveal that the best performance is reached using the first 80 important features with accuracy, sensitivity and specificity of 0.7252, 0.8042 and 0.9016, respectively.
Data Mining for Education
Ryan S.J.d. Baker, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
rsbaker@cmu.edu
Article to appear as
Baker, R.S.J.d. (in press) Data Mining for Education. To appear in McGaw, B., Peterson, P.,
Baker, E. (Eds.) International Encyclopedia of Education (3rd edition). Oxford, UK: Elsevier.
This is a pre-print draft. Final article may involve minor changes and different formatting.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
PREDICTING SUCCESS: AN APPLICATION OF DATA MINING TECHNIQUES TO STUDENT OUTCOMESIJDKP
This project examines the effectiveness of applying machine learning techniques to the realm of college
student success, specifically with the intent of discovering and identifying those student characteristics and
factors that show the strongest predictive capability with regards to successful graduation. The student
data examined consists of first time freshmen and transfer students who matriculated at California State
University San Marcos in the period of Fall 2000 through Fall 2010 and who either graduated successfully
or discontinued their education. Operating on over 30,000 student observations, random forests are used
to determine the relative importance of the student characteristics with genetic algorithms to perform
feature selection and pruning. To improve the machine learning algorithm cross validated hyperparameter tuning was also implemented. Overall predictive strength is relatively high as measured by the
Matthews Correlation Coefficient, and both intuitive and novel features which provide support for the
learning model are explored.
Predicting Success : An Application of Data Mining Techniques to Student Outc...IJDKP
This project examines the effectiveness of applying machine learning techniques to the realm of college
student success, specifically with the intent of discovering and identifying those student characteristics and
factors that show the strongest predictive capability with regards to successful graduation. The student
data examined consists of first time freshmen and transfer students who matriculated at California State
University San Marcos in the period of Fall 2000 through Fall 2010 and who either graduated successfully
or discontinued their education. Operating on over 30,000 student observations, random forests are used
to determine the relative importance of the student characteristics with genetic algorithms to perform
feature selection and pruning. To improve the machine learning algorithm cross validated hyperparameter
tuning was also implemented. Overall predictive strength is relatively high as measured by the
Matthews Correlation Coefficient, and both intuitive and novel features which provide support for the
learning model are explored.
PREDICTING SUCCESS: AN APPLICATION OF DATA MINING TECHNIQUES TO STUDENT OUTCOMESIJDKP
This project examines the effectiveness of applying machine learning techniques to the realm of college
student success, specifically with the intent of discovering and identifying those student characteristics and
factors that show the strongest predictive capability with regards to successful graduation. The student
data examined consists of first time freshmen and transfer students who matriculated at California State
University San Marcos in the period of Fall 2000 through Fall 2010 and who either graduated successfully
or discontinued their education. Operating on over 30,000 student observations, random forests are used
to determine the relative importance of the student characteristics with genetic algorithms to perform
feature selection and pruning. To improve the machine learning algorithm cross validated hyperparameter tuning was also implemented. Overall predictive strength is relatively high as measured by the
Matthews Correlation Coefficient, and both intuitive and novel features which provide support for the
learning model are explored.
A Study on Learning Factor Analysis – An Educational Data Mining Technique fo...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Oversampling technique in student performance classification from engineering...IJECEIAES
The first year of an engineering student was important to take proper academic planning. All subjects in the first year were essential for an engineering basis. Student performance prediction helped academics improve their performance better. Students checked performance by themselves. If they were aware that their performance are low, then they could make some improvement for their better performance. This research focused on combining the oversampling minority class data with various kinds of classifier models. Oversampling techniques were SMOTE, BorderlineSMOTE, SVMSMOTE, and ADASYN and four classifiers were applied using MLP, gradient boosting, AdaBoost and random forest in this research. The results represented that Borderline-SMOTE gave the best result for minority class prediction with several classifiers.
Association rule discovery for student performance prediction using metaheuri...csandit
According to the increase of using data mining tech
niques in improving educational systems
operations, Educational Data Mining has been introd
uced as a new and fast growing research
area. Educational Data Mining aims to analyze data
in educational environments in order to
solve educational research problems. In this paper
a new associative classification technique
has been proposed to predict students final perform
ance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative cla
ssifiers maintain interpretability along
with high accuracy. In this research work, we have
employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract associat
ion rule for student performance prediction
as a multi-objective classification problem. Result
s indicate that the proposed swarm based
algorithm outperforms well-known classification tec
hniques on student performance prediction
classification problem.
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...cscpconf
According to the increase of using data mining techniques in improving educational systems
operations, Educational Data Mining has been introduced as a new and fast growing research
area. Educational Data Mining aims to analyze data in educational environments in order to
solve educational research problems. In this paper a new associative classification technique
has been proposed to predict students final performance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative classifiers maintain interpretability along
with high accuracy. In this research work, we have employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract association rule for student performance prediction
as a multi-objective classification problem. Results indicate that the proposed swarm based
algorithm outperforms well-known classification techniques on student performance prediction
classification problem.
Clustering analysis of learning style on anggana high school studentTELKOMNIKA JOURNAL
The inability of students to absorb the knowledge conveyed by the teacher is’nt caused by the inability of understanding and by the teacher which isn’t able to teach too, but because of the mismatch of learning styles between students and teachers, so that students feel uncomfortable in learning to a particular teacher. It also happens in senior high school (SHS/SMAN) 1 Anggana, so it is necessary to do this research, to analyze cluster (group) of student learning style by applying data mining method that is k-Means and Fuzzy C-Means. The purpose was to know the effectiveness of this learning style cluster on the development of absorptive power and improving student achievement. The method used to cluster the learning style with data mining process starts from the data cleaning stage, data selection, data transformation, data mining, pattern evolution, and knowledge development.
Data Mining Techniques for School Failure and Dropout SystemKumar Goud
Abstract: Data mining techniques are applied to predict college failure and bum of the student. This is method uses real data on middle-school students for prediction of failure and drop out. It implements white-box classification strategies, like induction rules and decision trees or call trees. Call tree could be a call support tool that uses tree-like graph or a model of call and their possible consequences. A call tree is a flowchart-like structure in which internal node represents a "test" on an attribute. Attribute is the real information of students that is collected from college in middle or pedagogy, each branch represents the outcome of the test and each leaf node represents a class label. The paths from root to leaf represent classification rules and it consists of three kinds of nodes which incorporates call node, likelihood node and finish node. It is specifically used in call analysis. Using this technique to boost their correctness for predicting which students might fail or dropout (idler) by first, using all the accessible attributes next, choosing the most effective attributes. Attribute choice is done by using WEKA tool.
Keywords: dataset, classification, clustering.
Educational Data Mining is used to find interesting patterns from the data taken from
educational settings to improve teaching and learning. Assessing student’s ability and performance with
EDM methods in e-learning environment for math education in school level in India has not been
identified in our literature review. Our method is a novel approach in providing quality math education
with assessments indicating the knowledge level of a student in each lesson. This paper illustrates how
Learning Curve – an EDM visualization method is used to compare rural and urban students’ progress
in learning mathematics in an e-learning environment. The experiment is conducted in two different
schools in Tamil Nadu, India. After practicing the problems the students attended the test and their
interaction data are collected and analyzed their performance in different aspects: Knowledge
component level, time taken to solve a problem, error rate. This work studies the student actions for
identifying learning progress. The results show that the learning curve method is much helpful to the
teachers to visualize the students’ performance in granular level which is not possible manually. Also it
helps the students in knowing about their skill level when they complete each unit.
Clustering Students of Computer in Terms of Level of ProgrammingEditor IJCATR
Educational data mining (EDM) is one of the applications of data mining. In educational data mining, there are two key domains, i.e. student domain and faculty domain. Different type of research work has been done in both domains.
In existing system the faculty performance has calculated on the basis of two parameters i.e. Student feedback and the result of student in that subject. In existing system we define two approaches one is multiple classifier approach and the other is a single classifier approach and comparing them, for relative evaluation of faculty performance using data mining
Techniques. In multiple classifier approach K-nearest neighbor (KNN) is used in first step and Rule based classification is used in the second step of classification while in single classifier approach only KNN is used in both steps of classification.
But in proposed system, I will analyse the faculty performance using 4 parameters i.e., student complaint about faculty, Student review feedback for faculty, students feedback, and students result etc.
For this proposed system I will be going to use opinion mining technique for analyzing performance of faculty and calculating score of each faculty.
A SYSTEM OF SERIAL COMPUTATION FOR CLASSIFIED RULES PREDICTION IN NONREGULAR ...ijaia
Objects or structures that are regular take uniform dimensions. Based on the concepts of regular models,
our previous research work has developed a system of a regular ontology that models learning structures
in a multiagent system for uniform pre-assessments in a learning environment. This regular ontology has
led to the modelling of a classified rules learning algorithm that predicts the actual number of rules needed
for inductive learning processes and decision making in a multiagent system. But not all processes or
models are regular. Thus this paper presents a system of polynomial equation that can estimate and predict
the required number of rules of a non-regular ontology model given some defined parameters.
Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Predictionijtsrd
Data mining techniques play an important role in data analysis. For the construction of a classification model which could predict performance of students, particularly for engineering branches, a decision tree algorithm associated with the data mining techniques have been used in the research. A number of factors may affect the performance of students. Data mining technology which can related to this student grade well and we also used classification algorithms prediction. In this paper, we used educational data mining to predict students final grade based on their performance. We proposed student data classification using ID3 Iterative Dichotomiser 3 Decision Tree Algorithm Khin Khin Lay | San San Nwe "Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Prediction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26545.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26545/using-id3-decision-tree-algorithm-to-the-student-grade-analysis-and-prediction/khin-khin-lay
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.
The main objective of this paper is to develop a basic prototype model which can determine and extract
unknown knowledge (patterns, concepts and relations) related with multiple factors from past database records of
specific students. Data mining is science and engineering study of extracting previously undiscovered patterns
from a huge set of data. Data mining techniques are helpful for decision making as well as for discovering patterns
of data. In this paper students eligibility prediction system using Rule based classification is proposed to predict
the eligibility of students based on their details with high prediction accuracy. In Educational Institutes, a
tremendous amount of data is generated. This paper outlines the idea of predicting a particular student’s placement
eligibility by performing operations on the data stored. In this paper an efficient algorithm with the technique
Fuzzy for prediction is proposed.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
A Study on Learning Factor Analysis – An Educational Data Mining Technique fo...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Oversampling technique in student performance classification from engineering...IJECEIAES
The first year of an engineering student was important to take proper academic planning. All subjects in the first year were essential for an engineering basis. Student performance prediction helped academics improve their performance better. Students checked performance by themselves. If they were aware that their performance are low, then they could make some improvement for their better performance. This research focused on combining the oversampling minority class data with various kinds of classifier models. Oversampling techniques were SMOTE, BorderlineSMOTE, SVMSMOTE, and ADASYN and four classifiers were applied using MLP, gradient boosting, AdaBoost and random forest in this research. The results represented that Borderline-SMOTE gave the best result for minority class prediction with several classifiers.
Association rule discovery for student performance prediction using metaheuri...csandit
According to the increase of using data mining tech
niques in improving educational systems
operations, Educational Data Mining has been introd
uced as a new and fast growing research
area. Educational Data Mining aims to analyze data
in educational environments in order to
solve educational research problems. In this paper
a new associative classification technique
has been proposed to predict students final perform
ance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative cla
ssifiers maintain interpretability along
with high accuracy. In this research work, we have
employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract associat
ion rule for student performance prediction
as a multi-objective classification problem. Result
s indicate that the proposed swarm based
algorithm outperforms well-known classification tec
hniques on student performance prediction
classification problem.
ASSOCIATION RULE DISCOVERY FOR STUDENT PERFORMANCE PREDICTION USING METAHEURI...cscpconf
According to the increase of using data mining techniques in improving educational systems
operations, Educational Data Mining has been introduced as a new and fast growing research
area. Educational Data Mining aims to analyze data in educational environments in order to
solve educational research problems. In this paper a new associative classification technique
has been proposed to predict students final performance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative classifiers maintain interpretability along
with high accuracy. In this research work, we have employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract association rule for student performance prediction
as a multi-objective classification problem. Results indicate that the proposed swarm based
algorithm outperforms well-known classification techniques on student performance prediction
classification problem.
Clustering analysis of learning style on anggana high school studentTELKOMNIKA JOURNAL
The inability of students to absorb the knowledge conveyed by the teacher is’nt caused by the inability of understanding and by the teacher which isn’t able to teach too, but because of the mismatch of learning styles between students and teachers, so that students feel uncomfortable in learning to a particular teacher. It also happens in senior high school (SHS/SMAN) 1 Anggana, so it is necessary to do this research, to analyze cluster (group) of student learning style by applying data mining method that is k-Means and Fuzzy C-Means. The purpose was to know the effectiveness of this learning style cluster on the development of absorptive power and improving student achievement. The method used to cluster the learning style with data mining process starts from the data cleaning stage, data selection, data transformation, data mining, pattern evolution, and knowledge development.
Data Mining Techniques for School Failure and Dropout SystemKumar Goud
Abstract: Data mining techniques are applied to predict college failure and bum of the student. This is method uses real data on middle-school students for prediction of failure and drop out. It implements white-box classification strategies, like induction rules and decision trees or call trees. Call tree could be a call support tool that uses tree-like graph or a model of call and their possible consequences. A call tree is a flowchart-like structure in which internal node represents a "test" on an attribute. Attribute is the real information of students that is collected from college in middle or pedagogy, each branch represents the outcome of the test and each leaf node represents a class label. The paths from root to leaf represent classification rules and it consists of three kinds of nodes which incorporates call node, likelihood node and finish node. It is specifically used in call analysis. Using this technique to boost their correctness for predicting which students might fail or dropout (idler) by first, using all the accessible attributes next, choosing the most effective attributes. Attribute choice is done by using WEKA tool.
Keywords: dataset, classification, clustering.
Educational Data Mining is used to find interesting patterns from the data taken from
educational settings to improve teaching and learning. Assessing student’s ability and performance with
EDM methods in e-learning environment for math education in school level in India has not been
identified in our literature review. Our method is a novel approach in providing quality math education
with assessments indicating the knowledge level of a student in each lesson. This paper illustrates how
Learning Curve – an EDM visualization method is used to compare rural and urban students’ progress
in learning mathematics in an e-learning environment. The experiment is conducted in two different
schools in Tamil Nadu, India. After practicing the problems the students attended the test and their
interaction data are collected and analyzed their performance in different aspects: Knowledge
component level, time taken to solve a problem, error rate. This work studies the student actions for
identifying learning progress. The results show that the learning curve method is much helpful to the
teachers to visualize the students’ performance in granular level which is not possible manually. Also it
helps the students in knowing about their skill level when they complete each unit.
Clustering Students of Computer in Terms of Level of ProgrammingEditor IJCATR
Educational data mining (EDM) is one of the applications of data mining. In educational data mining, there are two key domains, i.e. student domain and faculty domain. Different type of research work has been done in both domains.
In existing system the faculty performance has calculated on the basis of two parameters i.e. Student feedback and the result of student in that subject. In existing system we define two approaches one is multiple classifier approach and the other is a single classifier approach and comparing them, for relative evaluation of faculty performance using data mining
Techniques. In multiple classifier approach K-nearest neighbor (KNN) is used in first step and Rule based classification is used in the second step of classification while in single classifier approach only KNN is used in both steps of classification.
But in proposed system, I will analyse the faculty performance using 4 parameters i.e., student complaint about faculty, Student review feedback for faculty, students feedback, and students result etc.
For this proposed system I will be going to use opinion mining technique for analyzing performance of faculty and calculating score of each faculty.
A SYSTEM OF SERIAL COMPUTATION FOR CLASSIFIED RULES PREDICTION IN NONREGULAR ...ijaia
Objects or structures that are regular take uniform dimensions. Based on the concepts of regular models,
our previous research work has developed a system of a regular ontology that models learning structures
in a multiagent system for uniform pre-assessments in a learning environment. This regular ontology has
led to the modelling of a classified rules learning algorithm that predicts the actual number of rules needed
for inductive learning processes and decision making in a multiagent system. But not all processes or
models are regular. Thus this paper presents a system of polynomial equation that can estimate and predict
the required number of rules of a non-regular ontology model given some defined parameters.
Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Predictionijtsrd
Data mining techniques play an important role in data analysis. For the construction of a classification model which could predict performance of students, particularly for engineering branches, a decision tree algorithm associated with the data mining techniques have been used in the research. A number of factors may affect the performance of students. Data mining technology which can related to this student grade well and we also used classification algorithms prediction. In this paper, we used educational data mining to predict students final grade based on their performance. We proposed student data classification using ID3 Iterative Dichotomiser 3 Decision Tree Algorithm Khin Khin Lay | San San Nwe "Using ID3 Decision Tree Algorithm to the Student Grade Analysis and Prediction" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26545.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26545/using-id3-decision-tree-algorithm-to-the-student-grade-analysis-and-prediction/khin-khin-lay
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.
The main objective of this paper is to develop a basic prototype model which can determine and extract
unknown knowledge (patterns, concepts and relations) related with multiple factors from past database records of
specific students. Data mining is science and engineering study of extracting previously undiscovered patterns
from a huge set of data. Data mining techniques are helpful for decision making as well as for discovering patterns
of data. In this paper students eligibility prediction system using Rule based classification is proposed to predict
the eligibility of students based on their details with high prediction accuracy. In Educational Institutes, a
tremendous amount of data is generated. This paper outlines the idea of predicting a particular student’s placement
eligibility by performing operations on the data stored. In this paper an efficient algorithm with the technique
Fuzzy for prediction is proposed.
Similar to A COMPARATIVE ANALYSIS OF DATA MINING METHODS AND HIERARCHICAL LINEAR MODELING USING PISA 2018 DATA (20)
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
An Approach to Detecting Writing Styles Based on Clustering Techniquesambekarshweta25
An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:
-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:
VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
A COMPARATIVE ANALYSIS OF DATA MINING METHODS AND HIERARCHICAL LINEAR MODELING USING PISA 2018 DATA
1. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
DOI: 10.5121/ijdms.2023.15301 1
A COMPARATIVE ANALYSIS OF DATA MINING
METHODS AND HIERARCHICAL LINEAR
MODELING USING PISA 2018 DATA
Wenting Weng1
and Wen Luo2
1
Krieger School of Arts and Sciences, Johns Hopkins University, Baltimore, USA
2
Department of Educational Psychology, Texas A&M University, College Station, USA
ABSTRACT
Educational research often encounters clustered data sets, where observations are organized into
multilevel units, consisting of lower-level units (individuals) nested within higher-level units (clusters).
However, many studies in education utilize tree-based methods like Random Forest without considering the
hierarchical structure of the data sets. Neglecting the clustered data structure can result in biased or
inaccurate results. To address this issue, this study aimed to conduct a comprehensive survey of three tree-
based data mining algorithms and hierarchical linear modeling (HLM). The study utilized the Programme
for International Student Assessment (PISA) 2018 data to compare different methods, including non-mixed-
effects tree models (e.g., Random Forest) and mixed-effects tree models (e.g., random effects expectation
minimization recursive partitioning method, mixed-effects Random Forest), as well as the HLM approach.
Based on the findings of this study, mixed-effects Random Forest demonstrated the highest prediction
accuracy, while the random effects expectation minimization recursive partitioning method had the lowest
prediction accuracy. However, it is important to note that tree-based methods limit deep interpretation of
the results. Therefore, further analysis is needed to gain a more comprehensive understanding. In
comparison, the HLM approach retains its value in terms of interpretability. Overall, this study offers
valuable insights for selecting and utilizing suitable methods when analyzing clustered educational
datasets.
KEYWORDS
Data Mining, Clustered Data, Mixed-effects, Random Forest, HLM, Hierarchical Linear Modeling, PISA
1. INTRODUCTION
Clustered or hierarchical data exhibits a multilevel structure where observations are sampled from
lower-level units (individuals) nested within higher-level units (clusters). This type of data
includes attributes at both the individual and cluster levels, enabling the exploration of variations
among individuals within and between clusters. Observations within the same cluster tend to
share more similarities than those from different clusters. Considering both similarities and
differences across clusters is crucial and can lead to more accurate results in research. Clustered
data sets are commonly encountered in educational research, such as the Programme for
International Student Assessment (PISA) data, which measures the academic achievements of
fifteen-year-old students in reading, mathematics, and science. Scholars have studied PISA data
using a clustered structure (e.g., [1], [2]).
In 1984, Breiman et al. [3] introduced tree-based methods called classification and regression
trees (CART). CART is a non-parametric approach that can handle large data sets with large
number of attributes without requiring preselection. CART is particularly robust in handling
2. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
2
outliers, unlike some traditional statistical methods such as linear regression. However, in certain
circumstances (e.g., when observations are modified), CART may produce unstable results,
leading to high variability and poor predictive performance [4]. To address the instability issue,
Breiman [5] proposed a tree-based ensemble method called Random Forest (RF). RF combines a
large number of regression trees with the goal of improving predictions. RF has been successfully
applied in educational research to predict students' learning performance (e.g., [6]). However, RF
only considers the fixed effects of attributes, even when the data has a clustered structure. To
overcome this limitation, a new method called the random effects expectation minimization
recursive partitioning method (RE-EM tree) was proposed based on CART by Sela and Simonoff
[7]. This method takes into account the random effects within a clustered data structure.
Subsequently, another approach called mixed-effects Random Forest (MERF) was introduced,
which incorporates random effects into RF [8]. This allows for the consideration of both fixed
and random effects of attributes, providing a more comprehensive analysis of clustered data.
This paper aims to conduct a comprehensive survey of various tree-based data mining algorithms
and hierarchical linear modeling (HLM), which is one of the most widely used approaches for
analyzing clustered educational data sets. The comparative study focuses on comparing non-
mixed-effects tree models (i.e., RF) with mixed-effects tree models (i.e., RE-EM tree, MERF), as
well as the HLM approach. By evaluating the advantages and disadvantages of each method, this
comparison will provide valuable insights for selecting and adopting appropriate methods in the
analysis of clustered educational data sets.
In the subsequent sections of the paper, we provide a concise overview of the non-mixed-effects
tree-based method (RF), the mixed-effects tree-based methods (RE-EM tree, MERF), and the
HLM approach. We then present a comparative study to determine the optimal method by
utilizing the PISA 2018 clustered data set. Finally, we report the results obtained and engage in a
thorough discussion of the findings.
2. THEORETICAL FRAMEWORK
Educational Data Mining (EDM) is a rapidly growing field that focuses on analyzing data within
an educational context using various Data Mining (DM) techniques and tools [25]. Tree-based
methods have been commonly employed in educational research. These methods have been
utilized in various studies to analyze educational data and gain insights into different aspects of
the educational context. For example, Decision Tree has been applied to predict student outcomes
such as academic success, dropout risks, and online persistence in web-supported courses (e.g.,
[26], [27]). Additionally, Random Forest has been utilized in predicting learning performance and
detecting instances of online cheating behavior among students [28]. These tree-based methods
offer valuable tools for extracting knowledge from educational datasets and facilitating data-
driven decision-making in the field of education.
Hierarchical linear modeling (HLM) is widely recognized as the predominant statistical method
utilized in educational research, particularly in the analysis of multilevel research data. It has
found extensive application in various educational studies, including investigations into the
effects of technology usage on student learning achievement [29]. HLM offers a powerful
framework for examining the relationships between variables at different levels of analysis,
allowing researchers to account for the hierarchical structure of educational data and assess the
impact of various factors on student outcomes. Its versatility and capability to handle nested data
make it a popular choice for researchers seeking to delve into the complexities of educational
phenomena.
3. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
3
2.1. Tree-based Method: Random Forest
Random Forest (RF), introduced by Breiman [9], has gained widespread use in prediction and
classification tasks (e.g., [10]), even in scenarios with high-dimensional data [11]. RF is a
collection of regression trees that combines the bagging procedure with randomization in variable
splitting. Bagging, as proposed by Breiman [5], involves generating random bootstrap samples
from the original data. The bootstrap samples are generated by repeatedly drawing from the
original data set, with each sample having the same size as the original data. Each tree in the RF
is constructed by randomly selecting features from the bootstrap samples. The predictions of RF
are determined by averaging the outputs of the individual trees.
The main challenge of RF lies in its interpretability due to the composition of multiple regression
trees. However, RF can still provide insights into the relevance of input attributes. When training
an RF model, the out-of-bag (OOB) observations are not included in the bootstrap samples. These
OOB observations are utilized to evaluate the model's accuracy by calculating the OOB error.
This error measure is also helpful in selecting optimal values for tuning parameters, such as the
number of randomly selected attributes considered for each split [5].
2.2. Mixed-Effects Methods
2.2.1. Hierarchical Linear Modeling
Hierarchical linear modeling (HLM), or multilevel modeling, is a widely utilized method for
analyzing clustered data, which involves nested structures where individuals (lower-level units)
are grouped within clusters (higher-level units). This approach is commonly applied in
educational research, where individuals are sampled from classes and schools (e.g., [12]). In a
two-level model, one level explores the relationships among the lower-level units, while the other
level examines how these relationships vary across the higher-level units [13]. For instance,
consider a random intercept model, which can be expressed as follows:
(1)
where:
= response variable value for the individual nested within the cluster unit;
= intercept for the cluster unit;
= regression slope associated with the attribute for the cluster unit;
= attribute value of X for the individual in the cluster unit;
= random error for the individual in the cluster unit.
In the model formula (1), can be written as:
(2)
where:
= mean intercept across all clustered units, which is a fixed effect;
= a random effect of the cluster unit on the intercept.
A combined model can be created using Equation (1) and Equation (2):
(3)
4. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
4
In this random intercept only model, the parameters are estimated via the variance components
and . represents the unexplained variation at the lower level when controlling the attribute
, while is the unexplained variation at the higher level.
2.2.2. RE-EM Tree
Sela and Simonoff [7] introduced the random effects expectation-maximization recursive
partitioning method (RE-EM tree), which is specifically designed to handle clustered and
longitudinal data. This method utilizes CART [3] as the underlying regression tree algorithm. In
Sela and Simonoff [7], we have sampling individuals or objects i = 1, ..., I at times t = 1, ..., .
An observation of an individual for a single time is referred as (i, t). An individual can have
multiple observations across different times. For each observation, we have a vector of j
attributes, . The attributes may be constant among individuals over time or
differ across time and individuals. To detect differences for individuals over time, we have a
known design matrix and a vector of unknown individual-specific random effects intercept
being uncorrelated with the attributes. A general effects model can be written as:
(4)
~ Normal (0,
(5)
and
Normal (0, (6)
are random errors that are independent and not associated with the random effects, .
is a non-diagonal matrix that allows an autocorrelation structure within the errors for an
individual. The RE-EM tree uses a tree structure to estimate as well as the individual-specific
random intercept . Compared with a linear mixed-effects model (where ), the RE-EM
tree has more flexible assumptions, which admit that the functional form of is normally
unknown. The RE-EM tree can also better handle with missing values and overfitting issues. The
estimation process of a RE-EM tree is shown as below [7]:
1. Initially set the estimated random effects, to zero.
2. Run iterations through the steps a–c until the estimated random effects, , converge by
considering change in the likelihood or restricted likelihood function being less than the
tolerance value.
a. Fit a regression tree to the data to predict the response variable using the
attributes, , for objects i = 1, ..., I at times t = 1, ..., . The tree
includes a set of indicator features, I ( ), where ranges over all the
terminal nodes in the tree.
b. Estimate the linear mixed-effects model, using
the response variable and the attributes.
c. Extract the estimated random effects from the estimated linear mixed-effects
model.
3. Replace the predicted values of the response variable at each terminal node of the tree in
the step 2a with the population-level predicted mean response from the linear mixed-
effects model in step 2b.
Any tree algorithm can be applied to step 2a. Sela and Simonoff [7] implemented the CART tree
algorithm based on the R package – rpart in the step 2a and developed the R package, REEMtree.
The RE-EM tree algorithm maximizes the reduction in sum of squares when splitting a node.
Maximum likelihood or restricted maximum likelihood (REML) can be used in step 2b. The
5. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
5
splitting process continues as long as the improvement in proportion of variability being
accounted for by the tree (termed complexity parameter), which determines the optimal size of
the tree. In the example of Sela and Simonoff [7], the value of complexity parameter (cp) was set
at least 0.001, and the number of observations in the node was set at least 20. A 10-fold cross
validation was applied to prune the tree once the initial tree was settled. The final split of the tree
had the largest cp value and obtained the minimized validation error that was less than one
standard error above the minimized value. The RE-EM tree allows for autocorrelation within
individuals, which may yield more effective models comparing with no autocorrelation structure
[7].
2.2.2. Mixed-Effects Random Forest
Hajjem et al. [14] expanded upon the CART algorithm [3] and introduced a mixed-effects
regression tree (MERT) approach for handling clustered data with a continuous outcome. MERT
utilizes the expectation-maximization (EM) algorithm to estimate the random components.
Subsequently, a standard tree is applied to estimate the fixed effects after removing the random
component. This approach enables the examination of non-linear relationships between the fixed
components and response values.
To enhance prediction accuracy, Hajjem et al. [8] further developed a mixed-effects Random
Forest (MERF), where a Random Forest replaces the regression tree. This advancement
incorporates the benefits of ensemble learning to improve predictions in the presence of random
effects. Additionally, Hajjem et al. [15] extended the MERT approach to handle non-Gaussian
response variables, introducing a generalized mixed-effects regression tree (GMERT) that can
address classification problems.
The MERF algorithm can be defined as follows:
(7)
(0, (0, (8)
(9)
where = is the vector of responses for the observations in the cluster , =
is the matrix of fixed effects attributes, and is estimated using Breiman's
Random Forest [9]. represents the matrix of random effects attributes for
the cluster , is the matrix of random effects coefficients for the cluster
, and is the vector of errors. D is the covariance matrix of , while is
the covariance matrix of . In the MERF algorithm, is assumed linear with the response
variable, the random component is assumed to be independent and normally
distributed. The covariance matrix of the response is assumed to be = Cov( ) = + ,
and V = Cov(y) = diag( ,…, ), where y = . Another assumption is the between-
clusters are independent. Fitting the MERF allows us to predict new observations in the clusters
considering the cluster-level random effects. The correlation is assumed to occur only via the
between-cluster variations, where is diagonal ( , i = 1,…, n).
The overall steps of the MERF algorithm, as described in Hajjem et al. [8], can be outlined as
follows:
6. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
6
1. Set r = 0 and the initial values for the parameters, which are , ,
.
2. Set r = r + 1. Update the response corrected for the random effects , random forest
of the fixed effects , the random component :
(i) Set , i = 1,…, n.
(ii) Build a RF with as the response and as the corresponding training set of
attributes, i = 1,…, n, j = 1,…, . The bootstrap training samples are repeatedly
drawn from the training set ( , ).
(iii) Estimate using the out-of-bag prediction of the RF, that is, estimate each
using the bootstrap samples to build the trees not containing observation .
(iv) Set = ( - ), i = 1,…, n, where = + ,
for i = 1,…,n.
3. Update and following
,
where .
4. Iterate the previous steps until convergence. Apply the generalized log-likelihood (GLL)
criterion to confirm the convergence:
GLL(f, |y) =
When predicting a new observation j from known cluster i, we can use the population-averaged
RF prediction and the random component . If a new observation is from an unknown
cluster not included in the sample, we use only the population-averaged RF prediction.
3. METHODS
3.1. Data
For this study, the PISA 2018 data set provided by the Organization for Economic Co-operation
and Development (OECD) was utilized. The PISA 2018 survey aimed to assess the knowledge
and skills of 15-year-old students in the areas of mathematics, reading, and science across 79
participating countries and regions. Additionally, 52 countries administered a questionnaire
regarding students' familiarity with information and communications technologies (ICT). In this
particular study, the focus was solely on the students' reading competencies (PV1READ) as the
response variable.
After addressing missing values, two countries with varying numbers of observations were
selected for analysis: Kazakhstan ( = 10,040) and the United States ( = 2,592). In this study, a
total of 31 attributes were considered, encompassing ICT-related attributes, reading attributes,
7. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
7
and other relevant student information. Table 1 provides a list of these attributes along with brief
descriptions.
Table 1. Attributes Information
Attribute Name Description
PV1READ Student reading performance score (WLE)
ICTHOME ICT available at home
ICTSCH ICT available at school
ICTRES ICT resources (WLE)
INTICT Student interest in ICT (WLE)
COMPICT Perceived ICT competence (WLE)
AUTICT Perceived autonomy related to ICT use (WLE)
SOCIAICT ICT as a topic in social interaction (WLE)
ICTCLASS Subject-related ICT use during lessons (WLE)
ICTOUTSIDE Subject-related ICT use outside of lessons (WLE)
ENTUSE ICT use for leisure outside of school (WLE)
HOMESCH Use of ICT for schoolwork activities outside of school (WLE)
USESCH Use of ICT at school in general (WLE)
PERFEED Perceived Feedback from teachers (WLE)
EMOSUPS Parental emotional support perceived by student (WLE)
LMINS Learning time (minutes per week)
ESCS Index of economic, social and cultural status (WLE)
UNDREM Meta-cognition: understanding and remembering
METASUM Meta-cognition: summarizing
METASPAM Meta-cognition: assess credibility
HEDRES Home educational resources (WLE)
STIMREAD Teachers' stimulation of reading engagement perceived by student
(WLE)
ADAPTIVITY Adaptation of instruction (WLE)
TEACHINT Perceived teacher's interest in teaching (WLE)
JOYREAD Joy/Like reading (WLE)
SCREADCOMP Self-concept of reading: Perception of competence (WLE)
SCREADDIFF Self-concept of reading: Perception of difficulty (WLE)
PISADIFF Perception of difficulty of the PISA test (WLE)
PERCOMP Perception of competitiveness at school (WLE)
PERCOOP Perception of cooperation at school (WLE)
ATTLNACT Attitude towards school: learning activities (WLE)
BELONG Subjective well-being: Sense of belonging to school (WLE)
It is worth noting that certain attributes in the PISA 2018 data set were derived using transformed
weighted likelihood estimates (WLE) techniques [16].
The formula of transformation is as below:
where is the final metric of the WLE scores after transformation, is the original WLEs in
logits, is the mean score based on the equally weighted OECD country samples, and
is the standard deviation of the initial WLEs for the OECD samples.
The PISA 2018 applied plausible values for each student reading competency. Plausible values
refer to a possible range of student competencies. Wu [17] noted that "instead of obtaining a point
estimate for θ, a range of possible values for a student's θ, with an associated probability for each
of these values, is estimated. Plausible values are random draws from this (estimated) distribution
8. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
8
for a student's θ. This distribution is referred to as the posterior distribution for a student's θ." (p.
116).
In this study, several attributes were selected that pertained to student engagement with teachers.
These attributes encompassed aspects such as teachers' ability to stimulate reading engagement
(STIMREAD), students' perception of teacher feedback (PERFREED), and students' perception
of their teacher's interest in teaching (TEACHINT). Additionally, attributes related to students'
meta-cognitive skills in reading were considered, including attributes such as understanding and
remembering (UNDREM), summarizing (METASUM), assessing credibility (METASPAM), and
enjoyment of reading (JOYREAD).
Other attributes related to learning included the amount of time spent on test language learning
(LMINS), student adaptivity in test language lessons (ADAPTIVITY), and students' self-concept
of reading, which encompassed their perception of competence (SCREADCOMP) and difficulty
(SCREADDIFF). The study also took into account students' perception of the difficulty of the
PISA 2018 test (PISADIFF).
Regarding students' background information, various attributes were analyzed. The index of
student economic, social, and cultural status (ESCS) in the PISA 2018 data set was computed,
taking into consideration factors such as parents' highest level of education, highest occupational
status (HISEI), and home possessions (e.g., number of books). Other attributes included
household possessions such as home educational resources (HEDRES) and parental emotional
support (EMOSUPS).
To examine the impact of the school environment on student learning, attributes representing
students' perceptions of the school were considered. These attributes encompassed students'
perception of school competitiveness (PERCOMP), school cooperation (PERCOOP), attitude
towards school (ATTLNACT), and the school climate as assessed by the scale measuring
students' sense of belonging to school (BELONG).
3.2. Data Analysis
Two countries' data were extracted from the raw data set and treated as separate individual data
sets. Prior to analysis, these data sets underwent a cleaning process to remove missing and noisy
data points. Each data set was then divided into a 70% training set and a 30% testing set using
random resampling without replacement within clusters. The training data sets were utilized to
construct the RF regression, RE-EM tree, MERF, and HLM models. On the other hand, the
testing data sets were not involved in the model development phase but were used to assess the
performance of the models created during the training phase. In applying RF regression, RE-EM
tree, MERF, and HLM, each clustered data set took into account the fixed effects of the selected
attributes as well as the variability associated with the schools.
3.2.1. Building a RF model
The randomForest package [18] in R (version 3.5.2) was applied to implement the RF algorithm.
The following hyperparameters of RF were applied in the tuning process:
1) Number of trees (ntreeTry). The default setting of number of trees (ntreeTry = 500)
was adopted. In this study, 500 trees were sufficient to produce solid results.
2) The stepFactor is the value by which the number of features sampled when
constructing each tree (mtry) is inflated or deflated. This value was set as 1.5.
9. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
9
3) The improvement value in the minimum out-of-bag (OOB) error (improve) to continue
the search was set as 0.01.
4) Number of features sampled when constructing each tree (mtry). The default value of
mtry was calculated using the formula, mtry = number of attributes / 3. The starting value of mtry
follows mtry = default value / stepFactor. The ending value of mtry follows mtry = default value
* stepFactor. Therefore, we used tuneRF function to confirm the best value of mtry based on the
OOB error. In both the Kazakhstan and USA data sets, the tuning process showed that mtry = 7
was the optimal value.
3.2.2. Building a RE-EM Model
The REEMtree package [19] in R (version 3.5.2) was applied in the analyses. In the RE-EM tree
analyses, 10-fold cross validation was applied when building the models, and complexity
parameter (cp) was set as 0.01 for pruning the trees in order to select the optimal tree size based
on the lowest cross validation error.
3.2.3. Building a MERF Model
The merf package in Python (version 3.8) was used to run the MERF regression. In this study, we
set 300 trees generated in the random forest and 50 as the maximum number of iterations until
convergence for both sampling data sets.
3.2.4. Applying HLM
The HLM method was conducted in R (version 3.5.2) using the package lme4 [20]. The adjusted
and conditional Intraclass Correlation Coefficient (ICC) was first run for each data set to estimate
the variance explained by the school clustered structure. A random intercept model was employed
for this study.
3.3. Evaluation Criteria
Once the RF regression, RE-EM tree, MERF, and HLM models were constructed, the testing data
sets were employed to assess the performance of these models. Various evaluation metrics were
utilized to measure the disparities between the predicted values and the actual values, including
the mean square error (MSE), mean absolute error (MAE), mean absolute percent error (MAPE),
and Accuracy (calculated as 100% minus MAPE). These metrics have been widely employed in
previous research studies to evaluate model performance (e.g., [21]). Below are the formulas of
MSE, MAE, and MAPE:
10. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
10
where n is the sample size, is the actual value, is the predicted value. Smaller values of
MSE, MAE, and MAPE indicate smaller discrepancies between the estimated model and the
actual data, indicating better model performance.
4. RESULTS
Based on the findings, the baseline models revealed intraclass correlations of 0.387 for
Kazakhstan and 0.15 for the United States. This indicates that 38.7% of the variation in student
reading achievement in Kazakhstan can be attributed to school effects, while for the United
States, the school effects account for 15% of the variation in student reading scores.
For the United States dataset, the random intercept model identified seven significant ICT-related
attributes (HOMESCH, INTICT, AUTICT, SOIAICT, ICTCLASS, ICTHOME, and ICTSCH)
and three significant teacher-related attributes (PERFEED, STIMREAD, and TEACHINT) that
influenced student reading achievement. Significant impacts on student reading were also
observed for student reading-related attributes (UNDREM, METASUM, METASPAM,
SCREADCOMP, and JOYREAD), as well as other attributes such as EMOSUPS, HEDRES,
ESCS, PISADIFF, PERCOOP, and BELONG. The overall HLM model achieved an accuracy of
88.22% for the United States.
In contrast, the HLM model for Kazakhstan yielded different significant attributes. Attributes
such as ENTUSE, USESCH, COMPICT, and ICTRES significantly influenced student reading
scores in Kazakhstan, while HOMESCH, AUTICT, and ICTSCH were found to be insignificant.
Other significant attributes for predicting Kazakhstan students' reading performance included
LMINS, ADAPTIVITY, and SCREADDIFF, which were not significant in the United States
dataset. ESCS and BELONG were found to be insignificant for Kazakhstan students' reading
performance. Overall, the HLM model for Kazakhstan achieved an accuracy of 89.8%.
Regarding the RF models, they explained 49.43% of the variance in the United States dataset and
53.17% of the variance in the Kazakhstan dataset. The top five important attributes in the RF
model for the United States were METASPAM, PISADIFF, ESCS, JOYREAD, and METASUM.
In the Kazakhstan dataset, the most important attributes were METASUM, UDREM, PISADIFF,
METASPAM, and SCREADDIFF. The accuracy of the RF models was 92.61% for the United
States and 93.72% for Kazakhstan.
Comparatively, the RE-EM tree models achieved lower accuracies, with 86.72% for the United
States and 89.03% for Kazakhstan. The RE-EM tree structures, as shown in Figure 1 and Figure
2, were simpler for the United States dataset compared to the Kazakhstan dataset. METASPAM,
PISADIFF, and METASUM were significant attributes contributing to the modeling structures
for both datasets.
11. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
11
Figure 1. The United States RE-EM Tree Model Result. It shows the significant attributes and their
thresholds. Those attributes are METASPAM, PISADIFF, METASUM, UNDREM, ESCS.
12. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
12
Figure 2. The Kazakhstan RE-EM Tree Model Result. It shows the significant attributes and their
thresholds. Those attributes are METASUM, PISADIFF, SCREADDIFF, INTICT, USESCH,
METASPAM, ICTCLASS.
The MERF models performed the best among the different methods for both datasets, achieving
accuracies of 93.16% for the United States and 94.38% for Kazakhstan. Other evaluation metrics
also indicated that the MERF models outperformed the other methods (see Table 2 and Table 3).
Table 2. The Evaluation Metrics Result of Each Model for the United States Data
MSE MAE MAPE ACCURACY
RF 2371.006 34.6963 0.0739 92.61%
RE-EM Tree 6238.66 62.8526 0.1328 86.72%
MERF 2207.5367 20.2245 0.0684 93.16%
HLM 4956.902 56.0686 0.1178 88.22%
Table 3. The Evaluation Metrics Result of Each Model for the Kazakhstan Data
MSE MAE MAPE ACCURACY
RF 1295.416 25.6777 0.0628 93.72%
RE-EM Tree 3227.529 45.0954 0.1097 89.03%
MERF 1143.1682 14.6682 0.0562 94.38%
HLM 2837.556 42.138 0.102 89.8%
13. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
13
Figures 3 and 4 further illustrated the influence of METASPAM, PISADIFF, and METASUM on
students' reading performance, consistent with the results from the RF models. However, the
MERF models slightly improved accuracy compared to the RF models in both datasets.
Figure 3. The Importance of Attributes in MERF Model for the United States Data.
Figure 4. The Importance of Attributes in MERF Model for the Kazakhstan Data.
14. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
14
5. DISCUSSION
Among the methods applied, MERF proved to be the most accurate for both the United States and
Kazakhstan datasets. MERF combines the advantages of the RF method, such as reducing
overfitting, being less sensitive to outliers, easy parameter setting, and automatic variable
importance generation. It is particularly suitable for clustering data as it considers both fixed and
random effects of variables. The accurate predictions generated by MERF, using a bagging
scheme, are valuable for predicting students' learning outcomes. A previous study by Pellagatti et
al. [22] successfully applied a similar method called generalized mixed-effects Random Forest
(GMERF) for predicting university student dropout.
However, MERF, like RF, has a major drawback in its "black box" nature, making it challenging
to interpret the relationships between predictor and response variables. The ensemble tree
structures hinder the interpretation of each tree, making it difficult to discern the exact directions
and magnitudes of variables' impacts, although variable importance information is available. In
this regard, the CART-based RE-EM tree method provides more interpretability of the results.
RE-EM tree combines the advantages of both regression tree and linear mixed-effects regression
algorithms. It is robust to outliers, as the tree-splitting process can isolate outliers in individual
tree nodes. Additionally, RE-EM tree does not require preselected variables in high-dimensional
datasets, providing flexibility in capturing data patterns. However, the method may generate
unstable decision trees due to different splitting approaches adopted by the tree structure.
When comparing data mining methods with HLM in educational clustering data settings, data
mining methods like MERF and RE-EM tree perform better for high-dimensional data, as they do
not require specifying a functional form and can handle missing data values more effectively. The
choice between MERF and RE-EM tree depends on the research study's objectives or
applications. For instance, when developing an early alert system for identifying student dropouts
or predicting course grades, MERF or GMERF can yield accurate predictions. These methods
may also have great potential for use in other technologies in the future, such as intelligent
tutoring systems, educational games, and recommender systems. On the other hand, when the
main objective is to examine relationships among variables in big data for education, collected
from technology systems or multiple sources, RE-EM tree may be more appropriate considering
its interpretability.
Additionally, HLM remains a useful method for educational clustering data, especially when the
data is not high-dimensional and does not have significant issues with outliers or missing values.
For example, Xu et al. [23] applied HLM to investigate the relationship between students' ICT
usage and learning performance in mathematics, science, and reading. Hew et al. [24] used HLM
to predict student satisfaction with massive open online courses. Our study results demonstrated
the advantage of applying HLM, which even showed slightly higher accuracy than the RE-EM
tree model.
6. CONCLUSION
This study offers a comprehensive comparison of four statistical methods, namely RF, RE-EM
tree, MERF, and HLM, in analyzing clustered educational data. The findings shed light on the
strengths and limitations of each method and provide valuable guidance for researchers in the
education field. Specifically, the study highlights the potential benefits of utilizing mixed-effects
data mining methods like RE-EM tree and MERF to enhance model accuracy when dealing with
clustered data structures. Researchers can leverage these insights to make informed decisions
regarding the selection and application of statistical methods in their own studies.
15. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
15
One limitation of this study is its exclusive focus on educational data, specifically the PISA 2018
dataset. Future studies should expand their scope by testing these statistical methods on diverse
datasets from other fields to validate the findings. Additionally, there is a need for further
development of the algorithms to address their limitations in terms of interpretability. Improving
the transparency and understanding of the models is crucial for their broader application and
practical utility.
REFERENCES
[1] X. Hu, Y. Gong, C. Lai, & F. K. Leung, “The relationship between ICT and student literacy in
mathematics, reading, and science across 44 countries: A multilevel analysis,” Computers &
Education, 125, 1-13, 2018.
[2] S. Park & W. Weng, “The relationship between ICT-related factors and student academic
achievement and the moderating effect of country economic index across 39 countries," Educational
Technology & Society, 23(3), 1-15, 2020.
[3] L. Breiman, J. H. Friedman, R. A .Olshen, & C. J. Stone, Classification and Regression Trees.
Wadsworth and Brooks/Cole: Monterey, CA, USA, 1984.
[4] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, R. Tibshirani, & J. Friedman, “Unsupervised
learning,” The elements of statistical learning: Data mining, inference, and prediction, 485-585,
2009.
[5] L. Breiman, “Bagging predictors,” Machine learning, 24(2), 123-140, 1996.
[6] A. Sandoval, C. Gonzalez, R. Alarcon, K. Pichara, & M. Montenegro, “Centralized student
performance prediction in large courses based on low-cost variables in an institutional context,” The
Internet and Higher Education, 37, 76-89, 2018.
[7] R. J. Sela & J. S. Simonoff, “RE-EM trees: a data mining approach for longitudinal and clustered
data,” Machine learning, 86(2), 169-207, 2012.
[8] A. Hajjem, F. Bellavance, & D. Larocque, “Mixed-effects random forest for clustered data,” Journal
of Statistical Computation and Simulation, 84(6), 1313-1328, 2014.
[9] L. Breiman, “Random forests,” Machine learning, 45(1), 5-32, 2001.
[10] M. Fernández-Delgado, M. Mucientes, B. Vázquez-Barreiros, & M. Lama, “Learning analytics for
the prediction of the educational objectives achievement,” in IEEE Frontiers in Education
Conference (FIE) Proceedings, Oct. 2014, pp. 1-4.
[11] X. Chen & H. Ishwaran, “Random forests for genomic data analysis,” Genomics, 99(6), 323-329,
2012.
[12] J. R. Winitzky-Stephens, & J. Pickavance, “Open educational resources and student course
outcomes: A multilevel analysis,” International Review of Research in Open and Distributed
Learning, 18(4), 35-49, 2017.
[13] H. Woltman, A. Feldstain, J. C. MacKay, & M. Rocchi, “An introduction to hierarchical linear
modelling,” Tutorials in quantitative methods for psychology, 8(1), 52-69, 2012.
[14] A. Hajjem, F. Bellavance, & D. Larocque, “Mixed effects regression trees for clustered
data,” Statistics & probability letters, 81(4), 451-459, 2011.
[15] A. Hajjem, D. Larocque, & F. Bellavance, “Generalized mixed effects regression trees,” Statistics &
Probability Letters, 126, 114-118, 2017.
[16] T. A. Warm. T. A., “Weighted likelihood estimation of ability in item response theory,”
Psychometrika, 54(3), 427-450, 1989.
[17] M. Wu, “The role of plausible values in large-scale surveys,” Studies in Educational
Evaluation, 31(2-3), 114-128, 2005.
[18] A. Liaw, & M. Wiener, “Classification and regression by randomForest,” R news, 2(3), 18-22, 2002.
[19] R. J. Sela, J. S. Simonoff, & W. Jing, “Package “REEMtree”: Regression Trees with Random
Effects for Longitudinal (Panel) Data,” R Foundation for Statistical Computing: Vienna, Austria,
2021.
[20] D. Bates, M. Maechler, & B. Bolker, “Walker., S. Fitting linear mixed-effects models using lme4,” J
Stat Softw, 67(1), 1-48, 2015.
[21] De Myttenaere, B. Golden, B. Le Grand, & F. Rossi, “Mean absolute percentage error for regression
models,” Neurocomputing, 192, 38-48, 2016.
16. International Journal of Database Management Systems (IJDMS) Vol.15, No.2/3, June 2023
16
[22] M. Pellagatti, C. Masci, F. Ieva, & A. M. Paganoni, A. M., “Generalized mixed‐effects random
forest: A flexible approach to predict university student dropout,” Statistical Analysis and Data
Mining: The ASA Data Science Journal, 14(3), 241-257, 2021.
[23] X. Hu, Y. Gong, C. Lai, & F. K. Leung, “The relationship between ICT and student literacy in
mathematics, reading, and science across 44 countries: A multilevel analysis,” Computers &
Education, 125, 1-13, 2018.
[24] K. F. Hew, X. Hu, C. Qiao, & Y. Tang, “What predicts student satisfaction with MOOCs: A
gradient boosting trees supervised machine learning and sentiment analysis approach,” Computers &
Education, 145, 103724, 2020.
[25] R. Jindal & M. D. Borah, “A survey on educational data mining and research trends,” International
Journal of Database Management Systems, 5(3), 53, 2013.
[26] A. Hershkovitz & R. Nachmias, “Online persistence in higher education web-supported courses,”
The Internet and Higher Education, 14(2), 98-106, 2011.
[27] R. Asif, A. Merceron. S. A. Ali, & N. G. Haider, “Analyzing undergraduate students' performance
using educational data mining,” Computers & Education, 113, 177-194, 2017.
[28] J. L. Hung, B. E. Shelton, J. Yang, & X. Du, “Improving predictive modeling for at-risk student
identification: A multistage approach,” IEEE Transactions on Learning Technologies, 12(2), 148-
157, 2019
[29] W. Weng & W. Luo, “Exploring the influence of students’ ICT use on mathematics and science
moderated by school-related factors,” Journal of Computers in Mathematics and Science Teaching,
41(2), 163-185, 2022.
AUTHORS
Wenting Weng is an instructional designer at Johns Hopkins University. She pursued her Ph.D. from
Texas A&M University. Her research interests include educational data mining, learning analytics, and
emerging educational technology, such as game-based learning and artificial intelligence in education.
Wen Luo is a Professor at the Department of Educational Psychology, Texas A&M University. Her
research interests include growth modeling of longitudinal data, modeling of data with complex multilevel
structures, and quantitative methods for teacher and program evaluations.