1.
Data Mining on Educational Domain Nikhil Rajadhyax Prof. Rudresh Shirwaikar Department of Information Technology Department of Information Technology Shree Rayeshwar Institute of Engineering & Information Shree Rayeshwar Institute of Engineering & Information Technology Technology Shiroda, Goa, India Shiroda, Goa, India e-mail: nikhil.rajadhyax@gmail.com e-mail: rudreshshirwaikar@gmail.comAbstract— Educational data mining (EDM) is defined as the areaof scientific inquiry centered around the development of methods II. METHODOLOGYfor making discoveries within the unique kinds of data that comefrom educational settings , and using those methods to better A. Backgroundunderstand students and the settings which they learn in. SRIEITs undergraduate degree programme - B.E. - consists ofData mining enables organizations to use their current reporting three fields of specialization (i.e. Information Technology,capabilities to uncover and understand hidden patterns in vastdatabases. As a result of this insight, institutions are able to Electronics and Telecommunication, and Computerallocate resources and staff more effectively. Engineering). Each field of specialization offers studentsIn this paper, we present a real-world experiment conducted in many subjects during eight different semesters within a periodShree Rayeshwar Institute of Engineering and Information of four years. A student belongs to a batch and the batch isTechnology (SRIEIT) in Goa, India. Here we found the relevant offered a number of subjects.subjects in an undergraduate syllabus and the strength of their The performance of students in the different courses offeredrelationship. We have also focused on classification of students provides a measure of the students ability to meetinto different categories such as good, average, poor depending lecturer/institutions expectations. The overall marks obtainedon their marks scored by them by obtaining a decision tree which by students in the different subjects are utilized in ourwill predict the performance of the students and accordingly helpthe weaker section of students to improve in their academics. experiment in finding related subjects. The main objective isWe have also found clusters of students for helping in analyzing to determine the relationships that exist between differentstudent’s performance and also improvising the subject teaching courses offered as this is required for optimizing thein that particular subject. organization of courses in the syllabi. This problem is solvedKeywords –Data Mining, Education Domain, India, Association in two steps:Rule Mining, Pearson Correlation Coefficient. 1. Identify the possible related subjects. 2. Determine the strength of their relationships and determine I. INTRODUCTION strongly related subjects.The advent of information technology in various fields has In the first step, we utilized association rule mining [1] tolead the large volumes of data storage in various formats like identify possibly related two subject combinations in therecords, files, documents, images, sound, videos, scientific syllabi which also reduces our search space. In the seconddata and many new data formats. The data collected from step, we applied Pearson Correlation Coefficient [2] todifferent applications require proper method of extracting determine the strength of the relationships of subjectknowledge from large repositories for better decision making. combinations identified in the first step.Knowledge discovery in databases (KDD), often called data Our experiment is based on the following hypothesis:mining, aims at the discovery of useful information from large "Typically, a student obtains similar marks for relatedcollections of data [1]. The main functions of data mining are subjects".applying various methods and algorithms in order to discover This assumption is justified by the fact that similar marksand extract patterns of stored data. obtained by a student for different subjects imply that theThere are increasing research interests in using data mining in student is meeting the expectations of the course in a similareducation. This new emerging field, called Data Mining on manner. This fact implies an inherent relationship between theEducational Domain, concerns with developing methods that courses.discover knowledge from data originating from educational For this experiment, we selected students of batches fromenvironments. Educational Data Mining uses many techniques 2009-2010 , in semesters 3-6 and 60 student in the three fieldssuch as Decision Trees, Neural Networks, Naïve Bayes, K- of specialization. The first step, finding possible relatedNearest neighbor, and many others. subjects, requires considering 2-subject combinations.Using these techniques many kinds of knowledge can be To do this we applied Association Rules Miningdiscovered such as association rules, classifications and [1].Association Rule Mining and its application are discussedclustering. The discovered knowledge can be used for in sections B and C.prediction regarding the overall performance of the student.The main objective of this paper is to use data mining B. Association Rule Miningmethodologies to study student’s performance in their Firstly, 2-subject combinations were obtained using Aprioriacademics. algorithm by using database of the form TABLE II. Then 1
2.
Association rules were applied to the output of Apriori In creating the database, we considered only passed subjectsalgorithm due to the fact that no subject had a 100% failure. To identifyAssociation rules are an important class of regularities that all possible related subjects (not necessarily subjects with highexists in databases. The classic application of association rules pass rates), we ignored the support and considered onlyis the market basket analysis [1]. It analyzes how items confidence measure. The confidence was sufficient topurchased by customers are associated. determine the possible related subjects (for instance, in theTable I illustrates a sales database with items purchases in above rule, confidence provides us with the percentage ofeach transaction. An example association rule is as follows: students that had passed subjecti also passed subjectj). We considered the average pass rate as the minimum pen => ink [support 75%, confidence 80%] confidence:This rule says that 75% of customers buy pen and inktogether, and those who buy pen buys ink 80% of the time. TABLE I. INSTANCE OF A SALES DATABASE D. Pearson Correlation Coefficient The Pearson Correlation Coefficient (r) measures the strength Transaction Items of the linear relationship between two continuous variables. ID We computed r and selected a threshold value (i.e. γ) to 111 pen, ink, milk determine strong relationships. 112 pen, ink The Pearson Correlation Coefficient (r) is computed as 113 pen, ink, juice follows: 114 pen, milkFormally, the association rule-mining model can be stated asfollows. Let I = {il, i2,...,im} be a set of items. Let D be a set of where:transactions (the database), where each transaction d is a set of - X, Y are two continuous variables,items such that d ⊂ I. An association rule is an implication of - Sx and Sy are the standard deviations of X and Y, andthe form, X->Y, where X ⊂ I, Y ⊂ I, and X ∩ Y = 0. The rule - and are the mean values of X and Y.X -> Y holds in the transaction set D with confidence c if c% The value of r is such that -1 < r < +1. The + and – signs areof transactions which contain X in D also contains Y.The rule used for positive linear correlations and negative linearhas support s in D if s% of the transactions in D contains correlations, respectively. If there is no linear correlation or aX∪Y. weak linear correlation, r is close to 0. A correlation greaterGiven a set of transactions D (the database), the problem of than 0.5 is generally described as strong, whereas a correlationmining association rules is to discover all association rules that less than 0.5 is generally described as weak.have support and confidence greater than or equal to the userspecified minimum support (called minsup) and minimumconfidence (called minconf). E. Application of Pearson Correlation CoefficientC. Application of Association Rule Mining After experimentation we selected 0.5 for the threshold value (i.e. γ = 0.5) as a suitable estimate for determining a strongAt SRIEIT, a student earns either a "pass" grade (that is, a relationship. A subject combination (say subjecti and subjectjstudent meets the minimum requirements for successful where i ≠ j) may contain a strong relationship (that is r ≥ 0.5completion of the subject) or "failure" grade (that is, a student for subjecti, subjectj permutation).fails to meet the minimum requirements for successfulcompletion of the subject) for every subject the studentfollowed. A transaction table is considered consisting of F. Classification Algorithmstudents with their passed subjects (see TABLE II). Here we make use of decision tree to classify the data and theOur goal is to find the relationship between two subjects (i.e. tree is obtained by making use of ID3 algorithm. A decisionsubjecti and subjectj where i ≠ j) using association rule mining. tree is a tree in which each branch node represents a choiceThat is, find association rules with the following format between a number of alternatives, and each leaf nodemeeting a certain selection criteria. represents a decision. Decision tree starts with a root node on subjecti -> subjectj, where i ≠ j which it is for users to take actions. From this node, users split each node recursively according to decision tree learning TABLE II. DATABASE INSTANCE OF STUDENT algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its Student Student Passed Subjects outcome. We provide the collected data to the algorithm to S1 subject1, subject2, subject3 create a model called as classifier. Once the classifier is built S2 subject1, subject2, subject4 we can make use of it and can easily classify any student and S3 subject1, subject5 can predict its performance. 2
3.
ID3 is a simple decision tree learning algorithm. The basic regions where data points are dense. If density falls below aidea of ID3 algorithm is to construct the decision tree by given threshold, data are regarded as noise.employing a top-down, greedy search through the given sets to DBSCAN requires three inputs:test each attribute at every tree node. In order to select the 1. The data sourceattribute that is most useful for classifying a given sets, we 2. A parameter, Minpts- which is the minimum number ofintroduce a metric - information gain. To find an optimal way points to define a cluster.to classify a learning set we need some function which 3. A distance parameter, Eps- a distance parameter- if thereprovides the most balanced splitting. The information gain are atleast Minpts within Eps of a point is a core point in ametric is such a function. Given a data table that contains cluster.attributes and class of the attributes, we can measurehomogeneity (or heterogeneity) of the table based on the Core Object: Object with at least MinPts objects within aclasses. The index used to measure degree of impurity is radius ‘Eps-neighborhood’Entropy. Border Object: Object that on the border of a clusterThe Entropy is calculated as follows: NEps(p): {q belongs to D | dist(p,q) <= Eps} Directly Density-Reachable: A point p is directly density- E(S) = reachable from a point q w.r.t Eps, MinPts if p belongs to NEps(q)Splitting criteria used for splitting of nodes of the tree is |NEps (q)| >= MinPtsInformation gain. To determine the best attribute for a Density-Reachable: A point p is density-reachable from aparticular node in the tree we use the measure called point q w.r.t Eps, MinPts if there is a chain of points p1, …,Information Gain. The information gain, Gain (S, A) of an pn, p1 = q, pn = p such that pi+1 is directly density-reachableattribute A, relative to a collection of examples S, is defined as from pi Density-Connected: A point p is density-connected to a point Gain(S,A) = E(S) - E( ) q w.r.t Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t Eps and MinPts.The ID3 algorithm is as follows: It starts with an arbitrary starting point that has not been- Create a root node for the tree visited. This points ε-neighborhood is retrieved, and if it- If all examples are positive, Return the single-node tree contains sufficiently many points, a cluster is started.Root, with label = +. Otherwise, the point is labeled as noise. Note that this point- If all examples are negative, Return the single-node tree might later be found in a sufficiently sized ε-environment of aRoot, with label = -. different point and hence be made part of a cluster.- If number of predicting attributes is empty, then Return the If a point is found to be part of a cluster, its ε-neighborhood issingle node tree Root, with label = most common value of the also part of that cluster. Hence, all points that are found withintarget attribute in the examples. the ε-neighborhood are added, as is their own ε-neighborhood.- Otherwise Begin This process continues until the cluster is completely found. - A = The Attribute that best classifies examples. Then, a new unvisited point is retrieved and processed, leading - Decision Tree attribute for Root = A. to the discovery of a further cluster or noise. - For each possible value, vi, of A, Here too we used attendance and marks of 60 students from 3 - Add a new tree branch below Root, corresponding branches each. (See TABLE III). to the test A = vi. - Let Examples (vi) be the subset of examples that have the value vi for A III. RESULT - If Examples (vi) is empty - Then below this new branch add a leaf node A. Observations with label = most common target value in the The output of association rule mining and later Pearson examples coefficient correlation provided us with the possibly related 2- - Else below this new branch add the subtree ID3 subject combination and the strength of their relationship. (Examples (vi), Target_Attribute, Attributes – The subjects reviewed and the strongly related subjects are {A}) mentioned in appendix A and E respectively.- End The results obtained through clustering gained important-Return Root knowledge and insights that can be used for improving the performance of students. The yield was different clusters thatHere we used attendance and marks of 60 students from 3 is, cluster1: students attending the classes regularly scoredbranches each. (See TABLE IV). high marks and cluster2: students attending regularly scored less marks. This result helps to predict whether scoring marksG. Clustering Algorithm in a subject actually depends on attendance or not. It evenDBSCAN is a density-based spatial clustering algorithm. By helps to find out weak students in a particular subject. Thisdensity-based we mean that clusters are defined as connected will help the teachers to improve the performance of the 3
4.
students who are weak in those particular subjects. (see subjectC). However, the student may not have acquired theAppendix F). necessary knowledge and skills required (i.e. pre-requisite knowledge) for passing subjectB and subjectC. Hence, the TABLE III. INSTANCE OF A DATABASE FOR CLUSTERING student may fail with a high probability and waste students and institutions resources. If there a large percentage of Stud_id attendance marks students who fail a pre-requisite subject (i.e. subjectA) also 1 93 20 fail the subject (i.e. subjectB), then these subjects are strongly related (subjectA is strongly related to subjectB) and is 2 100 41 captured in our experimental results. 3 100 41 4 100 25 • Project Course: 5 87 46 At the end of the 6th semester, RIEIT focuses on students . . . completing a project as a team. The main objective of the course is to apply knowledge gained from other subjects to . . . solve a real-world problem. So our experiment will be beneficial for the students to select project ideas which areThe result obtained from classification is a classifier in the based on present subject and related to past subject.form of decision tree which classifies the unseen student inorder to predict the performance of the student. Prediction willhelp the teachers to pay attention to poor and average students REFERENCESin order to enhance their capabilities in their academics. [1] Jiawei Han and Micheline Kamber “Data Mining - Concepts andThe result of Clustering and Classification is mentioned in Techniques “, Second Editionappendix F and G respectively. [2] W.M.R. Tissera, R.I. Athauda, H. C. Fernando ,” Discovery of strongly related subjects in the undergraduate syllabi using data mining”. TABLE IV. INSTANCE OF A DATABASE FOR CLASSIFICATION [3] Agrawal R., and Srikant. R., "Fast algorithms for mining association rules." VLDB-94,1994. [4] Agrawal, R. Imielinski, T. Swami, A., "Mining association rulesStud_id Dept Attendance Marks Performance between sets of items in large databases" SIGMOD -1993,1993.1 ETC Y 310 AVERAGE [5] Roset. S., Murad U., Neumann. E., Idan.Y., and Pinkas.G.,2 IT N 450 GOOD “Discovery of Fraud Rules for Telecommunications - Challenges and3 COMP Y 500 GOOD Solutions", KDD-99, 1999.4 IT Y 230 POOR [6] Agrawal, R. Imielinski, T. Swami, A., "Mining association rules between sets of items in large databases" SIGMOD -1993,1993. . . . . . [7] Bayardo,R. Agrawal, R., "Mining the most interesting rules." KDD- . . . . . 99, 1999.After applying classification algorithm we get a decision tree Appendixwhich is dependent on the “gain” (see Appendix G). A. List of subject id and their titlesB. Significance of ResultsThe results obtained through our experiment gained important id Subjectknowledge and insights that can be used for improving the IT31 Applied Mathematics IIIquality of the educational programmes. Some of these insights IT32 Numerical Methodsare outlined below: IT33 Analog And Digital Communication IT34 Computer Organization And Architecture • Preconceived notion of a relationship between IT35 Data Structures Using C Mathematics subjects and programming subjects: IT36 System Analysis And DesignThere existed a general notion that mathematics subjects and IT41 Discrete Mathematical Structuresprogramming subjects are correlated. However, our IT42 Signals And Systemsexperiments illustrated that there does not exists a strong IT43 Computer Hardware And Troubleshootingrelationship between these subjects. That is, passing or failing IT44 Microprocessors And Interfacesa mathematics subject does not determine the ability to IT45 Design And Analysis Of Algorithmspass/fail a programming subject and vice-versa. IT46 Object Oriented Programming System IT51 Introduction To Data Communication • Assist in determining pre-requisite subjects: IT52 Digital Signal ProcessingWhen determining prerequisites it is advantageous to know IT53 Software Engineeringthat the existence of the strong relationship between subjects. IT54 Intelligent AgentsA student may fail a particular subject (say subjectA) and IT55 Operating Systemsproceed to taking further subjects (say subjects, subjectB, 4
5.
IT56 Database Management System B. Subjects offered in the IT stream for semesters 3-6 IT51 Entrepreneurship Development IT52 Theory Of Computation 3rd Semester 4th Semester 5th Semester 6th Semester IT53 Computer Networks IT31 IT41 IT51 IT61 IT54 Computer Graphics IT32 IT42 IT52 IT62 IT55 Web Technology IT33 IT43 IT53 IT63 IT56 Software Testing And Quality Assurance IT34 IT44 IT54 IT64 ETC31 Applied Mathematics III IT35 IT45 IT55 IT65 ETC32 Digital System Design IT36 IT46 IT56 IT66 ETC33 Network Analysis And Synthesis ETC34 Electronic Devices And Circuits ETC35 Managerial Economics C. Subjects offered in the ETC stream for semesters 3-6 ETC36 Computer Oriented Numerical Techniques ETC41 Applied Mathematics IV 3rd Semester 4th Semester 5th Semester 6th Semester ETC42 Signals And Systems ETC31 ETC41 ETC51 ETC61 ETC43 Electrical Technology ETC32 ETC42 ETC52 ETC62 ETC44 Electromagnetic Field And Waves ETC33 ETC43 ETC53 ETC63 ETC45 Linear Integrated Circuits ETC34 ETC44 ETC54 ETC64 ETC46 Data Structures Using C++ ETC35 ETC45 ETC55 ETC65 ETC51 Probability Theory And Random Processes ETC36 ETC46 ETC56 ETC66 ETC52 Control System Engineering ETC53 Communication Engineering 1 ETC54 Microprocessors D. Subjects offered in the COMP stream for semesters 3-6 ETC55 Digital Signal Processing ETC56 Transmission Lines And Waveguides 3rd Semester 4th Semester 5th Semester 6th Semester ETC61 Communication Engineering 2 COMP31 COMP41 COMP51 COMP61 ETC62 Peripheral Devices And Interfacing COMP32 COMP42 COMP52 COMP62 ETC63 Power Electronics COMP33 COMP43 COMP53 COMP63 ETC64 Antenna And Wave Propagation COMP34 COMP44 COMP54 COMP64 ETC65 Electronic Instrumentation COMP35 COMP45 COMP55 COMP65 ETC66 VLSI Technologies And Design COMP36 COMP46 COMP56 COMP66COMP31 Applied Mathematics IIICOMP32 Basics Of C++COMP33 Principles Of Programming Languages E. Strongly Related subjects in the respective streams withCOMP34 Computer Oriented Numerical Techniques γ> 0.5COMP35 Logic DesignCOMP36 Integrated Electronics IT StreamCOMP41 Discrete Mathematical StructuresCOMP42 Data StructuresCOMP43 Computer OrganizationCOMP44 Electronic MeasurementsCOMP45 System Analysis And DesignCOMP46 Object Oriented Programming & Design Using C++COMP51 Organizational Behavior And Cyber LawCOMP52 Automata Language And ComputationCOMP53 Microprocessors And MicrocontrollersCOMP54 Computer Hardware DesignCOMP55 Database Management SystemCOMP56 Operating SystemCOMP61 Modern Algorithm Design FoundationCOMP62 Object Oriented Software EngineeringCOMP63 Artificial IntelligenceCOMP64 Computer GraphicsCOMP65 Device Interface And Pc MaintenanceCOMP66 Data Communications 5
6.
COMP Stream Computer Engg. Stream ETC StreamETC Stream G. Decision Tree obtained as a result of classification.F. Clusters obtained in the respective streams IT Stream 6
Be the first to comment