Kartik N. Shah, Srinivasulu Kothuru, S. Vairamuthu / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.935-939935 | P a g eClustering students’ based on previous academic performanceKartik N. Shah1, Srinivasulu Kothuru2and S. Vairamuthu3School of Computing Science and Engineering, VIT UniversityAbstractEducational data mining is very popularresearch area for studying the behavior ofstudents based upon their past performance. AsEducation is very basic need, which must begiven to all, the study of student behavior plays avital role. Grouping students on the basis of theirperformance, we can make a good team for anycompetitions to represent from institute oruniversity. Also we can make some more focus onthe students who are having bad performance bygiving some extra lectures and motivating themfor better study. In this paper, we will cluster thesimilar behavior students based upon their pastacademic performance. We are using Similaritymeasure as Canberra Distance for clusteringsame type of students. We will use VIT universitydata for analyzing performance.Keywords: Educational Data Mining, ClusteringStudents, Grouping Method, Students behavior.1. IntroductionEducational Data Mining (EDM) is thestudy of raw data of educational institutes anduniversities to get the needed information that canbe used by educational software developers,students, teachers, other educational researchers andparents. Moreover, we can say that machine-learning, statistical and data-mining (DM)algorithms are part of EDM to analyze the varioustypes of educational data . EDM is concernedwith understanding the student behavior and thencreate the methods for understanding the otherstudents behavior for creating techniques oralgorithms which can help for modifying existingsystem by which students can understand well .Web Education or E-Learning is the era in which wecan use Internet to give education to students. WebEducation has created new era of learning andunderstanding learning pattern of students . Thisinformation is gold mine for educational mining .Prediction of student‟s performance is very usefulpart of Educational Data Mining. We can use thispredicted data in many different contexts inuniversities. We can identify exceptional studentsfor scholarships or stipend. It is essential part foradmission of Post Graduate students. Also we canfind the students which can behave poorly in theexams. These predictions will make themanagement of universities to take remedialmeasures at early stage which enables institute oruniversity to produce excellent students. Now a day,many universities follow web based education. Inthis case, it is useful to classify a student‟sperformance. We can use Data Mining (DM)techniques to achieve these objectives .The supervised classification algorithmneeds a training sample for each class, whichcontains the information of known collection of datapoints of class of interest. So, classification processis based on closer the data point is compared to thetraining sample. The training samples arerepresentative of the known classes of interest to theanalyst. The classification methods which relay onuse of training samples are known as supervisedclassification methods. These methods are basicallyused for image classification.The 3 steps involved in supervised classification are:1. Data training stage: The analyst find theinterested area which is already known tohim and this known information is used toclassify the other unknown data based uponthis training data. Best match with trainingsample is consider in the case of multiplematch.2. The Classification stage: This is the mainstage in which, classification is carried outbased upon available training data. All theunknown classes are labeled based uponthe training data available for classification.3. Final Output Stage: The final stage isoutput stage which is concerned with usingof data available from classification stagein various applications.2. Related WorkFrom the past, we had measure thestudents‟ performance by the number of coursesthey have registered and Grand Point Averages(GPA) or Credit Grand Point Averages (CGPA)they have secured. But, that traditional method thatwe had used in past, was not guaranteeing that thestudents have achieved the qualifications requiredfor the next era. Now a day, a student‟s performancehas been evaluated on the basis of the marks givenby faculty during written examination. Most of theuniversities follow the mark based approach inwhich relative grading is given to all students(marks between 1 to 100) and then categorized inthe grades (A, B, C, D, E, F or N) and then nominalscores (1, 2, 3. . .10). And finally linguistic termslike „Fail‟ or „Pass‟, etc. In this study, a weightedsum of assessment style (i.e each exam havedifferent weight) is used to compute the numerical
Kartik N. Shah, Srinivasulu Kothuru, S. Vairamuthu / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.935-939936 | P a g escore of each student based upon the universityrules: Quiz (Q), Common Assessment Test (CAT)or Internal (I), Final (F), Performance Appraisals(P). The total of all of the above measures is treatedas Final result of the student.For Intelligent Tutoring System, we canuse constraint relaxations and sequential patternmining to automatically acquire the knowledge .We can predict the student‟s grade based uponsimilarity measures called Sum of AbsoluteDifference and Sum of Squared Difference and thenwe can assign grades for the system of universitywhich follow the Relative grading scheme . Tolabel the students‟ behavior, we can use textreplays.It is more precise and faster . There canbe many different objectives for classifying studentsbased on their characteristics which are 2 stepsclassification process. In which, first step is to studysingle classifier accuracy on data and then choosethe best and gather this information with weakclassifier for voting method , In Statisticalclassification, individual result is taken and groupedinto similar results and then majority is consideredfor classification . Some of the statisticalalgorithms are linear discriminant analysis, kerneland k nearest neighbors . The decision tree is aset of conditions which is organized in a hierarchicalstructure . It is a predictive model in which toclassify an instance, we need to follow the path inthe tree till leaf node. If we are able to reach leafnode then we can classify it on that class. We canconvert a decision tree to a set of classificationrules. Most well-known decision tree algorithms areC4.5 . Rule Induction - it is an area of machinelearning in which, from a set of observations, IF-THEN production rules are extracted . In ruleinduction, operators correspond to generalizationand specialization operations and state correspondsto a candidate rule that transform one candidate ruleinto another. Examples of rule induction algorithmsare CN2 , Supervised Inductive Algorithm(SIA) , a genetic algorithm using real-valuedgenes (Corcoran)  and a Grammar-based geneticprogramming algorithm (GGP) . We can useNeural Networks also for rule induction. Examplesof neural network algorithms are multilayerperceptron (with conjugate gradient-based training), a radial basis function neural network (RBFN), incremental RBFN , decremental RBFN, a hybrid Genetic Algorithm Neural Network(GANN)  and Neural Network EvolutionaryProgramming (NNEP) .3. Proposed FrameworkDifferent measures of similarity or distanceare convenient for many types of analysis. There aremany techniques available to cluster the similar typeof students. All grading techniques andclassification techniques can be extended forclustering of students. When we classify thestudents based upon any algorithm, we can concludethat the students who are coming in same classhaving some type of similar behavior orcharacteristics. So, we can make a group of studentswho are coming in same class and make a singlecluster of that group based upon the criteria which isused for classification purpose. We will clusterstudents‟ records based upon their academicperformance of the previous data. Some of thedistance and similarity measures for numerical dataare listed below. We will use this distance measureto cluster the students based upon their academicperformance.Distance measure between KNOWN(p, q, r) andUNKNOWN(xj, yj, zj)EuclideanDistance jjjj zrAbsyqAbsxpAbs ])()()([ 222ManhattanDistance jjjj zrAbsyqAbsxpAbs )]()()([SquaredEuclideanDistance jjjj zrAbsyqAbsxpAbs ])()()([ 222BrayCurtisDistance j jjjjjjzrAbsyqAbsxpAbszrAbsyqAbsxpAbs)]()()([)]()()([CanberraDistancejjjjj jjzrzrAbsyqyqAbsxpxpAbs)()()(ChessboardDistance)](),(),([ jjj zrAbsyqAbsxpAbsMAX CosineDistance2222221jjjjjjzyxrqprzqypxTable1 Similarity measures for numerical dataTable1 shows similarity and distancemeasures are useful for numerical data. There aremany more measures available for finding similaritylike Normalized Squared Euclidean Distance,Correlation Distance etc. We can use 1 of thetechniques and then apply classification process tocreate the cluster of similar students. Now days,every university follow their own evaluationpatterns based upon that we can modify ourapproach to calculate the similarity. We will takeCanberra Distance as similarity measure and clusterthe students based upon their marks obtained inCAT I, CAT II and Quiz. We will study thisanalysis for VIT University‟s students‟ records. Incase, if we want to apply the same technique for
Kartik N. Shah, Srinivasulu Kothuru, S. Vairamuthu / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.935-939937 | P a g edifferent university then we can add details inimplementation of this technique. We can extent itfor analysis of students of final year based upon theprevious years. Consider the following details of thestudents.CAT1 CAT2 QUIZ REG_NO34 36 8 114.3 29 9 244.5 47 13 340 45 14 441 46 13 545.5 46 15 637.5 43 12 728.5 43 10 838.5 45 13 928.5 40 6 1040 47 14 1135 40 9 1219 30 8 1343.5 46.5 13 1442.5 44.5 13 1536 43 12 1640 40 14 1741.5 45 14 1831.5 40 8 1942 47 13 2036 38 13 2142.5 46 14 2242 33.5 9 2330 35 13 2429 30.5 9 2520 25 7 2635 41 14 2738 40 13 2841 47 15 2940.5 46 14 3047.5 43 15 3125 40 6 320 0 0 3342 41 12 3428 29 8 35Table2 Students‟ RecordsAfter applying above Canberra Distancealgorithm, we can cluster the Table2 data. Afterclustering, we will able to know that which studentscome in which category. As every universitiesfollow different evaluation patterns, for VITUniversity, CAT I and CAT II is having 15% weightfor each. They conduct 3 quiz and each having 5%of weight. In above records, we have added all 3quiz marks for simplicity. Then we will calculate themarks based upon the weight of the exam and thenall details are together applied to algorithm. We willgive reference detail to cluster the data based uponwhich students will be grouped. We have usedfollowing groups to classify students.Group „A‟ CAT I- 45, CAT II- 45and Quiz – 15Group „B‟ CAT I- 40, CAT II- 40and Quiz – 13Group „C‟ CAT I- 34, CAT II- 34and Quiz - 9Group „D‟ CAT I- 28, CAT II- 28and Quiz - 8Unusual Behavior orFailureRemainingTable3 Reference marks for clusteringAfter applying the algorithm, we can get followingclusters. Table4 shows the details about clustersafter applying algorithms.Group„A‟3,4,6,11,14,15,18,22,29,30,31Group„B‟4,5,7,9,11,14,15,17,18,20,21,27,28,30,34Group„C‟1,12,23Group„D‟25,35UnusualBehavior orFailure2,8,10,13,16,19,24,26,28,32,33Table4 Output Clusters4. ImplementationWe can implement this algorithm using.Net technology and C# language. It will take allstudents‟ records and Reference marks (criteria forclustering) as input and will provide the Reg. no ofstudents who are coming in that cluster. Followingfigure1 shows the implementation details.Figure1 Implementation5. AnalysisAfter applying algorithm, we can clusterstudents based upon criteria shown above. We canmap in line chart as follows.
Kartik N. Shah, Srinivasulu Kothuru, S. Vairamuthu / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.935-939938 | P a g eFor more precision, we can go for normalization toget similar type of points. The line graph can beshown as below. It shows that while we consider theindividual weight of the exams then also it can beable to cluster proper way. From Line chart it isclear that all the students are coming in sameclusters. So, Canberra Distance algorithm can groupthe students based upon similar academicperformance in previous exams.6. Future WorkIn this paper, we have discussed marks as ameasure for clustering the students. There are manyother behaviors by which we can cluster thestudents. Some of the measures are based uponpractical knowledge, class behavior, talent inparticular field, family background. We have notconsidered any of the measures as it is very difficultto understand each student. We can enhance theclustering techniques based upon the abovementioned criteria.7. ConclusionFrom the above analysis, we can say thatwe can make cluster of students based uponCanberra Distance Similarity and Distance measure.In this study, we have use VIT University data as areference to check validity of results. We can extendthis algorithm for any university which follownumerical evaluation students (i.e based uponmarks). Most of the universities follow the markspattern and then marks are converted into grades.So, this analysis can be applied on most of theuniversity‟s students‟ records.8. AcknowledgementThe authors would like to thank the School ofComputer Science and Engineering, VIT University,for giving them the opportunity to carry out thisproject and also for providing them with therequisite resources and infrastructure for carryingout the research.References Kartik N. Shah, Shantanu Santoki,Himanshu Ghetia, L.Ramanthan, “Miningon Student‟s Records to Predict theBehavior of Students”, IEEE - InternationalConference on Research and DevelopmentProspects on Engineering and Technology,March 29,30 - 2013 Vol.4, pp. 54-57. R. Baker, “Data mining for education,” inInternational Encyclopedia of Education,B.McGaw, P. Peterson, and E. Baker, Eds.,3rd ed. Oxford, U.K.: Elsevier, 2010. T. Barnes, M. Desmarais, C. Romero, andS. Ventura, presented at the 2nd Int. Conf.Educ. Data Mining, Cordoba, Spain, 2009. F. Castro, A. Vellido, A. Nebot, and F.Mugica, “Applying data mining techniquesto e-learning problems,” in Evolution ofTeaching and Learning Paradigms inIntelligent Environment (Studies inComputational Intelligence), vol. 62, L. C.Jain, R. Tedman, and D. Tedman, Eds.New York: Springer-Verlag, 2007, pp.183–221. Minaei-Bidgoli, B., Punch, W. UsingGenetic Algorithms for Data MiningOptimization in an Educational Web-basedSystem. Genetic and EvolutionaryComputation, Part II. 2003. pp.2252–2263. Romero, C., Ventura, S. Educational DataMining: a Survey from 1995 to 2005.Expert Systems with Applications, 2007,33(1), pp.135-146. J. Mostow and J. Beck, “Some usefultactics to modify, map and mine data fromintelligent tutors,” J. Nat. Lang. Eng., vol.12, no. 2, pp. 195– 208, 2006. C. Antunes, Acquiring backgroundknowledge for intelligent tutoring systems,in: Proceedings of the 2nd InternationalConference on Educational Data Mining,2008, pp. 18–27. Baker, R.S.J.D. and De Carvalho, A.M.J.A.2008. Labeling Student Behavior Fasterand More Precisely with Text Replays. InProceedings of the 1st InternationalConference on Educational Data Mining,38-4701020304050CAT I CAT II QUIZ0246810121416CAT I CAT II QUIZGROUP AREG_3REG_4REG_6REG_11
Kartik N. Shah, Srinivasulu Kothuru, S. Vairamuthu / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.935-939939 | P a g e Nguyen Thai Nghe, Paul Janecek, andPeter Haddawy “A Comparative Analysisof Techniques for Predicting AcademicPerformance” 37th ASEE/IEEE Frontiersin Education Conference, October 10 – 13,2007, Milwaukee, WI, T2G-7 Muhammad Sufyian Bin Mohd Azmi“Academic Performance Prediction BasedOn Voting Technique” IEEE 2011. Otero, J., Sánchez, L. Induction ofDescriptive Fuzzy Classifiers with theLogitboost Algorithm. Soft Computing2005, 10(9), pp.825-835. Quinlan, J.R. C4.5: Programs for MachineLearning. Morgan Kaufman. 1993. Delgado, M., Gibaja, E., Pegalajar, M.C.,Pérez, O. Predicting Students‟ Marks fromMoodle Logs using Neural NetworkModels. Current Developments inTechnology- Assisted Education, Badajoz,2006. pp.586-590. Clark, P., Niblett, T. The CN2 InductionAlgorithm. Machine Learning 1989, 3(4),pp.261-283 Venturini, G. SIA A Supervised InductiveAlgorithm with Genetic Search forLearning Attributes based Concepts. Conf.on Machine Learning, 1993. pp.280-296. Corcoran, A.L., Sen, S. Using Real-valuedGenetic Algorithms to Evolve Rule Sets forClassification. Conference on EvolutionaryComputation, Orlando, 1994, pp.120-124. Espejo, P.G., Romero, C., Ventura, S.Hervas, C. Induction of ClassificationRules with Grammar-Based GeneticProgramming. Conference on MachineIntelligence, 2005. pp.596-601. Moller, M.F. A Scaled Conjugate GradientAlgorithm for Fast Supervised Learning.Neural Networks, 1993, 6(4), pp.525-533. Broomhead, D.S., Lowe, D. MultivariableFunctional Interpolation and AdaptativeNetworks. Complex Systems 11, 1988,pp.321-355. Yao, X. Evolving Artificial NeuralNetworks. Proceedings of the IEEE, 1999,87(9), pp.1423-1447. Martínez, F.J., Hervás, C., Gutiérrez, P.A.,Martínez, A.C., Ventura, S. EvolutionaryProduct-Unit Neural Networks forClassification. Conference on IntelligentData Engineering and Automated Learning.2006. pp.1320-1328.