SlideShare a Scribd company logo
1 of 4
Download to read offline
I nternational Journal Of Computational Engineering Research (ijceronline.com) Vol. 3 Issue. 1


 Comparison of Imputation Techniques after Classifying the Dataset Using
           Knn Classifier for the Imputation of Missing Data
                          Ms.R.Malarvizhi1, Dr. Antony Selvadoss Thanamani2
    1
        Research Scholar ,Research Depart ment of Co mputer Science, NGM Co llege, 90 Palghat Road,Po llach i, ,Bharathiyar.
    2
        Professor and HOD, Research Depart ment of Co mputer Science, NGM College 90 Palghat Road, Po llach i Bharathiyar
                                                     University, Coimbatore.

Abstract:
         Missing data has to be imputed by using the techniques available. In this paper four imputation techniques are
compared in the datasets grouped by using k-nn classifier. The results are compared in terms of percentage of accuracy. The
imputation techniques Mean Substitution and Standard Deviation show better results than Linear Regression and Median
Substitution.

Keywords: Mean Substitution, Median Substitution, Linear Regression, Standard Deviation, k-nn Classifier.

1. Introduction
         Missing Data Imputation involves imputation of missing data from the available data. Improper imputation produces
bias result. Therefore proper attention is needed to impute the missing values. Imputation techniques help to impute the missing
value. The accuracy can be measured in terms of percentage. Pre-processing has to be done before imputing the values using
imputation techniques. kNN classifier helps to classify the datasets into several groups by using the given training dataset. The
imputation techniques are separately imputed in each dataset and checked for accuracy. The results are then compared.

2. Missing Data Mechanisms
         In statistical analysis, data-values in a data set are missing completely at random (M CAR) if the events that lead to any
particular data-item being missing are independent both of observable variables and of unobservable parameters of interest.
Missing at random (MAR) is the alternative, suggesting that what caused the data to be missing does not depend upon the
missing data itself. Not missing at random (NMAR) is data that is missing for a specific reason.

3. Imputation Methods
        Imputation is the process of replacing imputed values from the availa ble data. There are some supervised and
unsupervised imputation techniques. The imputation techniques like Listwise Deletion and Pairwise Deletion deletes the entire
row. The motivation of this paper is to impute missing values rather than deletion. In this paper, unsupervised imputation
techniques are used and the results are compared in terms of percentage of accuracy.

3.1. Mean Substituti on
         Mean Substitution substitutes the mean value of data available. It can be calculated fro m the availab le data. Mean can
be calculated by using the formula



Where:
                is the symbol for the mean.
                is the symbol for su mmation.
           X is the symbol for the scores.
           N is the symbol for the nu mber of scores.

3.2. Medi an Substituti on
           Median Substitution is calculated by grouping up of data and finding average for the data. Median can be calculated by
using the formula
                            Median =L+h/f(n/2-c)
where
L is the lower class boundary of median class
h is the size of median class i.e. difference between upper and lower class boundaries of me dian class
f is the frequency of median class

||Issn 2250-3005(online)||                                      ||January || 2013                                  Page 101
I nternational Journal Of Computational Engineering Research (ijceronline.com) Vol. 3 Issue. 1

c is previous cumulative frequency of the med ian class
n/2 is total no. of observations divided by 2

3.3. Standard Devi ation
         The standard deviation measures the spread of the data about the mean value. It is us eful in comparing sets of data
which may have the same mean but a different range. The Standard Deviat ion is given by the formula




where                  are the observed values of the sample items and is the mean value o f these observations, while the
denominator N stands for the size of the sample.

3.4. Linear Regression
         Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.
One variab le is considered to be an explanatory variab le, and the other is co nsidered to be a dependent variable. Linear
regression can be calculated by using the formula
                           Y = a + bX,
where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept.

4. Database and Data Pre-Processing
          The dataset consists of 10000 records which is taken fro m a UK census report conducted in the year 2001. It has 6
variables. The data has to be pre-processed before it is used for experiment. The orig inal dataset is modified by making some
data as missed. The missing percentage may vary as 2, 5, 10, 15, and 20 percentages. Next data is classified into several groups.
For grouping of data, knn classifier is used. Each group is then separately taken for the experiment. The imputation techniqu es
are imp lemented one by one and the performance is measured by comparing with original database in terms of accuracy.

5. K-Nearest Neighbor Classifiers
          The idea in k-Nearest Neighbor methods is to identify k samples in the training set whose independent variable s x are
similar to u, and to use these k samples to classify this new sample into a class, v. f is a smooth function, a reasonable id ea is to
look for samples in our training data that are near it (in terms of the independent variables) and then to compute v from t he
values of y for these samples. The distance or dissimilarity measure can be co mputed between samples by measuring distance
using Euclidean distance.

          The simplest case is k = 1 where we find the sample in the training set that is closest (the n earest neighbor) to u and set
v = y where y is the class of the nearest neighbouring sample. In k -NN the nearest k neighbors of u is calculated and then use a
majority decision rule to classify the new samp le. The advantage is that higher values of k provide s moothing that reduces the
risk o f over-fitt ing due to noise in the training data. In typical applicat ions k is in units or tens rather than in hundreds or
thousands.

6. Experimental Analysis
        In our research the database is grouped into 3, 6 and 9 grou ps for experiment. The above said imputation techniques are
implemented in each group. The missing percentages were 2, 5, 10, 15 and 20. The imputation techniques are Mean Substitution,
Median Substitution, Standard Deviation and Linear Regression.

Table 1 describes the percentage of accuracy for various groups. For accuracy each imputed data set is compared with t he
original dataset. Linear Regression and Mean Substitution shows same result as well as poor result. Median Substitution and
Standard Deviation shows same result as well as better result. When the group size is large there is some imp rovement in t he
result.

Table 2 describes the overall average of accuracy.




||Issn 2250-3005(online)||                                       ||January || 2013                                    Page 102
I nternational Journal Of Computational Engineering Research (ijceronline.com) Vol. 3 Issue. 1

                  Table 1. Co mparison of Imputation Techniques in Groups Classified Using k-NN Classifier
       Percentage             3 GROUPS                             6 GROUPS                             9 GROUPS
       of Missing     Mean    Med Std Linear               Mean    Med Std Linear              Mean     Med Std Linear
                      Sub     Sub Dev Regress              Sub     Sub Dev Regress             Sub      Sub Dev Regress

             2         67       70        70        70      69       71     71        66           71   73   72      67
             5         69       72        72        69      75       76     76        75           74   77   77      75
            10         66       71        71        66      72       78     78        72           71   80   80      71
            15         64       72        72        64      66       77     77        66           66   79   79      66
            20         60       72        72        60      55       78     77        56           50   80   80      52

                  Table 2: Performance of Above Imputation Methods in Terms of Percentage of Accuracy

                     Percentage          Mean               Median          Standard            Linear
                     of Missing       Substitution        Substitution      Devi ation        Regression
                          2                69                 71                71                68
                          5                73                 75                75                73
                         10                70                 76                76                70
                         15                65                 76                76                65
                         20                55                 77                76                56
                        % of              66.4                75               74.8              66.4
                      Accuracy


                                          Figure 1: Comparison of Imputati on Methods

                                     90
                                     80
                                     70
                                     60
                                                                                         Mean Sub
                                     50
                                                                                         Med Sub
                                     40
                                                                                         Std Dev
                                     30
                                                                                         Linear Reg
                                     20
                                     10
                                      0
                                               2%    5%   10% 15% 20%


                                                Figure 1 shows the difference graphically.

7. Conclusions and Future Enhancement
         In this paper, as the imputation techniques like Median Substitution and Standard deviation shows better result, it can
be used for further research. Our research also finds out that when the data gets groups into several sizes, there is some drastic
improvement in the accuracy of percentage. In future data algorithms can be generated for grouping of data. After grouping, the
two imputation methods can be applied.




||Issn 2250-3005(online)||                                        ||January || 2013                               Page 103
I nternational Journal Of Computational Engineering Research (ijceronline.com) Vol. 3 Issue. 1

References
  [1]. Graham, J.W, “M issing Data Analysis: Making it work in the real world. Annual Rev iew o f Psychology”, 60, 549 – 576
        , 2009.
  [2]. Jeffrey C.Wayman , “Mult iple Imputation For M issing Data : What Is It And How Can I Use It?” , Paper presented at
        the 2003 Annual Meeting of the American Educational Research Association, Chicago, IL ,pp . 2 -16, 2003.
  [3]. A.Rogier T.Donders, Geert J.M .G Vander Heljden, Theo St ijnen, Kernel G.M Moons, “Review: A gentle introduction
        to imputation of missing values” , Journal of Clinical Epidemiology 59 , pp.1087 – 1091, 2006.
  [4]. Kin Wagstaff ,”Clustering with M issing Values : No Imputation Required” -NSF grant IIS-0325329,pp.1-10.
  [5]. S.Hichao Zhang , Jilian Zhang, Xiaofeng Zhu, Yongsong Qin,chengqi Zhang , “Missing Value Imputation Based on
        Data Clustering”, Springer-Verlag Berlin, Heidelberg ,2008.
  [6]. Richard J.Hathuway , James C.Bezex, Jacalyn M.Huband , “Scalable Visual Assessment of Cluster Tendency for Large
        Data Sets”, Pattern Recognition ,Volu me 39, Issue 7,pp,1315-1324- Feb 2006.
  [7]. Gabriel L.Scholmer, Sheri Bau man and Noel A.card “Best practices for Missing Data Management in Counseling
        Psychology”, Journal of Counseling Psychology, Vol. 57, No. 1,pp. 1– 10,2010.
  [8]. R.Kavitha Ku mar, Dr.R.M Chandrasekar,“M issing Data Imputation in Cardiac Data Set” ,International Journal on
        Co mputer Science and Engineering , Vol.02 , No.05,pp-1836 – 1840 , 2010.
  [9]. Jinhai Ma, Noori Aichar –Danesh , Lisa Do lovich, Lahana Thabane , “Imputation Strategies for M issing Binary
        Outcomes in Cluster Randomized Trials”- BM C Med Res Methodol. 2011; pp- 11: 18. – 2011.
  [10]. R.S.So masundaram , R.Nedunchezhian , “Evaluation of Three Simple Imputation Methods for Enhancing
        Preprocessing of Data with M issing Values”, International Journal of Co mputer Applications (0975 – 8887) Vo lu me 21
        – No.10 ,pp.14-19 ,May 2011.
  [11]. K.Raja , G.Tholkappia A rasu , Ch itra. S.Nair , “Imputation Framework for Missing Value” , International Journal of
        Co mputer Trends and Technology – volume3Issue2 – 2012.
  [12]. Paul D. A llison, Statistical Horizons, Haverford, PA, USA ,” Handling Missing Data by Maximu m Likelihood”, SAS
        Global Foru m 2012- Statistics and Data Analysis
  [13]. BOB L.Wall , Jeff K.Elser – “Imputation of Missing Data for Input to Support Vector Machines” ,
  [14]. Ms.R.Malarv izhi, Dr.Antony Selvadoss Thanamani – “K-Nearest Neighbor in Missing Data Imputation”, International
        Journal of Engineering Research and Develop ment, volu me 5Issue 1-November - 2012




||Issn 2250-3005(online)||                                   ||January || 2013                                 Page 104

More Related Content

What's hot

Chapter09
Chapter09Chapter09
Chapter09
rwmiller
 
Chapter 3 Confidence Interval
Chapter 3 Confidence IntervalChapter 3 Confidence Interval
Chapter 3 Confidence Interval
ghalan
 
Statistics
StatisticsStatistics
Statistics
Bob Smullen
 
Power Analysis and Sample Size Determination
Power Analysis and Sample Size DeterminationPower Analysis and Sample Size Determination
Power Analysis and Sample Size Determination
Ajay Dhamija
 

What's hot (20)

Estimating a Population Proportion
Estimating a Population ProportionEstimating a Population Proportion
Estimating a Population Proportion
 
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)
 
Discriminant Analysis in Sports
Discriminant Analysis in SportsDiscriminant Analysis in Sports
Discriminant Analysis in Sports
 
Descriptive statistics and graphs
Descriptive statistics and graphsDescriptive statistics and graphs
Descriptive statistics and graphs
 
Chapter09
Chapter09Chapter09
Chapter09
 
Population and sample mean
Population and sample meanPopulation and sample mean
Population and sample mean
 
Two Means, Independent Samples
Two Means, Independent SamplesTwo Means, Independent Samples
Two Means, Independent Samples
 
Two dependent samples (matched pairs)
Two dependent samples (matched pairs) Two dependent samples (matched pairs)
Two dependent samples (matched pairs)
 
Malhotra07
Malhotra07Malhotra07
Malhotra07
 
Data sampling and probability
Data sampling and probabilityData sampling and probability
Data sampling and probability
 
Confidence Level and Sample Size
Confidence Level and Sample SizeConfidence Level and Sample Size
Confidence Level and Sample Size
 
Hypo
HypoHypo
Hypo
 
Chapter 3 Confidence Interval
Chapter 3 Confidence IntervalChapter 3 Confidence Interval
Chapter 3 Confidence Interval
 
Statistics - Basics
Statistics - BasicsStatistics - Basics
Statistics - Basics
 
Two Means, Two Dependent Samples, Matched Pairs
Two Means, Two Dependent Samples, Matched PairsTwo Means, Two Dependent Samples, Matched Pairs
Two Means, Two Dependent Samples, Matched Pairs
 
Statistics
StatisticsStatistics
Statistics
 
02a one sample_t-test
02a one sample_t-test02a one sample_t-test
02a one sample_t-test
 
Mat 255 chapter 3 notes
Mat 255 chapter 3 notesMat 255 chapter 3 notes
Mat 255 chapter 3 notes
 
Solution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populationsSolution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populations
 
Power Analysis and Sample Size Determination
Power Analysis and Sample Size DeterminationPower Analysis and Sample Size Determination
Power Analysis and Sample Size Determination
 

Similar to International Journal of Computational Engineering Research(IJCER)

2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments
Dr Lendy Spires
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Jewel Refran
 

Similar to International Journal of Computational Engineering Research(IJCER) (20)

Kinds Of Variable
Kinds Of VariableKinds Of Variable
Kinds Of Variable
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
 
linear model multiple predictors.pdf
linear model multiple predictors.pdflinear model multiple predictors.pdf
linear model multiple predictors.pdf
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments
 
Discriminant analysis.pptx
Discriminant analysis.pptxDiscriminant analysis.pptx
Discriminant analysis.pptx
 
6SigmaReferenceMaterials
6SigmaReferenceMaterials6SigmaReferenceMaterials
6SigmaReferenceMaterials
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease DiagnosisFuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
Fuzzy Regression Model for Knee Osteoarthritis Disease Diagnosis
 
Multinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationshipsMultinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationships
 
Corrleation and regression
Corrleation and regressionCorrleation and regression
Corrleation and regression
 
H017365863
H017365863H017365863
H017365863
 
Dimensionality Reduction Evolution and Validation
Dimensionality Reduction Evolution and ValidationDimensionality Reduction Evolution and Validation
Dimensionality Reduction Evolution and Validation
 
Dm
DmDm
Dm
 
Building classification model, tree model, confusion matrix and prediction ac...
Building classification model, tree model, confusion matrix and prediction ac...Building classification model, tree model, confusion matrix and prediction ac...
Building classification model, tree model, confusion matrix and prediction ac...
 
An Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisAn Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data Analysis
 
Msa
MsaMsa
Msa
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 

International Journal of Computational Engineering Research(IJCER)

  • 1. I nternational Journal Of Computational Engineering Research (ijceronline.com) Vol. 3 Issue. 1 Comparison of Imputation Techniques after Classifying the Dataset Using Knn Classifier for the Imputation of Missing Data Ms.R.Malarvizhi1, Dr. Antony Selvadoss Thanamani2 1 Research Scholar ,Research Depart ment of Co mputer Science, NGM Co llege, 90 Palghat Road,Po llach i, ,Bharathiyar. 2 Professor and HOD, Research Depart ment of Co mputer Science, NGM College 90 Palghat Road, Po llach i Bharathiyar University, Coimbatore. Abstract: Missing data has to be imputed by using the techniques available. In this paper four imputation techniques are compared in the datasets grouped by using k-nn classifier. The results are compared in terms of percentage of accuracy. The imputation techniques Mean Substitution and Standard Deviation show better results than Linear Regression and Median Substitution. Keywords: Mean Substitution, Median Substitution, Linear Regression, Standard Deviation, k-nn Classifier. 1. Introduction Missing Data Imputation involves imputation of missing data from the available data. Improper imputation produces bias result. Therefore proper attention is needed to impute the missing values. Imputation techniques help to impute the missing value. The accuracy can be measured in terms of percentage. Pre-processing has to be done before imputing the values using imputation techniques. kNN classifier helps to classify the datasets into several groups by using the given training dataset. The imputation techniques are separately imputed in each dataset and checked for accuracy. The results are then compared. 2. Missing Data Mechanisms In statistical analysis, data-values in a data set are missing completely at random (M CAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest. Missing at random (MAR) is the alternative, suggesting that what caused the data to be missing does not depend upon the missing data itself. Not missing at random (NMAR) is data that is missing for a specific reason. 3. Imputation Methods Imputation is the process of replacing imputed values from the availa ble data. There are some supervised and unsupervised imputation techniques. The imputation techniques like Listwise Deletion and Pairwise Deletion deletes the entire row. The motivation of this paper is to impute missing values rather than deletion. In this paper, unsupervised imputation techniques are used and the results are compared in terms of percentage of accuracy. 3.1. Mean Substituti on Mean Substitution substitutes the mean value of data available. It can be calculated fro m the availab le data. Mean can be calculated by using the formula Where: is the symbol for the mean. is the symbol for su mmation. X is the symbol for the scores. N is the symbol for the nu mber of scores. 3.2. Medi an Substituti on Median Substitution is calculated by grouping up of data and finding average for the data. Median can be calculated by using the formula Median =L+h/f(n/2-c) where L is the lower class boundary of median class h is the size of median class i.e. difference between upper and lower class boundaries of me dian class f is the frequency of median class ||Issn 2250-3005(online)|| ||January || 2013 Page 101
  • 2. I nternational Journal Of Computational Engineering Research (ijceronline.com) Vol. 3 Issue. 1 c is previous cumulative frequency of the med ian class n/2 is total no. of observations divided by 2 3.3. Standard Devi ation The standard deviation measures the spread of the data about the mean value. It is us eful in comparing sets of data which may have the same mean but a different range. The Standard Deviat ion is given by the formula where are the observed values of the sample items and is the mean value o f these observations, while the denominator N stands for the size of the sample. 3.4. Linear Regression Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variab le is considered to be an explanatory variab le, and the other is co nsidered to be a dependent variable. Linear regression can be calculated by using the formula Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept. 4. Database and Data Pre-Processing The dataset consists of 10000 records which is taken fro m a UK census report conducted in the year 2001. It has 6 variables. The data has to be pre-processed before it is used for experiment. The orig inal dataset is modified by making some data as missed. The missing percentage may vary as 2, 5, 10, 15, and 20 percentages. Next data is classified into several groups. For grouping of data, knn classifier is used. Each group is then separately taken for the experiment. The imputation techniqu es are imp lemented one by one and the performance is measured by comparing with original database in terms of accuracy. 5. K-Nearest Neighbor Classifiers The idea in k-Nearest Neighbor methods is to identify k samples in the training set whose independent variable s x are similar to u, and to use these k samples to classify this new sample into a class, v. f is a smooth function, a reasonable id ea is to look for samples in our training data that are near it (in terms of the independent variables) and then to compute v from t he values of y for these samples. The distance or dissimilarity measure can be co mputed between samples by measuring distance using Euclidean distance. The simplest case is k = 1 where we find the sample in the training set that is closest (the n earest neighbor) to u and set v = y where y is the class of the nearest neighbouring sample. In k -NN the nearest k neighbors of u is calculated and then use a majority decision rule to classify the new samp le. The advantage is that higher values of k provide s moothing that reduces the risk o f over-fitt ing due to noise in the training data. In typical applicat ions k is in units or tens rather than in hundreds or thousands. 6. Experimental Analysis In our research the database is grouped into 3, 6 and 9 grou ps for experiment. The above said imputation techniques are implemented in each group. The missing percentages were 2, 5, 10, 15 and 20. The imputation techniques are Mean Substitution, Median Substitution, Standard Deviation and Linear Regression. Table 1 describes the percentage of accuracy for various groups. For accuracy each imputed data set is compared with t he original dataset. Linear Regression and Mean Substitution shows same result as well as poor result. Median Substitution and Standard Deviation shows same result as well as better result. When the group size is large there is some imp rovement in t he result. Table 2 describes the overall average of accuracy. ||Issn 2250-3005(online)|| ||January || 2013 Page 102
  • 3. I nternational Journal Of Computational Engineering Research (ijceronline.com) Vol. 3 Issue. 1 Table 1. Co mparison of Imputation Techniques in Groups Classified Using k-NN Classifier Percentage 3 GROUPS 6 GROUPS 9 GROUPS of Missing Mean Med Std Linear Mean Med Std Linear Mean Med Std Linear Sub Sub Dev Regress Sub Sub Dev Regress Sub Sub Dev Regress 2 67 70 70 70 69 71 71 66 71 73 72 67 5 69 72 72 69 75 76 76 75 74 77 77 75 10 66 71 71 66 72 78 78 72 71 80 80 71 15 64 72 72 64 66 77 77 66 66 79 79 66 20 60 72 72 60 55 78 77 56 50 80 80 52 Table 2: Performance of Above Imputation Methods in Terms of Percentage of Accuracy Percentage Mean Median Standard Linear of Missing Substitution Substitution Devi ation Regression 2 69 71 71 68 5 73 75 75 73 10 70 76 76 70 15 65 76 76 65 20 55 77 76 56 % of 66.4 75 74.8 66.4 Accuracy Figure 1: Comparison of Imputati on Methods 90 80 70 60 Mean Sub 50 Med Sub 40 Std Dev 30 Linear Reg 20 10 0 2% 5% 10% 15% 20% Figure 1 shows the difference graphically. 7. Conclusions and Future Enhancement In this paper, as the imputation techniques like Median Substitution and Standard deviation shows better result, it can be used for further research. Our research also finds out that when the data gets groups into several sizes, there is some drastic improvement in the accuracy of percentage. In future data algorithms can be generated for grouping of data. After grouping, the two imputation methods can be applied. ||Issn 2250-3005(online)|| ||January || 2013 Page 103
  • 4. I nternational Journal Of Computational Engineering Research (ijceronline.com) Vol. 3 Issue. 1 References [1]. Graham, J.W, “M issing Data Analysis: Making it work in the real world. Annual Rev iew o f Psychology”, 60, 549 – 576 , 2009. [2]. Jeffrey C.Wayman , “Mult iple Imputation For M issing Data : What Is It And How Can I Use It?” , Paper presented at the 2003 Annual Meeting of the American Educational Research Association, Chicago, IL ,pp . 2 -16, 2003. [3]. A.Rogier T.Donders, Geert J.M .G Vander Heljden, Theo St ijnen, Kernel G.M Moons, “Review: A gentle introduction to imputation of missing values” , Journal of Clinical Epidemiology 59 , pp.1087 – 1091, 2006. [4]. Kin Wagstaff ,”Clustering with M issing Values : No Imputation Required” -NSF grant IIS-0325329,pp.1-10. [5]. S.Hichao Zhang , Jilian Zhang, Xiaofeng Zhu, Yongsong Qin,chengqi Zhang , “Missing Value Imputation Based on Data Clustering”, Springer-Verlag Berlin, Heidelberg ,2008. [6]. Richard J.Hathuway , James C.Bezex, Jacalyn M.Huband , “Scalable Visual Assessment of Cluster Tendency for Large Data Sets”, Pattern Recognition ,Volu me 39, Issue 7,pp,1315-1324- Feb 2006. [7]. Gabriel L.Scholmer, Sheri Bau man and Noel A.card “Best practices for Missing Data Management in Counseling Psychology”, Journal of Counseling Psychology, Vol. 57, No. 1,pp. 1– 10,2010. [8]. R.Kavitha Ku mar, Dr.R.M Chandrasekar,“M issing Data Imputation in Cardiac Data Set” ,International Journal on Co mputer Science and Engineering , Vol.02 , No.05,pp-1836 – 1840 , 2010. [9]. Jinhai Ma, Noori Aichar –Danesh , Lisa Do lovich, Lahana Thabane , “Imputation Strategies for M issing Binary Outcomes in Cluster Randomized Trials”- BM C Med Res Methodol. 2011; pp- 11: 18. – 2011. [10]. R.S.So masundaram , R.Nedunchezhian , “Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data with M issing Values”, International Journal of Co mputer Applications (0975 – 8887) Vo lu me 21 – No.10 ,pp.14-19 ,May 2011. [11]. K.Raja , G.Tholkappia A rasu , Ch itra. S.Nair , “Imputation Framework for Missing Value” , International Journal of Co mputer Trends and Technology – volume3Issue2 – 2012. [12]. Paul D. A llison, Statistical Horizons, Haverford, PA, USA ,” Handling Missing Data by Maximu m Likelihood”, SAS Global Foru m 2012- Statistics and Data Analysis [13]. BOB L.Wall , Jeff K.Elser – “Imputation of Missing Data for Input to Support Vector Machines” , [14]. Ms.R.Malarv izhi, Dr.Antony Selvadoss Thanamani – “K-Nearest Neighbor in Missing Data Imputation”, International Journal of Engineering Research and Develop ment, volu me 5Issue 1-November - 2012 ||Issn 2250-3005(online)|| ||January || 2013 Page 104