Digit Recognizer (Machine Learning) Internship Project Report Submitted to: Persistent System LimitedProduct Engineering Unit-1 for Data Mining, Pune Submitted By:- Amit Kumar PGPBA Praxis Business School, Kolkata Project Mentor:-Mr. Yogesh Badhe (Technical Specialist, Product Engineering Unit-1) Start Date of Internship: 16th July 2012 End date of Internship: 15th October 2012 Report Date: 15th October 2012
PrefaceThis report documents is the work done during the summer internship at Persistent System Limited,Pune on the classify handwritten digits, under the supervision of Mr. Yohesh Badhe. The report willgive an overview of the tasks completed during the period of internship with technical details. Then theresults obtained are discussed and analyzed. I have tried my best to keep report simple yet technicallycorrect. I hope I succeed in my attempt.Amit Kumar
ACKNOWLEDGEMENTSimply, I could not have done this work without the lots of help I received cheerfully from Data MiningTeam. The work culture in Persistent System Limited is really motivates.Everybody is such a friendlyand cheerful companion here that work stress is never comes in way. I would specially like to thankMr. Mukund Deshpande who gave me this project to learn and understand the business implication ofstatistical algorithm. Once again I would be thankful to Mr. Yogesh Badhe who helped me from theunderstanding of the project to the building the statistical model. He not only advised me in the project,but listened my arguments in our discussion. I am also very thankful to Ms. Deepti Takale who helpedme lot to absorb the statistical concepts.Amit Kumar
AbstractThe report presents the three tasks completed during summer internship at Persistent System LimitedWhich are listed below:1. Understand of the Problem objective & business implication2. Understanding the data & build the model3. Evaluation of the modelAll these tasks have been completed successfully and results were according to Expectations. All thetasks were need very systematic approach, starting from the behavior of the data to the application ofthe algorithm and till evaluation of the model. The most challenging task was the domain knowledge, tounderstand the behavior of the data. Once the data has been prepared, we applied statistical algorithmfor model building. It is one of the major area and really need very fundamental and conceptualknowledge of Advanced Statistics.Amit Kumar
Introduction:- This project is taken from kaggle. It is a platform for predictive modeling and Analyticscompetitions. Here organization and researchers post the data. Statisticians and data scientist from allover the world compete to produce the best models.Problem Statement:- There is a image of hand written digit and each image is 28pixels in height & 28 pixelsin width, for a total of 784 pixels. Each pixel has a single pixel value associated with it, indicating the lightnessor darkness of that pixel, with higher meaning darker. The pixel value is an integer between 0 and 255,inclusive. Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where iand j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28matrix, (indexing by zero).Goal of the Competition:- take an image of a handwritten single digit, and determine what that digit is?Approach for the model building:-Develop the Analysis Plan:- For the conceptual model establishment, we need to understand the selectedtechniques and model implementation issue. Here we will establish predictive model, which will be in basedon random forest algorithm. This will our model consideration based upon sample size and required type ofvariable(metric versus nonmetric).Evaluation of Underlying techniques:-Since all multivariate techniques rely on underlying assumption, bothstatistical & conceptual, that substantially affect their ability to represent multivariate relationships. For thetechniques based on the statistical inference, the assumptions of the multivariate normality, linearity,independence of the error term, and equality of variance in a dependence relationship must all be met. Sinceour data is categorical so we do not need to identify the linearity or any independence relation.Estimate the Model and Assess Overall Model Fit:- With the assumption satisfied, the analysis proceeds tothe actual estimation of the model and assessment of overall model fit. In the estimation process we canchoose option to meet specific characteristics of the data or to maximize the fit of data. After the model isestimated, the overall model fit is evaluated to ascertain whether it achieves acceptable levels on statisticalcriteria. Many times, the model will be specified in an attempt to better level of overall fit or explanation.Interpret the variate:- With the acceptable level of model fit, interpreting the variate reveals the nature ofrelationship. The interpretation of effects for individual variables is made by examining the estimated weightfor each variable in the variate.Validate the Model:-Before accepting the results, we must subject them to one final set of diagnosticanalyses that assess the degree of generalization of the results by the available validation methods. Theattempt to validate the model is directed towards demonstrating the generalization of the results to the totalpopulation. These diagnostic analyses add little to the interpretation of the results but can be viewed asinsurance that the results are the most descriptive of the data.
Required statistical concepts for this project:-Data Mining:-In another words we say it Knowledge Discovery in Database. It is a field at the interactionof computer science and statistics to attempt to discover the pattern in large data set. It utilizemethods at the intersection of artificial intelligence , machine learning, statistics and database system.The overall goal of the data mining process is to extract information from a data set and transform itinto an understandable structure for future use.Decision Tree:- Decision tree can be used to predict a pattern or to classify the class of a data. It iscommonly used in data mining. The goal to use the decision tree algorithm is to create the model thatpredict the target variable based upon the several input variables. In decision tree each leaf represents avalue of target variable given the value of input variables represented by the path from the root of theleaf. A tree can be learned by splitting the source set into the subset based on an attribute value test.This process repeated in each derived subset called recursive. The general fashion for tree is top downinduction.Decision tree used in data mining are of two main types:- 1. Classification Tree:- When the predicted outcome is the class to which the data belongs. 2. Regression Tree:- When the predicted outcome can be consider a real number.The term classification & regression tree(CART) analysis is an umbrella term used to refer to both of theabove procedure.Some other techniques constructs more than one decision tree like, Bagging, Random Forest, Boostedtree etc. We have used Random Forest decision tree for this project. The algorithm that are used forconstructing decision trees usually work top-down by choosing a variable at each step, that is next bestvariable to use in splitting the set of variables. “Best” is defined by how well the variable splits the setinto homogeneous subsets that have the same value as target variable. Different algorithm usedifferent formulae for measuring “Best”.These are the mathematical function through we measure the Impurity.
Random Forest:- Recently lot of interest in “ensemble learning” methods that generates many classifierand aggregates their results. Two well-known methods are Boosting & bagging of classification trees. InBoosting Successive tree give extra weight to points incorrectly predicted by earlier predictor. In theEnd, a weighted vote is taken for prediction. In bagging, successive trees do not depend onearlier trees each is independently constructed using a bootstrap sample of the data set. In theend, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests,which add an additional layer of randomness to bagging. In addition to constructing each treeusing a different bootstrap sample of the data, random forests change how the classification orregression trees are constructed. In standard trees, each node is split using the best split amongall variables. In a random forest, each node is split using the best among a subset of predictorsrandomly chosen at that node. This somewhat counterintuitive strategy turns out to performVery well compared to many other classifiers, including discriminant analysis, support vectormachines and neural networks, and is robust against over fitting. In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the randomsubset at each node and the number of trees in the forest), and is usually not very sensitive totheir value.The algorithm:-The random forests algorithm (for both classification and regression) is as follows:1. Draw ntree bootstrap samples from the original data.2. For each of the bootstrap samples, grow an unpruned classification or regression tree, withthe following modification: at each node, rather than choosing the best split among allpredictors, randomly sample mtry of the predictors and choose the best split from among thoseVariables. (Bagging can be thought of as the special case of random forests obtained when mtry= p, the number of predictors.)3. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes forClassification, Average for regression). An estimate of the error rate can be obtained, based onthe training data, by the following:1. At each bootstrap iteration, predict the data not in the bootstrap sample (what BreimanCalls “out-of-bag”, or OOB, data) using the tree grown with the bootstrap sample.2. Aggregate the OOB predictions. (On the average, each data point would be out-of-bagaround 36% of the times, so aggregate these predictions.) Calculate the error rate, and call itthe OOB estimate of error rate.
Source Code:- # makes the random forest submissionlibrary(randomForest)train <- read.csv("../data/train.csv", header=TRUE)test <- read.csv("../data/test.csv", header=TRUE)labels <- as.factor(train[,1])train <- train[,-1]rf <- randomForest(train, labels, xtest=test, ntree=1000)predictions <- levels(labels)[rf$test$predicted]write(predictions, file="rf_benchmark.csv", ncolumns=1)
Solutions:--For this train data set we are taking the five random sample(20percent) and making fivedifferent model. Since our data set is large we need to combine the respective model results to validate theoverall model accuracy.. Our approach can be vary, like we can build a model on the 80percent of the traindata and keep 20percent of the data to validate the model.Model-1(Random Forest Algorithm,sample=train.csv)
Interpretation of the Model:-The prediction of model in the test data set taken as the collective results of all the five model and theweighted average taken to determine the best fit of result. Based upon the confusion matrix andrecursive pattern of data set to build the model is show that the average confidence is .954. Here wehave avoided over fitting because in the Random Forest algorithm data taken to build the model israndom & unbiased. Here the interesting thing is Random Forest can handle missing values and itdoesn’t require the pruning. In random Forest roughly 30-35% of the samples are not selected inbootstrap, which we call as (OOB) sample. Using OOB sample as input to the corresponding tree,predictions are made.Bibliography:-Multivariate Data Analysis by:- Hair black & tathamhttp://www.webchem.science.ru.nl:8080/PRiNS/rF.pdfhttp://people.revoledu.com/kardi/tutorial/DecisionTree/index.htmlhttp://www.statmethods.net/interface/workspace.html