Your SlideShare is downloading. ×
0
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The caret Package: A Unified Interface for Predictive Models

6,431

Published on

The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on …

The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations.

An example from classification of music genres is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models.

Bio: Max is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the molecular diagnostic and pharmaceutical industries for over 15 years. He is the author of several R packages including the caret package that provides a simple and consistent interface to over 100 predictive models available in R.

Max has taught courses on modeling within Pfizer and externally. Recently, he taught modeling classes for the American Society of Chemistry, the Indian Ministry of Information Technology and Predictive Analytics World. He is a co-author of the forthcoming Spring book "Applied Predictive Modeling".

Published in: Education
2 Comments
18 Likes
Statistics
Notes
  • your vasmo account is over storage. Please provide a different video link
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • For video and synced slides, please go here http://vcasmo.com/video/nycpa/11727. To download just the presentation slides, you can find them here: http://www.meetup.com/NYC-Predictive-Analytics/files/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
6,431
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
2
Likes
18
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The caret Package: A Unified Interface for Predictive Models Max Kuhn Pfizer Global R&D Nonclinical Statistics Groton, CT max.kuhn@pfizer.com May 12, 2011
  • 2. Shameless Plug # 1: CoursesI’ll be teaching 2 R classes here for Predictive Analytics World. R Bootcamp (October 16) R for Predictive Modeling: A Hands-On Introduction (October 17)http://www.predictiveanalyticsworld.com/newyork/2011/ Max Kuhn (Pfizer Global R&D) caret May 12, 2011 2 / 44
  • 3. MotivationTheorem (No Free Lunch)In the absence of any knowledge about the prediction problem, no modelcan be said to be uniformly better than any otherGiven this, it makes sense to use a variety of different models to find onethat best fits the dataR has many packages for predictive modeling (aka machine learning)(akapattern recognition) . . . Max Kuhn (Pfizer Global R&D) caret May 12, 2011 3 / 44
  • 4. Model Function ConsistencySince there are many modeling packages written by different people, thereare some inconsistencies in how models are specified and predictions aremade.For example, many models have only one method of specifying the model(e.g. formula method only)The table below shows the syntax to get probability estimates from severalclassification models: obj Class Package predict Function Syntax lda MASS predict(obj) (no options needed) glm stats predict(obj, type = "response") gbm gbm predict(obj, type = "response", n.trees) mda mda predict(obj, type = "posterior") rpart rpart predict(obj, type = "prob") Weka RWeka predict(obj, type = "probability") LogitBoost caTools predict(obj, type = "raw", nIter) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 4 / 44
  • 5. The caret PackageThe caret package was developed to: create a unified interface for modeling and prediction streamline model tuning using resampling provide a variety of “helper” functions and classes for day–to–day model building tasks increase computational efficiency using parallel processingFirst commits within Pfizer: 6/2005First version on CRAN: 10/2007Website: http://caret.r-forge.r-project.orgJSS Paper: www.jstatsoft.org/v28/i05/paper4 package vignettes (82 pages total) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 5 / 44
  • 6. Example Data: TunedIT Music Challengehttp://tunedit.org/challenge/music-retrieval/genresUsing 191 descriptors, classify 12495 musical segments into one of 6genres: Blues, Classical, Jazz, Metal, Pop, Rock.Use these data to predict a large test set of music segments. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 6 / 44
  • 7. Example Data: TunedIT Music ChallengeThe predictors and class variables are contained in a data frame calledmusic.> head(music[,1:5]) TC SC SC_V ASE1 ASE21 2.5788 481.45 76989.0 -0.12334 -0.115782 2.7195 1405.30 825380.0 -0.17655 -0.183233 2.5351 601.09 686240.0 -0.13940 -0.132514 2.4465 637.73 122580.0 -0.14995 -0.148025 2.5657 776.86 124010.0 -0.16863 -0.161126 2.7737 447.09 8531.9 -0.16128 -0.15742> head(music$GENRE)[1] Pop Blues Pop Jazz Jazz ClassicalLevels: Blues Classical Jazz Metal Pop Rock Max Kuhn (Pfizer Global R&D) caret May 12, 2011 7 / 44
  • 8. Data SplittingcreateDataPartition conducts stratified random splits> ## Create a test set with 25% of the data> set.seed(1)> inTrain <- createDataPartition(music$GENRE, p = .75)[[1]]> length(inTrain)[1] 9373> head(inTrain)[1] 2 7 14 20 22 47This produces a list for each resample. The list elements are integers forthe resampled set. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 8 / 44
  • 9. Data Splitting> trainDescr <- music[ inTrain, -ncol(music)]> testDescr <- music[-inTrain, -ncol(music)]> trainClass <- music$GENRE[ inTrain]> testClass <- music$GENRE[-inTrain]> prop.table(table(music$GENRE)) Blues Classical Jazz Metal Pop Rock0.12773109 0.27563025 0.24033613 0.07394958 0.12605042 0.15630252> prop.table(table(trainClass))trainClass Blues Classical Jazz Metal Pop Rock0.12770724 0.27557879 0.24037128 0.07393577 0.12610690 0.15630001Other functions: createFolds, createMultiFolds, createResamples Max Kuhn (Pfizer Global R&D) caret May 12, 2011 9 / 44
  • 10. Data Pre–Processing MethodspreProcess calculates values that can be used to apply to any data set(e.g. training, set, unknowns).Current methods: centering, scaling, spatial sign transformation, PCA orICA “signal extraction” imputation (via bagging or k –nearest neighbors), ,Box–Cox transformations> ## Determine means and sds> procValues <- preProcess(trainDescr, method = c("center", "scale"))> procValues> ## Use the predict methods to do the adjustments> trainScaled <- predict(procValues, trainDescr)> testScaled <- predict(procValues, testDescr)preProcess can also be called within other functions, such as train, foreach resampling iteration. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 10 / 44
  • 11. Model Tuning Using ResamplingDefine sets of model parameter values to evaluate;for each parameter set do for each resampling iteration do Hold–out specific samples ; Fit the model on the remainder; Predict the hold–out samples; end Calculate the average performance across hold–out predictionsendDetermine the optimal parameter set; Max Kuhn (Pfizer Global R&D) caret May 12, 2011 11 / 44
  • 12. Model Tuningtrain uses resampling to tune and/or evaluate candidate models.> set.seed(1)> rbfSVM <- train(x = trainDescr, y = trainClass,+ method = "svmRadial",+ ## center and scale+ preProc = c("center", "scale"),+ ## Length of default tuning parameter grid+ tuneLength = 8,+ ## Repeated cross-validation resampling+ trControl = trainControl(method = "repeatedcv",+ repeats = 5),+ ## Pick the best model using resampled Kappa+ metric = "Kappa",+ ## Pass arguments to ksvm+ fit = FALSE) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 12 / 44
  • 13. Model Tuning> print(rbfSVM, printCall = FALSE)9373 samples 191 predictors 6 classes: Blues, Classical, Jazz, Metal, Pop, RockPre-processing: centered, scaledResampling: Cross-Validation (10 fold, repeated 5 times)Summary of sample sizes: 8437, 8435, 8434, 8435, 8437, 8436, ...Resampling results across tuning parameters: C Accuracy Kappa Accuracy SD Kappa SD 0.25 0.916 0.895 0.00953 0.0119 0.5 0.938 0.923 0.00824 0.0103 1 0.956 0.945 0.00641 0.008 2 0.964 0.955 0.00614 0.00766 4 0.968 0.961 0.0061 0.00761 8 0.969 0.962 0.00623 0.00777 16 0.969 0.962 0.00633 0.0079 32 0.969 0.962 0.0063 0.00786Tuning parameter sigma was held constant at a value of 0.00518Kappa was used to select the optimal model using the largest value.The final values used for the model were C = 16 and sigma = 0.00518. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 13 / 44
  • 14. Model Tuning> class(rbfSVM)[1] "train"> class(rbfSVM$finalModel)[1] "ksvm"attr(,"package")[1] "kernlab" Max Kuhn (Pfizer Global R&D) caret May 12, 2011 14 / 44
  • 15. Model Tuning train uses as many “tricks” as possible to reduce the number of models fits (e.g. using sub–models). Here, it uses the kernlab function sigest to analytically estimate the RBF scale parameter. Currently, there are options for 110 models (see ?train for a list) Allows user–defined search grid, performance metrics and selection rules Easily integrates with any parallel processing framework that can emulate lapply Formula and non–formula interfaces Methods: predict, print, plot, varImp, resamples, xyplot, densityplot, histogram, stripplot, . . . Max Kuhn (Pfizer Global R&D) caret May 12, 2011 15 / 44
  • 16. Plotsplot(rbfSVM, xTrans = function(x) log2(x)) Accuracy (Repeated Cross−Validation) 0.97 q q q q q 0.96 q 0.95 0.94 q 0.93 0.92 q −2 0 2 4 Cost Max Kuhn (Pfizer Global R&D) caret May 12, 2011 16 / 44
  • 17. Plotsdensityplot(rbfSVM, metric = "Kappa", pch = "|") 40 30 Density 20 10 0 | | | || || | || | || | | | | | | | | || | | | | | | | | 0.94 0.96 0.98 Kappa Max Kuhn (Pfizer Global R&D) caret May 12, 2011 17 / 44
  • 18. Prediction and Performance AssessmentThe predict method can be used to get results for other data sets:> svmPred <- predict(rbfSVM, testDescr)> str(svmPred) Factor w/ 6 levels "Blues","Classical",..: 3 2 6 3 5 6 5 1 2 6 ...> svmProbs <- predict(rbfSVM, testDescr, type = "prob")> head(svmProbs) Blues Classical Jazz Metal Pop Rock1 0.03109657 0.51176742 0.31534778 0.05645315 0.02457019 0.060764892 0.00000000 0.98948148 0.01051852 0.00000000 0.00000000 0.000000003 0.05631158 0.03418600 0.07429845 0.14161385 0.20161666 0.491973454 0.09363752 0.15474426 0.32233519 0.14328794 0.14338776 0.142607335 0.07702743 0.09083003 0.16349012 0.17140600 0.23710395 0.260142486 0.06928080 0.03574326 0.08477684 0.13890564 0.18578909 0.48550437 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 18 / 44
  • 19. Prediction and Performance Assessment> confusionMatrix(svmPred, testClass)Confusion Matrix and Statistics ReferencePrediction Blues Classical Jazz Metal Pop Rock Blues 395 0 0 3 1 1 Classical 0 841 21 0 1 2 Jazz 4 20 724 9 4 8 Metal 0 0 0 214 2 0 Pop 0 0 0 3 378 6 Rock 0 0 5 2 7 471Overall Statistics Accuracy : 0.9683 95% CI : (0.9615, 0.9742) No Information Rate : 0.2758 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.9605 Mcnemars Test P-Value : NA Max Kuhn (Pfizer Global R&D) caret May 12, 2011 19 / 44
  • 20. Prediction and Performance AssessmentStatistics by Class: Class: Blues Class: Classical Class: Jazz Class: Metal Class: PSensitivity 0.9900 0.9768 0.9653 0.92641 0.96Specificity 0.9982 0.9894 0.9810 0.99931 0.99Pos Pred Value 0.9875 0.9723 0.9415 0.99074 0.97Neg Pred Value 0.9985 0.9911 0.9890 0.99415 0.99Prevalence 0.1278 0.2758 0.2402 0.07399 0.12Detection Rate 0.1265 0.2694 0.2319 0.06855 0.12Detection Prevalence 0.1281 0.2771 0.2463 0.06919 0.12 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 20 / 44
  • 21. Comparing ModelsWe can use the resampling results to make formal and informalcomparisons between models.Based on the work of Hothorn et al. “The design and analysis of benchmark experiments” . Journal of Computational and Graphical Statistics (2005) vol. 14 (3) pp. 675-699 Eugster et al. “Exploratory and inferential analysis of benchmark experiments” Ludwigs-Maximilians-Universitat Munchen, Department . of Statistics, Tech. Rep (2008) vol. 30 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 21 / 44
  • 22. Comparing Models> set.seed(1)> rfFit <- train(x = trainDescr, y = trainClass,+ method = "rf", tuneLength = 5,+ trControl = trainControl(method = "repeatedcv",+ repeats = 5,+ verboseIter = FALSE),+ metric = "Kappa")> set.seed(1)> plsFit <- train(x = trainDescr, y = trainClass,+ method = "pls", tuneLength = 20,+ preProc = c("center", "scale", "BoxCox"),+ trControl = trainControl(method = "repeatedcv",+ repeats = 5,+ verboseIter = FALSE),+ metric = "Kappa") Max Kuhn (Pfizer Global R&D) caret May 12, 2011 22 / 44
  • 23. Comparing Models> resamps <- resamples(list(rf = rfFit, pls = plsFit, svm = rbfSVM))> print(summary(resamps))Call:summary.resamples(object = resamps)Models: rf, pls, svmNumber of resamples: 50Accuracy Min. 1st Qu. Median Mean 3rd Qu. Max.rf 0.9200 0.9328 0.9370 0.9370 0.9424 0.9499pls 0.8348 0.8488 0.8554 0.8554 0.8631 0.8806svm 0.9478 0.9648 0.9691 0.9694 0.9752 0.9819Kappa Min. 1st Qu. Median Mean 3rd Qu. Max.rf 0.9003 0.9162 0.9215 0.9215 0.9282 0.9376pls 0.7932 0.8106 0.8192 0.8190 0.8286 0.8507svm 0.9350 0.9561 0.9615 0.9619 0.9691 0.9774 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 23 / 44
  • 24. Comparing Models> diffs <- diff(resamps, metric = "Kappa")> print(summary(diffs))Call:summary.diff.resamples(object = diffs)p-value adjustment: bonferroniUpper diagonal: estimates of the differenceLower diagonal: p-value for H0: difference = 0Kappa rf pls svmrf 0.10245 -0.04043pls < 2.2e-16 -0.14288svm < 2.2e-16 < 2.2e-16 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 24 / 44
  • 25. Parallel Coordinate Plotsparallel(resamps, metric = "Kappa") svm rf pls 0.8 0.85 0.9 0.95 Kappa Max Kuhn (Pfizer Global R&D) caret May 12, 2011 25 / 44
  • 26. Box Plotsbwplot(resamps, metric = "Kappa") Kappa svm q q rf q pls q 0.80 0.85 0.90 0.95 Max Kuhn (Pfizer Global R&D) caret May 12, 2011 26 / 44
  • 27. Dot Plots of Average Differencesdotplot(diffs) rf − svm q rf − pls q pls − svm q −0.15 −0.10 −0.05 0.00 0.05 0.10 Difference in Kappa Confidence Level 0.983 (multiplicity adjusted) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 27 / 44
  • 28. Feature SelectionThere are many predictive models with built–in feature selection (e.g.trees, the lasso, MARS, etc).caret contains a few functions for supervised feature selection via“wrappers”.Two wrappers techniques in caret are:: recursive feature selection (RFE) filtering using simple, univariate statisticsThis can be tricky and can be fraught with bias.See: Ambroise and McLachlan (2002) for an example Max Kuhn (Pfizer Global R&D) caret May 12, 2011 28 / 44
  • 29. Recursive Feature SelectionThis is basically backwards selection.We rank the predictors by importance, then cull the least important.We create a performance profile across the subset size and pick the bestThe final model is refit using only the subset.The feature selection step must be cross–validated! Max Kuhn (Pfizer Global R&D) caret May 12, 2011 29 / 44
  • 30. Recursive Feature Eliminationfor Each Resampling Iteration do Partition original data into training and hold–back sets via resampling ; Train the model on the training set using all predictors; Predict the held–back samples; Calculate variable importance or rankings; for Each subset size Si , i = 1 . . . S do Keep the Si most important variables; Train the model on the training set using Si predictors; Predict the held–back samples; endendCalculate the performance profile over the Si using the held–back samples;Determine the appropriate number of predictors;Estimate the final list of predictors to keep in the final model;Fit the final model based on the optimal Si using the original data set; Max Kuhn (Pfizer Global R&D) caret May 12, 2011 30 / 44
  • 31. Recursive Feature SelectionThe rfe function is a framework for doing this. There are severalpre–defined functions for certain models (and a wrapper for train)For each subset, let’s run a few regression models for illustration> data(BloodBrain)> varSizes <- c(2:25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 80, 80)> x <- bbbDescr[,-nearZeroVar(bbbDescr)]> x <- x[, -findCorrelation(cor(x), .9)]> set.seed(1)> lmProfile <- rfe(x, logBBB,+ sizes = varSizes,+ rfeControl = rfeControl(functions = lmFuncs,+ number = 200,+ verbose = FALSE))> rfProfile <- rfe(x, logBBB,+ sizes = varSizes,+ rfeControl = rfeControl(functions = rfFuncs,+ number = 200,+ verbose = FALSE)) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 31 / 44
  • 32. Backwards Selection Results Linear Reg q Random Forests 1.1 q q q 1.0 q q 0.9 q RMSE q 0.8 q q q q 0.7 q q q qqq q qqqqqqqqqqqqqqqqq 0.6 0 20 40 60 80 Variables Max Kuhn (Pfizer Global R&D) caret May 12, 2011 32 / 44
  • 33. Opportunities for Parallel ProcessingRecall the algorithm for selecting models via resampling:Define sets of model parameter values to evaluate;for each parameter set do for each resampling iteration do Hold–out specific samples ; Fit the model on the remainder; Predict the hold–out samples; end Calculate the average performance across hold–out predictionsendDetermine the optimal parameter set; Max Kuhn (Pfizer Global R&D) caret May 12, 2011 33 / 44
  • 34. Opportunities for Parallel ProcessingIn this process, M models are fit to B resampled data sets.There is (usually∗ ) no connection between these models, so they could berun within different processes on the same computer or over separatecomputers.Can we get any benefit from parallel processing?∗ There are some exceptions where sub–models are evaluated withoutfurther re–fitting. For example, if we can fit a PLS model with 10components, we can get the results from models with 1–9 components forfree. We’ll call this the “sub–model” trick. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 34 / 44
  • 35. An Example – Boosted TreesWe trained a medium sized data set (n = 4, 500) to tune a gradientboosting machine (GBM) model sequentially and in parallel.We fit models with four different values of the interaction depth and 10different values for the number of boosting iterations.It turns out that, for each value of the interaction depth, we can fit onemodel with the largest number of iterations and get the predictions fromsmaller models at no cost.This means we need to fit four models (with different interaction depths)for 50 bootstrap samples. We’ll partition these 200 model fits ontodifferent processes in a few ways to see if parallelization helps. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 35 / 44
  • 36. Execution Times – An Example – Support Vector MachinesSVM regression models with 5 candidate values of the cost parameter with50 bootstrap iterations can be tested on the same data.caret uses sub–models wherever possible to be efficient but, unlikeboosted trees, support vector machines cannot be exploited in this way.The GBM and SVM computations were performed using sequentialprocess and parallel processing with 1 to 16 “worker nodes’. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 36 / 44
  • 37. Execution Time Results parallel q sequential 5 10 15 GBM SVM 35 qq q 80 q q 30 60 25 q q q Training Time q q 20 q q q 40 q q 15 q q q q q q q qq q q 10 q q q q q qq q q q q q q q qq q q q q q q q q 20 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 5 10 15 #Processors Max Kuhn (Pfizer Global R&D) caret May 12, 2011 37 / 44
  • 38. Speedupsspeedup = Sequential Time / Parallel Time GBM q SVM 5 q q q q q q q q q q 4 q q q q q q q q q Speedup q q q q q 3 q q q q 2 q q q 1 q 5 10 15 #Processors Max Kuhn (Pfizer Global R&D) caret May 12, 2011 38 / 44
  • 39. ResultsThere is a benefit to adding more workers for these calculations.The optimal speedup with be W where W is the number of workers. Weare not optimal, but we can cut the execution time down by 4–5 fold.The SVM model benefited more than the GBM model, perhaps since GBMwas fitting less models (using sub–models). Max Kuhn (Pfizer Global R&D) caret May 12, 2011 39 / 44
  • 40. Other Functions and Classes nearZeroVar: a function to remove predictors that are sparse and highly unbalanced findCorrelation: a function to remove the optimal set of predictors to achieve low pair–wise correlations predictors: class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models) (currently 57 methods) confusionMatrix, sensitivity, specificity, posPredValue, negPredValue: classes for assessing classifier performance varImp: classes for assessing the aggregate effect of a predictor on the model equations (currently 20 methods) Max Kuhn (Pfizer Global R&D) caret May 12, 2011 40 / 44
  • 41. Other Functions and Classes knnreg: nearest–neighbor regression plsda, splsda: PLS discriminant analysis icr: independent component regression pcaNNet: nnet:::nnet with automatic PCA pre–processing step bagEarth, bagFDA: bagging with MARS and FDA models normalize2Reference: RMA–like processing of Affy arrays using a training set spatialSign: class for transforming numeric data (x = x /||x ||) maxDissim: a function for maximum dissimilarity sampling featurePlot: a wrapper for several lattice functions Max Kuhn (Pfizer Global R&D) caret May 12, 2011 41 / 44
  • 42. Shameless Plug # 2: Other PackagesA few others that I’m working on... sparseLDA: Lasso–type regularization for LDA Cubist: Quinlan’s model trees C5.0: Quinlan’s decision trees (I could use some C help here) FuseBox: a framework for combining ensembles of models Max Kuhn (Pfizer Global R&D) caret May 12, 2011 42 / 44
  • 43. ThanksKirk Mettler, Bruno and the NYC Predictive Analytics OrganizersR CorePfizer’s Statistics leadership for providing the time and support to create Rpackagescaret contributors: Jed Wing, Steve Weston, Andre Williams, ChrisKeefer and Allan EngelhardtMax Kuhn (Pfizer Global R&D) caret May 12, 2011 43 / 44
  • 44. Session Info R version 2.11.1 (2010-05-31), x86_64-apple-darwin9.8.0 Base packages: base, datasets, graphics, grDevices, methods, splines, stats, tools, utils Other packages: caret 4.87, class 7.3-2, cluster 1.12.3, codetools 0.2-2, digest 0.4.2, e1071 1.5-24, gbm 1.6-3.1, kernlab 0.9-12, lattice 0.18-8, plyr 1.2.1, reshape 0.8.3, survival 2.35-8, weaver 1.16.0 Loaded via a namespace (and not attached): grid 2.11.1This presentation was created with a MacPro using LTEXand R’s Sweave Afunction at 16:02 on Wednesday, May 11, 2011. Max Kuhn (Pfizer Global R&D) caret May 12, 2011 44 / 44

×