R Reference Card for Data Mining


Published on

a list of R packages and functions for data mining

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

R Reference Card for Data Mining

  1. 1. R Reference Card for Data Mining Performance Evaluation apclusterK() affinity propagation clustering to get K clusters (apcluster) performance() provide various measures for evaluating performance of pre- cclust() Convex Clustering, incl. k-means and two other clustering algo-by Yanchang Zhao, yanchang@rdatamining.com, January 3, 2013 diction and classification models (ROCR) rithms (cclust)The latest version is available at http://www.RDataMining.com. Click the link roc() build a ROC curve (pROC) KMeansSparseCluster() sparse k-means clustering (sparcl)also for document R and Data Mining: Examples and Case Studies. auc() compute the area under the ROC curve (pROC) tclust(x,k,alpha,...) trimmed k-means with which a proportionThe package names are in parentheses. ROC() draw a ROC curve (DiagnosisMed) alpha of observations may be trimmed (tclust) PRcurve() precision-recall curves (DMwR) Association Rules & Frequent Itemsets CRchart() cumulative recall charts (DMwR) Hierarchical Clustering Packages a hierarchical decomposition of data in either bottom-up (agglomerative) or top-APRIORI Algorithm rpart recursive partitioning and regression trees down (divisive) waya level-wise, breadth-first algorithm which counts transactions to find frequent party recursive partitioning hclust(d, method, ...) hierarchical cluster analysis on a set of dissim-itemsets randomForest classification and regression based on a forest of trees using ran- ilarities d using the method for agglomerationapriori() mine associations with APRIORI algorithm (arules) dom inputs birch() the BIRCH algorithm that clusters very large data with a CF-tree rpartOrdinal ordinal classification trees, deriving a classification tree when the (birch)ECLAT Algorithm response to be predicted is ordinal pvclust() hierarchical clustering with p-values via multi-scale bootstrap re- rpart.plot plots rpart models with an enhanced version of plot.rpart in the sampling (pvclust)employs equivalence classes, depth-first search and set intersection instead of rpart package agnes() agglomerative hierarchical clustering (cluster)counting ROCR visualize the performance of scoring classifiers diana() divisive hierarchical clustering (cluster)eclat() mine frequent itemsets with the Eclat algorithm (arules) pROC display and analyze ROC curves mona() divisive hierarchical clustering of a dataset with binary variables only (cluster)Packages Regression rockCluster() cluster a data matrix using the Rock algorithm (cba)arules mine frequent itemsets, maximal frequent itemsets, closed frequent item- proximus() cluster the rows of a logical matrix using the Proximus algorithm sets and association rules. It includes two algorithms, Apriori and Eclat. Functions (cba)arulesViz visualizing association rules lm() linear regression isopam() Isopam clustering algorithm (isopam) glm() generalized linear regression LLAhclust() hierarchical clustering based on likelihood linkage analysis Sequential Patterns nls() non-linear regression (LLAhclust) predict() predict with models flashClust() optimal hierarchical clustering (flashClust)Functions residuals() residuals, the difference between observed values and fitted val- fastcluster() fast hierarchical clustering (fastcluster)cspade() mining frequent sequential patterns with the cSPADE algorithm ues (arulesSequences) cutreeDynamic(), cutreeHybrid() detection of clusters in hierarchi- gls() fit a linear model using generalized least squares (nlme) cal clustering dendrograms (dynamicTreeCut)seqefsub() searching for frequent subsequences (TraMineR) gnls() fit a nonlinear model using generalized least squares (nlme) HierarchicalSparseCluster() hierarchical sparse clustering (sparcl)Packages PackagesarulesSequences add-on for arules to handle and mine frequent sequences nlme linear and nonlinear mixed effects models Model based ClusteringTraMineR mining, describing and visualizing sequences of states or events Clustering Mclust() model-based clustering (mclust) Classification & Prediction HDDC() a model-based method for high dimensional data clustering (HDclas- Partitioning based Clustering sif )Decision Trees partition the data into k groups first and then try to improve the quality of clus- fixmahal() Mahalanobis Fixed Point Clustering (fpc)ctree() conditional inference trees, recursive partitioning for continuous, cen- tering by moving objects from one group to another fixreg() Regression Fixed Point Clustering (fpc) sored, ordered, nominal and multivariate response variables in a condi- kmeans() perform k-means clustering on a data matrix mergenormals() clustering by merging Gaussian mixture components (fpc) tional inference framework (party) kmeansCBI() interface function for kmeans (fpc)rpart() recursive partitioning and regression trees (rpart) Density based Clustering kmeansruns() call kmeans for the k-means clustering method and includes generate clusters by connecting dense regionsmob() model-based recursive partitioning, yielding a tree with fitted models estimation of the number of clusters and finding an optimal solution from associated with each terminal node (party) dbscan(data,eps,MinPts,...) generate a density based clustering of several starting points (fpc) arbitrary shapes, with neighborhood radius set as eps and density thresh-Random Forest pam() the Partitioning Around Medoids (PAM) clustering method (cluster) old as MinPts (fpc)cforest() random forest and bagging ensemble (party) pamk() the Partitioning Around Medoids (PAM) clustering method with esti- pdfCluster() clustering via kernel density estimation (pdfCluster)randomForest() random forest (randomForest) mation of number of clusters (fpc)varimp() variable importance (party) cluster.optimal() search for the optimal k-clustering of the dataset Other Clustering Techniquesimportance() variable importance (randomForest) (bayesclust) clara() Clustering Large Applications (cluster) mixer() random graph clustering (mixer)Neural Networks nncluster() fast clustering with restarted minimum spanning tree (nnclust) fanny(x,k,...) compute a fuzzy clustering of the data into k clusters (clus-nnet() fit single-hidden-layer neural network (nnet) orclus() ORCLUS subspace clustering (orclus) ter)Support Vector Machine (SVM) kcca() k-centroids clustering (flexclust) Plotting Clustering Solutionssvm() train a support vector machine for regression, classification or density- ccfkms() clustering with Conjugate Convex Functions (cba) plotcluster() visualisation of a clustering or grouping in data (fpc) estimation (e1071) apcluster() affinity propagation clustering for a given similarity matrix (ap- bannerplot() a horizontal barplot visualizing a hierarchical clustering (clus-ksvm() support vector machines (kernlab) cluster) ter)
  2. 2. Cluster Validation Packages SnowballStemmer() Snowball word stemmers (Snowball)silhouette() compute or extract silhouette information (cluster) extremevalues detect extreme values in one-dimensional data LDA() fit a LDA (latent Dirichlet allocation) model (topicmodels)cluster.stats() compute several cluster validity statistics from a cluster- mvoutlier multivariate outlier detection based on robust methods CTM() fit a CTM (correlated topics model) model (topicmodels) ing and a dissimilarity matrix (fpc) outliers some tests commonly used for identifying outliers terms() extract the most likely terms for each topic (topicmodels)clValid() calculate validation measures for a given set of clustering algo- Rlof a parallel implementation of the LOF algorithm topics() extract the most likely topics for each document (topicmodels) rithms and number of clusters (clValid) Time Series Analysis PackagesclustIndex() calculate the values of several clustering indexes, which can tm a framework for text mining applications be independently used to determine the number of clusters existing in a Construction & Plot lda fit topic models with LDA data set (cclust) ts() create time-series objects (stats) topicmodels fit topic models with LDA and CTMNbClust() provide 30 indices for cluster validation and determining the num- plot.ts() plot time-series objects (stats) RTextTools automatic text classification via supervised learning ber of clusters (NbClust) smoothts() time series smoothing (ast) tm.plugin.dc a plug-in for package tm to support distributed text miningPackages sfilter() remove seasonal fluctuation using moving average (ast) tm.plugin.mail a plug-in for package tm to handle mailcluster cluster analysis Decomposition RcmdrPlugin.TextMining GUI for demonstration of text mining concepts andfpc various methods for clustering and cluster validation tm package decomp() time series decomposition by square-root filter (timsac)mclust model-based clustering and normal mixture modeling textir a suite of tools for inference about text documents and associated sentiment decompose() classical seasonal decomposition by moving averages (stats)birch clustering very large datasets using the BIRCH algorithm tau utilities for text analysis stl() seasonal decomposition of time series by loess (stats)pvclust hierarchical clustering with p-values textcat n-gram based text categorization tsr() time series decomposition (ast)apcluster Affinity Propagation Clustering YjdnJlp Japanese text analysis by Yahoo! Japan Developer Network ardec() time series autoregressive decomposition (ArDec)cclust Convex Clustering methods, including k-means algorithm, On-line Up- Social Network Analysis and Graph Mining date algorithm and Neural Gas algorithm and calculation of indexes for Forecasting finding the number of clusters in a data set arima() fit an ARIMA model to a univariate time series (stats) Functionscba Clustering for Business Analytics, including clustering techniques such as predict.Arima() forecast from models fitted by arima (stats) graph(), graph.edgelist(), graph.adjacency(), Proximus and Rock auto.arima() fit best ARIMA model to univariate time series (forecast) graph.incidence() create graph objects respectively from edges,bclust Bayesian clustering using spike-and-slab hierarchical model, suitable for forecast.stl(), forecast.ets(), forecast.Arima() an edge list, an adjacency matrix and an incidence matrix (igraph) clustering high-dimensional data forecast time series using stl, ets and arima models (forecast) plot(), tkplot() static and interactive plotting of graphs (igraph)biclust algorithms to find bi-clusters in two-dimensional data Packages gplot(), gplot3d() plot graphs (sna)clue cluster ensembles forecast displaying and analysing univariate time series forecasts V(), E() vertex/edge sequence of igraph (igraph)clues clustering method based on local shrinking timsac time series analysis and control program are.connected() check whether two nodes are connected (igraph)clValid validation of clustering results ast time series analysis degree(), betweenness(), closeness() various centrality scoresclv cluster validation techniques, contains popular internal and external cluster ArDec time series autoregressive-based decomposition (igraph, sna) validation methods for outputs produced by package cluster ares a toolbox for time series analyses using generalized additive models add.edges(), add.vertices(), delete.edges(),bayesclust tests/searches for significant clusters in genetic data dse tools for multivariate, linear, time-invariant, time series models delete.vertices() add and delete edges and vertices (igraph)clustvarsel variable selection for model-based clustering neighborhood() neighborhood of graph vertices (igraph, sna)clustsig significant cluster analysis, tests to see which (if any) clusters are statis- Text Mining get.adjlist() adjacency lists for edges or vertices (igraph) tically different Functions nei(), adj(), from(), to() vertex/edge sequence indexing (igraph)clusterfly explore clustering interactively cliques() find cliques, ie. complete subgraphs (igraph) Corpus() build a corpus, which is a collection of text documents (tm)clusterSim search for optimal clustering procedure for a data set clusters() maximal connected components of a graph (igraph) tm map() transform text documents, e.g., stemming, stopword removal (tm)clusterGeneration random cluster generation %->%, %<-%, %--% edge sequence indexing (igraph) tm filter() filtering out documents (tm)clusterCons calculate the consensus clustering result from re-sampled clustering get.edgelist() return an edge list in a two-column matrix (igraph) TermDocumentMatrix(), DocumentTermMatrix() construct a experiments with the option of using multiple algorithms and parameter read.graph(), write.graph() read and writ graphs from and to files term-document matrix or a document-term matrix (tm)gcExplorer graphical cluster explorer (igraph) Dictionary() construct a dictionary from a character vector or a term-hybridHclust hybrid hierarchical clustering via mutual clusters Packages document matrix (tm)Modalclust hierarchical modal Clustering findAssocs() find associations in a term-document matrix (tm) sna social network analysisiCluster integrative clustering of multiple genomic data types findFreqTerms() find frequent terms in a term-document matrix (tm) igraph network analysis and visualizationEMCC evolutionary Monte Carlo (EMC) methods for clustering stemDocument() stem words in a text document (tm) statnet a set of tools for the representation, visualization, analysis and simulationrEMM extensible Markov Model (EMM) for data stream clustering stemCompletion() complete stemmed words (tm) of network data Outlier Detection termFreq() generate a term frequency vector from a text document (tm) egonet ego-centric measures in social network analysis stopwords(language) return stopwords in different languages (tm) snort social network-analysis on relational tablesFunctions removeNumbers(), removePunctuation(), removeWords() re- network tools to create and modify network objectsboxplot.stats()$out list data points lying beyond the extremes of the move numbers, punctuation marks, or a set of words from a text docu- bipartite visualising bipartite networks and calculating some (ecological) indices whiskers ment (tm) blockmodelinggeneralized and classical blockmodeling of valued networkslofactor() calculate local outlier factors using the LOF algorithm (DMwR removeSparseTerms() remove sparse terms from a term-document matrix diagram visualising simple graphs (networks), plotting flow diagrams or dprep) (tm) NetCluster clustering for networkslof() a parallel implementation of the LOF algorithm (Rlof ) textcat() n-gram based text categorization (textcat) NetData network data for McFarland’s SNA R labs
  3. 3. NetIndices estimating network indices, including trophic structure of foodwebs Packages googleVis an interface between R and the Google Visualisation API to create in R nlme linear and nonlinear mixed effects models interactive chartsNetworkAnalysis statistical inference on populations of weighted or unweighted lattice a powerful high-level data visualization system, with an emphasis on mul- networks Graphics tivariate datatnet analysis of weighted, two-mode, and longitudinal networks Functions vcd visualizing categorical datatriads triad census for networks plot() generic function for plotting (graphics) denpro visualization of multivariate, functions, sets, and data Spatial Data Analysis barplot(), pie(), hist() bar chart, pie chart and histogram (graph- iplots interactive graphicsFunctions ics) Data Manipulation boxplot() box-and-whisker plot (graphics)geocode() geocodes a location using Google Maps (ggmap) stripchart() one dimensional scatter plot (graphics) Functionsqmap() quick map plot (ggmap) dotchart() Cleveland dot plot (graphics) transform() transform a data frameget map() queries the Google Maps, OpenStreetMap, or Stamen Maps server qqnorm(), qqplot(), qqline() QQ (quantile-quantile) plot (stats) scale() scaling and centering of matrix-like objects for a map at a certain location (ggmap) coplot() conditioning plot (graphics) t() matrix transposegvisGeoChart(), gvisGeoMap(), gvisIntensityMap(), splom() conditional scatter plot matrices (lattice) aperm() array transpose gvisMap() Google geo charts and maps (googleVis) pairs() a matrix of scatterplots (graphics) sample() samplingGetMap() download a static map from the Google server (RgoogleMaps) cpairs() enhanced scatterplot matrix (gclus) table(), tabulate(), xtabs() cross tabulation (stats)ColorMap() plot levels of a variable in a colour-coded map (RgoogleMaps) parcoord() parallel coordinate plot (MASS) stack(), unstack() stacking vectorsPlotOnStaticMap() overlay plot on background image of map tile cparcoord() enhanced parallel coordinate plot (gclus) split(), unsplit() divide data into groups and reassemble (RgoogleMaps) paracoor() parallel coordinates plot (denpro) reshape() reshape a data frame between “wide” and “long” format (stats)TextOnStaticMap() plot text on map (RgoogleMaps) parallelplot() parallel coordinates plot (lattice) merge() merge two data frames; similar to database join operationsPackages densityplot() kernel density plot (lattice) aggregate() compute summary statistics of data subsets (stats)plotGoogleMaps plot spatial data as HTML map mushup over Google Maps contour(), filled.contour() contour plot (graphics) by() apply a function to a data frame split by factorsRgoogleMaps overlay on Google map tiles in R levelplot(), contourplot() level plots and contour plots (lattice) melt(), cast() melt and then cast data into the reshaped or aggregatedplotKML visualization of spatial and spatio-temporal objects in Google Earth smoothScatter() scatterplots with smoothed densities color representation; form you want (reshape)ggmap Spatial visualization with Google Maps and OpenStreetMap capable of visualizing large datasets (graphics) complete.cases() find complete cases, i.e., cases without missing valuesclustTool GUI for clustering data with spatial information sunflowerplot() a sunflower scatter plot (graphics) na.fail, na.omit, na.exclude, na.pass handle missing valuesSGCS Spatial Graph based Clustering Summaries for spatial point patterns assocplot() association plot (graphics) Packagesspdep spatial dependence: weighting schemes, statistics and models mosaicplot() mosaic plot (graphics) reshape flexibly restructure and aggregate data matplot() plot the columns of one matrix against the columns of another Statistics data.table extension of data.frame for fast indexing, ordered joins, assignment, (graphics) and grouping and list columnsSummarization fourfoldplot() a fourfold display of a 2 × 2 × k contingency table (graph- gdata various tools for data manipulation ics)summary() summarize datadescribe() concise statistical description of data (Hmisc) persp() perspective plots of surfaces over the x?y plane (graphics) Data Access cloud(), wireframe() 3d scatter plots and surfaces (lattice) Functionsboxplot.stats() box plot statistics interaction.plot() two-way interaction plot (stats)Analysis of Variance iplot(), ihist(), ibar(), ipcp() interactive scatter plot, his- save(), load() save and load R data objectsaov() fit an analysis of variance model (stats) togram, bar plot, and parallel coordinates plot (iplots) read.csv(), write.csv() import from and export to .CSV filesanova() compute analysis of variance (or deviance) tables for one or more pdf(), postscript(), win.metafile(), jpeg(), bmp(), read.table(), write.table(), scan(), write() read and fitted model objects (stats) png(), tiff() save graphs into files of various formats write data write.matrix() write a matrix or data frame (MASS)Statistical Test gvisAnnotatedTimeLine(), gvisAreaChart(), readLines(), writeLines() read/write text lines from/to a connection,t.test() student’s t-test (stats) gvisBarChart(), gvisBubbleChart(), gvisCandlestickChart(), gvisColumnChart(), such as a text fileprop.test() test of equal or given proportions (stats) sqlQuery() submit an SQL query to an ODBC database (RODBC)binom.test() exact binomial test (stats) gvisComboChart(), gvisGauge(), gvisGeoChart(), gvisGeoMap(), gvisIntensityMap(), sqlFetch() read a table from an ODBC database (RODBC)Mixed Effects Models gvisLineChart(), gvisMap(), gvisMerge(), sqlSave(), sqlUpdate() write or update a table in an ODBC databaselme() fit a linear mixed-effects model (nlme) gvisMotionChart(), gvisOrgChart(), (RODBC)nlme() fit a nonlinear mixed-effects model (nlme) gvisPieChart(), gvisScatterChart(), sqlColumns() enquire about the column structure of tables (RODBC) sqlTables() list tables on an ODBC connection (RODBC)Principal Components and Factor Analysis gvisSteppedAreaChart(), gvisTable(), gvisTreeMap() various interactive charts produced with the Google odbcConnect(), odbcClose(), odbcCloseAll() open/close con-princomp() principal components analysis (stats) nections to ODBC databases (RODBC)prcomp() principal components analysis (stats) Visualisation API (googleVis) gvisMerge() merge two googleVis charts into one (googleVis) dbSendQuery execute an SQL statement on a given database connectionOther Functions (DBI)var(), cov(), cor() variance, covariance, and correlation (stats) Packages dbConnect(), dbDisconnect() create/close a connection to a DBMSdensity() compute kernel density estimates (stats) ggplot2 an implementation of the Grammar of Graphics (DBI)
  4. 4. Packages snowfall usability wrapper around snow for easier development of parallel R gWidgets a toolkit-independent API for building interactive GUIsRODBC ODBC database access programs Red-R An open source visual programming GUI interface for RDBI a database interface (DBI) between R and relational DBMS snowFT extension of snow supporting fault tolerant and reproducible applica- R AnalyticFlow a software which enables data analysis by drawing analysisRMySQL interface to the MySQL database tions, and easy-to-use parallel programming flowchartsRJDBC access to databases through the JDBC interface Rmpi interface (Wrapper) to MPI (Message-Passing Interface) latticist a graphical user interface for exploratory visualisationRSQLite SQLite interface for R rpvm R interface to PVM (Parallel Virtual Machine) nws provide coordination and parallel execution facilities Other R Reference CardsROracle Oracle database interface (DBI) driver foreach foreach looping construct for R R Reference Card, by Tom ShortRpgSQL DBI/RJDBC interface to PostgreSQL database doMC foreach parallel adaptor for the multicore package http://rpad.googlecode.com/svn-history/r76/Rpad_homepage/RODM interface to Oracle Data Mining doSNOW foreach parallel adaptor for the snow package R-refcard.pdf orxlsReadWrite read and write Excel files doMPI foreach parallel adaptor for the Rmpi package http://cran.r-project.org/doc/contrib/Short-refcard.pdfWriteXLS create Excel 2003 (XLS) files from data frames doParallel foreach parallel adaptor for the multicore package R Reference Card, by Jonathan BaronBig Data doRNG generic reproducible parallel backend for foreach Loops http://cran.r-project.org/doc/contrib/refcard.pdf GridR execute functions on remote hosts, clusters or grids R Functions for Regression Analysis, by Vito RicciFunctions http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.as.ffdf() coerce a dataframe to an ffdf (ff ) fork R functions for handling multiple processes pdfread.table.ffdf(), read.csv.ffdf() read data from a flat file to Generating Reports R Functions for Time Series Analysis, by Vito Ricci an ffdf object (ff ) http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdfwrite.table.ffdf(), write.csv.ffdf() write an ffdf object to a Sweave() mixing text and R/S code for automatic report generation (utils) flat file (ff ) knitr a general-purpose package for dynamic report generation in R RDataMining Website, Package, Twitter & Groupsffdfappend() append a dataframe or an ffdf to an existing ffdf (ffdf ) R2HTML making HTML reports R2PPT generating Microsoft PowerPoint presentations RDataMining Website: http://www.rdatamining.combig.matrix() create a standard big.matrix, which is constrained to avail- Group on LinkedIn: http://group.rdatamining.com able RAM (bigmemory) Interface to Weka Group on Google: http://group2.rdatamining.comread.big.matrix() create a big.matrix by reading from an ASCII file Package RWeka is an R interface to Weka, and enables to use the following Weka Twitter: http://twitter.com/rdatamining (bigmemory) functions in R. RDataMining Package: http://www.rdatamining.com/packagewrite.big.matrix() write a big.matrix to a file (bigmemory) Association rules: http://package.rdatamining.comfilebacked.big.matrix() create a file-backed big.matrix, which may Apriori(), Tertius() exceed available RAM by using hard drive space (bigmemory) Regression and classification:mwhich() expanded “which”-like functionality (bigmemory) LinearRegression(), Logistic(), SMO()Packages Lazy classifiers:ff memory-efficient storage of large data on disk and fast access functions IBk(), LBR()ffbase basic statistical functions for package ff Meta classifiers:filehash a simple key-value database for handling large data AdaBoostM1(), Bagging(), LogitBoost(),g.data create and maintain delayed-data packages MultiBoostAB(), Stacking(),BufferedMatrix a matrix data storage object held in temporary files CostSensitiveClassifier()biglm regression for data too large to fit in memory Rule classifiers:bigmemory manage massive matrices with shared memory and memory-mapped JRip(), M5Rules(), OneR(), PART() files Regression and classification trees:biganalytics extend the bigmemory package with various analytics J48(), LMT(), M5P(), DecisionStump()bigtabulate table-, tapply-, and split-like functionality for matrix and Clustering: big.matrix objects Cobweb(), FarthestFirst(), SimpleKMeans(), XMeans(), DBScan()Parallel Computing Filters:Functions Normalize(), Discretize()foreach(...) %dopar% looping in parallel (foreach) Word stemmers:registerDoSEQ(), registerDoSNOW(), registerDoMC() regis- IteratedLovinsStemmer(), LovinsStemmer() ter respectively the sequential, SNOW and multicore parallel backend Tokenizers: with the foreach package (foreach, doSNOW, doMC) AlphabeticTokenizer(), NGramTokenizer(),sfInit(), sfStop() initialize and stop the cluster (snowfall) WordTokenizer()sfLapply(), sfSapply(), sfApply() parallel versions of Editors/GUIs lapply(), sapply(), apply() (snowfall) Tinn-R a free GUI for R language and environmentPackages RStudio a free integrated development environment (IDE) for Rmulticore parallel processing of R code on machines with multiple cores or rattle graphical user interface for data mining in R CPUs Rpad workbook-style, web-based interface to Rsnow simple parallel computing in R RPMG graphical user interface (GUI) for interactive R analysis sessions