SlideShare a Scribd company logo
1 of 72
Download to read offline
Rtips. Revival 2014! 
Paul E. Johnson <pauljohn @ ku.edu> 
March 24, 2014 
The original Rtips started in 1999. It became dicult to update because of limitations in the 
software with which it was created. Now I know more about R, and have decided to wade in 
again. In January, 2012, I took the FaqManager HTML output and converted it to LATEX with 
the excellent open source program pandoc, and from there I've been editing and updating it in 
LYX. 
From here on out, the latest html version will be at http://pj.freefaculty.org/R/Rtips. 
html and the PDF for the same will be 
http://pj.freefaculty.org/R/Rtips.pdf. 
You are reading the New Thing! 
The
rst chore is to cut out the old useless stu that was no good to start with, correct 
mistakes in translation (the quotation mark translations are particularly dangerous, but also 
there is trouble with ~, $, and -. 
Original Preface 
(I thought it was cute to call this StatsRus but the Toystore's lawyer called and, well, you 
know. . . ) 
If you need a tip sheet for R, here it is. 
This is not a substitute for R documentation, just a list of things I had trouble remembering 
when switching from SAS to R. 
Heed the words of Brian D. Ripley, One enquiry came to me yesterday which suggested that 
some users of binary distributions do not know that R comes with two Guides in the doc/manual 
directory plus an FAQ and the help pages in book form. I hope those are distributed with all 
the binary distributions (they are not made nor installed by default). Windows-speci
c versions 
are available. Please run help.start() in R! 
Contents 
1 Data Input/Output 5 
1.1 Bring raw numbers into R (05/22/2012) . . . . . . . . . . . . . . . . . . . . . . . 5 
1.2 Basic notation on data access (12/02/2012) . . . . . . . . . . . . . . . . . . . . . 6 
1.3 Checkout the new Data Import/Export manual (13/08/2001) . . . . . . . . . . . 6 
1.4 Exchange data between R and other programs (Excel, etc) (01/21/2009) . . . . . 6 
1.5 Merge data frames (04/23/2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 
1.6 Add one row at a time (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 9 
1.7 Need yet another dierent kind of merge for data frames (11/08/2000) . . . . . . 9 
1.8 Check if an object is NULL (06/04/2001) . . . . . . . . . . . . . . . . . . . . . . 10 
1
1.9 Generate random numbers (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . 10 
1.10 Generate random numbers with a
xed mean/variance (06/09/2000) . . . . . . . 11 
1.11 Use rep to manufacture a weighted data set (30/01/2001) . . . . . . . . . . . . . 11 
1.12 Convert contingency table to data frame (06/09/2000) . . . . . . . . . . . . . . . 12 
1.13 Write: data in text
le (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . 12 
2 Working with data frames: Recoding, selecting, aggregating 12 
2.1 Add variables to a data frame (or list) (02/06/2003) . . . . . . . . . . . . . . . . 12 
2.2 Create variable names on the 
y (10/04/2001) . . . . . . . . . . . . . . . . . . 13 
2.3 Recode one column, output values into another column (12/05/2003) . . . . . . . 13 
2.4 Create indicator (dummy) variables (20/06/2001) . . . . . . . . . . . . . . . . . . 16 
2.5 Create lagged values of variables for time series regression (05/22/2012) . . . . . 16 
2.6 How to drop factor levels for datasets that don't have observations with those 
values? (08/01/2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 
2.7 Select/subset observations out of a dataframe (08/02/2012) . . . . . . . . . . . . 17 
2.8 Delete
rst observation for each element in a cluster of observations (11/08/2000) 18 
2.9 Select a random sample of data (11/08/2000) . . . . . . . . . . . . . . . . . . . . 18 
2.10 Selecting Variables for Models: Don't forget the subset function (15/08/2000) . . 19 
2.11 Process all numeric variables, ignore character variables? (11/02/2012) . . . . . . 19 
2.12 Sorting by more than one variable (06/09/2000) . . . . . . . . . . . . . . . . . . 19 
2.13 Rank within subgroups de
ned by a factor (06/09/2000) . . . . . . . . . . . . . . 20 
2.14 Work with missing values (na.omit, is.na, etc) (15/01/2012) . . . . . . . . . . . . 20 
2.15 Aggregate values, one for each line (16/08/2000) . . . . . . . . . . . . . . . . . . 21 
2.16 Create new data frame to hold aggregate values for each factor (11/08/2000) . . 21 
2.17 Selectively sum columns in a data frame (15/01/2012) . . . . . . . . . . . . . . . 21 
2.18 Rip digits out of real numbers one at a time (11/08/2000) . . . . . . . . . . . . . 21 
2.19 Grab an item from each of several matrices in a List (14/08/2000) . . . . . . . . 22 
2.20 Get vector showing values in a dataset (10/04/2001) . . . . . . . . . . . . . . . . 22 
2.21 Calculate the value of a string representing an R command (13/08/2000) . . . . 22 
2.22 Which can grab the index values of cases satisfying a test (06/04/2001) . . . . 22 
2.23 Find unique lines in a matrix/data frame (31/12/2001) . . . . . . . . . . . . . . . 23 
3 Matrices and vector operations 23 
3.1 Create a vector, append values (01/02/2012) . . . . . . . . . . . . . . . . . . . . 23 
3.2 How to create an identity matrix? (16/08/2000) . . . . . . . . . . . . . . . . . . 24 
3.3 Convert matrix m to one long vector (11/08/2000) . . . . . . . . . . . . . . . . 24 
3.4 Creating a peculiar sequence (1 2 3 4 1 2 3 1 2 1) (11/08/2000) . . . . . . . . . . 24 
3.5 Select every n'th item (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 25 
3.6 Find index of a value nearest to 1.5 in a vector (11/08/2000) . . . . . . . . . . . 25 
3.7 Find index of nonzero items in vector (18/06/2001) . . . . . . . . . . . . . . . . . 25 
3.8 Find index of missing values (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . 26 
3.9 Find index of largest item in vector (16/08/2000) . . . . . . . . . . . . . . . . . . 26 
3.10 Replace values in a matrix (22/11/2000) . . . . . . . . . . . . . . . . . . . . . . . 26 
3.11 Delete particular rows from matrix (06/04/2001) . . . . . . . . . . . . . . . . . . 27 
3.12 Count number of items meeting a criterion (01/05/2005) . . . . . . . . . . . . . . 27 
3.13 Compute partial correlation coecients from correlation matrix (08/12/2000) . . 27 
3.14 Create a multidimensional matrix (R array) (20/06/2001) . . . . . . . . . . . . . 28 
3.15 Combine a lot of matrices (20/06/2001) . . . . . . . . . . . . . . . . . . . . . . . 28 
3.16 Create neighbormatrices according to speci
c logics (20/06/2001) . . . . . . . 28 
3.17 Matching two columns of numbers by a key variable (20/06/2001) . . . . . . 29 
3.18 Create Upper or Lower Triangular matrix (06/08/2012) . . . . . . . . . . . . . . 29 
3.19 Calculate inverse of X (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . . 30 
2
3.20 Interesting use of Matrix Indices (20/06/2001) . . . . . . . . . . . . . . . . . . . 31 
3.21 Eigenvalues example (20/06/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . 31 
4 Applying functions, tapply, etc 32 
4.1 Return multiple values from a function (12/02/2012) . . . . . . . . . . . . . . . . 32 
4.2 Grab p values out of a list of signi
cance tests (22/08/2000) . . . . . . . . . . . 32 
4.3 ifelse usage (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 
4.4 Apply to create matrix of probabilities, one for each cell (14/08/2000) . . . . . 32 
4.5 Outer. (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 
4.6 Check if something is a formula/function (11/08/2000) . . . . . . . . . . . . . . . 33 
4.7 Optimize with a vector of variables (11/08/2000) . . . . . . . . . . . . . . . . . . 33 
4.8 slice.index, like in S+ (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 33 
5 Graphing 33 
5.1 Adjust features with par before graphing (18/06/2001) . . . . . . . . . . . . . . . 33 
5.2 Save graph output (03/21/2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 
5.3 How to automatically name plot output into separate
les (10/04/2001) . . . . . 36 
5.4 Control papersize (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 
5.5 Integrating R graphs into documents: LATEX and EPS or PDF (20/06/2001) . . . 37 
5.6 Snapshot graphs and scroll through them (31/12/2001) . . . . . . . . . . . . . 37 
5.7 Plot a density function (eg. Normal) (22/11/2000) . . . . . . . . . . . . . . . . . 37 
5.8 Plot with error bars (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 37 
5.9 Histogram with density estimates (14/08/2000) . . . . . . . . . . . . . . . . . . . 37 
5.10 How can I overlay several line plots on top of one another? (09/29/2005) . . . 37 
5.11 Create matrix of graphs (18/06/2001) . . . . . . . . . . . . . . . . . . . . . . . 39 
5.12 Combine lines and bar plot? (07/12/2000) . . . . . . . . . . . . . . . . . . . . . . 39 
5.13 Regression scatterplot: add
tted line to graph (03/20/2014) . . . . . . . . . . . 40 
5.14 Control the plotting character in scatterplots? (11/08/2000) . . . . . . . . . . . . 40 
5.15 Scatterplot: Control Plotting Characters (men vs women, etc)g (11/11/2002) . . 41 
5.16 Scatterplot with size/color adjustment (12/11/2002) . . . . . . . . . . . . . . . . 41 
5.17 Scatterplot: adjust size according to 3rd variable (06/04/2001) . . . . . . . . . . 42 
5.18 Scatterplot: smooth a line connecting points (02/06/2003) . . . . . . . . . . . . . 42 
5.19 Regression Scatterplot: add estimate to plot (18/06/2001) . . . . . . . . . . . . . 42 
5.20 Axes: controls: ticks, no ticks, numbers, etc (22/11/2000) . . . . . . . . . . . . . 42 
5.21 Axes: rotate labels (06/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 
5.22 Axes: Show formatted dates in axes (06/04/2001) . . . . . . . . . . . . . . . . . 43 
5.23 Axes: Reverse axis in plot (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . 43 
5.24 Axes: Label axes with dates (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . 44 
5.25 Axes: Superscript in axis labels (11/08/2000) . . . . . . . . . . . . . . . . . . . . 44 
5.26 Axes: adjust positioning (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . 44 
5.27 Add error arrows to a scatterplot (30/01/2001) . . . . . . . . . . . . . . . . . . 44 
5.28 Time Series: how to plot several lines in one graph? (06/09/2000) . . . . . . . 45 
5.29 Time series: plot
tted and actual data (11/08/2000) . . . . . . . . . . . . . . . . 45 
5.30 Insert text into a plot (22/11/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 45 
5.31 Plotting unbounded variables (07/12/2000) . . . . . . . . . . . . . . . . . . . . . 45 
5.32 Labels with dynamically generated content/math markup (16/08/2000) . . . . . 45 
5.33 Use math/sophisticated stu in title of plot (11/11/2002) . . . . . . . . . . . . . 46 
5.34 How to color-code points in scatter to reveal missing values of 3rd variable? 
(15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 
5.35 lattice: misc examples (12/11/2002) . . . . . . . . . . . . . . . . . . . . . . . . . 46 
5.36 Make 3d scatterplots (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 46 
5.37 3d contour with line style to re
ect value (06/04/2001) . . . . . . . . . . . . . . . 47 
3
5.38 Animate a Graph! (13/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 
5.39 Color user-portion of graph background dierently from margin (06/09/2000) . . 47 
5.40 Examples of graphing code that seem to work (misc) (11/16/2005)g . . . . . . . 48 
6 Common Statistical Chores 51 
6.1 Crosstabulation Tables (01/05/2005) . . . . . . . . . . . . . . . . . . . . . . . . . 51 
6.2 t-test (18/07/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 
6.3 Test for Normality (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 
6.4 Estimate parameters of distributions (12/02/2012) . . . . . . . . . . . . . . . . . 52 
6.5 Bootstrapping routines (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 52 
6.6 BY subgroup analysis of data (summary or model for subgroups)(06/04/2001) . 52 
7 Model Fitting (Regression-type things) 53 
7.1 Tips for specifying regression models (12/02/2002) . . . . . . . . . . . . . . . . . 53 
7.2 Summary Methods, grabbing results inside an output object . . . . . . . . . . . 53 
7.3 Calculate separate coecients for each level of a factor (22/11/2000) . . . . . . . 53 
7.4 Compare
ts of regression models (F test subset B's =0) (14/08/2000) . . . . . . 54 
7.5 Get Predicted Values from a model with predict() (11/13/2005) . . . . . . . . . . 55 
7.6 Polynomial regression (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 57 
7.7 Calculate p value for an F stat from regression (13/08/2000) . . . . . . . . . . 57 
7.8 Compare
ts (F test) in stepwise regression/anova (11/08/2000) . . . . . . . . . 57 
7.9 Test signi
cance of slope and intercept shifts (Chow test?) . . . . . . . . . . . . . 58 
7.10 Want to estimate a nonlinear model? (11/08/2000) . . . . . . . . . . . . . . . . . 58 
7.11 Quasi family and passing arguments to it. (12/11/2002) . . . . . . . . . . . . . . 58 
7.12 Estimate a covariance matrix (22/11/2000) . . . . . . . . . . . . . . . . . . . . . 58 
7.13 Control number of signi
cant digits in output (22/11/2000) . . . . . . . . . . . . 59 
7.14 Multiple analysis of variance (06/09/2000) . . . . . . . . . . . . . . . . . . . . . . 59 
7.15 Test for homogeneity of variance (heteroskedasticity) (12/02/2012) . . . . . . . . 59 
7.16 Use nls to estimate a nonlinear model (14/08/2000) . . . . . . . . . . . . . . . . 60 
7.17 Using nls and graphing things with it (22/11/2000) . . . . . . . . . . . . . . . . . 60 
7.18 2Log(L) and hypo tests (22/11/2000) . . . . . . . . . . . . . . . . . . . . . . . 60 
7.19 logistic regression with repeated measurements (02/06/2003) . . . . . . . . . . . 61 
7.20 Logit (06/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 
7.21 Random parameter (Mixed Model) tips (01/05/2005) . . . . . . . . . . . . . . . . 61 
7.22 Time Series: basics (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 
7.23 Time Series: misc examples (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . 62 
7.24 Categorical Data and Multivariate Models (04/25/2004) . . . . . . . . . . . . . . 62 
7.25 Lowess. Plot a smooth curve (04/25/2004) . . . . . . . . . . . . . . . . . . . . . . 62 
7.26 Hierarchical/Mixed linear models. (06/03/2003) . . . . . . . . . . . . . . . . . . 62 
7.27 Robust Regression tools (07/12/2000) . . . . . . . . . . . . . . . . . . . . . . . . 63 
7.28 Durbin-Watson test (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 
7.29 Censored regression (04/25/2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 
8 Packages 63 
8.1 What packages are installed on Paul's computer? . . . . . . . . . . . . . . . . . . 63 
8.2 Install and load a package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 
8.3 List Loaded Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 
8.4 Where is the default R library folder? Where does R look for packages in a 
computer? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 
8.5 Detach libraries when no longer needed (10/04/2001) . . . . . . . . . . . . . . . . 66 
4
9 Misc. web resources 66 
9.1 Navigating R Documentation (12/02/2012) . . . . . . . . . . . . . . . . . . . . . 66 
9.2 R Task View Pages (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 
9.3 Using help inside R(13/08/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 
9.4 Run examples in R (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 
10 R workspace 67 
10.1 Writing, saving, running R code (31/12/2001) . . . . . . . . . . . . . . . . . . . . 67 
10.2 .RData, .RHistory. Help or hassle? (31/12/2001) . . . . . . . . . . . . . . . . . . 68 
10.3 Save  Load R objects (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . 68 
10.4 Reminders for object analysis/usage (11/08/2000) . . . . . . . . . . . . . . . . . 68 
10.5 Remove objects by pattern in name (31/12/2001) . . . . . . . . . . . . . . . . . . 68 
10.6 Save work/create a Diary of activity (31/12/2001) . . . . . . . . . . . . . . . . . 69 
10.7 Customized Rpro
le (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . 69 
11 Interface with the operating system 69 
11.1 Commands to system like change working directory (22/11/2000) . . . . . . . . 69 
11.2 Get system time. (30/01/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 
11.3 Check if a
le exists (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 69 
11.4 Find
les by name or part of a name (regular expression matching) (14/08/2001) 70 
12 Stupid R tricks: basics you can't live without 70 
12.1 If you are asking for help (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . 70 
12.2 Commenting out things in R
les (15/08/2000) . . . . . . . . . . . . . . . . . . . 71 
13 Misc R usages I
nd interesting 71 
13.1 Character encoding (01/27/2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 
13.2 list names of variables used inside an expression (10/04/2001) . . . . . . . . . . . 71 
13.3 R environment in side scripts (10/04/2001) . . . . . . . . . . . . . . . . . . . . . 71 
13.4 Derivatives (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 
1 Data Input/Output 
1.1 Bring raw numbers into R (05/22/2012) 
This is truly easy. Suppose you've got numbers in a space-separated
lemyData, with column 
names in the
rst row (thats a header). Run 
myDataFrame  r e a d . t a b l e ( ` `myData ' ' , header=TRUE) 
If you type ?read.table it tells about importing
les with other delimiters. 
Suppose you have tab delimited data with blank spaces to indicate missing values. Do this: 
myDataFramer e a d . t a b l e ( myData , sep=n t  , n a . s t r i n g s=  , header=TRUE) 
Be aware than anybody can choose his/her own separator. I am fond of j because it seems 
never used in names or addresses (unlike just about any other character I've found). 
Suppose your data is in a compressed gzip
le, myData.gz, use R's gz
le function to decompress 
on the 
y. Do this: 
myDataFrame  r e a d . t a b l e ( g z f i l e ( myData.gz ) , header=TRUE) 
If you read columns in from separate
les, combine into a data frame as: 
5
v a r i a b l e 1  scan (  f i l e 1 ) 
v a r i a b l e 2  scan (  f i l e 2 ) 
mydata  cbind ( va r i abl e 1 , v a r i a b l e 2 ) 
#or use the equivalent: 
#mydata  data.frame(variable1 , variable2) 
#Optionally save dataframe in R object file with: 
wr i t e . t a b l e (mydata , f i l e=f i l ename 3 ) 
1.2 Basic notation on data access (12/02/2012) 
To access the columns of a data frame x with the column number, say x[,1], to get the
rst 
column. If you know the column name, say pjVar1, it is the same as x$pjVar1 or x[, pjVar1]. 
Grab an element in a list as x[[1]]. If you just run x[1] you get a list, in which there is a single 
item. Maybe you want that, but I bet you really want x[[1]]. If the list elements are named, 
you can get them with x$pjVar1 or x[[pjVar1]]. 
For instance, if a data frame is nebdata then grab the value in row 1, column 2 with: 
nebdata [ 1 , 2 ] 
## or to selectively take from column2 only when the column Volt equals ABal 
nebdata [ nebdata $Volt==ABal  , 2 ] 
( from Diego Kuonen) 
1.3 Checkout the new Data Import/Export manual (13/08/2001) 
With R{1.2, the R team released a manual about how to get data into R and out of R. That's 
the
rst place to look if you need help. It is now distributed with R. Run 
h e l p . s t a r t ( ) 
1.4 Exchange data between R and other programs (Excel, etc) (01/21/2009) 
Patience is the key. If you understand the format in which your data is currently held, chances 
are good you can translate it into R with more or less ease. 
Most commonly, people seem to want to import Microsoft Excel spreadsheets. I believe there 
is an ODBC approach to this, but I think it is only for MS Windows. 
In the gdata package, which was formerly part of gregmisc, there is a function that can use Perl 
to import an Excel spreadsheet. If you install gdata, the function to use is called read.xls. You 
have to specify which sheet you want. That works well enough if your data is set in a numeric 
format inside Excel. If it is set with the GENERAL type, I've seen the imported numbers turn 
up as all asterixes. 
Other packages have appeared to oer Excel import ability, such as xlsReadWrite. 
In the R-help list, I also see reference to an Excel program addon called RExcel that can 
install an option in the Excel menus called Put R Dataframe. The home for that project is 
http://sunsite.univie.ac.at/rcom/ 
I usually proceed like this. 
Step 1. Use Excel to edit the sheet so that it is RECTANGULAR. It should have variable 
names in row 1, and it has numbers where desired and the string NA otherwise. It must have 
NO embedded formulae or other Excel magic. Be sure that the columns are declared with the 
proper Excel format. Numerical information should have a numerical type, while text should 
have text as the type. Avoid General. 
6
Step 2. First, try the easy route. Try gdata's read.xls method. As long as you tell it which 
sheet you want to read out of your data set, and you add header=T and whatever other options 
you'd like in an ordinary read.table usage, then it has worked well for us. 
Step 3. Suppose step 2 did not work. Then you have something irregular in your Excel sheet 
and you should proceed as follows. Either open Excel and clean up your sheet and try step 2 
again, or take the more drastic step: Export the spread sheet to text format. File/Save As, 
navigate to csv or txt and choose any options like advanced format or text con
gurable. 
Choose the delimiter you want. One option is the tab key. When I'm having trouble, I use 
the bar symbol, j, because there's little chance it is in use inside a column of data. If your 
columns have addresses or such, usage of the COMMA as a delimiter is very dangerous. 
After you save that
le in text mode, open it in a powerful 
at text editor like Emacs and 
look through to make sure that 1) your variable names in the
rst row do not have spaces or 
quotation marks or special characters, and 2) look to the bottom to make sure your spreadsheet 
program did not insert any crud. If so, delete it. If you see commas in numeric variables, just 
use the Edit/Search/replace option to delete them. 
Then read in your tab separated data with 
r e a d . t a b l e ( ` ` f i l ename ' ' , header=T, sep=``nt ' ' ) 
I have tested the foreign package by importing an SPSS
le and it worked great. I've had 
great results importing Stata Data sets too. 
Here's a big caution for you, however. If your SPSS or Stata numeric variables have some value 
lables, say 98=No Answer and 99=Missing, then R will think that the variable is a factor, and 
it will convert it into a factor with 99 possible values. The foreign library commands for reading 
spss and dta
les have options to stop R from trying to help with factors and I'd urge you to 
read the manual and use them. 
If you use read.spss, for example, setting use.value.labels=F will stop R from creating factors 
at all. If you don't want to go that far, there's an option max.value.labels that you can set to 
5 or 10, and stop it from seeing 98=Missing and then creating a factor with 98 values. It will 
only convert variables that have fewer than 5 or 10 values. If you use read.dta (for Stata), you 
can use the option convert.factors=F. 
Also, if you are using read.table, you may have trouble if your numeric variables have any 
illegal values, such as letters. Then R will assume you really intend them to be factors and it 
will sometimes be tough to
x. If you add the option as.is=T, it will stop that cleanup eort 
by R. 
At one time, the SPSS import support in foreign did not work for me, and so I worked out a 
routine of copying the SPSS data into a text
le, just as described for Excel. 
I have a notoriously dicult time with SAS XPORT
les and don't even try anymore. I've seen 
several email posts by Frank Harrel in r-help and he has some pretty strong words about it. I 
do have one working example of importing the Annenberg National Election Study into R from 
SAS and you can review that at http://pj.freefaculty.org/DataSets/ANES/2002. I wrote a long 
boring explanation. Honestly, I think the best thing to do is to
nd a bridge between SAS and 
R, say use some program to convert the SAS into Excel, and go from there. Or just write the 
SAS data set to a
le and then read into R with read.table() or such. 
1.5 Merge data frames (04/23/2004) 
update:Merge is confusing! But if you study this, you will see everything in perfect clarity: 
x1  rnorm(100) 
x2  rnorm(100) 
x3  rnorm(100) 
7
x4  rnorm(100) 
ds1  data. f rame ( c i t y=rep ( 1 , 1 0 0 ) , x1=x1 , x2=x2 ) 
ds2  data. f rame ( c i t y=rep ( 2 , 1 0 0 ) , x1=x1 , x3=x3 , x4=x4 ) 
merge ( ds1 , ds2 , by.x=c ( ` ` c i t y ' ' , ` `x1 ' ' ) , by.y=c ( ` ` c i t y ' ' , ` `x1 ' ' ) , a l l=TRUE) 
The trick is to make sure R understands which are the common variables in the two datasets 
so it lines them up, and then all=T is needed to say that you don't want to throw away the 
variables that are only in one set or the other. Read the help page over and over, you eventually 
get it. 
More examples: 
exper iment  data. f rame ( t imes = c ( 0 , 0 , 1 0 , 1 0 , 2 0 , 2 0 , 3 0 , 3 0 ) , expval = c 
( 1 , 1 , 2 , 2 , 3 , 3 , 4 , 4 ) ) 
s imul  data. f rame ( t imes = c ( 0 , 1 0 , 2 0 , 3 0 ) , s imul = c ( 3 , 4 , 5 , 6 ) ) 
I want a merged datatset like: 
t imes expval s imul 
1 0 1 3 
2 0 1 3 
3 10 2 4 
4 10 2 4 
5 20 3 5 
6 20 3 5 
7 30 4 6 
8 30 4 6 
Suggestions 
merge ( experiment , s imul ) 
( from Brian D. Ripley ) 
does all the work for you. 
Or consider: 
exp. s im  data. f rame ( experiment , s imul=s imul $ s imul [match ( exper iment $ times , s imul $ 
t imes ) ] ) 
( from Jim Lemon) 
I have dataframes like this: 
s t a t e count1 pe r c ent1 
CA 19 0 . 3 4 
TX 22 0 . 3 5 
FL 11 0 . 2 4 
OR 34 0 . 4 2 
GA 52 0 . 6 2 
MN 12 0 . 1 7 
NC 19 0 . 3 4 
s t a t e count2 pe r c ent2 
FL 22 0 . 3 5 
MN 22 0 . 3 5 
CA 11 0 . 2 4 
TX 52 0 . 6 2 
And I want 
s t a t e count1 pe r c ent1 count2 pe r c ent2 
CA 19 0 . 3 4 11 0 . 2 4 
TX 22 0 . 3 5 52 0 . 6 2 
FL 11 0 . 2 4 22 0 . 3 5 
OR 34 0 . 4 2 0 0 
GA 52 0 . 6 2 0 0 
8
MN 12 0 . 1 7 22 0 . 3 5 
NC 19 0 . 3 4 0 0 
( from YuLing Wu ) 
In response, Ben Bolker said 
s t a t e 1  
c ( ` `CA' ' , ` `TX' ' , ` `FL ' ' , ` `OR' ' , ` `GA' ' , ` `MN' ' , ` `NC' ' ) 
count1  c ( 1 9 , 2 2 , 1 1 , 3 4 , 5 2 , 1 2 , 1 9 ) 
pe r c ent1  c (0 .34 , 0 .35 , 0 .24 , 0 .42 , 0 .62 , 0 .17 , 0 . 3 4 ) 
s t a t e 2  c ( ` `FL ' ' , ` `MN' ' , ` `CA' ' , ` `TX' ' ) 
count2  c ( 2 2 , 2 2 , 1 1 , 5 2 ) 
pe r c ent2  c (0 .35 , 0 .35 , 0 .24 , 0 . 6 2 ) 
data1  data. f rame ( s tat e1 , count1 , pe r c ent1 ) 
data2  data. f rame ( s tat e2 , count2 , pe r c ent2 ) 
datac  data1m  match ( data1 $ s tat e1 , data2 $ s tat e2 , 0 ) 
datac $ count2  i f e l s e (m==0 ,0 , data2 $ count2 [m] ) 
datac $ pe r c ent2  i f e l s e (m==0 ,0 , data2 $ pe r c ent2 [m] ) 
If you didn't want to keep all the rows in both data sets (but just the shared rows) you could 
use 
merge ( data1 , data2 , by=1) 
1.6 Add one row at a time (14/08/2000) 
Question: I would like to create an (empty) data frame withheadingsfor every column (column 
titles) and then put data row-by-row into this data frame (one row for every computation I will 
be doing), i.e. 
no. time temp pr e s sur e the headings 
1 0 100 80 f i r s t r e s u l t 
2 10 110 87 2nd r e s u l t . . . . . 
Answer: Depends if the cols are all numeric: if they are a matrix would be better. But if you 
insist on a data frame, here goes: 
If you know the number of results in advance, say, N, do this 
df  data. f rame ( time=numeric (N) , temp=numeric (N) , pr e s sur e=numeric (N) ) 
df [ 1 , ]  c ( 0 , 100 , 80) 
df [ 2 , ]  c (10 , 110 , 87) 
. . . 
or 
m  matrix ( nrow=N, ncol=3) 
colnames (m)  c ( time  , temp  , pr e s sur e ) 
m[ 1 , ]  c ( 0 , 100 , 80) 
m[ 2 , ]  c (10 , 110 , 87) 
The matrix form is better size it only needs to access one vector (a matrix is a vector with 
attributes), not three. 
If you don't know the
nal size you can use rbind to add a row at a time, but that is substantially 
less ecient as lots of re-allocation is needed. It's better to guess the size,
ll in and then rbind 
on a lot more rows if the guess was too small.(from Brian Ripley) 
1.7 Need yet another dierent kind of merge for data frames (11/08/2000) 
Convert these two
les 
9
Fi l e 1 
C A T 
Fi l e 2 
1 2 34 56 
2 3 45 67 
3 4 56 78 
( from Stephen Arthur ) 
Into a new data frame that looks like: 
C A T 1 2 34 56 
C A T 2 3 45 67 
C A T 3 4 56 78 
This works: 
r epcbind  func t i on (x , y ) f 
nx  nrow( x ) 
ny  nrow( y ) 
i f (nxny) 
x  apply (x , 2 , rep , l eng th=ny ) 
e l s e i f (nynx) 
y  apply (y , 2 , rep , l eng th=nx ) 
cbind (x , y ) 
g 
( from Ben Bolker ) 
1.8 Check if an object is NULL (06/04/2001) 
NULL does not mean that something does not exist. It means that it exists, and it is nothing. 
X  NULL 
This may be a way of clearing values assigned to X, or initializing a variable as nothing. 
Programs can check on whether X is null 
i f ( i s . n u l l ( x ) ) f #then...} 
If you load things, R does not warn you when they are not found, it records them as NULL. 
You have the responsibility of checking them. Use 
i s . n u l l ( l i s t $component ) 
to check a thing named component in a thing named list. 
Accessing non-existent dataframe columns with [ does give an error, so you could do that 
instead. 
data ( t r e e s ) 
t r e e s $ aardvark 
NULL 
t r e e s [ , aardvark  ] 
Error in [.data.frame(trees, , aardvark) : subscript out of bounds (from Thomas Lumley) 
1.9 Generate random numbers (12/02/2012) 
You want randomly drawn integers? Use Sample, like so: 
# If you mean sampling without replacement: 
sample ( 1 : 1 0 , 3 , r e p l a c e=FALSE) 
#If you mean with replacement: 
sample ( 1 : 1 0 , 3 , r e p l a c e=TRUE) 
( from Bi l l Simpson ) 
10
Included with R are many univariate distributions, for example the Gaussian normal, Gamma, 
Binomial, Poisson, and so forth. Run 
? r u n i f 
? rnorm 
?rgamma 
? r p o i s 
You will see a distribution's functions are a base name like norm with pre
x letters r, d, 
p, q. 
ˆ rnorm: draw pseudo random numbers from a normal 
ˆ dnorm: the density value for a given value of a variable 
ˆ pnorm: the cumulative probability density value for a given value 
ˆ qnorm: the quantile function: given a probability, what is the corresponding value of the 
variable? 
I made a long-ish lecture about this in my R workshop (http://pj.freefaculty.org/guides/ 
Rcourse/rRandomVariables) 
Multivariate distributions are not (yet) in the base of R, but they are in several packages, such 
as MASS and mvtnorm. Note, when you use these, it is necessary to specify a mean vector and 
a covariance matrix among the variables. Brian Ripley gave this example: with mvrnorm in 
package MASS (part of the VR bundle), 
mvrnorm( 2 , c ( 0 , 0 ) , matrix ( c (0 .25 , 0 .20 , 0 .20 , 0 . 2 5 ) , 2 ,2) ) 
If you don't want to use a contributed package to draw multivariate observations, you can 
approximate some using the univariate distributions in R itself. Peter Dalgaard observed a less 
general solution for this particular case would be 
rnorm( 1 , sd=s q r t (0 . 2 0 ) ) + rnorm( 2 , sd=s q r t (0 . 0 5 ) ) 
1.10 Generate random numbers with a
xed mean/variance (06/09/2000) 
If you generate random numbers with a given tool, you don't get a sample with the exact mean 
you specify. A generator with a mean of 0 will create samples with varying means, right? 
I don't know why anybody wants a sample with a mean that is exactly 0, but you can draw a 
sample and then transform it to force the mean however you like. Take a 2 step approach: 
R x  rnorm(100 , mean = 5 , sd = 2) 
R x  ( x  mean( x ) ) / s q r t ( var ( x ) ) 
R mean( x ) 
[ 1 ] 1 .385177e16 
R var ( x ) 
[ 1 ] 1 
and now create your sample with mean 5 and sd 2: 
R x  x*2 + 5 
R mean( x ) 
[ 1 ] 5 
R var ( x ) 
[ 1 ] 4 
( from Tor s ten.Hothorn ) 
1.11 Use rep to manufacture a weighted data set (30/01/2001) 
11
x  c ( 1 0 , 4 0 , 5 0 , 1 0 0 ) # income vector for instance 
 w  c ( 2 , 1 , 3 , 2 ) # the weight for each observation in x with the same 
 rep (x ,w) 
[ 1 ] 10 10 40 50 50 50 100 100 
( from P. Malewski ) 
That expands a single variable, but we can expand all of the columns in a dataset one at a time 
to represent weighted data 
Thomas Lumley provided an example: Most of the functions that have weights have frequency 
weights rather than probability weights: that is, setting a weight equal to 2 has exactly the 
same eect as replicating the observation. 
expanded.dataa s .da t a . f r ame ( l apply ( compres sed.data , 
func t i on ( x ) rep (x , compr e s s ed.data $we ight s ) ) ) 
1.12 Convert contingency table to data frame (06/09/2000) 
Given a 8 dimensional crosstab, you want a data frame with 8 factors and 1 column for fre-quencies 
of the cells in the table. 
R1.2 introduces a function as.data.frame.table() to handle this. 
This can also be done manually. Here's a function (it's a simple wrapper around expand.grid): 
d f i f y  func t i on ( ar r , value.name = value  , dn.names = names ( dimnames ( a r r ) ) ) f 
Ver s ion  $ Id : d f i f y . s f u n , v 1 . 1 1995 /10/09 1 6 : 0 6 : 1 2 d3a061 Exp $  
dn  dimnames ( a r r  a s . a r r a y ( a r r ) ) 
i f ( i s . n u l l (dn) ) 
s top ( Can ' t dataframei fy an ar ray wi thout dimnames ) 
names (dn)  dn.names 
ans  cbind ( expand.gr id (dn) , a s . v e c t o r ( a r r ) ) 
names ( ans ) [ ncol ( ans ) ]  value.name 
ans 
g 
The name is short for data-frame-i-fy. 
For your example, assuming your multi-way array has proper dimnames, you'd just do: 
my.data. f rame  d f i f y (my.array , value.name=``f r equency ' ' ) 
(from Todd Taylor) 
1.13 Write: data in text
le (31/12/2001) 
Say I have a command that produced a 28 x 28 data matrix. How can I output the matrix into 
a txt
le (rather than copy/paste the matrix)? 
wr i t e . t a b l e (mat , f i l e= f i l e n ame . t x t ) 
Note MASS library has a function write.matrix which is faster if you need to write a numerical 
matrix, not a data frame. Good for big jobs. 
2 Working with data frames: Recoding, selecting, aggregating 
2.1 Add variables to a data frame (or list) (02/06/2003) 
If dat is a data frame, the column x1 can be added to dat in (at least 4) methods, dat$x1, dat[ 
, x1], dat[x1], or dat[[x1]]. Observe 
12
dat  data. f rame ( a=c ( 1 , 2 , 3 ) ) 
 dat [ , x1  ]  c (12 , 23 , 44) 
 dat [ x2  ]  c (12 , 23 , 44) 
 dat [ [ x3  ] ]  c (12 , 23 , 44) 
 dat 
a x1 x2 x3 
1 1 12 12 12 
2 2 23 23 23 
3 3 44 44 44 
There are many other ways, including cbind(). 
Often I plan to calculate variable names within a program, as well as the values of the variables. 
I think of this as generating new column names on the 
y. In r-help, I asked I keep
nding 
myself in a situation where I want to calculate a variable name and then use it on the left hand 
side of an assignment. To me, this was a dicult problem. 
Brian Ripley pointed toward one way to add the variable to a data frame: 
i t e r a t i o n  9 
newname  pas t e ( run  , i t e r a t i o n , sep=) 
mydf [ newname ]  aColumn 
## or , in one step: 
mydf [ pas t e ( run  , i t e r a t i o n , sep=) ]  aColumn 
## for a list , same idea works , use double brackets 
myList [ [ pas t e ( run  , i t e r a t i o n , sep=) ] ]  aColumn 
And Thomas Lumley added:  If you wanted to do something of this sort for which the above 
didn't work you could also learn about substitute(): 
eva l ( s u b s t i t u t e (myList $newColumnaColumn) , 
l i s t (newColumn=as.name ( varName ) ) ) 
2.2 Create variable names on the 
y (10/04/2001) 
The previous showed how to add a column to a data frame on the 
y. What if you just want 
to calculate a name for a variable that is not in a data frame. The assign function can do that. 
Try this to create an object (a variable) named zzz equal to 32. 
 a s s i g n ( z z z  , 32) 
 z z z 
[ 1 ] 32 
In that case,I specify zzz, but we can use a function to create the variable name. Suppose you 
want a random variable name. Every time you run this, you get a new variable starting with 
a. 
a s s i g n ( pas t e ( a  , round ( rnorm( 1 , 50 ,12) , 2) , sep=) , 324) 
I got a44.05: 
 a44.05 
[ 1 ] 324 
2.3 Recode one column, output values into another column (12/05/2003) 
Please read the documentation for transform() and replace() and also learn how to use the 
magic of R vectors. 
The transform() function works only for data frames. Suppose a data frame is called mdf and 
you want to add a new variable newV that is a function of var1 and var2: 
13
mdf  t rans form (mdf , newV=l o g ( var1 ) + var2 ) ) 
I'm inclined to take the easy road when I can. Proper use of indexes in R will help a lot, 
especially for recoding discrete valued variables. Some cases are particularly simple because of 
the way arrays are processed. 
Suppose you create a variable, and then want to reset some values to missing. Go like this: 
x  rnorm(10000) x [ x  1 . 5 ]  NA 
And if you don't want to replace the original variable, create a new one
rst ( xnew - x ) and 
then do that same thing to xnew. 
You can put other variables inside the brackets, so if you want x to equal 234 if y equals 1, then 
x [ y==1 ]  234 
Suppose you have v1, and you want to add another variable v2 so that there is a translation. 
If v1=1, you want v2=4. If v1=2, you want v2=4. If v1=3, you want v2=5. This reminds me 
of the old days using SPSS and SAS. I think it is clearest to do: 
 v1  c ( 1 , 2 , 3 ) # now initialize v2  v2  rep( -9 , length(v1)) # now recode v2  
v2[v1= =1]  4 v2[v1= =2]4 v2[v1= =3]5 v2[1] 4 4 5 
Note that R's ifelse command can work too: 
xi f e l s e (x1.5 ,NA, x ) 
One user recently asked how to take data like a vector of names and convert it to numbers, and 
2 good solutions appeared: 
y  c ( OLDa , ALL , OLDc , OLDa , OLDb , NEW , OLDb , OLDa , ALL ,  
. . . ) e l  c ( OLDa , OLDb , OLDc , NEW , ALL) match (y , e l ) [ 1 ] 1 5 3 
1 2 4 2 1 5 NA 
or 
 f  f a c t o r (x , l e v e l s=c ( OLDa , OLDb , OLDc , NEW , ALL) ) a s . i n t e g e r ( f ) 
[ 1 ] 1 5 3 1 2 4 2 1 5 
I asked Mark Myatt for more examples: 
For example, suppose I get a bunch of variables coded on a scale 
1 = no 6 = yes 8 = t i e d 9 = mi s s ing 10 = not a p p l i c a b l e . 
Recode that into a new variable name with 0=no, 1=yes, and all else NA. 
It seems like the replace() function would do it for single values but you end up with empty 
levels in factors but that can be
xed by re-factoring the variable. Here is a basic recode() 
function: 
r e code  func t i on ( var , old , new) f 
x  r e p l a c e ( var , var==old , new) 
i f ( i s . f a c t o r ( x ) ) f a c t o r ( x ) 
e l s e x 
g 
For the above example: 
t e s t  c ( 1 , 1 , 2 , 1 , 1 , 8 , 1 , 2 , 1 , 1 0 , 1 , 8 , 2 , 1 , 9 , 1 , 2 , 9 , 1 0 , 1 ) 
t e s t t e s t  r e code ( t e s t , 1 , 0) 
t e s t  r e code ( t e s t , 2 , 1) 
t e s t  r e code ( t e s t , 8 , NA) 
t e s t  r e code ( t e s t , 9 , NA) 
t e s t  r e code ( t e s t , 10 , NA) t e s t 
Although it is probably easier to use replace(): 
14
t e s t  c ( 1 , 1 , 2 , 1 , 1 , 8 , 1 , 2 , 1 , 1 0 , 1 , 8 , 2 , 1 , 9 , 1 , 2 , 9 , 1 0 , 1 ) 
t e s t t e s t  r e p l a c e ( t e s t , t e s t ==8 j t e s t ==9 j t e s t ==10 , NA) 
t e s t  r e p l a c e ( t e s t , t e s t ==1 , 0) 
t e s t  r e p l a c e ( t e s t , t e s t ==2 , 1) t e s t 
I suppose a better function would take from and to lists as arguments: 
r e code  func t i on ( var , from , to ) f 
x  a s . v e c t o r ( var ) 
f o r ( i in 1 : l ength ( from) ) f 
x  r e p l a c e (x , x==from [ i ] , to [ i ] ) 
g 
i f ( i s . f a c t o r ( var ) ) f a c t o r ( x ) 
e l s e x 
g 
For the example: 
t e s t  c ( 1 , 1 , 2 , 1 , 1 , 8 , 1 , 2 , 1 , 1 0 , 1 , 8 , 2 , 1 , 9 , 1 , 2 , 9 , 1 0 , 1 ) 
t e s t t e s t  r e code ( t e s t , c ( 1 , 2 , 8 : 1 0 ) , c ( 0 , 1 ) ) 
t e s t 
and it still works with single values. 
Suppose somebody gives me ascale from 1 to 100, and I want to collapse it into 10 groups, how 
do I go about it? 
Mark says: Use cut() for this. This cuts into 10 groups: 
t e s t  t runc ( r u n i f (1000 ,1 ,100) ) 
groups cut ( t e s t , seq ( 0 , 1 0 0 , 1 0 ) ) 
t abl e ( t e s t , groups ) 
To get ten groups without knowing the minimum and maximum value you can use pretty(): 
groups  cut ( t e s t , pr e t ty ( t e s t , 1 0 ) ) 
t abl e ( t e s t , groups ) 
You can specify the cut-points: 
groups  cut ( t e s t , c ( 0 , 2 0 , 4 0 , 6 0 , 8 0 , 1 0 0 ) ) 
t abl e ( t e s t , groups ) 
And they don't need to be even groups: 
groups  cut ( t e s t , c ( 0 , 3 0 , 5 0 , 7 5 , 1 0 0 ) ) 
t abl e ( t e s t , groups ) 
Mark added, I think I will add this sort of thing to the REX pack. 
2003{12{01, someone asked how to convert a vector of numbers to characters, such as 
i f x [ i ]  250 then c o l [ i ] = `` red ' ' 
e l s e i f x [ i ]  500 then c o l [ i ] = `` blue ' ' 
and so forth. Many interesing answers appeared in R-help. A big long nested ifelse would work, 
as in: 
x . c o l  i f e l s e ( x  250 , red  , 
i f e l s e ( x500 , blue  , i f e l s e ( x750 , green  , black ) ) ) 
There were some nice suggestions to use cut, such as Gabor Grothendeick's advice: 
The following results in a character vector: 
 c o l o u r s  c ( red  , blue  , green  , back ) 
 c o l o u r s [ cut (x , c (Inf , 2 5 0 , 5 0 0 , 7 0 0 , I n f ) , r i g h t=F, lab=FALSE) ] 
While this generates a factor variable: 
 c o l o u r s  c ( red  , blue  , green  , black ) 
 cut (x , c (Inf , 2 5 0 , 5 0 0 , 7 0 0 , I n f ) , r i g h t=F, lab=c o l o u r s ) 
15
2.4 Create indicator (dummy) variables (20/06/2001) 
2 examples: 
c is a column, you want dummy variable, one for each valid value. First, make it a factor, then 
use model.matrix(): 
 x  c ( 2 , 2 , 5 , 3 , 6 , 5 ,NA) 
 xf  f a c t o r (x , l e v e l s =2:6) 
 model .mat r ix ( xf1 ) 
xf2 xf3 xf4 xf5 xf6 
1 1 0 0 0 0 
2 1 0 0 0 0 
3 0 0 0 1 0 
4 0 1 0 0 0 
5 0 0 0 0 1 
6 0 0 0 1 0 
a t t r ( ,  a s s i g n ) 
[ 1 ] 1 1 1 1 1 
(from Peter Dalgaard) 
Question: I have a variable with 5 categories and I want to create dummy variables for each 
category. 
Answer: Use row indexing or model.matrix. 
f f  f a c t o r ( sample ( l e t t e r s [ 1 : 5 ] , 25 , r e p l a c e=TRUE) ) 
diag ( n l e v e l s ( f f ) ) [ f f , ] 
#or 
model .mat r ix (f f  1) 
( from Brian D. Ripley ) 
2.5 Create lagged values of variables for time series regression (05/22/2012) 
Peter Dalgaard explained, the simple way is to create a new variable which shifts the response, 
i.e. 
y shf t  c ( y [1 ] , NA) # pad with missing 
summary( lm( y shf t  x + y ) ) 
Alternatively, lag the regressors: 
N  l eng th ( x ) 
xlag  c (NA, x [ 1 : (N1) ] ) 
ylag  c (NA, y [ 1 : (N1) ] ) 
summary( lm( y  xlag + ylag ) ) 
Dave Armstrong (personal communication, 2012/5/21) brought to my attention the following 
problem in cross sectional time series data. Simply inserting an NA will lead to disaster 
because we need to insert a lag within each unit. There is also a bad problem when the time 
points observed for the sub units are not all the same. He suggests the following 
dat  data. f rame ( 
ccode = c ( 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 ) , 
year = c (1980 , 1982 , 1983 , 1984 , 1981:1984 , 1980:1982 , 1984) , 
x = seq ( 2 , 2 4 , by=2) ) 
dat $ obs  1 : nrow( dat ) 
dat $ lagobs  match ( pas t e ( dat $ ccode , dat $year1 , sep= . ) , 
pas t e ( dat $ ccode , dat $ year , sep= . ) ) 
dat $ l a g x  dat $x [ dat $ lagobs ] 
16
Run this example, be surprised, then email me if you
nd a better way. I haven't. This seems 
like a dicult problem to me and if I had to do it very often, I am pretty sure I'd have to 
re-think what it means to lag when there are years missing in the data. Perhaps this is a rare 
occasion where interpolation might be called for. 
2.6 How to drop factor levels for datasets that don't have observations with 
those values? (08/01/2002) 
The best way to drop levels, BTW, is 
pr obl em. f a c t o r  pr obl em. f a c t o r [ , drop=TRUE] 
( from Brian D. Ripley ) 
That has the same eect as running the pre-existing problem.factor through the function 
factor: 
pr obl em. f a c t o r  f a c t o r ( pr obl em. f a c t o r ) 
2.7 Select/subset observations out of a dataframe (08/02/2012) 
If you just want particular named or numbered rows or columns, of course, that's easy. Take 
columns x1, x2, and x3. 
datSubset1  dat [ , c ( x1  , x2  , x3 ) ] 
If those happen to be columns 44, 92, and 93 in a data frame, 
datSubset1  dat [ , c (44 , 92 , 93) ] 
Usually, we want observations that are conditional. 
Want to take observations for which variable Y is greater than A and less or equal than B: 
X[Y  A  Y  B ] 
Suppose you want observations with c=1 in df1. This makes a new data frame. 
df2  df1 [ df1 $ c==1 , ] 
and note that indexing is pretty central to using S (the language), so it is worth learning all the 
ways to use it. (from Brian Ripley) 
Or use match select values from the column d by taking the ones that match the values of 
another column, as in 
 d  t ( ar ray ( 1 : 2 0 , dim=c ( 2 , 1 0 ) ) ) 
 i  c ( 1 3 , 5 , 1 9 ) 
 d [match ( i , d [ , 1 ] ) , 2 ] 
[ 1 ] 14 6 20 
( from Peter Wolf ) 
Till Baumgaertel wanted to select observations for men over age 40, and sex was coded either 
m or M. Here are two working commands: 
# 1.) 
maleOver40  subs e t (d , sex %in% c ( m , M)  age  40) 
# 2.) 
maleOver40  d [ ( d$ sex==m j d$ sex==M)  d$ age 40 , ] 
To decipher that, do ?match and ?%in to
nd out about the %in% operator. 
If you want to grab the rows for which the variable subject is 15 or 19, try: 
df1 $ subj e c t %in% c ( 1 9 , 1 5 ) 
17
to get a True/False indication for each row in the data frame, and you can then use that output 
to pick the rows you want: 
i n d i c a t o r  df1 $ subj e c t %in% c ( 1 9 , 1 5 ) 
df1 [ indi c a t o r , ] 
How to deal with values that are already marked as missing? If you want to omit all rows for 
which one or more column is NA (missing): 
x2  na.omi t ( x ) 
produces a copy of the data frame x with all rows that contain missing data removed. The func-tion 
na.exclude could be used also. For more information on missings, check help : ?na.exclude. 
For exclusion of missing, Peter Dalgaard likes 
subs e t (x , c ompl e t e . c a s e s ( x ) ) or x [ c ompl e t e . c a s e s ( x ) , ] 
adding is.na(x) is preferable to x !=NA 
2.8 Delete
rst observation for each element in a cluster of observations 
(11/08/2000) 
Given data like: 
1 ABK 19910711 11 .1867461 0 .0000000 108 
2 ABK 19910712 11 .5298979 11 .1867461 111 
6 CSCO 19910102 0 .1553819 0 .0000000 106 
7 CSCO 19910103 0 .1527778 0 .1458333 166 
remove the
rst observation for each value of the sym variable (the one coded ABK,CSCO, 
etc). . If you just need to remove rows 1, 6, and 13, do: 
newhi lodata  hi l oda t a [c ( 1 , 6 , 1 3 ) , ] 
To solve the more general problem of omitting the
rst in each group, assuming sym is a 
factor, try something like 
newhi lodata  subs e t ( hi lodata , d i f f ( c ( 0 , a s . i n t e g e r ( sym) ) ) != 0) 
(actually, the as.integer is unnecessary because the c() will unclass the factor automagically) 
(from Peter Dalgaard) 
Alternatively, you could use the match function because it returns the
rst match. Suppose jm 
is the data set. Then: 
 match ( unique ( jm$sym) , jm$sym) 
[ 1 ] 1 6 13 
 jm  jm[ match( unique ( jm$sym) , jm$sym) , ] 
(from Douglas Bates) 
As Robert pointed out to me privately: duplicated() does the trick 
subs e t ( hi lodata , dupl i c a t ed ( sym) ) 
has got to be the simplest variant. 
2.9 Select a random sample of data (11/08/2000) 
sample (N, n , r e p l a c e=FALSE) 
and 
seq (N) [ rank ( r u n i f (N) )  n ] 
18

More Related Content

Viewers also liked

New microsoft office word document
New microsoft office word documentNew microsoft office word document
New microsoft office word documentAnees999
 
Presentacion de angie
Presentacion de angiePresentacion de angie
Presentacion de angielore75vivas
 
10 Things Your Competitors Are Blogging About
10 Things Your Competitors Are Blogging About10 Things Your Competitors Are Blogging About
10 Things Your Competitors Are Blogging AboutDuncan Gilman
 
Sintesis informativa 25 de enero 2013
Sintesis informativa 25 de enero 2013Sintesis informativa 25 de enero 2013
Sintesis informativa 25 de enero 2013megaradioexpress
 
4 Parte Seres Humanos National Geographic
4 Parte Seres Humanos National Geographic4 Parte Seres Humanos National Geographic
4 Parte Seres Humanos National GeographicRenny
 
Comment tirer de la connaissance marché des recherches google
Comment tirer de la connaissance marché des recherches googleComment tirer de la connaissance marché des recherches google
Comment tirer de la connaissance marché des recherches googleClustaar
 
Quantum physics
Quantum physicsQuantum physics
Quantum physicsraymond69
 
Cataratas NiáGara En Inverno
Cataratas NiáGara En InvernoCataratas NiáGara En Inverno
Cataratas NiáGara En InvernoJudith Naveda
 

Viewers also liked (15)

Mantenimiento De Equipos De Computo STIVEN PASTRANA
Mantenimiento De Equipos De Computo STIVEN PASTRANAMantenimiento De Equipos De Computo STIVEN PASTRANA
Mantenimiento De Equipos De Computo STIVEN PASTRANA
 
New microsoft office word document
New microsoft office word documentNew microsoft office word document
New microsoft office word document
 
Math pp
Math ppMath pp
Math pp
 
Trabajo tecnología
Trabajo tecnologíaTrabajo tecnología
Trabajo tecnología
 
Huracan
HuracanHuracan
Huracan
 
Presentacion de angie
Presentacion de angiePresentacion de angie
Presentacion de angie
 
10 Things Your Competitors Are Blogging About
10 Things Your Competitors Are Blogging About10 Things Your Competitors Are Blogging About
10 Things Your Competitors Are Blogging About
 
Sintesis informativa 25 de enero 2013
Sintesis informativa 25 de enero 2013Sintesis informativa 25 de enero 2013
Sintesis informativa 25 de enero 2013
 
4 Parte Seres Humanos National Geographic
4 Parte Seres Humanos National Geographic4 Parte Seres Humanos National Geographic
4 Parte Seres Humanos National Geographic
 
Comment tirer de la connaissance marché des recherches google
Comment tirer de la connaissance marché des recherches googleComment tirer de la connaissance marché des recherches google
Comment tirer de la connaissance marché des recherches google
 
Quantum physics
Quantum physicsQuantum physics
Quantum physics
 
Scis 9 28 09 Pep Rally Schedule
Scis 9 28 09 Pep Rally ScheduleScis 9 28 09 Pep Rally Schedule
Scis 9 28 09 Pep Rally Schedule
 
Las poleas!
Las poleas!Las poleas!
Las poleas!
 
Cataratas NiáGara En Inverno
Cataratas NiáGara En InvernoCataratas NiáGara En Inverno
Cataratas NiáGara En Inverno
 
1 my einglish
1 my einglish1 my einglish
1 my einglish
 

Similar to Rtips123

Getstart graphic
Getstart graphicGetstart graphic
Getstart graphicalldesign
 
Wise Document Translator Report
Wise Document Translator ReportWise Document Translator Report
Wise Document Translator ReportRaouf KESKES
 
Jcl reference mvs zos v1 r10
Jcl reference mvs zos v1 r10Jcl reference mvs zos v1 r10
Jcl reference mvs zos v1 r10Leo Goicochea
 
Automatización de sistemas de fabricación con PLC.pdf
Automatización de sistemas de fabricación con PLC.pdfAutomatización de sistemas de fabricación con PLC.pdf
Automatización de sistemas de fabricación con PLC.pdfSANTIAGO PABLO ALBERTO
 
Introduction to MATLAB Programming and Numerical Methods for Engineers 1st Ed...
Introduction to MATLAB Programming and Numerical Methods for Engineers 1st Ed...Introduction to MATLAB Programming and Numerical Methods for Engineers 1st Ed...
Introduction to MATLAB Programming and Numerical Methods for Engineers 1st Ed...AmeryWalters
 
Tems pocket 7.2_for_sony_ericsson_w995_and_w995a_--_user_s_manual-libre
Tems pocket 7.2_for_sony_ericsson_w995_and_w995a_--_user_s_manual-libreTems pocket 7.2_for_sony_ericsson_w995_and_w995a_--_user_s_manual-libre
Tems pocket 7.2_for_sony_ericsson_w995_and_w995a_--_user_s_manual-libreTue Phan
 
Reporting Summary Information of Spatial Datasets and Non-Compliance Issues U...
Reporting Summary Information of Spatial Datasets and Non-Compliance Issues U...Reporting Summary Information of Spatial Datasets and Non-Compliance Issues U...
Reporting Summary Information of Spatial Datasets and Non-Compliance Issues U...Safe Software
 

Similar to Rtips123 (20)

Learn matlab primer
Learn matlab primerLearn matlab primer
Learn matlab primer
 
Ashwin_Thesis
Ashwin_ThesisAshwin_Thesis
Ashwin_Thesis
 
Getstart graphic
Getstart graphicGetstart graphic
Getstart graphic
 
Wise Document Translator Report
Wise Document Translator ReportWise Document Translator Report
Wise Document Translator Report
 
FINAL PROJECT
FINAL PROJECTFINAL PROJECT
FINAL PROJECT
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to Matlab
 
An Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked ExamplesAn Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked Examples
 
Jcl reference mvs zos v1 r10
Jcl reference mvs zos v1 r10Jcl reference mvs zos v1 r10
Jcl reference mvs zos v1 r10
 
Automatización de sistemas de fabricación con PLC.pdf
Automatización de sistemas de fabricación con PLC.pdfAutomatización de sistemas de fabricación con PLC.pdf
Automatización de sistemas de fabricación con PLC.pdf
 
Example
ExampleExample
Example
 
Heat source simulation
Heat source simulationHeat source simulation
Heat source simulation
 
Tim Telcik contour TIN
Tim Telcik contour TINTim Telcik contour TIN
Tim Telcik contour TIN
 
report
reportreport
report
 
Manual de KmPlot
Manual de KmPlotManual de KmPlot
Manual de KmPlot
 
Introduction to MATLAB Programming and Numerical Methods for Engineers 1st Ed...
Introduction to MATLAB Programming and Numerical Methods for Engineers 1st Ed...Introduction to MATLAB Programming and Numerical Methods for Engineers 1st Ed...
Introduction to MATLAB Programming and Numerical Methods for Engineers 1st Ed...
 
Test
TestTest
Test
 
Tems pocket 7.2_for_sony_ericsson_w995_and_w995a_--_user_s_manual-libre
Tems pocket 7.2_for_sony_ericsson_w995_and_w995a_--_user_s_manual-libreTems pocket 7.2_for_sony_ericsson_w995_and_w995a_--_user_s_manual-libre
Tems pocket 7.2_for_sony_ericsson_w995_and_w995a_--_user_s_manual-libre
 
Reporting Summary Information of Spatial Datasets and Non-Compliance Issues U...
Reporting Summary Information of Spatial Datasets and Non-Compliance Issues U...Reporting Summary Information of Spatial Datasets and Non-Compliance Issues U...
Reporting Summary Information of Spatial Datasets and Non-Compliance Issues U...
 
11 019-maldonado-jesus-bericht
11 019-maldonado-jesus-bericht11 019-maldonado-jesus-bericht
11 019-maldonado-jesus-bericht
 
Sliderfns
SliderfnsSliderfns
Sliderfns
 

Recently uploaded

HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsMichael W. Hawkins
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Roland Driesen
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...Paul Menig
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityEric T. Tung
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfAdmir Softic
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Centuryrwgiffor
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear RegressionRavindra Nath Shukla
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesDipal Arora
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...Aggregage
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Neil Kimberley
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxpriyanshujha201
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...amitlee9823
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 

Recently uploaded (20)

HONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael HawkinsHONOR Veterans Event Keynote by Michael Hawkins
HONOR Veterans Event Keynote by Michael Hawkins
 
Forklift Operations: Safety through Cartoons
Forklift Operations: Safety through CartoonsForklift Operations: Safety through Cartoons
Forklift Operations: Safety through Cartoons
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pillsMifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
Mifty kit IN Salmiya (+918133066128) Abortion pills IN Salmiyah Cytotec pills
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 

Rtips123

  • 1. Rtips. Revival 2014! Paul E. Johnson <pauljohn @ ku.edu> March 24, 2014 The original Rtips started in 1999. It became dicult to update because of limitations in the software with which it was created. Now I know more about R, and have decided to wade in again. In January, 2012, I took the FaqManager HTML output and converted it to LATEX with the excellent open source program pandoc, and from there I've been editing and updating it in LYX. From here on out, the latest html version will be at http://pj.freefaculty.org/R/Rtips. html and the PDF for the same will be http://pj.freefaculty.org/R/Rtips.pdf. You are reading the New Thing! The
  • 2. rst chore is to cut out the old useless stu that was no good to start with, correct mistakes in translation (the quotation mark translations are particularly dangerous, but also there is trouble with ~, $, and -. Original Preface (I thought it was cute to call this StatsRus but the Toystore's lawyer called and, well, you know. . . ) If you need a tip sheet for R, here it is. This is not a substitute for R documentation, just a list of things I had trouble remembering when switching from SAS to R. Heed the words of Brian D. Ripley, One enquiry came to me yesterday which suggested that some users of binary distributions do not know that R comes with two Guides in the doc/manual directory plus an FAQ and the help pages in book form. I hope those are distributed with all the binary distributions (they are not made nor installed by default). Windows-speci
  • 3. c versions are available. Please run help.start() in R! Contents 1 Data Input/Output 5 1.1 Bring raw numbers into R (05/22/2012) . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Basic notation on data access (12/02/2012) . . . . . . . . . . . . . . . . . . . . . 6 1.3 Checkout the new Data Import/Export manual (13/08/2001) . . . . . . . . . . . 6 1.4 Exchange data between R and other programs (Excel, etc) (01/21/2009) . . . . . 6 1.5 Merge data frames (04/23/2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Add one row at a time (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.7 Need yet another dierent kind of merge for data frames (11/08/2000) . . . . . . 9 1.8 Check if an object is NULL (06/04/2001) . . . . . . . . . . . . . . . . . . . . . . 10 1
  • 4. 1.9 Generate random numbers (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . 10 1.10 Generate random numbers with a
  • 5. xed mean/variance (06/09/2000) . . . . . . . 11 1.11 Use rep to manufacture a weighted data set (30/01/2001) . . . . . . . . . . . . . 11 1.12 Convert contingency table to data frame (06/09/2000) . . . . . . . . . . . . . . . 12 1.13 Write: data in text
  • 6. le (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Working with data frames: Recoding, selecting, aggregating 12 2.1 Add variables to a data frame (or list) (02/06/2003) . . . . . . . . . . . . . . . . 12 2.2 Create variable names on the y (10/04/2001) . . . . . . . . . . . . . . . . . . 13 2.3 Recode one column, output values into another column (12/05/2003) . . . . . . . 13 2.4 Create indicator (dummy) variables (20/06/2001) . . . . . . . . . . . . . . . . . . 16 2.5 Create lagged values of variables for time series regression (05/22/2012) . . . . . 16 2.6 How to drop factor levels for datasets that don't have observations with those values? (08/01/2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.7 Select/subset observations out of a dataframe (08/02/2012) . . . . . . . . . . . . 17 2.8 Delete
  • 7. rst observation for each element in a cluster of observations (11/08/2000) 18 2.9 Select a random sample of data (11/08/2000) . . . . . . . . . . . . . . . . . . . . 18 2.10 Selecting Variables for Models: Don't forget the subset function (15/08/2000) . . 19 2.11 Process all numeric variables, ignore character variables? (11/02/2012) . . . . . . 19 2.12 Sorting by more than one variable (06/09/2000) . . . . . . . . . . . . . . . . . . 19 2.13 Rank within subgroups de
  • 8. ned by a factor (06/09/2000) . . . . . . . . . . . . . . 20 2.14 Work with missing values (na.omit, is.na, etc) (15/01/2012) . . . . . . . . . . . . 20 2.15 Aggregate values, one for each line (16/08/2000) . . . . . . . . . . . . . . . . . . 21 2.16 Create new data frame to hold aggregate values for each factor (11/08/2000) . . 21 2.17 Selectively sum columns in a data frame (15/01/2012) . . . . . . . . . . . . . . . 21 2.18 Rip digits out of real numbers one at a time (11/08/2000) . . . . . . . . . . . . . 21 2.19 Grab an item from each of several matrices in a List (14/08/2000) . . . . . . . . 22 2.20 Get vector showing values in a dataset (10/04/2001) . . . . . . . . . . . . . . . . 22 2.21 Calculate the value of a string representing an R command (13/08/2000) . . . . 22 2.22 Which can grab the index values of cases satisfying a test (06/04/2001) . . . . 22 2.23 Find unique lines in a matrix/data frame (31/12/2001) . . . . . . . . . . . . . . . 23 3 Matrices and vector operations 23 3.1 Create a vector, append values (01/02/2012) . . . . . . . . . . . . . . . . . . . . 23 3.2 How to create an identity matrix? (16/08/2000) . . . . . . . . . . . . . . . . . . 24 3.3 Convert matrix m to one long vector (11/08/2000) . . . . . . . . . . . . . . . . 24 3.4 Creating a peculiar sequence (1 2 3 4 1 2 3 1 2 1) (11/08/2000) . . . . . . . . . . 24 3.5 Select every n'th item (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6 Find index of a value nearest to 1.5 in a vector (11/08/2000) . . . . . . . . . . . 25 3.7 Find index of nonzero items in vector (18/06/2001) . . . . . . . . . . . . . . . . . 25 3.8 Find index of missing values (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . 26 3.9 Find index of largest item in vector (16/08/2000) . . . . . . . . . . . . . . . . . . 26 3.10 Replace values in a matrix (22/11/2000) . . . . . . . . . . . . . . . . . . . . . . . 26 3.11 Delete particular rows from matrix (06/04/2001) . . . . . . . . . . . . . . . . . . 27 3.12 Count number of items meeting a criterion (01/05/2005) . . . . . . . . . . . . . . 27 3.13 Compute partial correlation coecients from correlation matrix (08/12/2000) . . 27 3.14 Create a multidimensional matrix (R array) (20/06/2001) . . . . . . . . . . . . . 28 3.15 Combine a lot of matrices (20/06/2001) . . . . . . . . . . . . . . . . . . . . . . . 28 3.16 Create neighbormatrices according to speci
  • 9. c logics (20/06/2001) . . . . . . . 28 3.17 Matching two columns of numbers by a key variable (20/06/2001) . . . . . . 29 3.18 Create Upper or Lower Triangular matrix (06/08/2012) . . . . . . . . . . . . . . 29 3.19 Calculate inverse of X (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . . 30 2
  • 10. 3.20 Interesting use of Matrix Indices (20/06/2001) . . . . . . . . . . . . . . . . . . . 31 3.21 Eigenvalues example (20/06/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Applying functions, tapply, etc 32 4.1 Return multiple values from a function (12/02/2012) . . . . . . . . . . . . . . . . 32 4.2 Grab p values out of a list of signi
  • 11. cance tests (22/08/2000) . . . . . . . . . . . 32 4.3 ifelse usage (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Apply to create matrix of probabilities, one for each cell (14/08/2000) . . . . . 32 4.5 Outer. (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6 Check if something is a formula/function (11/08/2000) . . . . . . . . . . . . . . . 33 4.7 Optimize with a vector of variables (11/08/2000) . . . . . . . . . . . . . . . . . . 33 4.8 slice.index, like in S+ (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Graphing 33 5.1 Adjust features with par before graphing (18/06/2001) . . . . . . . . . . . . . . . 33 5.2 Save graph output (03/21/2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3 How to automatically name plot output into separate
  • 12. les (10/04/2001) . . . . . 36 5.4 Control papersize (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.5 Integrating R graphs into documents: LATEX and EPS or PDF (20/06/2001) . . . 37 5.6 Snapshot graphs and scroll through them (31/12/2001) . . . . . . . . . . . . . 37 5.7 Plot a density function (eg. Normal) (22/11/2000) . . . . . . . . . . . . . . . . . 37 5.8 Plot with error bars (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.9 Histogram with density estimates (14/08/2000) . . . . . . . . . . . . . . . . . . . 37 5.10 How can I overlay several line plots on top of one another? (09/29/2005) . . . 37 5.11 Create matrix of graphs (18/06/2001) . . . . . . . . . . . . . . . . . . . . . . . 39 5.12 Combine lines and bar plot? (07/12/2000) . . . . . . . . . . . . . . . . . . . . . . 39 5.13 Regression scatterplot: add
  • 13. tted line to graph (03/20/2014) . . . . . . . . . . . 40 5.14 Control the plotting character in scatterplots? (11/08/2000) . . . . . . . . . . . . 40 5.15 Scatterplot: Control Plotting Characters (men vs women, etc)g (11/11/2002) . . 41 5.16 Scatterplot with size/color adjustment (12/11/2002) . . . . . . . . . . . . . . . . 41 5.17 Scatterplot: adjust size according to 3rd variable (06/04/2001) . . . . . . . . . . 42 5.18 Scatterplot: smooth a line connecting points (02/06/2003) . . . . . . . . . . . . . 42 5.19 Regression Scatterplot: add estimate to plot (18/06/2001) . . . . . . . . . . . . . 42 5.20 Axes: controls: ticks, no ticks, numbers, etc (22/11/2000) . . . . . . . . . . . . . 42 5.21 Axes: rotate labels (06/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.22 Axes: Show formatted dates in axes (06/04/2001) . . . . . . . . . . . . . . . . . 43 5.23 Axes: Reverse axis in plot (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . 43 5.24 Axes: Label axes with dates (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . 44 5.25 Axes: Superscript in axis labels (11/08/2000) . . . . . . . . . . . . . . . . . . . . 44 5.26 Axes: adjust positioning (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . 44 5.27 Add error arrows to a scatterplot (30/01/2001) . . . . . . . . . . . . . . . . . . 44 5.28 Time Series: how to plot several lines in one graph? (06/09/2000) . . . . . . . 45 5.29 Time series: plot
  • 14. tted and actual data (11/08/2000) . . . . . . . . . . . . . . . . 45 5.30 Insert text into a plot (22/11/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.31 Plotting unbounded variables (07/12/2000) . . . . . . . . . . . . . . . . . . . . . 45 5.32 Labels with dynamically generated content/math markup (16/08/2000) . . . . . 45 5.33 Use math/sophisticated stu in title of plot (11/11/2002) . . . . . . . . . . . . . 46 5.34 How to color-code points in scatter to reveal missing values of 3rd variable? (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.35 lattice: misc examples (12/11/2002) . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.36 Make 3d scatterplots (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.37 3d contour with line style to re ect value (06/04/2001) . . . . . . . . . . . . . . . 47 3
  • 15. 5.38 Animate a Graph! (13/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.39 Color user-portion of graph background dierently from margin (06/09/2000) . . 47 5.40 Examples of graphing code that seem to work (misc) (11/16/2005)g . . . . . . . 48 6 Common Statistical Chores 51 6.1 Crosstabulation Tables (01/05/2005) . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 t-test (18/07/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.3 Test for Normality (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.4 Estimate parameters of distributions (12/02/2012) . . . . . . . . . . . . . . . . . 52 6.5 Bootstrapping routines (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.6 BY subgroup analysis of data (summary or model for subgroups)(06/04/2001) . 52 7 Model Fitting (Regression-type things) 53 7.1 Tips for specifying regression models (12/02/2002) . . . . . . . . . . . . . . . . . 53 7.2 Summary Methods, grabbing results inside an output object . . . . . . . . . . . 53 7.3 Calculate separate coecients for each level of a factor (22/11/2000) . . . . . . . 53 7.4 Compare
  • 16. ts of regression models (F test subset B's =0) (14/08/2000) . . . . . . 54 7.5 Get Predicted Values from a model with predict() (11/13/2005) . . . . . . . . . . 55 7.6 Polynomial regression (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.7 Calculate p value for an F stat from regression (13/08/2000) . . . . . . . . . . 57 7.8 Compare
  • 17. ts (F test) in stepwise regression/anova (11/08/2000) . . . . . . . . . 57 7.9 Test signi
  • 18. cance of slope and intercept shifts (Chow test?) . . . . . . . . . . . . . 58 7.10 Want to estimate a nonlinear model? (11/08/2000) . . . . . . . . . . . . . . . . . 58 7.11 Quasi family and passing arguments to it. (12/11/2002) . . . . . . . . . . . . . . 58 7.12 Estimate a covariance matrix (22/11/2000) . . . . . . . . . . . . . . . . . . . . . 58 7.13 Control number of signi
  • 19. cant digits in output (22/11/2000) . . . . . . . . . . . . 59 7.14 Multiple analysis of variance (06/09/2000) . . . . . . . . . . . . . . . . . . . . . . 59 7.15 Test for homogeneity of variance (heteroskedasticity) (12/02/2012) . . . . . . . . 59 7.16 Use nls to estimate a nonlinear model (14/08/2000) . . . . . . . . . . . . . . . . 60 7.17 Using nls and graphing things with it (22/11/2000) . . . . . . . . . . . . . . . . . 60 7.18 2Log(L) and hypo tests (22/11/2000) . . . . . . . . . . . . . . . . . . . . . . . 60 7.19 logistic regression with repeated measurements (02/06/2003) . . . . . . . . . . . 61 7.20 Logit (06/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.21 Random parameter (Mixed Model) tips (01/05/2005) . . . . . . . . . . . . . . . . 61 7.22 Time Series: basics (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.23 Time Series: misc examples (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . 62 7.24 Categorical Data and Multivariate Models (04/25/2004) . . . . . . . . . . . . . . 62 7.25 Lowess. Plot a smooth curve (04/25/2004) . . . . . . . . . . . . . . . . . . . . . . 62 7.26 Hierarchical/Mixed linear models. (06/03/2003) . . . . . . . . . . . . . . . . . . 62 7.27 Robust Regression tools (07/12/2000) . . . . . . . . . . . . . . . . . . . . . . . . 63 7.28 Durbin-Watson test (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 7.29 Censored regression (04/25/2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8 Packages 63 8.1 What packages are installed on Paul's computer? . . . . . . . . . . . . . . . . . . 63 8.2 Install and load a package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.3 List Loaded Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.4 Where is the default R library folder? Where does R look for packages in a computer? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8.5 Detach libraries when no longer needed (10/04/2001) . . . . . . . . . . . . . . . . 66 4
  • 20. 9 Misc. web resources 66 9.1 Navigating R Documentation (12/02/2012) . . . . . . . . . . . . . . . . . . . . . 66 9.2 R Task View Pages (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 9.3 Using help inside R(13/08/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 9.4 Run examples in R (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 10 R workspace 67 10.1 Writing, saving, running R code (31/12/2001) . . . . . . . . . . . . . . . . . . . . 67 10.2 .RData, .RHistory. Help or hassle? (31/12/2001) . . . . . . . . . . . . . . . . . . 68 10.3 Save Load R objects (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . 68 10.4 Reminders for object analysis/usage (11/08/2000) . . . . . . . . . . . . . . . . . 68 10.5 Remove objects by pattern in name (31/12/2001) . . . . . . . . . . . . . . . . . . 68 10.6 Save work/create a Diary of activity (31/12/2001) . . . . . . . . . . . . . . . . . 69 10.7 Customized Rpro
  • 21. le (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . 69 11 Interface with the operating system 69 11.1 Commands to system like change working directory (22/11/2000) . . . . . . . . 69 11.2 Get system time. (30/01/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 11.3 Check if a
  • 22. le exists (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 69 11.4 Find
  • 23. les by name or part of a name (regular expression matching) (14/08/2001) 70 12 Stupid R tricks: basics you can't live without 70 12.1 If you are asking for help (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . 70 12.2 Commenting out things in R
  • 24. les (15/08/2000) . . . . . . . . . . . . . . . . . . . 71 13 Misc R usages I
  • 25. nd interesting 71 13.1 Character encoding (01/27/2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 13.2 list names of variables used inside an expression (10/04/2001) . . . . . . . . . . . 71 13.3 R environment in side scripts (10/04/2001) . . . . . . . . . . . . . . . . . . . . . 71 13.4 Derivatives (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 1 Data Input/Output 1.1 Bring raw numbers into R (05/22/2012) This is truly easy. Suppose you've got numbers in a space-separated
  • 26. lemyData, with column names in the
  • 27. rst row (thats a header). Run myDataFrame r e a d . t a b l e ( ` `myData ' ' , header=TRUE) If you type ?read.table it tells about importing
  • 28. les with other delimiters. Suppose you have tab delimited data with blank spaces to indicate missing values. Do this: myDataFramer e a d . t a b l e ( myData , sep=n t , n a . s t r i n g s= , header=TRUE) Be aware than anybody can choose his/her own separator. I am fond of j because it seems never used in names or addresses (unlike just about any other character I've found). Suppose your data is in a compressed gzip
  • 30. le function to decompress on the y. Do this: myDataFrame r e a d . t a b l e ( g z f i l e ( myData.gz ) , header=TRUE) If you read columns in from separate
  • 31. les, combine into a data frame as: 5
  • 32. v a r i a b l e 1 scan ( f i l e 1 ) v a r i a b l e 2 scan ( f i l e 2 ) mydata cbind ( va r i abl e 1 , v a r i a b l e 2 ) #or use the equivalent: #mydata data.frame(variable1 , variable2) #Optionally save dataframe in R object file with: wr i t e . t a b l e (mydata , f i l e=f i l ename 3 ) 1.2 Basic notation on data access (12/02/2012) To access the columns of a data frame x with the column number, say x[,1], to get the
  • 33. rst column. If you know the column name, say pjVar1, it is the same as x$pjVar1 or x[, pjVar1]. Grab an element in a list as x[[1]]. If you just run x[1] you get a list, in which there is a single item. Maybe you want that, but I bet you really want x[[1]]. If the list elements are named, you can get them with x$pjVar1 or x[[pjVar1]]. For instance, if a data frame is nebdata then grab the value in row 1, column 2 with: nebdata [ 1 , 2 ] ## or to selectively take from column2 only when the column Volt equals ABal nebdata [ nebdata $Volt==ABal , 2 ] ( from Diego Kuonen) 1.3 Checkout the new Data Import/Export manual (13/08/2001) With R{1.2, the R team released a manual about how to get data into R and out of R. That's the
  • 34. rst place to look if you need help. It is now distributed with R. Run h e l p . s t a r t ( ) 1.4 Exchange data between R and other programs (Excel, etc) (01/21/2009) Patience is the key. If you understand the format in which your data is currently held, chances are good you can translate it into R with more or less ease. Most commonly, people seem to want to import Microsoft Excel spreadsheets. I believe there is an ODBC approach to this, but I think it is only for MS Windows. In the gdata package, which was formerly part of gregmisc, there is a function that can use Perl to import an Excel spreadsheet. If you install gdata, the function to use is called read.xls. You have to specify which sheet you want. That works well enough if your data is set in a numeric format inside Excel. If it is set with the GENERAL type, I've seen the imported numbers turn up as all asterixes. Other packages have appeared to oer Excel import ability, such as xlsReadWrite. In the R-help list, I also see reference to an Excel program addon called RExcel that can install an option in the Excel menus called Put R Dataframe. The home for that project is http://sunsite.univie.ac.at/rcom/ I usually proceed like this. Step 1. Use Excel to edit the sheet so that it is RECTANGULAR. It should have variable names in row 1, and it has numbers where desired and the string NA otherwise. It must have NO embedded formulae or other Excel magic. Be sure that the columns are declared with the proper Excel format. Numerical information should have a numerical type, while text should have text as the type. Avoid General. 6
  • 35. Step 2. First, try the easy route. Try gdata's read.xls method. As long as you tell it which sheet you want to read out of your data set, and you add header=T and whatever other options you'd like in an ordinary read.table usage, then it has worked well for us. Step 3. Suppose step 2 did not work. Then you have something irregular in your Excel sheet and you should proceed as follows. Either open Excel and clean up your sheet and try step 2 again, or take the more drastic step: Export the spread sheet to text format. File/Save As, navigate to csv or txt and choose any options like advanced format or text con
  • 36. gurable. Choose the delimiter you want. One option is the tab key. When I'm having trouble, I use the bar symbol, j, because there's little chance it is in use inside a column of data. If your columns have addresses or such, usage of the COMMA as a delimiter is very dangerous. After you save that
  • 37. le in text mode, open it in a powerful at text editor like Emacs and look through to make sure that 1) your variable names in the
  • 38. rst row do not have spaces or quotation marks or special characters, and 2) look to the bottom to make sure your spreadsheet program did not insert any crud. If so, delete it. If you see commas in numeric variables, just use the Edit/Search/replace option to delete them. Then read in your tab separated data with r e a d . t a b l e ( ` ` f i l ename ' ' , header=T, sep=``nt ' ' ) I have tested the foreign package by importing an SPSS
  • 39. le and it worked great. I've had great results importing Stata Data sets too. Here's a big caution for you, however. If your SPSS or Stata numeric variables have some value lables, say 98=No Answer and 99=Missing, then R will think that the variable is a factor, and it will convert it into a factor with 99 possible values. The foreign library commands for reading spss and dta
  • 40. les have options to stop R from trying to help with factors and I'd urge you to read the manual and use them. If you use read.spss, for example, setting use.value.labels=F will stop R from creating factors at all. If you don't want to go that far, there's an option max.value.labels that you can set to 5 or 10, and stop it from seeing 98=Missing and then creating a factor with 98 values. It will only convert variables that have fewer than 5 or 10 values. If you use read.dta (for Stata), you can use the option convert.factors=F. Also, if you are using read.table, you may have trouble if your numeric variables have any illegal values, such as letters. Then R will assume you really intend them to be factors and it will sometimes be tough to
  • 41. x. If you add the option as.is=T, it will stop that cleanup eort by R. At one time, the SPSS import support in foreign did not work for me, and so I worked out a routine of copying the SPSS data into a text
  • 42. le, just as described for Excel. I have a notoriously dicult time with SAS XPORT
  • 43. les and don't even try anymore. I've seen several email posts by Frank Harrel in r-help and he has some pretty strong words about it. I do have one working example of importing the Annenberg National Election Study into R from SAS and you can review that at http://pj.freefaculty.org/DataSets/ANES/2002. I wrote a long boring explanation. Honestly, I think the best thing to do is to
  • 44. nd a bridge between SAS and R, say use some program to convert the SAS into Excel, and go from there. Or just write the SAS data set to a
  • 45. le and then read into R with read.table() or such. 1.5 Merge data frames (04/23/2004) update:Merge is confusing! But if you study this, you will see everything in perfect clarity: x1 rnorm(100) x2 rnorm(100) x3 rnorm(100) 7
  • 46. x4 rnorm(100) ds1 data. f rame ( c i t y=rep ( 1 , 1 0 0 ) , x1=x1 , x2=x2 ) ds2 data. f rame ( c i t y=rep ( 2 , 1 0 0 ) , x1=x1 , x3=x3 , x4=x4 ) merge ( ds1 , ds2 , by.x=c ( ` ` c i t y ' ' , ` `x1 ' ' ) , by.y=c ( ` ` c i t y ' ' , ` `x1 ' ' ) , a l l=TRUE) The trick is to make sure R understands which are the common variables in the two datasets so it lines them up, and then all=T is needed to say that you don't want to throw away the variables that are only in one set or the other. Read the help page over and over, you eventually get it. More examples: exper iment data. f rame ( t imes = c ( 0 , 0 , 1 0 , 1 0 , 2 0 , 2 0 , 3 0 , 3 0 ) , expval = c ( 1 , 1 , 2 , 2 , 3 , 3 , 4 , 4 ) ) s imul data. f rame ( t imes = c ( 0 , 1 0 , 2 0 , 3 0 ) , s imul = c ( 3 , 4 , 5 , 6 ) ) I want a merged datatset like: t imes expval s imul 1 0 1 3 2 0 1 3 3 10 2 4 4 10 2 4 5 20 3 5 6 20 3 5 7 30 4 6 8 30 4 6 Suggestions merge ( experiment , s imul ) ( from Brian D. Ripley ) does all the work for you. Or consider: exp. s im data. f rame ( experiment , s imul=s imul $ s imul [match ( exper iment $ times , s imul $ t imes ) ] ) ( from Jim Lemon) I have dataframes like this: s t a t e count1 pe r c ent1 CA 19 0 . 3 4 TX 22 0 . 3 5 FL 11 0 . 2 4 OR 34 0 . 4 2 GA 52 0 . 6 2 MN 12 0 . 1 7 NC 19 0 . 3 4 s t a t e count2 pe r c ent2 FL 22 0 . 3 5 MN 22 0 . 3 5 CA 11 0 . 2 4 TX 52 0 . 6 2 And I want s t a t e count1 pe r c ent1 count2 pe r c ent2 CA 19 0 . 3 4 11 0 . 2 4 TX 22 0 . 3 5 52 0 . 6 2 FL 11 0 . 2 4 22 0 . 3 5 OR 34 0 . 4 2 0 0 GA 52 0 . 6 2 0 0 8
  • 47. MN 12 0 . 1 7 22 0 . 3 5 NC 19 0 . 3 4 0 0 ( from YuLing Wu ) In response, Ben Bolker said s t a t e 1 c ( ` `CA' ' , ` `TX' ' , ` `FL ' ' , ` `OR' ' , ` `GA' ' , ` `MN' ' , ` `NC' ' ) count1 c ( 1 9 , 2 2 , 1 1 , 3 4 , 5 2 , 1 2 , 1 9 ) pe r c ent1 c (0 .34 , 0 .35 , 0 .24 , 0 .42 , 0 .62 , 0 .17 , 0 . 3 4 ) s t a t e 2 c ( ` `FL ' ' , ` `MN' ' , ` `CA' ' , ` `TX' ' ) count2 c ( 2 2 , 2 2 , 1 1 , 5 2 ) pe r c ent2 c (0 .35 , 0 .35 , 0 .24 , 0 . 6 2 ) data1 data. f rame ( s tat e1 , count1 , pe r c ent1 ) data2 data. f rame ( s tat e2 , count2 , pe r c ent2 ) datac data1m match ( data1 $ s tat e1 , data2 $ s tat e2 , 0 ) datac $ count2 i f e l s e (m==0 ,0 , data2 $ count2 [m] ) datac $ pe r c ent2 i f e l s e (m==0 ,0 , data2 $ pe r c ent2 [m] ) If you didn't want to keep all the rows in both data sets (but just the shared rows) you could use merge ( data1 , data2 , by=1) 1.6 Add one row at a time (14/08/2000) Question: I would like to create an (empty) data frame withheadingsfor every column (column titles) and then put data row-by-row into this data frame (one row for every computation I will be doing), i.e. no. time temp pr e s sur e the headings 1 0 100 80 f i r s t r e s u l t 2 10 110 87 2nd r e s u l t . . . . . Answer: Depends if the cols are all numeric: if they are a matrix would be better. But if you insist on a data frame, here goes: If you know the number of results in advance, say, N, do this df data. f rame ( time=numeric (N) , temp=numeric (N) , pr e s sur e=numeric (N) ) df [ 1 , ] c ( 0 , 100 , 80) df [ 2 , ] c (10 , 110 , 87) . . . or m matrix ( nrow=N, ncol=3) colnames (m) c ( time , temp , pr e s sur e ) m[ 1 , ] c ( 0 , 100 , 80) m[ 2 , ] c (10 , 110 , 87) The matrix form is better size it only needs to access one vector (a matrix is a vector with attributes), not three. If you don't know the
  • 48. nal size you can use rbind to add a row at a time, but that is substantially less ecient as lots of re-allocation is needed. It's better to guess the size,
  • 49. ll in and then rbind on a lot more rows if the guess was too small.(from Brian Ripley) 1.7 Need yet another dierent kind of merge for data frames (11/08/2000) Convert these two
  • 50. les 9
  • 51. Fi l e 1 C A T Fi l e 2 1 2 34 56 2 3 45 67 3 4 56 78 ( from Stephen Arthur ) Into a new data frame that looks like: C A T 1 2 34 56 C A T 2 3 45 67 C A T 3 4 56 78 This works: r epcbind func t i on (x , y ) f nx nrow( x ) ny nrow( y ) i f (nxny) x apply (x , 2 , rep , l eng th=ny ) e l s e i f (nynx) y apply (y , 2 , rep , l eng th=nx ) cbind (x , y ) g ( from Ben Bolker ) 1.8 Check if an object is NULL (06/04/2001) NULL does not mean that something does not exist. It means that it exists, and it is nothing. X NULL This may be a way of clearing values assigned to X, or initializing a variable as nothing. Programs can check on whether X is null i f ( i s . n u l l ( x ) ) f #then...} If you load things, R does not warn you when they are not found, it records them as NULL. You have the responsibility of checking them. Use i s . n u l l ( l i s t $component ) to check a thing named component in a thing named list. Accessing non-existent dataframe columns with [ does give an error, so you could do that instead. data ( t r e e s ) t r e e s $ aardvark NULL t r e e s [ , aardvark ] Error in [.data.frame(trees, , aardvark) : subscript out of bounds (from Thomas Lumley) 1.9 Generate random numbers (12/02/2012) You want randomly drawn integers? Use Sample, like so: # If you mean sampling without replacement: sample ( 1 : 1 0 , 3 , r e p l a c e=FALSE) #If you mean with replacement: sample ( 1 : 1 0 , 3 , r e p l a c e=TRUE) ( from Bi l l Simpson ) 10
  • 52. Included with R are many univariate distributions, for example the Gaussian normal, Gamma, Binomial, Poisson, and so forth. Run ? r u n i f ? rnorm ?rgamma ? r p o i s You will see a distribution's functions are a base name like norm with pre
  • 53. x letters r, d, p, q. ˆ rnorm: draw pseudo random numbers from a normal ˆ dnorm: the density value for a given value of a variable ˆ pnorm: the cumulative probability density value for a given value ˆ qnorm: the quantile function: given a probability, what is the corresponding value of the variable? I made a long-ish lecture about this in my R workshop (http://pj.freefaculty.org/guides/ Rcourse/rRandomVariables) Multivariate distributions are not (yet) in the base of R, but they are in several packages, such as MASS and mvtnorm. Note, when you use these, it is necessary to specify a mean vector and a covariance matrix among the variables. Brian Ripley gave this example: with mvrnorm in package MASS (part of the VR bundle), mvrnorm( 2 , c ( 0 , 0 ) , matrix ( c (0 .25 , 0 .20 , 0 .20 , 0 . 2 5 ) , 2 ,2) ) If you don't want to use a contributed package to draw multivariate observations, you can approximate some using the univariate distributions in R itself. Peter Dalgaard observed a less general solution for this particular case would be rnorm( 1 , sd=s q r t (0 . 2 0 ) ) + rnorm( 2 , sd=s q r t (0 . 0 5 ) ) 1.10 Generate random numbers with a
  • 54. xed mean/variance (06/09/2000) If you generate random numbers with a given tool, you don't get a sample with the exact mean you specify. A generator with a mean of 0 will create samples with varying means, right? I don't know why anybody wants a sample with a mean that is exactly 0, but you can draw a sample and then transform it to force the mean however you like. Take a 2 step approach: R x rnorm(100 , mean = 5 , sd = 2) R x ( x mean( x ) ) / s q r t ( var ( x ) ) R mean( x ) [ 1 ] 1 .385177e16 R var ( x ) [ 1 ] 1 and now create your sample with mean 5 and sd 2: R x x*2 + 5 R mean( x ) [ 1 ] 5 R var ( x ) [ 1 ] 4 ( from Tor s ten.Hothorn ) 1.11 Use rep to manufacture a weighted data set (30/01/2001) 11
  • 55. x c ( 1 0 , 4 0 , 5 0 , 1 0 0 ) # income vector for instance w c ( 2 , 1 , 3 , 2 ) # the weight for each observation in x with the same rep (x ,w) [ 1 ] 10 10 40 50 50 50 100 100 ( from P. Malewski ) That expands a single variable, but we can expand all of the columns in a dataset one at a time to represent weighted data Thomas Lumley provided an example: Most of the functions that have weights have frequency weights rather than probability weights: that is, setting a weight equal to 2 has exactly the same eect as replicating the observation. expanded.dataa s .da t a . f r ame ( l apply ( compres sed.data , func t i on ( x ) rep (x , compr e s s ed.data $we ight s ) ) ) 1.12 Convert contingency table to data frame (06/09/2000) Given a 8 dimensional crosstab, you want a data frame with 8 factors and 1 column for fre-quencies of the cells in the table. R1.2 introduces a function as.data.frame.table() to handle this. This can also be done manually. Here's a function (it's a simple wrapper around expand.grid): d f i f y func t i on ( ar r , value.name = value , dn.names = names ( dimnames ( a r r ) ) ) f Ver s ion $ Id : d f i f y . s f u n , v 1 . 1 1995 /10/09 1 6 : 0 6 : 1 2 d3a061 Exp $ dn dimnames ( a r r a s . a r r a y ( a r r ) ) i f ( i s . n u l l (dn) ) s top ( Can ' t dataframei fy an ar ray wi thout dimnames ) names (dn) dn.names ans cbind ( expand.gr id (dn) , a s . v e c t o r ( a r r ) ) names ( ans ) [ ncol ( ans ) ] value.name ans g The name is short for data-frame-i-fy. For your example, assuming your multi-way array has proper dimnames, you'd just do: my.data. f rame d f i f y (my.array , value.name=``f r equency ' ' ) (from Todd Taylor) 1.13 Write: data in text
  • 56. le (31/12/2001) Say I have a command that produced a 28 x 28 data matrix. How can I output the matrix into a txt
  • 57. le (rather than copy/paste the matrix)? wr i t e . t a b l e (mat , f i l e= f i l e n ame . t x t ) Note MASS library has a function write.matrix which is faster if you need to write a numerical matrix, not a data frame. Good for big jobs. 2 Working with data frames: Recoding, selecting, aggregating 2.1 Add variables to a data frame (or list) (02/06/2003) If dat is a data frame, the column x1 can be added to dat in (at least 4) methods, dat$x1, dat[ , x1], dat[x1], or dat[[x1]]. Observe 12
  • 58. dat data. f rame ( a=c ( 1 , 2 , 3 ) ) dat [ , x1 ] c (12 , 23 , 44) dat [ x2 ] c (12 , 23 , 44) dat [ [ x3 ] ] c (12 , 23 , 44) dat a x1 x2 x3 1 1 12 12 12 2 2 23 23 23 3 3 44 44 44 There are many other ways, including cbind(). Often I plan to calculate variable names within a program, as well as the values of the variables. I think of this as generating new column names on the y. In r-help, I asked I keep
  • 59. nding myself in a situation where I want to calculate a variable name and then use it on the left hand side of an assignment. To me, this was a dicult problem. Brian Ripley pointed toward one way to add the variable to a data frame: i t e r a t i o n 9 newname pas t e ( run , i t e r a t i o n , sep=) mydf [ newname ] aColumn ## or , in one step: mydf [ pas t e ( run , i t e r a t i o n , sep=) ] aColumn ## for a list , same idea works , use double brackets myList [ [ pas t e ( run , i t e r a t i o n , sep=) ] ] aColumn And Thomas Lumley added: If you wanted to do something of this sort for which the above didn't work you could also learn about substitute(): eva l ( s u b s t i t u t e (myList $newColumnaColumn) , l i s t (newColumn=as.name ( varName ) ) ) 2.2 Create variable names on the y (10/04/2001) The previous showed how to add a column to a data frame on the y. What if you just want to calculate a name for a variable that is not in a data frame. The assign function can do that. Try this to create an object (a variable) named zzz equal to 32. a s s i g n ( z z z , 32) z z z [ 1 ] 32 In that case,I specify zzz, but we can use a function to create the variable name. Suppose you want a random variable name. Every time you run this, you get a new variable starting with a. a s s i g n ( pas t e ( a , round ( rnorm( 1 , 50 ,12) , 2) , sep=) , 324) I got a44.05: a44.05 [ 1 ] 324 2.3 Recode one column, output values into another column (12/05/2003) Please read the documentation for transform() and replace() and also learn how to use the magic of R vectors. The transform() function works only for data frames. Suppose a data frame is called mdf and you want to add a new variable newV that is a function of var1 and var2: 13
  • 60. mdf t rans form (mdf , newV=l o g ( var1 ) + var2 ) ) I'm inclined to take the easy road when I can. Proper use of indexes in R will help a lot, especially for recoding discrete valued variables. Some cases are particularly simple because of the way arrays are processed. Suppose you create a variable, and then want to reset some values to missing. Go like this: x rnorm(10000) x [ x 1 . 5 ] NA And if you don't want to replace the original variable, create a new one
  • 61. rst ( xnew - x ) and then do that same thing to xnew. You can put other variables inside the brackets, so if you want x to equal 234 if y equals 1, then x [ y==1 ] 234 Suppose you have v1, and you want to add another variable v2 so that there is a translation. If v1=1, you want v2=4. If v1=2, you want v2=4. If v1=3, you want v2=5. This reminds me of the old days using SPSS and SAS. I think it is clearest to do: v1 c ( 1 , 2 , 3 ) # now initialize v2 v2 rep( -9 , length(v1)) # now recode v2 v2[v1= =1] 4 v2[v1= =2]4 v2[v1= =3]5 v2[1] 4 4 5 Note that R's ifelse command can work too: xi f e l s e (x1.5 ,NA, x ) One user recently asked how to take data like a vector of names and convert it to numbers, and 2 good solutions appeared: y c ( OLDa , ALL , OLDc , OLDa , OLDb , NEW , OLDb , OLDa , ALL , . . . ) e l c ( OLDa , OLDb , OLDc , NEW , ALL) match (y , e l ) [ 1 ] 1 5 3 1 2 4 2 1 5 NA or f f a c t o r (x , l e v e l s=c ( OLDa , OLDb , OLDc , NEW , ALL) ) a s . i n t e g e r ( f ) [ 1 ] 1 5 3 1 2 4 2 1 5 I asked Mark Myatt for more examples: For example, suppose I get a bunch of variables coded on a scale 1 = no 6 = yes 8 = t i e d 9 = mi s s ing 10 = not a p p l i c a b l e . Recode that into a new variable name with 0=no, 1=yes, and all else NA. It seems like the replace() function would do it for single values but you end up with empty levels in factors but that can be
  • 62. xed by re-factoring the variable. Here is a basic recode() function: r e code func t i on ( var , old , new) f x r e p l a c e ( var , var==old , new) i f ( i s . f a c t o r ( x ) ) f a c t o r ( x ) e l s e x g For the above example: t e s t c ( 1 , 1 , 2 , 1 , 1 , 8 , 1 , 2 , 1 , 1 0 , 1 , 8 , 2 , 1 , 9 , 1 , 2 , 9 , 1 0 , 1 ) t e s t t e s t r e code ( t e s t , 1 , 0) t e s t r e code ( t e s t , 2 , 1) t e s t r e code ( t e s t , 8 , NA) t e s t r e code ( t e s t , 9 , NA) t e s t r e code ( t e s t , 10 , NA) t e s t Although it is probably easier to use replace(): 14
  • 63. t e s t c ( 1 , 1 , 2 , 1 , 1 , 8 , 1 , 2 , 1 , 1 0 , 1 , 8 , 2 , 1 , 9 , 1 , 2 , 9 , 1 0 , 1 ) t e s t t e s t r e p l a c e ( t e s t , t e s t ==8 j t e s t ==9 j t e s t ==10 , NA) t e s t r e p l a c e ( t e s t , t e s t ==1 , 0) t e s t r e p l a c e ( t e s t , t e s t ==2 , 1) t e s t I suppose a better function would take from and to lists as arguments: r e code func t i on ( var , from , to ) f x a s . v e c t o r ( var ) f o r ( i in 1 : l ength ( from) ) f x r e p l a c e (x , x==from [ i ] , to [ i ] ) g i f ( i s . f a c t o r ( var ) ) f a c t o r ( x ) e l s e x g For the example: t e s t c ( 1 , 1 , 2 , 1 , 1 , 8 , 1 , 2 , 1 , 1 0 , 1 , 8 , 2 , 1 , 9 , 1 , 2 , 9 , 1 0 , 1 ) t e s t t e s t r e code ( t e s t , c ( 1 , 2 , 8 : 1 0 ) , c ( 0 , 1 ) ) t e s t and it still works with single values. Suppose somebody gives me ascale from 1 to 100, and I want to collapse it into 10 groups, how do I go about it? Mark says: Use cut() for this. This cuts into 10 groups: t e s t t runc ( r u n i f (1000 ,1 ,100) ) groups cut ( t e s t , seq ( 0 , 1 0 0 , 1 0 ) ) t abl e ( t e s t , groups ) To get ten groups without knowing the minimum and maximum value you can use pretty(): groups cut ( t e s t , pr e t ty ( t e s t , 1 0 ) ) t abl e ( t e s t , groups ) You can specify the cut-points: groups cut ( t e s t , c ( 0 , 2 0 , 4 0 , 6 0 , 8 0 , 1 0 0 ) ) t abl e ( t e s t , groups ) And they don't need to be even groups: groups cut ( t e s t , c ( 0 , 3 0 , 5 0 , 7 5 , 1 0 0 ) ) t abl e ( t e s t , groups ) Mark added, I think I will add this sort of thing to the REX pack. 2003{12{01, someone asked how to convert a vector of numbers to characters, such as i f x [ i ] 250 then c o l [ i ] = `` red ' ' e l s e i f x [ i ] 500 then c o l [ i ] = `` blue ' ' and so forth. Many interesing answers appeared in R-help. A big long nested ifelse would work, as in: x . c o l i f e l s e ( x 250 , red , i f e l s e ( x500 , blue , i f e l s e ( x750 , green , black ) ) ) There were some nice suggestions to use cut, such as Gabor Grothendeick's advice: The following results in a character vector: c o l o u r s c ( red , blue , green , back ) c o l o u r s [ cut (x , c (Inf , 2 5 0 , 5 0 0 , 7 0 0 , I n f ) , r i g h t=F, lab=FALSE) ] While this generates a factor variable: c o l o u r s c ( red , blue , green , black ) cut (x , c (Inf , 2 5 0 , 5 0 0 , 7 0 0 , I n f ) , r i g h t=F, lab=c o l o u r s ) 15
  • 64. 2.4 Create indicator (dummy) variables (20/06/2001) 2 examples: c is a column, you want dummy variable, one for each valid value. First, make it a factor, then use model.matrix(): x c ( 2 , 2 , 5 , 3 , 6 , 5 ,NA) xf f a c t o r (x , l e v e l s =2:6) model .mat r ix ( xf1 ) xf2 xf3 xf4 xf5 xf6 1 1 0 0 0 0 2 1 0 0 0 0 3 0 0 0 1 0 4 0 1 0 0 0 5 0 0 0 0 1 6 0 0 0 1 0 a t t r ( , a s s i g n ) [ 1 ] 1 1 1 1 1 (from Peter Dalgaard) Question: I have a variable with 5 categories and I want to create dummy variables for each category. Answer: Use row indexing or model.matrix. f f f a c t o r ( sample ( l e t t e r s [ 1 : 5 ] , 25 , r e p l a c e=TRUE) ) diag ( n l e v e l s ( f f ) ) [ f f , ] #or model .mat r ix (f f 1) ( from Brian D. Ripley ) 2.5 Create lagged values of variables for time series regression (05/22/2012) Peter Dalgaard explained, the simple way is to create a new variable which shifts the response, i.e. y shf t c ( y [1 ] , NA) # pad with missing summary( lm( y shf t x + y ) ) Alternatively, lag the regressors: N l eng th ( x ) xlag c (NA, x [ 1 : (N1) ] ) ylag c (NA, y [ 1 : (N1) ] ) summary( lm( y xlag + ylag ) ) Dave Armstrong (personal communication, 2012/5/21) brought to my attention the following problem in cross sectional time series data. Simply inserting an NA will lead to disaster because we need to insert a lag within each unit. There is also a bad problem when the time points observed for the sub units are not all the same. He suggests the following dat data. f rame ( ccode = c ( 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 ) , year = c (1980 , 1982 , 1983 , 1984 , 1981:1984 , 1980:1982 , 1984) , x = seq ( 2 , 2 4 , by=2) ) dat $ obs 1 : nrow( dat ) dat $ lagobs match ( pas t e ( dat $ ccode , dat $year1 , sep= . ) , pas t e ( dat $ ccode , dat $ year , sep= . ) ) dat $ l a g x dat $x [ dat $ lagobs ] 16
  • 65. Run this example, be surprised, then email me if you
  • 66. nd a better way. I haven't. This seems like a dicult problem to me and if I had to do it very often, I am pretty sure I'd have to re-think what it means to lag when there are years missing in the data. Perhaps this is a rare occasion where interpolation might be called for. 2.6 How to drop factor levels for datasets that don't have observations with those values? (08/01/2002) The best way to drop levels, BTW, is pr obl em. f a c t o r pr obl em. f a c t o r [ , drop=TRUE] ( from Brian D. Ripley ) That has the same eect as running the pre-existing problem.factor through the function factor: pr obl em. f a c t o r f a c t o r ( pr obl em. f a c t o r ) 2.7 Select/subset observations out of a dataframe (08/02/2012) If you just want particular named or numbered rows or columns, of course, that's easy. Take columns x1, x2, and x3. datSubset1 dat [ , c ( x1 , x2 , x3 ) ] If those happen to be columns 44, 92, and 93 in a data frame, datSubset1 dat [ , c (44 , 92 , 93) ] Usually, we want observations that are conditional. Want to take observations for which variable Y is greater than A and less or equal than B: X[Y A Y B ] Suppose you want observations with c=1 in df1. This makes a new data frame. df2 df1 [ df1 $ c==1 , ] and note that indexing is pretty central to using S (the language), so it is worth learning all the ways to use it. (from Brian Ripley) Or use match select values from the column d by taking the ones that match the values of another column, as in d t ( ar ray ( 1 : 2 0 , dim=c ( 2 , 1 0 ) ) ) i c ( 1 3 , 5 , 1 9 ) d [match ( i , d [ , 1 ] ) , 2 ] [ 1 ] 14 6 20 ( from Peter Wolf ) Till Baumgaertel wanted to select observations for men over age 40, and sex was coded either m or M. Here are two working commands: # 1.) maleOver40 subs e t (d , sex %in% c ( m , M) age 40) # 2.) maleOver40 d [ ( d$ sex==m j d$ sex==M) d$ age 40 , ] To decipher that, do ?match and ?%in to
  • 67. nd out about the %in% operator. If you want to grab the rows for which the variable subject is 15 or 19, try: df1 $ subj e c t %in% c ( 1 9 , 1 5 ) 17
  • 68. to get a True/False indication for each row in the data frame, and you can then use that output to pick the rows you want: i n d i c a t o r df1 $ subj e c t %in% c ( 1 9 , 1 5 ) df1 [ indi c a t o r , ] How to deal with values that are already marked as missing? If you want to omit all rows for which one or more column is NA (missing): x2 na.omi t ( x ) produces a copy of the data frame x with all rows that contain missing data removed. The func-tion na.exclude could be used also. For more information on missings, check help : ?na.exclude. For exclusion of missing, Peter Dalgaard likes subs e t (x , c ompl e t e . c a s e s ( x ) ) or x [ c ompl e t e . c a s e s ( x ) , ] adding is.na(x) is preferable to x !=NA 2.8 Delete
  • 69. rst observation for each element in a cluster of observations (11/08/2000) Given data like: 1 ABK 19910711 11 .1867461 0 .0000000 108 2 ABK 19910712 11 .5298979 11 .1867461 111 6 CSCO 19910102 0 .1553819 0 .0000000 106 7 CSCO 19910103 0 .1527778 0 .1458333 166 remove the
  • 70. rst observation for each value of the sym variable (the one coded ABK,CSCO, etc). . If you just need to remove rows 1, 6, and 13, do: newhi lodata hi l oda t a [c ( 1 , 6 , 1 3 ) , ] To solve the more general problem of omitting the
  • 71. rst in each group, assuming sym is a factor, try something like newhi lodata subs e t ( hi lodata , d i f f ( c ( 0 , a s . i n t e g e r ( sym) ) ) != 0) (actually, the as.integer is unnecessary because the c() will unclass the factor automagically) (from Peter Dalgaard) Alternatively, you could use the match function because it returns the
  • 72. rst match. Suppose jm is the data set. Then: match ( unique ( jm$sym) , jm$sym) [ 1 ] 1 6 13 jm jm[ match( unique ( jm$sym) , jm$sym) , ] (from Douglas Bates) As Robert pointed out to me privately: duplicated() does the trick subs e t ( hi lodata , dupl i c a t ed ( sym) ) has got to be the simplest variant. 2.9 Select a random sample of data (11/08/2000) sample (N, n , r e p l a c e=FALSE) and seq (N) [ rank ( r u n i f (N) ) n ] 18
  • 73. is another general solution. (from Brian D. Ripley) 2.10 Selecting Variables for Models: Don't forget the subset function (15/08/2000) You can manage data directly by deleting lines or so forth, but subset() can be used to achieve the same eect without editing the data at all. Do ?select to
  • 74. nd out more. Subset is also an option in many statistical functions like lm. Peter Dalgaard gave this example, using the builtin dataset airquality. data ( a i r q u a l i t y ) names ( a i r q u a l i t y ) lm(Ozone. , data=subs e t ( a i r q u a l i t y , s e l e c t=Ozone :Month) ) lm(Ozone. , data=subs e t ( a i r q u a l i t y , s e l e c t=c (Ozone :Wind ,Month) ) ) lm(Ozone.Temp , data=subs e t ( a i r q u a l i t y , s e l e c t=Ozone :Month) ) The . on the RHS of lm means all variables and the subset command on the rhs picks out dierent variables from the dataset. x1:x2means variables between x1 and x2, inclusive. 2.11 Process all numeric variables, ignore character variables? (11/02/2012) This was a lifesaver for me. I ran simulations in C and there were some numbers and some character variables. The output went into lastline.txt and I wanted to import the data and leave some in character format, and then calculate summary indicators for the rest. data r e a d . t a b l e ( ` ` l a s t l i n e . t x t ' ' , header=TRUE, a s . i s = TRUE) i n d i c e s 1 : dim( data ) [ 2 ] i n d i c e s na.omi t ( i f e l s e ( i n d i c e s * sapply ( data , i s . n ume r i c ) , indi c e s ,NA) ) mean sapply ( data [ , i n d i c e s ] ,mean) sd sapply ( data [ , i n d i c e s ] , sd ) If you just want a new dataframe with only the numeric variables, do: sapply ( dataframe , i s . n ume r i c ) # or which ( sapply ( data. f rame , i s . n ume r i c ) ) (Thomas Lumley) 2.12 Sorting by more than one variable (06/09/2000) Can someone tell me how can I sort a list that contains duplicates (name) but keeping the duplicates together when sorting the values. name M 1234 8 1234 8 . 3 4321 9 4321 8 . 1 I also would like to set a cut-o, so that anything below a certain values will not be sorted.(from Kong, Chuang Fong) I take it that the cuto is on the value of M. OK, suppose it is the value of `co'. s o r t . i n d orde r (name , pmax( c o f f , M) ) # sorting index name name [ s o r t . i n d ] M M[ s o r t . i n d ] Notice how using pmax() for parallel maximum you can implement the cuto by raising all values below the mark up to the mark thus putting them all into the same bin as far as sorting is concerned. If your two variables are in a data frame you can combine the last two steps into one, of course. 19
  • 75. s o r t . i n d orde r ( dat $name , pmax( c o f f , dat $M) ) dat dat [ s o r t . i n d , ] In fact it's not long before you are doing it all in one step: dat dat [ orde r ( dat $name , pmax( c o f f , dat $M) ) , ] ( from Bi l l Venables ) I want the ability to sort a data frame lexicographically according to several variables Here's how: spshe e t [ orde r (name , age , z ip ) , ] ( from Peter Dalgaard ) 2.13 Rank within subgroups de
  • 76. ned by a factor (06/09/2000) Read the help for by() by ( x [ 2 ] , x$group , rank ) x$ group : A [ 1 ] 4 . 0 1 . 5 1 . 5 3 . 0 x$ group : B [ 1 ] 3 2 1 c ( by ( x [ 2 ] , x$group , rank ) , r e c u r s i v e=TRUE) A1 A2 A3 A4 B1 B2 B3 4 . 0 1 . 5 1 . 5 3 . 0 3 . 0 2 . 0 1 . 0 (from Brian D. Ripley) 2.14 Work with missing values (na.omit, is.na, etc) (15/01/2012) NA is a symbol, not just a pair of letters. Values for any type of variable, numeric or character, may be set to NA. Thus, in a sense, management of NA values is just like any other recoding exercise. Find the NAs and change them, or
  • 77. nd suspicious values and set them to NA. Matt Wiener (2005{01{05) oered this example, which inserts some NA's, and then converts them all to 0. temp1 matrix ( r u n i f (25) , 5 , 5) temp1 [ temp1 0 . 1 ] NA temp1 [ i s . n a ( temp1 ) ] 0 When routines encounter NA values, special procedures may be called for. For example, the mean function will return NA when it encounters any NA values. That's a cold slap in the face to most of us. To avoid that, run mean(x, na.rm=TRUE), so the NA's are automatically ignored. Many functions have that same default (check max, min). Other functions, like lm, will default to omit missings, unless the user has put other settings in place. The environment options() has settings that aect the way missings are handled. Suppose you are a cautious person. If you want lm to fail when it encounters an NA, then the lm argument na.action can be set to na.fail. Before that, however, it will be necessary to change or omit missings. To manually omit missings, then sure missings are omitted by telling lm to fail if it
  • 78. nds NAs. t h i s d f na.omi t ( i n s u r e [ , c ( 1 , 1 9 : 3 9 ) ] ) body.m lm(BI.PPrem . , data = t h i s d f , n a . a c t i o n = n a . f a i l ) The function is.na() returns a TRUE for missings, and that can be used to spot and change them. The table function defaults to exclude NA from the reported categories. To force it to report NA as a valid value, add the option exclude=NULL in the table command. 20
  • 79. 2.15 Aggregate values, one for each line (16/08/2000) Question: I want to read values from a text
  • 80. le - 200 lines, 32 oats per line - and calculate a mean for each of the 32 values, so I would end up with an x vector of 1{200 and a y vector of the 200 means. Peter Dalgaard says do this: y apply ( a s .ma t r ix ( r e a d . t a b l e ( my f i l e ) ) , 1 , mean) x seq ( along=y ) (possibly adding a couple of options to read.table, depending on the
  • 81. le format) 2.16 Create new data frame to hold aggregate values for each factor (11/08/2000) How about this: Zaggr egat e (X, f , sum) (assumes all X variables can be summed) Or: [If] X contains also factors. I have to select variables for which summary statistics have to be computed. So I used: Z data. f rame ( f=l e v e l s ( f ) , x1=a s . v e c t o r ( tapply ( x1 , f , sum) ) ) ( from Wolfgang Ko l l e r ) 2.17 Selectively sum columns in a data frame (15/01/2012) Given 10 20 23 44 33 10 20 33 23 67 and you want 10 20 56 67 100 try this, where it assumes that data.dat has two columns std and cf that we do not want to sum: datr e a d . t a b l e ( da t a .da t , header=TRUE) aggr egat e ( dat [ ,( 1 : 2 ) ] , by=l i s t ( s td=dat $ std , c f=dat $ c f ) , sum) note the
  • 82. rst two columns are excluded by [,(1:2)] and the by option preserves those values in the output. 2.18 Rip digits out of real numbers one at a time (11/08/2000) I want to take out the
  • 83. rst decimal place of each output, plot them based on their appearance frequencies. Then take the second decimal place, do the same thing. a l o g ( 1 : 1 0 0 0 ) d1f l o o r (10 * ( af loor ( a ) ) ) # first decimal par (mfrow=c ( 2 , 2 ) ) h i s t ( d1 , br eaks=c (1 : 9 ) ) t abl e ( d1 ) d2f l o o r (10 * (10 * af loor (10 *a ) ) ) # second decimal h i s t ( d2 , br eaks=c (1 : 9 ) ) t abl e ( d2 ) ( from Yudi Pawitan ) 21
  • 84. x 1:1000 ndig 6 ( i i a s . i n t e g e r (10^ ( ndig1 ) * l o g ( x ) ) ) [ 1 : 7 ] ( c i formatC( i i , f l a g=0 , wid= ndig ) ) [ 1 : 7 ] cm t ( sapply ( c i , func t i on ( cc ) s t r s p l i t ( cc ,NULL) [ [ 1 ] ] ) ) cm [ 1 : 7 , ] apply (cm, 2 , t abl e ) #-- Nice tables # The plots : par (mfrow= c ( 3 , 2 ) , lab = c ( 1 0 , 1 0 , 7 ) ) f o r ( i in 1 : ndig ) h i s t ( a s . i n t e g e r (cm[ , i ] ) , br eaks = .5 + 0 : 1 0 , main = pas t e ( Di s t r i b u t i o n o f , i , th d i g i t ) ) ( from Martin Maechler ) 2.19 Grab an item from each of several matrices in a List (14/08/2000) Let Z denote the list of matrices. All matrices have the same order. Suppose you need to take element [1,2] from each. l apply (Z , func t i on ( x ) x [ 1 , 2 ] ) should do this, giving a list. Use sapply if you want a vector. (Brian Ripley) 2.20 Get vector showing values in a dataset (10/04/2001) x l e v e l s s o r t ( unique ( x ) ) 2.21 Calculate the value of a string representing an R command (13/08/2000) In R, they use the termexpression to refer to a command that is written down as a character string, but is not yet submitted to the processor. That has to be parsed, and then evaluated. Example: St r ing2Eval A v a l i d R s tatement eva l ( par s e ( t ext = St r ing2Eval ) ) ( from Mark Myatt ) Using eval is something like saying to R, I'd consider typing this into the command line now, but I typed it before and it is a variable now, so couldn't you go get it and run it now? # Or eva l ( par s e ( t ext= l s ( ) ) ) # Or eva l ( par s e ( t ext = x [ 3 ] 5 ) ) ( from Peter Dalgaard ) Also check out substitute(), as.name() et al. for other methods of manipulating expressions and function calls 2.22 Which can grab the index values of cases satisfying a test (06/04/2001) To analyze large vectors of data using boxplot to
  • 85. nd outliers, try: which ( x==boxplot (x , range=1)$ out ) 22
  • 86. 2.23 Find unique lines in a matrix/data frame (31/12/2001) Jason Liao writes: I have 10000 integer triplets stored in A[1:10000, 1:3]. I would like to
  • 87. nd the unique triplets among the 10000 ones with possible duplications. Peter Dalgaard answers, As of 1.4.0 (!): unique ( a s .da t a . f r ame (A) ) 3 Matrices and vector operations 3.1 Create a vector, append values (01/02/2012) Some things in R are so startlingly simple and dierent from other programs that the experts have a dicult time understanding our misunderstanding. Mathematically, I believe a vector is a column in which all elements are of the same type (e.g., real numbers). (1,2,3) is a vector of integers as well as a column in the data frame sense. If you put dissimilar-typed items into an R column, it creates a vector that reduces the type of all values to the lowest type. Example: a c ( a , 1 , 2 , 3 , h e l l o ) i s . v e c t o r ( a ) [ 1 ] TRUE i s . c h a r a c t e r ( a ) [ 1 ] TRUE as .nume r i c ( a ) [ 1 ] NA 1 2 3 NA Warning message : NAs int roduc ed by c o e r c i on This shows that an easy way to create a vector is with the concatenate function, which is used so commonly it is known only as c. As long as all of your input is of the same type, you don't hit any snags. Note the is.vector() function will say this is a vector: v c ( 1 , 2 , 3) i s . v e c t o r ( v ) [ 1 ] TRUE i s . n ume r i c ( v ) [ 1 ] TRUE The worst case scenario is that it manufactures a list. The double brackets [[]] are the big signal that you really have a list: bb c ( 1 , c , 4 ) i s . v e c t o r (bb) [ 1 ] TRUE bb [ [ 1 ] ] [ 1 ] 1 [ [ 2 ] ] .Pr imi t i v e ( c ) [ [ 3 ] ] [ 1 ] 4 i s . c h a r a c t e r (bb) [ 1 ] FALSE i s . i n t e g e r (bb) [ 1 ] FALSE i s . l i s t (bb) [ 1 ] TRUE 23
  • 88. To make sure you get what you want, you can use the vector() function, and then tell it what type of elements you want in your vector. Please note it puts a false value in each position, not a missing. ve c t o r (mode= i n t e g e r ,10) [ 1 ] 0 0 0 0 0 0 0 0 0 0 ve c t o r (mode=double ,10) [ 1 ] 0 0 0 0 0 0 0 0 0 0 ve c t o r (mode= l o g i c a l ,10) [ 1 ] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE There are also methods to directly create the particular kind of vector you want (factor, char-acter, integer, etc). For example, if you want a real-valued vector, there is a shortcut method numeric() x numeric (10) x [ 1 ] 0 0 0 0 0 0 0 0 0 0 If you want to add values onto the end of a vector, you have options! If you want to add a single value onto the vector, try: r rnorm( 1 ) v append (v , r ) The length() command can not only tell you how long a vector is, but it can be used to RESIZE it. This command says if the length of v has reached the value vlen, then expand the length by doubling it. i f ( vl en==l ength ( v ) ) l ength ( v ) 2* l ength ( v ) These were mentioned in an email from Gabor Grothendieck to the list on 2005{01{04. 3.2 How to create an identity matrix? (16/08/2000) diag (n) Or, go the long way: nc ( 5 ) I matrix ( 0 , nrow=n , ncol=n) I [ row( I )==c o l ( I ) ] 1 ( from E.D. I s a i a ) 3.3 Convert matrixm to one long vector (11/08/2000) dim(m)NULL or c (m) ( from Peter Dalgaard ) 3.4 Creating a peculiar sequence (1 2 3 4 1 2 3 1 2 1) (11/08/2000) I don't know why this is useful, but it shows some row and matrix things: I ended up using Brian Ripley's method, as I got it
  • 89. rst and it worked, ie. A matrix ( 1 , n1 , n1) rA row(A) rA[ rA + c o l (A) n ] 24
  • 90. However, thanks to Andy Royle I have since discovered that there is a much more simple and sublime solution: n 5 sequence ( (n1) : 1 ) [ 1 ] 1 2 3 4 1 2 3 1 2 1 ( from Karen Kotschy ) 3.5 Select every n'th item (14/08/2000) extract every nth element from a very long vector, vec? We need to create an index vector to indicate which values we want to take. The seq function creates a sequence that can step from one value to another, by a certain increment. seq (n , l eng th ( vec ) , by=n) ( Brian Ripley ) seq (1 ,11628 , l ength=1000) will give 1000 evenly spaced numbers from 1:11628 that you can then index with. (from Thomas Lumley) My example. Use the index vector to take what we need: vec rnorm(1999) newvec vec [ seq ( 1 , l eng th ( vec ) , 200) ] newvec [ 1 ] 0 .2685562 1 .8336569 0 .1371113 0 .2204333 1.2798172 0 .3337282 [ 7 ] 0.2366385 0 .5060078 0 .9680530 1 .2725744 This shows the items in vec at indexes (1, 201, 401, . . . , 1801) 3.6 Find index of a value nearest to 1.5 in a vector (11/08/2000) n 1000 x s o r t ( rnorm(n) ) x0 1 . 5 dx abs (xx0) which ( dx==min( dx ) ) (from Jan Schelling) which ( abs ( x 1 . 5 )==min( abs ( x 1 . 5 ) ) ) (from Uwe Ligges) 3.7 Find index of nonzero items in vector (18/06/2001) which ( x !=0) ( from Uwe Ligge s ) r f i n d func t i on ( x ) seq ( along=x ) [ x != 0 ] ( from Brian D. Ripley ) Concerning speed and memory eciency I
  • 91. nd a s . l o g i c a l ( x ) is better than x !=0 and 25
  • 92. seq ( along=x ) [ a s . l o g i c a l ( x ) ] is better than which ( a s . l o g i c a l ( x ) ) thus which ( x !=0) is shortest and r f i n d func t i on ( x ) seq ( along=x ) [ a s . l o g i c a l ( x ) ] seems to be computationally most ecient (from Jens Oehlschlagel-Akiyoshi) 3.8 Find index of missing values (15/08/2000) Suppose the vector Pes has 600 observations. Don't do this: ( 1 : 6 0 0 ) [ i s . n a ( Pes ) ] The `approved' method is seq ( along=Pes ) [ i s . n a ( Pes ) ] In this case it does not matter as the subscript is of length 0, but it has oored enough library/- package writers to be worth thinking about. (from Brian Ripley) However, the solution I gave which ( i s . n a ( Pes ) ) is the one I still really recommend; it does deal with 0-length objects, and it keeps names when there are some, and it has an `arr.ind = FALSE' argument to return array indices instead of vector indices when so desired. (from Martin Maechler) 3.9 Find index of largest item in vector (16/08/2000) A[ which (A==max(A, na.rm=TRUE) ) ] 3.10 Replace values in a matrix (22/11/2000) tmat matrix ( rep (0 ,3 * 3) , ncol=3) tmat [ , 1 ] [ , 2 ] [ , 3 ] [ 1 , ] 0 0 0 [ 2 , ] 0 0 0 [ 3 , ] 0 0 0 tmat [ tmat==0 ] 1 tmat [ , 1 ] [ , 2 ] [ , 3 ] [ 1 , ] 1 1 1 [ 2 , ] 1 1 1 [ 3 , ] 1 1 1 ( from Jan Goebel ) If la is a data frame, you have to coerce the data into matrix form, with: l a a s .ma t r ix ( l a ) l a [ l a==0 ] 1 Try this: 26
  • 93. l a i f e l s e ( l a==0 , 1 , l a ) (from Martyn Plummer) 3.11 Delete particular rows from matrix (06/04/2001) x matrix ( 1 : 1 0 , , 2 ) x [ x [ , 1 ] %in% c ( 2 , 3 ) , ] x [ ! x [ , 1 ] %in% c ( 2 , 3 ) , ] (from Peter Malewski) mat [ ! (mat$ f i r s t %in% 7 13 : 71 5 ) , ] (from Peter Dalgaard ) 3.12 Count number of items meeting a criterion (01/05/2005) Apply length() to results of which() described in previous question, as in l eng th (which (v7) ) # or sum( v7 , na.rm=TRUE) If you apply sum() to a matrix, it will scan over all columns. To focus on a particular column, use subscripts like sum(v[,1]0) or such. If you want separate counts for columns, there's a method colSums() that will count separately for each column. 3.13 Compute partial correlation coecients from correlation matrix (08/12/2000) I need to compute partial correlation coecients between multiple variables (correlation be-tween two paired samples with the eects of all other variables partialled out)? (from Kaspar P ugshaupt) Actually, this is quite straightforward. Suppose that R is the correlation matrix among the variables. Then, Rinvs o l v e (R) Ddiag (1 / s q r t ( diag (Rinv ) ) ) P D %*% Rinv %*% D The o-diagonal elements of P are then the partial correlations between each pair of variables partialed for the others. (Why one would want to do this is another question.) (from John Fox). In general you invert the variance-covariance matrix and then rescale it so the diagonal is one. The o-diagonal elements are the negative partial correlation coecients given all other variables. pcor2 func t i on ( x ) f conc s o l v e ( var ( x ) ) r e s i d . s d 1/ s q r t ( diag ( conc ) ) pcc sweep ( sweep ( conc , 1 , r e s i d . s d , * ) , 2 , r e s i d . s d , * ) pcc g pcor2 ( cbind ( x1 , x2 , x3 ) ) see J. Whittaker's book Graphical models in applied multivariate statistics (from Martyn Plummer) This is the version I'm using now, together with a test for signi
  • 94. cance of each coecient (H0: coe=0): 27
  • 95. f . p a r c o r func t i on (x , t e s t = F, p = 0 . 0 5 ) f nvar ncol ( x ) ndata nrow( x ) conc s o l v e ( cor ( x ) ) r e s i d . s d 1/ s q r t ( diag ( conc ) ) pcc sweep ( sweep ( conc , 1 , r e s i d . s d , * ) , 2 , r e s i d . s d , * ) colnames ( pcc ) rownames ( pcc ) colnames ( x ) i f ( t e s t ) f t . d f ndata nvar t pcc / s q r t ( ( 1 pcc^ 2) / t . d f ) pcc l i s t ( c o e f s = pcc , s i g n i f i c a n t = t qt (1 (p/ 2) , df = t . d f ) ) g r e turn ( pcc ) g (from Kaspar P ugshaupt) 3.14 Create a multidimensional matrix (R array) (20/06/2001) Brian Ripley said: my.arrayar ray ( 0 , dim=c ( 1 0 , 5 , 6 , 8 ) ) will give you a 4-dimensional 10 x 5 x 6 x 8 array. Or a r r a y . t e s t ar ray ( 1 : 6 4 , c ( 4 , 4 , 4 ) ) a r r a y . t e s t [ 1 , 1 , 1 ] 1 a r r a y . t e s t [ 4 , 4 , 4 ] 64 3.15 Combine a lot of matrices (20/06/2001) If you have a list of matrices, it may be tempting to take them one by one and use rbind to stack them all together. Don't! It is really slow because of inecient memory allocation. Better to do d o . c a l l ( rbind , l i s tOfMa t r i c e s ) In the WorkingExamples folder, I wrote up a longish example of this, stackListItems.R. 3.16 Create neighbormatrices according to speci
  • 96. c logics (20/06/2001) Want a matrix of 0s and 1s indicating whether a cell has a neighbor at a location: N 3 x matrix ( 1 : (N^ 2) , nrow=N, ncol=N) r owd i f f func t i on (y , z ,mat ) abs ( row(mat ) [ y ]row(mat ) [ z ] ) c o l d i f f func t i on (y , z ,mat ) abs ( c o l (mat ) [ y ] col (mat ) [ z ] ) r o o k . c a s e func t i on (y , z ,mat ) f c o l d i f f (y , z ,mat )+r owd i f f (y , z ,mat )==1g b i s h o p . c a s e func t i on (y , z ,mat ) f c o l d i f f (y , z ,mat )==1 r owd i f f (y , z ,mat )==1g que en. c a s e func t i on (y , z ,mat ) f r o o k . c a s e (y , z ,mat ) j b i s h o p . c a s e (y , z ,mat ) g matrix ( as .nume r i c ( sapply (x , func t i on ( y ) sapply (x , r o ok. c a s e , y ,mat=x ) ) ) , ncol=N^ 2 , nrow =N^ 2) matrix ( as .nume r i c ( sapply (x , func t i on ( y ) sapply (x , bi shop. c a s e , y ,mat=x ) ) ) , ncol=N^ 2 , nrow=N^ 2) 28
  • 97. matrix ( as .nume r i c ( sapply (x , func t i on ( y ) sapply (x , que en. cas e , y ,mat=x ) ) ) , ncol=N^ 2 , nrow=N^ 2) (from Ben Bolker) 3.17 Matching two columns of numbers by a key variable (20/06/2001) The question was: I have a matrix of predictions from an proportional odds model (using the polr function in MASS), so the columns are the probabilities of the responses, and the rows are the data points. I have another column with the observed responses, and I want to extract the probabilities for the observed responses. As a toy example, if I have x matrix ( c ( 1 , 2 , 3 , 4 , 5 , 6 ) , 2 , 3 ) y c ( 1 , 3 ) and I want to extract the numbers in x[1,1] and x[2,3] (the columns being indexed from y), what do I do? Is x[cbind(seq(along=y), y)] what you had in mind? The key is de
  • 98. nitely matrix indexing. (from Brian Ripley) 3.18 Create Upper or Lower Triangular matrix (06/08/2012) The most direct route to do that is to create a matrix X, and then use the upper.tri() or lower.tri() functions to take out the parts that are needed. To do it in one step is trickly. Whenever it comes up in r-help, there is a dierent answer. Suppose you want a matrix like a^0 0 0 a^1 a^0 0 a^2 a^1 a^0 I don't know why you'd want that, but look at the dierences among answers this elicited. ma matrix ( rep ( c ( a^ ( 0 : n) , 0) , n+1) , nrow=n+1, ncol=n+1) ma[ u p p e r . t r i (ma) ] 0 (from Uwe Ligges) n 3 a 2 out diag (n+1) out [ l owe r . t r i ( out ) ] a^ apply (matrix ( 1 : n , ncol=1) ,1 , func t i on ( x ) c ( rep ( 0 , x ) , 1 : ( nx+1) ) ) [ l owe r . t r i ( out ) ] out (from Woodrow Setzer) If you want an upper triangular matrix, this use of ifelse with outer looks great to me: fun func t i on (x , y ) f i f e l s e (yx , x+y , 0) g fun ( 2 , 3 ) [ 1 ] 5 out e r ( 1 : 4 , 1 : 4 , fun ) [ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ 1 , ] 0 3 4 5 [ 2 , ] 0 0 5 6 [ 3 , ] 0 0 0 7 [ 4 , ] 0 0 0 0 29
  • 99. If I had to do this, this method is most understandable to me. Here set a=0.2. a 0 . 2 fun func t i on (x , y ) f i f e l s e ( y x , a^ ( x+y1) , 0) g out e r ( 1 : 4 , 1 : 4 , fun ) [ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ 1 , ] 0 .2000 0 .00000 0 . 0 e+00 0 . 0 0 e+00 [ 2 , ] 0 .0400 0 .00800 0 . 0 e+00 0 . 0 0 e+00 [ 3 , ] 0 .0080 0 .00160 3 .2e04 0 . 0 0 e+00 [ 4 , ] 0 .0016 0 .00032 6 .4e05 1 .28e05 During the Summer Stats Camp at KU in 2012, a user wanted to import a
  • 100. le full of numbers that were written by Mplus and then convert it to a lower triangular matrix. These need to end up in row order, like so: 1 0 0 0 0 2 3 0 0 0 4 5 6 0 0 7 8 9 10 0 11 12 13 14 15 The data is a raw text
  • 102. rst solution I found used a function to create symmetric matrices from the package miscTools (by Arne Henningsen and Ott Toomet). This creates a full matrix, and then I obliterate the upper right side by writing over with 0's. x scan ( t e ch3 . out ) l i b r a r y ( mi scTool s ) dat1 symMatrix ( x , nrow=90, byrow = TRUE) upperNoDiag u p p e r . t r i ( dat1 , diag = FALSE) dat1 [ upperNoDiag ] 0 f i x ( dat1 ) To
  • 103. nd out why that works, just type symMatrix. The relevant part of Arne's code is i f ( byrow != upper ) f r e s u l t [ u p p e r . t r i ( r e s u l t , diag = TRUE) ] data r e s u l t [ l owe r . t r i ( r e s u l t ) ] t ( r e s u l t ) [ l owe r . t r i ( r e s u l t ) ] g Once we
  • 104. ddle around with that idiom a bit, we see that the magic is in the way R's matrix receives a vector and automagically
  • 105. lls in the pieces. Because we want to
  • 106. ll the rows, it turns out we have to
  • 107. rst
  • 108. ll the columns of an upper triangular matrix and then transpose the result. x 1 : 6 xmat matrix ( 0 , 3 , 3 ) xmat [ u p p e r . t r i (xmat , diag = TRUE) ] x r e s u l t t ( xmat ) r e s u l t [ , 1 ] [ , 2 ] [ , 3 ] [ 1 , ] 1 0 0 [ 2 , ] 2 3 0 [ 3 , ] 4 5 6 This pre-supposes I've got the correct number of elements in x. The main advantage of sym- Matrix for us is that it checks the size of the vector against the size of the matrix and reports an error if they don't match up properly. 3.19 Calculate inverse of X (12/02/2012) solve(A) yields the inverse of A. 30
  • 109. Caution: Be sure you really want an inverse. To estimate regression coecients, the recom-mended method is not to solve (X0X)1X0y , but rather use a QR decomposition. There's a nice discussion of that in Generalized Additive Models by Simon Wood. 3.20 Interesting use of Matrix Indices (20/06/2001) Mehdi Ghafariyan said I have two vectors A=c(5,2,2,3,3,2) and B=c(2,3,4,5,6,1,3,2,4,3,1,5,1,4,6,1,4) and I want to make the following matrix using the information I have from the above vectors. 0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0 so the
  • 110. rst vector says that I have 6 elements therefore I have to make a 6 by 6 matrix and then I have to read 5 elements from the second vector , and put 1 in [1,j] where j=2,3,4,5,6 and put zero elsewhere( i.e.~in [1,1]) and so on. Any idea how this can be done in R ? Use matrix indices: ac ( 5 , 2 , 2 , 3 , 3 , 2 ) bc ( 2 , 3 , 4 , 5 , 6 , 1 , 3 , 2 , 4 , 3 , 1 , 5 , 1 , 4 , 6 , 1 , 4 ) nl eng th ( a ) Mmatrix ( 0 , n , n) M[ cbind ( rep ( 1 : n , a ) ,b) ]1 ( from Peter Dalgaard ) 3.21 Eigenvalues example (20/06/2001) Tapio Nummi asked about double-checking results of spectral decomposition In what follows, m is this matrix: 0 .4015427 0 .08903581 0.2304132 0 .08903581 1 .60844812 0 .9061157 0.23041322 0 .9061157 2 .9692562 Brian Ripley posted: sm e i g en (m, sym=TRUE) sm $ va lue s [ 1 ] 3 .4311626 1 .1970027 0 .3510817 $ v e c t o r s [ , 1 ] [ , 2 ] [ , 3 ] [ 1 , ] 0.05508142 0.2204659 0 .9738382 [ 2 , ] 0 .44231784 0.8797867 0.1741557 [ 3 , ] 0 .89516533 0 .4211533 0 .1459759 V sm$ v e c t o r s t (V) %*% V [ , 1 ] [ , 2 ] [ , 3 ] [ 1 , ] 1 .000000e+00 1.665335e16 5.551115e17 [ 2 , ] 1.665335e16 1 .000000e+00 2 .428613e16 [ 3 , ] 5.551115e17 2 .428613e16 1 .000000e+00 V %*% diag (sm$ va lue s ) %*% t (V) [ , 1 ] [ , 2 ] [ , 3 ] 31
  • 111. [ 1 , ] 0 .40154270 0 .08903581 0.2304132 [ 2 , ] 0 .08903581 1 .60844812 0 .9061157 [ 3 , ] 0.23041320 0 .90611570 2 .9692562 4 Applying functions, tapply, etc 4.1 Return multiple values from a function (12/02/2012) Please note. The return function is NOT required at the end of a function. One can return a collection of objects in a list, as in fun func t i on ( z ) f ##whatever... l i s t ( x1 , x2 , x3 , x4 ) g One can also return a single object from a function, but include additional information as attributes of the single thing. For example, fun func t i on (x , y , z ) f m1 lm( y x + z ) a t t r (m1, f r i e n d ) j o e g I don't want to add the character variable joe to any regression models, but it is nice to know I could do that. 4.2 Grab p values out of a list of signi
  • 112. cance tests (22/08/2000) Suppose chisq1M11 is a list of htest objects, and you want a vector of p values. Kjetil Kjernsmo observed this works: apply ( cbind ( 1 : 1 0 0 0 ) , 1 , func t i on ( i ) chisq1M11 [ [ i ] ] $ p. v a lue ) And Peter Dalgaard showed a more elegant approach: sapply ( chisq1M11 , func t i on ( x ) x$ p. v a lue ) In each of these, a simple R function called function is created and applied to each item 4.3 ifelse usage (12/02/2012) If y is 0, leave it alone. Otherwise, set y equal to y*log(y) i f e l s e ( y==0 , 0 , y* l o g ( y ) ) ( from Ben Bolker ) 4.4 Apply to create matrix of probabilities, one for each cell (14/08/2000) Suppose you want to calculate the gamma density for various values of scale and shape. So you create vectors scales and shapes, then create a grid if them with expand.grid, then write a function, then apply it (this example courtesy of Jim Robison-Cox) : gr idS expand.gr id ( s c a l e s , shapes ) s u r v L i l e l func t i on ( s s ) sum(dgamma( Survival , s s [ 1 ] , s s [ 2 ] ) ) Li k e l apply ( gr idS , 1 , s u r v L i l e l ) Actually, I would use 32
  • 113. s c 8 : 1 2 ; sh 7:12 args expand.gr id ( s c a l e=sc , shape=sh ) matrix ( apply ( args , 1 , func t i on ( x ) sum(dgamma( Survival , s c a l e=x [ 1 ] , shape=x [ 2 ] , l o g=TRUE) ) ) , l ength ( s c ) , dimnames=l i s t ( s c a l e=sc , shape=sh ) ) ( from Brian Ripley ) 4.5 Outer. (15/08/2000) outer(x,y,f) does just one call to f with arguments created by stacking x and y together in the right way, so f has to be vectorised. (from Thomas Lumley) 4.6 Check if something is a formula/function (11/08/2000) Formulae have class formula, so inherits(obj, formula) looks best. (from Prof Brian D Ripley) And if you want to ensure that it is ~X+Z+W rather than Y ~ X+Z+W you can use i n h e r i t s ( obj , ` ` formula ' ' ) ( l ength ( obj )==2) (from Thomas Lumley) 4.7 Optimize with a vector of variables (11/08/2000) The function being optimized has to be a function of a single argument. If alf and bet are both scalars you can combine them into a vector and use opt . func func t i on ( arg ) t (Y(X[ , 1 ] * arg [ 1 ] + X[ , 2 ] * arg [ 2 ] ) ^ de l t a ) %*% c o v a r i a n c e .ma t r i x . i n v e r s e %*% (Y(X[ , 1 ] * arg [ 1 ] + X[ , 2 ] * arg [ 2 ] ) ^ de l t a ) ( from Douglas Bates ) 4.8 slice.index, like in S+ (14/08/2000) s l i c e . i n d e x func t i on (x , i ) f k dim( x ) [ i ] sweep (x , i , 1 : k , func t i on (x , y ) y ) g ( from Peter Dalgaard ) 5 Graphing 5.1 Adjust features with par before graphing (18/06/2001) par() is a multipurpose command to aect qualities of graphs. par(mfrow=c(3,2)), creates a
  • 114. gure that has a 3x2 matrix of plots, or par(mar=c(10, 5, 4, 2)), adjust margins. A usage example: par (mfrow=c ( 2 , 1 ) ) par (mar=c (10 , 5 , 4 , 2) ) x 1:10 pl o t ( x ) box ( f i g u r e , l t y=dashed ) par (mar=c ( 5 , 5 , 10 , 2) ) pl o t ( x ) box ( f i g u r e , l t y=dashed ) 33
  • 115. 5.2 Save graph output (03/21/2014) There are two ways to save and/or print graphs. You can create the graphs on screen, and then copy them into a
  • 116. le, or you can write them directly into a
  • 117. le. This is explained in greater depth in my R lecture notes, plot-2 (http://pj.freefaculty.org/guides/Rcourse/plot-2/ plot-2.pdf). Graphs that are written directly into a
  • 118. le are more likely to come out correctly, so lets concentrate there. Run this. x rnorm(100) png ( f i l e = mypng1.png , he i ght = 800 , width = 800) h i s t (x , main = I 'm running t h i s and won ' t s e e anything on the s c r e en ) d e v . o f f ( ) The dev.o() is a
  • 119. le saver statement. It means
  • 121. le. In the terminal, you see null device. Whatever, it worked. A
  • 122. le appears in the current working directory mypng-1.png. Look at that with an image viewer. Now try another one png ( f i l e = mypng2.png , he i ght = 300 , width = 800 , p o i n t s i z e = 8) h i s t (x , main = I 'm running t h i s and won ' t s e e anything on the s c r e en ) d e v . o f f ( ) Nothing appears on the screen. R knows the size you want (800 x 800 in pixels). I adjusted the pointsize for fun. You can play with settings. You don't want a png
  • 123. le? OK, lets try PDF, which is now the recommended format for making graphs that end up in research papers. We used to use postscript, so most of my documentation used to be about EPS. But now, PDF. So run this pdf ( f i l e = mypdf1.pdf , he i ght = 6 , width = 6 , o n e f i l e = FALSE, paper = s p e c i a l ) h i s t (x , main = I f the r e i s no PDF at the end o f thi s , I ' l l be mad) d e v . o f f ( ) NB: height and width are in inches, NOT pixesl, with the pdf device. I usually only use the png and pdf devices, but I have sometimes tested others, like SVG (scalable vector graphics) and
  • 124. g (
  • 125. les that can be edited with x
  • 126. g). I'm pushing the point that you create the output device, you run the plot commands, and then dev.o(). That works, but it is not useful for planning a plot. You need to see it on the screen, right? R plot functions draw on screen devices by default. As a result, the research work ow is to run the graph commands, look at them on the screen, and then go back and run png(: : :) or pdf(: : :) line, then re-run the graphing commands, concluding with dev.o(). I know students ignore this advice, and they think it is a hassle. They create graphs on screen and then use File - Save As or such from some GUI IDE. Sometimes, they end up with horrible looking graphs. Sizes and shapes turn out wrong. The reason is that the R graph they see on the screen is not seamlessly copied into a
  • 127. le output. You can reproduce that kind of trouble if you want to. When I'm in a hurry, this is what I do. First, I create an on screen device with about the right size. On all platforms, from an R console, this will do so dev.new ( he ight = 7 , width = 7) If you happen to be using RStudio, you will get an error message indicating that the function is not available. In that case, on Windows, replace dev.new() with windows(). On Mac, replace it with quartz(). On Linux, X11() is a safe bet. Your system may have other on-screen devices as well. I chose to make a square graph, you don't have to if you don't want to. After you make the graph look just right, you can save it as a PDF like so: 34
  • 128. dev.copy ( pdf , f i l e = my f i l e . p d f , he i ght = 7 , width = 7 , o n e f i l e = FALSE) d e v . o f f ( ) Back in the olden days of postscript, the one
  • 129. le = FALSE argument was vital to get the bounding box that put the encapsulated into EPS, rather than plain old postscript. I still use it now with PDF, although I don't think it has much eect. If you want to save the same as a picture image, say a png
  • 130. le with white background, do dev.copy ( png , f i l ename = myf i l e .png , he ight = 600 , width = 800 , bg = whi te ) d e v . o f f ( ) In my experience, the bg argument is ignored and the background always turns out transparent. Note you can also set the pointsize for text. Peter Dalgaard told me that on postscript devices, a point is 1/72nd of an inch. If you expect to shrink an image for
  • 131. nal presentation, set a big pointsize, like 20 or so, in order to avoid having really small type. The dev.copy approach works well OFTEN. Sometimes, however, it is bad. It cuts o the margin labels, it breaks legend boxes. So if you do that, well, you get what you deserve. I've learned this the hard way, but sending out papers with bad graphs and I did not notice until it was too late. R uses the term device to refer to the various graphical output formats. The command dev.o() is vital because it lets the device know you are
  • 132. nished drawing in there. There are many devices representing kinds of
  • 133. le output. You can choose among device types, postscript, png, jpeg, bitmap, etc. (windows users can choose windows meta
  • 134. le). Type ? devi c e to see a list of devices R
  • 135. nds on your system. There is a help page for each type of device, so you can run ?png, ?pdf, ?windows to learn more. After you create a device, check what you have on: d e v . l i s t ( ) And, if you have more than one device active, you tell which one to use with dev.set(n), for the n device. Suppose we want the third device: d e v . s e t ( 3 ) For me, the problem with this approach is that I don't usually know what I want to print until I see it. If I am using the jpeg device, there is no screen output, no picture to look at. So I have to make the plot, see what I want, turn on an output device and run the plot all over in order to save a
  • 136. le. It seems complicated, anyway. The pdf and postscript devices have options to output multiple graphs, one after the other. They can either be drawn as new pages inside the single
  • 138. le = TRUE), or one can use a customized
  • 139. le name to ask for it to create one
  • 140. le per graph. If you use dev.copy(), you may run into the problem that your graph elements have to be resized and they don't
  • 141. t together the way you expect. The default on-screen device (for me, X11(), for you maybe windows() or quartz()) size is 7in x 7in, but postscript size is 1/2 inch smaller than the usable papersize. That mismatch can spell danger. The dev.print() function, as far as I can see, is basically the same as dev.copy(), except it has two features. First, the graph is rescaled according to paper dimensions, and the fonts are rescaled accordingly. Due to the problem mentioned above, not everything gets rescaled perfectly, however, so take care. Second, it automatically turns o the device after it has printed/saved its result. d e v . p r i n t ( devi c e = p o s t s c r i p t , f i l e = yourFi leName.ps ) 35
  • 142. A default jpeg is 480x480, but you can change that: jpeg ( f i l ename= p l o t . j p g width = 460 , he ight = 480 , p o i n t s i z e = 12 , q u a l i t y = 85) As of R1.1, the dev.print() and dev.copy2eps() will work when called from a function, for example: ps func t i on ( f i l e=Rpl o t . eps , width=7, he i ght=7, . . . ) f dev. copy2eps ( f i l e=f i l e , width=width , he i ght=height , . . . ) g data ( c a r s ) pl o t ( c a r s ) ps ( ) The mismatch of size between devices even comes up when you want to print out an plot. This command will print to a printer: d e v . p r i n t ( he i ght=6, width=6, h o r i z o n t a l=FALSE) You might want to include pointsize=20 or whatever so the text is in proper proportion to the rest of your plot. One user observed, Unfortunately this will also make the hash marks too big and put a big gap between the axis labels and the axis titlenldotsfg, and in response Brian Ripley observed: The problem is your use of dev.print here: the ticks change but not the text size. dev.copy does not use the new pointsize: try x11 ( width = 3 , he i ght = 3 , p o i n t s i z e = 8) x11 ( width = 6 , he i ght = 6 , p o i n t s i z e = 16) d e v . s e t ( 2 ) pl o t ( 1 : 1 0 ) dev.copy ( ) Re-scaling works as expected for new plots but not re-played plots. Plotting directly on a bigger device is that answer: plots then scale exactly, except for perhaps default line widths and other things where rasterization eects come into play. In short, if you want postscript or pdf, use postscript() or pdf() directly. The rescaling of a windows() device works dierently, and does rescale the fonts. (I don't have anything more to say on that now) 5.3 How to automatically name plot output into separate
  • 143. les (10/04/2001) Use this notation for the
  • 144. le name: %03d is in the style of old-fashioned fortran format state-ments. It means use 3 digit numbers like 001, 002, etc. p o s t s c r i p t ( f i l e=tes t3%03d.ps , o n e f i l e = FALSE, paper = s p e c i a l ) will create postscript
  • 145. les called test3-001.ps, test3-002.ps and so on (from Thomas Lumley). pdf ( f i l e=tes t3%03d.pdf , o n e f i l e = FALSE, paper = s p e c i a l ) Don't forget d e v . o f f ( ) to
  • 147. les to disk. Same works with pdf or other devices. 5.4 Control papersize (15/08/2000) opt i ons ( pa pe r s i z e = l e t t e r ) 36
  • 148. whereas ps.options is only for the postscript device 5.5 Integrating R graphs into documents: LATEX and EPS or PDF (20/06/2001) Advice seems to be to output R graphs in a scalable vector format, pdf or eps. Then include the output
  • 149. les in the LATEX document. 5.6 Snapshot graphs and scroll through them (31/12/2001) Ordinarily, the graphics device throws away the old graph to make room for the new one. Have a look to ?recordPlot to see how to keep snapshots. To save snapshots, do myGraph r e c o rdPl o t ( ) # then r epl ayPl o t (myGraph) to see it again. On Windows you can choose history-recording in the menu and scroll through the graphs using the PageUp/Down keys. (from Uwe Ligges) 5.7 Plot a density function (eg. Normal) (22/11/2000) x seq (3, 3 , by = 0 . 1 ) probx dnorm(x , 0 , 1 ) pl o t (x , probx , type =`` l ' ' ) 5.8 Plot with error bars (11/08/2000) Here is how to do it. x c ( 1 , 2 , 3 , 4 , 5) y c (1 .1 , 2 .3 , 3 .0 , 3 .9 , 5 . 1 ) uc l c (1 .3 , 2 .4 , 3 .5 , 4 .1 , 5 . 3 ) l c l c ( .9 , 1 .8 , 2 .7 , 3 .8 , 5 . 0 ) pl o t (x , y , yl im = range ( c ( l c l , uc l ) ) ) ar rows (x , ucl , x , l c l , l ength = .05 , angl e = 90 , code = 3) #or segments (x , ucl , x , l c l ) ( from Bi l l Simpson ) 5.9 Histogram with density estimates (14/08/2000) If you want a density estimate and a histogram on the same scale, I suggest you try something like this: IQR d i f f ( summary( data ) [ c ( 5 , 2) ] ) de s t dens i ty ( data , width = 2*IQR) # or some smaller width , maybe , h i s t ( data , xl im = range ( de s t $x ) , xlab = x , ylab = dens i ty , p r o b a b i l i t y = TRUE) # --- this is the vital argument l i n e s ( des t , l t y = 2) ( from Bi l l Venables ) 5.10 How can I overlay several line plots on top of one another? (09/29/2005) This is a question where terminology is important. If you mean overlay as in overlay slides on a projector, it can be done. This literally super-imposes 2 graphs. Use par(new = TRUE) to stop the previous output from being erased, as in 37
  • 150. tmp1 pl o t ( acquaint x , type = ' l ' , yl im = c ( 0 , 1 ) , ylab = average pr opo r t i on , xlab = PERIOD , l t y = 1 , pch = 1 , main = ) par ( new = TRUE) tmp2 pl o t ( harmony x , type = ' l ' , yl im = c ( 0 , 1 ) , ylab = average pr opo r t i on , xlab = PERIOD , l t y = 2 , pch = 1 , main = ) par ( new = TRUE) tmp3 pl o t ( i d e n t i c a l x , type = ' l ' , yl im = c ( 0 , 1 ) , ylab = average pr opo r t i on , xlab = PERIOD , l t y = 3 , pch = 1 , main = ) Note the par() is used to overlay high level plotting commands. Here's an example for histograms: h i s t ( rnorm(100 , mean = 20 , sd = 12) , xl im = range ( 0 , 1 0 0 ) , yl im = range ( 0 , 50) ) par (new = TRUE) h i s t ( rnorm(100 , mean = 88 , sd = 2) , xl im = range ( 0 , 100) , yl im = range ( 0 , 50) ) The label at the bottom is all messed up. You can put the option xlab= to get rid of them. There are very few times when you actually want an overlay in that sense. Instead, you want to draw one graph, and then add points, lines, text, an so forth. That is what the R authors actually intend. First you create the baseline plot, the one that sets the axes values. Then add points, lines, etc. Here are some self-contained examples to give you the idea. Here's a regression example x rnorm(100) e rnorm (100) y 12 + 4*x +13 * e mylm lm( y x ) pl o t (x , y , main = My r e g r e s s i o n ) a b l i n e (mylm) From Roland Rau in r-help: h i s t ( rnorm(10000) , f r e q = FALSE) xva l s seq (5, 5 , l ength = 100) l i n e s ( x = xval s , y = dnorm( xva l s ) ) Here's another example that works great for scatterplots x1 1:10 x2 2:12 y1 x1+rnorm( l ength ( x1 ) ) y2 . 1 *x2+rnorm( l ength ( x2 ) ) pl o t ( x1 , y1 , xl im = range ( x1 , x2 ) , yl im = range ( y1 , y2 ) , pch = 1) po int s ( x2 , y2 , pch = 2) ( from Bi l l Simpson ) If you just want to plot several lines on a single plot, the easiest thing is to use matplot which is intended for that kind of thing. But plot can do it too. Here's another idea: form a new data.frame and pass it through as y: pl o t ( cbind ( t s 1 . t s , t s 2 . t s ) , xlab = time , ylab = , p l o t . t y p e = s i n g l e ) or better something like: pl o t ( cbind ( t s 1 . t s , t s 2 . t s ) , p l o t . t y p e = s i n g l e , c o l=c ( ye l low , red ) , l t y = c ( s o l i d , dot ted ) ) #colours and patterns It can also be helpful to contrast `c(ts1.ts,ts2.ts)' with `cbind(ts1.ts,ts2.ts)'. (from Guido Masarotto) For time series data, there is a special function for this: t s . p l o t ( t s 1 . t s , t s 2 . t s ) # same as above 38
  • 151. See help on plot.ts and ts.plot (from Brian D. Ripley) Often, I
  • 152. nd myself using the plot command to plot nothing, and then