Upcoming SlideShare
×

# Basic R

481 views

Published on

Published in: Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
481
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
8
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Basic R

1. 1. Prof. Dr. Roberto Dantas de Pinho, roberto.pinho@mct.gov.br 26/jul/2012 This presentation is based on courses by Dr. Paulo Justiniano Ribeiro Jr (UFPR) & Dr. Cosme Marcelo Furtado Passos da Silva (FIOCRUZ) SEXECASCAV|CGIN 1
2. 2.  A First R Session  Saving your work Objects  Changing data Data input  Sums e Now that we have aggregates data...  Linear regression  Some analyses Filter & select And lots of other things along the way SEXECASCAV|CGIN 2
3. 3. Install, configuration etc.R internals, structure etc.Handling large datasetsFancy plots beyond the basics SEXECASCAV|CGIN 3
4. 4.  You can use R to evaluate some simple expressions. Just type: 1 + 2 + 3 2 + 3 * 4 3/2 + 1 4 * 3**3 R is an environment and a language SEXECASCAV|CGIN 4
5. 5.  The R environment allows for you to submit command and see results immediately. The R language is made by the set of rules and functions that may be run by the R environment. You may keep command sequences (scripts) for latter use. SEXECASCAV|CGIN 5
6. 6.  Several functions are available. A couple simple examples:  sqrt(2) 2  abs(-10)  10  sin(pi) sin( ) pi is a constant in R, its value is already defined. SEXECASCAV|CGIN 6
7. 7.  Results, input data, tables etc. are all stored in R as Objects Objects have a name, content , type and are stored in memory. Ex.  Creates object “x” with the number 10: x <- 10  Show the content of x: x In R, abc is different of ABC SEXECASCAV|CGIN 7
8. 8.  Try: X <- sqrt(2) <- and = are equivalent. Y = sin(pi) Z = sqrt(X+Y) In the above examples, X, Y and Z store results from each operation.In R, There is always many ways ofdoing the same thing. We will try to focus on a single way of doing each task. SEXECASCAV|CGIN 8
9. 9.  What is the value of C at the end of the script? A = 1 B = 2 C = A + B A = 5 B = 5 Why? SEXECASCAV|CGIN 9
10. 10. SEXECASCAV|CGIN 10
11. 11.  Tool that makes it easier to use R Manages work windows Easier access to objects, scripts, history of commands and plots. SEXECASCAV|CGIN 11
12. 12. Editing Scripts &object view Console SEXECASCAV|CGIN 12
13. 13. Object list& historyHelp, plots,files & packages SEXECASCAV|CGIN 13
14. 14.  Object that hold multiple values that store data of a single type Function c( ) (“c” from concatenate) groups values to build a vector: X = c(1,3,6) To access vector elements: X[1] X[3] SEXECASCAV|CGIN 14
15. 15.  Operations may be performed and functions applied over the whole vector. Ex. X = c(1,3,5) Y = c(10,20,30) X+Y [1] 11 23 35 sum(X) [1] 9 How about X + 100 ? [1] 101 103 105 due to the Recycling law SEXECASCAV|CGIN 15
16. 16.  When the size of an object required by an operation is different from the actual size, available data is repeated as needed. As X has 3 elements, X+100 is the same as X + c(100,100,100) SEXECASCAV|CGIN 16
17. 17. > X = 1:10> [1] 1 2 3 4 5 6 7 8 9 10> X = seq(0,1,by=0.1)> [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0> rep(“a”,5)> “a” “a” “a” “a” “a”> names = c("fulano", "beltrano", "cicrano")> names [1] "fulano" "beltrano" "cicrano"> letras = letters[1:5]> letras [1] "a" "b" "c" "d" "e"> letras = LETTERS[1:5]> letras [1] "A" "B" "C" "D" "E" SEXECASCAV|CGIN 17
18. 18.  numeric  integer  is.numeric( )  is.integer( )  as.numeric( )  as.integer( ) character  logical  is.character( )  T == TRUE == 1  as.character( )  F == FALSE == 0 A == B means “is A equal to B?” SEXECASCAV|CGIN 18
19. 19.  A Vector arranged in rows & columns m1 <- matrix(1:12, ncol = 3) [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 SEXECASCAV|CGIN 19
20. 20.  length(m1) [1] 12 dim(m1) [1] 4 3 nrow(m1) [1] 4 ncol(m1) [1] 3 SEXECASCAV|CGIN 20
21. 21.  m1[1, 2] [1] 5 m1[2, 2] [1] 6 m1[ , 2] [1] 5 6 7 8 m1[3, ] m1[1,2]= 99 [1] 3 7 11 changes the value of the cell SEXECASCAV|CGIN 21
22. 22. m1[1:2, 2:3] [,1] [,2][1,] 5 9[2,] 6 10 SEXECASCAV|CGIN 22
23. 23. colnames(m1)NULLrownames(m1)NULLcolnames(m1) = c("C1","C2","C3")m1[,”C1”][1] 1 2 3 4 t(m1) transpose of m1 SEXECASCAV|CGIN 23
24. 24.  “matrix” with many dimensions. Ex. 3 dim.:ar1 <- array(1:24, dim = c(3, 4, 2)), , 1 1ª matrix [,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 2 5 8 11[3,] 3 6 9 12 For a 3 dimention array, you migth visualize the 3rd, , 2 dimentions as a colections of matrices. [,1] [,2] [,3] [,4][1,] 13 16 19 22[2,] 14 17 20 23 2ª matrix[3,] 15 18 21 24 SEXECASCAV|CGIN 24
25. 25.  How to work with this kind of data?Ano Código do Órgão UF Órgão Código da UO unidade orçamentária função subfunção programa ação localizador descrição da ação valor P&D valor ACTC Adm direta e MODERNIZAÇÃO DO SISTEMA DE2010 AC 1 indireta 1 Adm direta e indireta 19 121 2056 1548 PLANEJAMENTO E GESTÃO DA SDCT R\$ - R\$ 16.655,00 PROGRAMA DE COOPERAÇÃO TÉCNICA E Adm FINANCEIRA COM INSTIT. NAC. INTERN. direta e GOVERNAMENTAIS E NÃO2010 AC 1 indireta 1 Adm direta e indireta 19 121 2056 1549 GOVERNAMENTAIS R\$ - R\$ 715.000,00 Adm direta e MANUTENÇÃO DO GABINETE DO SECRETÁ2010 AC 1 indireta 1 Adm direta e indireta 19 122 2009 2224 RIO R\$ - R\$ 27.732,11 Adm direta e2010 AC 1 indireta 1 Adm direta e indireta 19 122 2009 2227 DEPARTAMENTO DE GESTÃO INTERNA R\$ - R\$ 2.266.169,90 SEXECASCAV|CGIN 25
26. 26. colnames(d) [1] "letra" "num" "valor"  Each column has its own data type d = data.frame(letters[1:4], 1:4, 10.5) letters.1.4. X1.4 X10.5 1 a 1 10.5 We will be using 2 b 2 10.5 data.frames most of 3 c 3 10.5 the time 4 d 4 10.5  We can change column names: colnames(d) = c("letra","num", "valor") colnames(d) [1] "letra" "num" "valor“ d\$valor # selects column “valor” from d SEXECASCAV|CGIN 26
27. 27.  list factor latter... 27 SEXECASCAV|CGIN
28. 28.  Several possible sources. We will see:  Keyboard x = scan( )  Excel files  CSV files  SQL Databases SEXECASCAV|CGIN 28
29. 29. require(XLConnect)wb <- loadWorkbook(“AC_PDACTCaula.xls”)plan1 <- readWorksheet(wb, sheet = 1)str(plan1)View(plan1) SEXECASCAV|CGIN 29
30. 30. require(XLConnect) Loads package XLConnect Packages are sets of functions and data that add capabilities to R. If the package is not installed:setInternet2() #only on windowsinstall.packages("XLConnect", dep=T) SEXECASCAV|CGIN 30
31. 31.  Creates an object “wb” that points to the excel file:wb <- loadWorkbook(“AC_PDACTCaula.xls”) SEXECASCAV|CGIN 31
32. 32.  Load the first sheet data into an object called “plan1”plan1 <- readWorksheet(wb, sheet = 1) R functions identify parameters by Or by name, or order both SEXECASCAV|CGIN 32
33. 33.  Show the structure of the new object:str(plan1) str() works with any R Object. It is very useful. Show data on a window:View(plan1) In RStudio, you may click on na object from the objects list to the same effect SEXECASCAV|CGIN 33
34. 34. args(readWorksheet) #shows available parametersfunction (object, #workbook “wb”sheet, #number or name of the sheetstartRow, #startCol, #endRow, #endCol, #header # T or F: use first line to name columns ) SEXECASCAV|CGIN 34
35. 35.  Comma-separated values Very popular format for data interchange ; Other separators are also popular: <tab> <space> Example:uf ano valido somaactc somapdAC 2009 1 34296430.67 3630841.04AC 2010 1 29397712.04 3579715.12AL 2009 1 12650160.51 8903714.41 SEXECASCAV|CGIN 35
36. 36.  Example:uf ano valido somaactc somapdAC 2009 1 34296430,67 3630841,04AC 2010 1 29397712,04 3579715,12AL 2009 1 12650160,51 8903714,41 To read this file:d = read.csv(file="AgregaUF20110930_b.txt", header=T, # uses first line as column names sep="t", # separator is <tab> dec="," # decimals uses comma) SEXECASCAV|CGIN 36
37. 37.  str(d) #structure summary(d) #Statistical summary head(d) #first rows tail(d) #last rows plot(d) #standard plot SEXECASCAV|CGIN 37
38. 38. require(RODBC)canal <- odbcConnect(“base_ODBC",case="tolower“,uid=“user”,pwd=“password”)d <- sqlQuery(canal,”select * from table where year = 2010”,as.is=T) SEXECASCAV|CGIN 38
39. 39.  How to get the sum of values from a data.frame column? sum(data.frame\$column) sum(d\$somapd) [1] NA SEXECASCAV|CGIN 39
40. 40.  NA Not Available  Missing values. NaN Not a Number  Value not able to be presented as a number. Inf & -Inf  plus and minus infinite Try: c(-1,0,1)/0 SEXECASCAV|CGIN 40
41. 41.  Sum: sum(d\$somapd, na.rm=T) [1] 4836882446 Mean:mean(d\$somapd, na.rm=T) Median:median(d\$somapd, na.rm=T) Standard deviation:sd(d\$somapd, na.rm=T) SEXECASCAV|CGIN 41
42. 42.  For these examples: milsa = read.csv("milsaText.txt", sep="t", head=T, dec=".") SEXECASCAV|CGIN 42
43. 43.  Absolute frequenciestable(milsa\$civil) Relative frequenciestable(milsa\$civil) / length(milsa\$civil) orprop.table(milsa\$civil) Pie chartpie(table(milsa\$civil)) SEXECASCAV|CGIN 43
44. 44.  With attach(milsa) Absolute frequenciestable(civil) Relative frequenciestable(civil) / length(civil) orprop.table(civil) Pie Chart after: detach(milsa)pie(table(civil)) SEXECASCAV|CGIN 44
45. 45.  Bar plot:barplot(table(instrucao)) remember:  I may save any result as an object to use it later.instrucao.tb = table(instrucao)barplot(instrucao.tb)pie(instrucao.tb) SEXECASCAV|CGIN 45
46. 46.  Try:prop.table(filhos) Solution:prop.table(table(filhos)) Other solution:  Filter out elements with NA SEXECASCAV|CGIN 46
47. 47.  mean(filhos, na.rm=T)  median(filhos, na.rm=T)  range(filhos, na.rm=T)  var(filhos, na.rm=T) #variance  sd(filhos, na.rm=T) #standard deviation Quantiles:  filhos.quartis = quantile(filhos, na.rm=T) interquartile range:  filhos.quartis [4] -filhos.quartis [1] SEXECASCAV|CGIN 47
48. 48.  plot(milsa) plot(salario ~ ano) hist(salario) boxplot(salario) stem(salario) SEXECASCAV|CGIN 48
49. 49.  Selecting some rows milsaNovo = milsa[c(1,3,5,6) , ] Selecting some columns milsaNovo = milsa[ , c(1,3,5)] milsaNovo = milsa[ , c(“funcionario”, ”instrucao“, “salario”)] Attention:  New copy milsaNovo=milsa[c(1,3,5,6) ,]  Replaces previous milsa=milsa[c(1,3,5,6) , ] SEXECASCAV|CGIN 49
50. 50.  Who earns above median acimamediana = milsa[ salario > median(salario), ] Who is married and has higher education degree? casadoEsuperior = milsa[ civil==“casado” & instrucao == “Superior”, ] AND: both must be true SEXECASCAV|CGIN 50
51. 51.  Who is married or has higher education degree? casadoOUsuperior = milsa[ civil==“casado” | instrucao == “Superior”, ] OR: at least one must be true SEXECASCAV|CGIN 51
52. 52. NOT milsaLimpo=milsa[!is.na(salario), ] In English:  New Table milsaLimpo  equals =  Old table milsa  Select [  Rows where  Salary is not NA ! is.na(salario)  And all columns , ] SEXECASCAV|CGIN 52
53. 53. How many are married?sum(civil==“casado”)  ortable(civil)["casado"]How may are married and has higher ed. degree?sum(civil==“casado” & instrucao == “Superior” )  ortable(civil,instrucao)["casado","S uperior"] SEXECASCAV|CGIN 53
54. 54.  milsaNovo is equal to milsa, without rows 1,2 & 5 & without columns 1 & 8:milsaNovo =milsa[-c(1,2,5), -c(1,8)] SEXECASCAV|CGIN 54
55. 55. Which rows where this is TRUE sup = which(instrucao=="Superior“) [1] 19 24 31 33 34 36 May use it again later:  mean(milsa[sup,”salario”])  Mean salary for those with higher education advantage: it is not a copy!! SEXECASCAV|CGIN 55
56. 56.  A random sample of 10 rows from milsa: amostra = sample(x=nrow(milsa),size=10) [1] 12 29 1 3 17 14 26 33 20 31 Mean salary for the sample: mean(milsa[amostra,”salario”]) SEXECASCAV|CGIN 56
57. 57.  By number of children: milsa[order(filhos),] Decreasing: milsa[order(filhos, decreasing=T),] By number of children and then age: milsa[order(filhos,ano),] 10 youngest: head(milsa[order(ano),], 10) 10 older: tail(milsa[order(ano),], 10) SEXECASCAV|CGIN 57
58. 58.  Removing an object  rm(milsaNovo) Removing every object  rm(list = ls()) ls() : list of current objects SEXECASCAV|CGIN 58
59. 59.  List objects are collections that may include different types of objects.lis = list(A=1:10, B=“Text”, C = matrix(1:9,ncol=3) They are often used as parameters to functions or as result sets from them. lis[1:2]  A list with the two first objects from lis (A & B) lis[[1]]:  object stored at the first position of the list ( the content of A). The same as lis\$A SEXECASCAV|CGIN 59
60. 60.  Saving all objects: save.image(“file.RData”) Saving selected objects: save( x, y, file=“file.RData”) loading: load(“file.RData“) Several “loads”: objects with distinct names are kept in memory SEXECASCAV|CGIN 60
61. 61.  Saving a script “.R” that reproduces the desired output. Advantage:  It may be used to document the work performed;  It may be used again over updated data to update results. Hybrid model:  Save intermediate results that take long time to process. Update them less often. SEXECASCAV|CGIN 61
62. 62.  Add a column to a data.frame: milsa\$idade = milsa\$ano + milsa\$mes/12 SEXECASCAV|CGIN 62
63. 63. X Y6+3+5=14 SEXECASCAV|CGIN 63
64. 64. X Y SEXECASCAV|CGIN 64
65. 65. X Y SEXECASCAV|CGIN 65
66. 66. X Y SEXECASCAV|CGIN 66
67. 67. X Y SEXECASCAV|CGIN 67
68. 68.  Example: & SEXECASCAV|CGIN 68
69. 69.  Only rows found in both data.frames:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc“, all=F)All rows from data.frame X:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc", all.x=T) SEXECASCAV|CGIN 69
70. 70. All rows from data.frame y:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc", all.y=T)All rows from data.frames x & y:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc", all=T) SEXECASCAV|CGIN 70
71. 71.  From text to numericd.f\$novaColuna = as.numeric(d.f\$coluna) From numeric to text:d.f\$novaColuna=as.character(d.f\$coluna) From text or numeric to integer:d.f\$novaColuna = as.integer(d.f\$coluna) Integers save memory SEXECASCAV|CGIN 71
72. 72.  Representation for categorical data  Nominal ▪ “married”, “single”  Ordinal Factors save memory ▪ “tall”, “short” Assure proper treatment for these variables by many R functions SEXECASCAV|CGIN 72
73. 73. Nominal:milsa\$fatorcivil=factor(milsa\$civil, ordered=F)\$fatorcivil : Factor w/ 2 levels "casado","solteiro": 2 1 1 2 2 1 2 2 1 2Ordinal:milsa\$fatormes = factor(milsa\$mes, ordered=T)\$fatormes : Ord.factor w/ 12 levels "0"<"1"<"2"<"3"<..: 4 11 6 11 8 1 1 5 11 7 ... It is possible to define a custom order: ?factor SEXECASCAV|CGIN 73
74. 74.  From factor to text:d.f\$novaColuna = as.character(d.f\$colunaFator) From factor to numeric:d.f\$novaColuna = as.numeric( as.character(d.f\$colunaFator)) The internal representation of a factor is different from its text description SEXECASCAV|CGIN 74
75. 75.  Using: m1 <- matrix(1:12, ncol = 3) Sum of columns (a value for each column):colSums(m1)[1] 10 26 42  orapply(m1,2,sum)[1] 10 26 42 SEXECASCAV|CGIN 75
76. 76.  Sum of rows (one value for each row):rowSums(m1)[1] 15 18 21 24  orapply(m1,1,sum)[1] 15 18 21 24 May use any function, even your own. SEXECASCAV|CGIN 76
77. 77. aggregate(salario ~ instrucao, data = milsa, mean) instrucao salario1 1oGrau 7.8366672 2oGrau 11.5283333 Superior 16.475000 SEXECASCAV|CGIN 77
78. 78. aggregate( salario ~ instrucao + civil, data = milsa, mean) instrucao civil salario1 1oGrau casado 7.0440002 2oGrau casado 12.8250003 Superior casado 17.7833334 1oGrau solteiro 8.4028575 2oGrau solteiro 8.9350006 Superior solteiro 15.166667 SEXECASCAV|CGIN 78
79. 79. model = lm( formula = salario ~ ano + instrucao, data = milsa)summary(model) Just one line!!! SEXECASCAV|CGIN 79
80. 80. Prof. Dr. Roberto Dantas de Pinho, roberto.pinho@mct.gov.br This presentation is based on courses by Dr. Paulo Justiniano Ribeiro Jr (UFPR) & Dr. Cosme Marcelo Furtado Passos da Silva (FIOCRUZ) SEXECASCAV|CGIN 80