Basic R
Upcoming SlideShare
Loading in...5
×
 

Basic R

on

  • 353 views

 

Statistics

Views

Total Views
353
Views on SlideShare
353
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Basic R Basic R Presentation Transcript

  • Prof. Dr. Roberto Dantas de Pinho, roberto.pinho@mct.gov.br 26/jul/2012 This presentation is based on courses by Dr. Paulo Justiniano Ribeiro Jr (UFPR) & Dr. Cosme Marcelo Furtado Passos da Silva (FIOCRUZ) SEXECASCAV|CGIN 1
  •  A First R Session  Saving your work Objects  Changing data Data input  Sums e Now that we have aggregates data...  Linear regression  Some analyses Filter & select And lots of other things along the way SEXECASCAV|CGIN 2
  • Install, configuration etc.R internals, structure etc.Handling large datasetsFancy plots beyond the basics SEXECASCAV|CGIN 3
  •  You can use R to evaluate some simple expressions. Just type: 1 + 2 + 3 2 + 3 * 4 3/2 + 1 4 * 3**3 R is an environment and a language SEXECASCAV|CGIN 4
  •  The R environment allows for you to submit command and see results immediately. The R language is made by the set of rules and functions that may be run by the R environment. You may keep command sequences (scripts) for latter use. SEXECASCAV|CGIN 5
  •  Several functions are available. A couple simple examples:  sqrt(2) 2  abs(-10)  10  sin(pi) sin( ) pi is a constant in R, its value is already defined. SEXECASCAV|CGIN 6
  •  Results, input data, tables etc. are all stored in R as Objects Objects have a name, content , type and are stored in memory. Ex.  Creates object “x” with the number 10: x <- 10  Show the content of x: x In R, abc is different of ABC SEXECASCAV|CGIN 7
  •  Try: X <- sqrt(2) <- and = are equivalent. Y = sin(pi) Z = sqrt(X+Y) In the above examples, X, Y and Z store results from each operation.In R, There is always many ways ofdoing the same thing. We will try to focus on a single way of doing each task. SEXECASCAV|CGIN 8
  •  What is the value of C at the end of the script? A = 1 B = 2 C = A + B A = 5 B = 5 Why? SEXECASCAV|CGIN 9
  • SEXECASCAV|CGIN 10
  •  Tool that makes it easier to use R Manages work windows Easier access to objects, scripts, history of commands and plots. SEXECASCAV|CGIN 11
  • Editing Scripts &object view Console SEXECASCAV|CGIN 12
  • Object list& historyHelp, plots,files & packages SEXECASCAV|CGIN 13
  •  Object that hold multiple values that store data of a single type Function c( ) (“c” from concatenate) groups values to build a vector: X = c(1,3,6) To access vector elements: X[1] X[3] SEXECASCAV|CGIN 14
  •  Operations may be performed and functions applied over the whole vector. Ex. X = c(1,3,5) Y = c(10,20,30) X+Y [1] 11 23 35 sum(X) [1] 9 How about X + 100 ? [1] 101 103 105 due to the Recycling law SEXECASCAV|CGIN 15
  •  When the size of an object required by an operation is different from the actual size, available data is repeated as needed. As X has 3 elements, X+100 is the same as X + c(100,100,100) SEXECASCAV|CGIN 16
  • > X = 1:10> [1] 1 2 3 4 5 6 7 8 9 10> X = seq(0,1,by=0.1)> [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0> rep(“a”,5)> “a” “a” “a” “a” “a”> names = c("fulano", "beltrano", "cicrano")> names [1] "fulano" "beltrano" "cicrano"> letras = letters[1:5]> letras [1] "a" "b" "c" "d" "e"> letras = LETTERS[1:5]> letras [1] "A" "B" "C" "D" "E" SEXECASCAV|CGIN 17
  •  numeric  integer  is.numeric( )  is.integer( )  as.numeric( )  as.integer( ) character  logical  is.character( )  T == TRUE == 1  as.character( )  F == FALSE == 0 A == B means “is A equal to B?” SEXECASCAV|CGIN 18
  •  A Vector arranged in rows & columns m1 <- matrix(1:12, ncol = 3) [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 SEXECASCAV|CGIN 19
  •  length(m1) [1] 12 dim(m1) [1] 4 3 nrow(m1) [1] 4 ncol(m1) [1] 3 SEXECASCAV|CGIN 20
  •  m1[1, 2] [1] 5 m1[2, 2] [1] 6 m1[ , 2] [1] 5 6 7 8 m1[3, ] m1[1,2]= 99 [1] 3 7 11 changes the value of the cell SEXECASCAV|CGIN 21
  • m1[1:2, 2:3] [,1] [,2][1,] 5 9[2,] 6 10 SEXECASCAV|CGIN 22
  • colnames(m1)NULLrownames(m1)NULLcolnames(m1) = c("C1","C2","C3")m1[,”C1”][1] 1 2 3 4 t(m1) transpose of m1 SEXECASCAV|CGIN 23
  •  “matrix” with many dimensions. Ex. 3 dim.:ar1 <- array(1:24, dim = c(3, 4, 2)), , 1 1ª matrix [,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 2 5 8 11[3,] 3 6 9 12 For a 3 dimention array, you migth visualize the 3rd, , 2 dimentions as a colections of matrices. [,1] [,2] [,3] [,4][1,] 13 16 19 22[2,] 14 17 20 23 2ª matrix[3,] 15 18 21 24 SEXECASCAV|CGIN 24
  •  How to work with this kind of data?Ano Código do Órgão UF Órgão Código da UO unidade orçamentária função subfunção programa ação localizador descrição da ação valor P&D valor ACTC Adm direta e MODERNIZAÇÃO DO SISTEMA DE2010 AC 1 indireta 1 Adm direta e indireta 19 121 2056 1548 PLANEJAMENTO E GESTÃO DA SDCT R$ - R$ 16.655,00 PROGRAMA DE COOPERAÇÃO TÉCNICA E Adm FINANCEIRA COM INSTIT. NAC. INTERN. direta e GOVERNAMENTAIS E NÃO2010 AC 1 indireta 1 Adm direta e indireta 19 121 2056 1549 GOVERNAMENTAIS R$ - R$ 715.000,00 Adm direta e MANUTENÇÃO DO GABINETE DO SECRETÁ2010 AC 1 indireta 1 Adm direta e indireta 19 122 2009 2224 RIO R$ - R$ 27.732,11 Adm direta e2010 AC 1 indireta 1 Adm direta e indireta 19 122 2009 2227 DEPARTAMENTO DE GESTÃO INTERNA R$ - R$ 2.266.169,90 SEXECASCAV|CGIN 25
  • colnames(d) [1] "letra" "num" "valor"  Each column has its own data type d = data.frame(letters[1:4], 1:4, 10.5) letters.1.4. X1.4 X10.5 1 a 1 10.5 We will be using 2 b 2 10.5 data.frames most of 3 c 3 10.5 the time 4 d 4 10.5  We can change column names: colnames(d) = c("letra","num", "valor") colnames(d) [1] "letra" "num" "valor“ d$valor # selects column “valor” from d SEXECASCAV|CGIN 26
  •  list factor latter... 27 SEXECASCAV|CGIN
  •  Several possible sources. We will see:  Keyboard x = scan( )  Excel files  CSV files  SQL Databases SEXECASCAV|CGIN 28
  • require(XLConnect)wb <- loadWorkbook(“AC_PDACTCaula.xls”)plan1 <- readWorksheet(wb, sheet = 1)str(plan1)View(plan1) SEXECASCAV|CGIN 29
  • require(XLConnect) Loads package XLConnect Packages are sets of functions and data that add capabilities to R. If the package is not installed:setInternet2() #only on windowsinstall.packages("XLConnect", dep=T) SEXECASCAV|CGIN 30
  •  Creates an object “wb” that points to the excel file:wb <- loadWorkbook(“AC_PDACTCaula.xls”) SEXECASCAV|CGIN 31
  •  Load the first sheet data into an object called “plan1”plan1 <- readWorksheet(wb, sheet = 1) R functions identify parameters by Or by name, or order both SEXECASCAV|CGIN 32
  •  Show the structure of the new object:str(plan1) str() works with any R Object. It is very useful. Show data on a window:View(plan1) In RStudio, you may click on na object from the objects list to the same effect SEXECASCAV|CGIN 33
  • args(readWorksheet) #shows available parametersfunction (object, #workbook “wb”sheet, #number or name of the sheetstartRow, #startCol, #endRow, #endCol, #header # T or F: use first line to name columns ) SEXECASCAV|CGIN 34
  •  Comma-separated values Very popular format for data interchange ; Other separators are also popular: <tab> <space> Example:uf ano valido somaactc somapdAC 2009 1 34296430.67 3630841.04AC 2010 1 29397712.04 3579715.12AL 2009 1 12650160.51 8903714.41 SEXECASCAV|CGIN 35
  •  Example:uf ano valido somaactc somapdAC 2009 1 34296430,67 3630841,04AC 2010 1 29397712,04 3579715,12AL 2009 1 12650160,51 8903714,41 To read this file:d = read.csv(file="AgregaUF20110930_b.txt", header=T, # uses first line as column names sep="t", # separator is <tab> dec="," # decimals uses comma) SEXECASCAV|CGIN 36
  •  str(d) #structure summary(d) #Statistical summary head(d) #first rows tail(d) #last rows plot(d) #standard plot SEXECASCAV|CGIN 37
  • require(RODBC)canal <- odbcConnect(“base_ODBC",case="tolower“,uid=“user”,pwd=“password”)d <- sqlQuery(canal,”select * from table where year = 2010”,as.is=T) SEXECASCAV|CGIN 38
  •  How to get the sum of values from a data.frame column? sum(data.frame$column) sum(d$somapd) [1] NA SEXECASCAV|CGIN 39
  •  NA Not Available  Missing values. NaN Not a Number  Value not able to be presented as a number. Inf & -Inf  plus and minus infinite Try: c(-1,0,1)/0 SEXECASCAV|CGIN 40
  •  Sum: sum(d$somapd, na.rm=T) [1] 4836882446 Mean:mean(d$somapd, na.rm=T) Median:median(d$somapd, na.rm=T) Standard deviation:sd(d$somapd, na.rm=T) SEXECASCAV|CGIN 41
  •  For these examples: milsa = read.csv("milsaText.txt", sep="t", head=T, dec=".") SEXECASCAV|CGIN 42
  •  Absolute frequenciestable(milsa$civil) Relative frequenciestable(milsa$civil) / length(milsa$civil) orprop.table(milsa$civil) Pie chartpie(table(milsa$civil)) SEXECASCAV|CGIN 43
  •  With attach(milsa) Absolute frequenciestable(civil) Relative frequenciestable(civil) / length(civil) orprop.table(civil) Pie Chart after: detach(milsa)pie(table(civil)) SEXECASCAV|CGIN 44
  •  Bar plot:barplot(table(instrucao)) remember:  I may save any result as an object to use it later.instrucao.tb = table(instrucao)barplot(instrucao.tb)pie(instrucao.tb) SEXECASCAV|CGIN 45
  •  Try:prop.table(filhos) Solution:prop.table(table(filhos)) Other solution:  Filter out elements with NA SEXECASCAV|CGIN 46
  •  mean(filhos, na.rm=T)  median(filhos, na.rm=T)  range(filhos, na.rm=T)  var(filhos, na.rm=T) #variance  sd(filhos, na.rm=T) #standard deviation Quantiles:  filhos.quartis = quantile(filhos, na.rm=T) interquartile range:  filhos.quartis [4] -filhos.quartis [1] SEXECASCAV|CGIN 47
  •  plot(milsa) plot(salario ~ ano) hist(salario) boxplot(salario) stem(salario) SEXECASCAV|CGIN 48
  •  Selecting some rows milsaNovo = milsa[c(1,3,5,6) , ] Selecting some columns milsaNovo = milsa[ , c(1,3,5)] milsaNovo = milsa[ , c(“funcionario”, ”instrucao“, “salario”)] Attention:  New copy milsaNovo=milsa[c(1,3,5,6) ,]  Replaces previous milsa=milsa[c(1,3,5,6) , ] SEXECASCAV|CGIN 49
  •  Who earns above median acimamediana = milsa[ salario > median(salario), ] Who is married and has higher education degree? casadoEsuperior = milsa[ civil==“casado” & instrucao == “Superior”, ] AND: both must be true SEXECASCAV|CGIN 50
  •  Who is married or has higher education degree? casadoOUsuperior = milsa[ civil==“casado” | instrucao == “Superior”, ] OR: at least one must be true SEXECASCAV|CGIN 51
  • NOT milsaLimpo=milsa[!is.na(salario), ] In English:  New Table milsaLimpo  equals =  Old table milsa  Select [  Rows where  Salary is not NA ! is.na(salario)  And all columns , ] SEXECASCAV|CGIN 52
  • How many are married?sum(civil==“casado”)  ortable(civil)["casado"]How may are married and has higher ed. degree?sum(civil==“casado” & instrucao == “Superior” )  ortable(civil,instrucao)["casado","S uperior"] SEXECASCAV|CGIN 53
  •  milsaNovo is equal to milsa, without rows 1,2 & 5 & without columns 1 & 8:milsaNovo =milsa[-c(1,2,5), -c(1,8)] SEXECASCAV|CGIN 54
  • Which rows where this is TRUE sup = which(instrucao=="Superior“) [1] 19 24 31 33 34 36 May use it again later:  mean(milsa[sup,”salario”])  Mean salary for those with higher education advantage: it is not a copy!! SEXECASCAV|CGIN 55
  •  A random sample of 10 rows from milsa: amostra = sample(x=nrow(milsa),size=10) [1] 12 29 1 3 17 14 26 33 20 31 Mean salary for the sample: mean(milsa[amostra,”salario”]) SEXECASCAV|CGIN 56
  •  By number of children: milsa[order(filhos),] Decreasing: milsa[order(filhos, decreasing=T),] By number of children and then age: milsa[order(filhos,ano),] 10 youngest: head(milsa[order(ano),], 10) 10 older: tail(milsa[order(ano),], 10) SEXECASCAV|CGIN 57
  •  Removing an object  rm(milsaNovo) Removing every object  rm(list = ls()) ls() : list of current objects SEXECASCAV|CGIN 58
  •  List objects are collections that may include different types of objects.lis = list(A=1:10, B=“Text”, C = matrix(1:9,ncol=3) They are often used as parameters to functions or as result sets from them. lis[1:2]  A list with the two first objects from lis (A & B) lis[[1]]:  object stored at the first position of the list ( the content of A). The same as lis$A SEXECASCAV|CGIN 59
  •  Saving all objects: save.image(“file.RData”) Saving selected objects: save( x, y, file=“file.RData”) loading: load(“file.RData“) Several “loads”: objects with distinct names are kept in memory SEXECASCAV|CGIN 60
  •  Saving a script “.R” that reproduces the desired output. Advantage:  It may be used to document the work performed;  It may be used again over updated data to update results. Hybrid model:  Save intermediate results that take long time to process. Update them less often. SEXECASCAV|CGIN 61
  •  Add a column to a data.frame: milsa$idade = milsa$ano + milsa$mes/12 SEXECASCAV|CGIN 62
  • X Y6+3+5=14 SEXECASCAV|CGIN 63
  • X Y SEXECASCAV|CGIN 64
  • X Y SEXECASCAV|CGIN 65
  • X Y SEXECASCAV|CGIN 66
  • X Y SEXECASCAV|CGIN 67
  •  Example: & SEXECASCAV|CGIN 68
  •  Only rows found in both data.frames:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc“, all=F)All rows from data.frame X:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc", all.x=T) SEXECASCAV|CGIN 69
  • All rows from data.frame y:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc", all.y=T)All rows from data.frames x & y:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc", all=T) SEXECASCAV|CGIN 70
  •  From text to numericd.f$novaColuna = as.numeric(d.f$coluna) From numeric to text:d.f$novaColuna=as.character(d.f$coluna) From text or numeric to integer:d.f$novaColuna = as.integer(d.f$coluna) Integers save memory SEXECASCAV|CGIN 71
  •  Representation for categorical data  Nominal ▪ “married”, “single”  Ordinal Factors save memory ▪ “tall”, “short” Assure proper treatment for these variables by many R functions SEXECASCAV|CGIN 72
  • Nominal:milsa$fatorcivil=factor(milsa$civil, ordered=F)$fatorcivil : Factor w/ 2 levels "casado","solteiro": 2 1 1 2 2 1 2 2 1 2Ordinal:milsa$fatormes = factor(milsa$mes, ordered=T)$fatormes : Ord.factor w/ 12 levels "0"<"1"<"2"<"3"<..: 4 11 6 11 8 1 1 5 11 7 ... It is possible to define a custom order: ?factor SEXECASCAV|CGIN 73
  •  From factor to text:d.f$novaColuna = as.character(d.f$colunaFator) From factor to numeric:d.f$novaColuna = as.numeric( as.character(d.f$colunaFator)) The internal representation of a factor is different from its text description SEXECASCAV|CGIN 74
  •  Using: m1 <- matrix(1:12, ncol = 3) Sum of columns (a value for each column):colSums(m1)[1] 10 26 42  orapply(m1,2,sum)[1] 10 26 42 SEXECASCAV|CGIN 75
  •  Sum of rows (one value for each row):rowSums(m1)[1] 15 18 21 24  orapply(m1,1,sum)[1] 15 18 21 24 May use any function, even your own. SEXECASCAV|CGIN 76
  • aggregate(salario ~ instrucao, data = milsa, mean) instrucao salario1 1oGrau 7.8366672 2oGrau 11.5283333 Superior 16.475000 SEXECASCAV|CGIN 77
  • aggregate( salario ~ instrucao + civil, data = milsa, mean) instrucao civil salario1 1oGrau casado 7.0440002 2oGrau casado 12.8250003 Superior casado 17.7833334 1oGrau solteiro 8.4028575 2oGrau solteiro 8.9350006 Superior solteiro 15.166667 SEXECASCAV|CGIN 78
  • model = lm( formula = salario ~ ano + instrucao, data = milsa)summary(model) Just one line!!! SEXECASCAV|CGIN 79
  • Prof. Dr. Roberto Dantas de Pinho, roberto.pinho@mct.gov.br This presentation is based on courses by Dr. Paulo Justiniano Ribeiro Jr (UFPR) & Dr. Cosme Marcelo Furtado Passos da Silva (FIOCRUZ) SEXECASCAV|CGIN 80