Your SlideShare is downloading. ×
  • Like
Basic R
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Basic R

  • 199 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
199
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Prof. Dr. Roberto Dantas de Pinho, roberto.pinho@mct.gov.br 26/jul/2012 This presentation is based on courses by Dr. Paulo Justiniano Ribeiro Jr (UFPR) & Dr. Cosme Marcelo Furtado Passos da Silva (FIOCRUZ) SEXECASCAV|CGIN 1
  • 2.  A First R Session  Saving your work Objects  Changing data Data input  Sums e Now that we have aggregates data...  Linear regression  Some analyses Filter & select And lots of other things along the way SEXECASCAV|CGIN 2
  • 3. Install, configuration etc.R internals, structure etc.Handling large datasetsFancy plots beyond the basics SEXECASCAV|CGIN 3
  • 4.  You can use R to evaluate some simple expressions. Just type: 1 + 2 + 3 2 + 3 * 4 3/2 + 1 4 * 3**3 R is an environment and a language SEXECASCAV|CGIN 4
  • 5.  The R environment allows for you to submit command and see results immediately. The R language is made by the set of rules and functions that may be run by the R environment. You may keep command sequences (scripts) for latter use. SEXECASCAV|CGIN 5
  • 6.  Several functions are available. A couple simple examples:  sqrt(2) 2  abs(-10)  10  sin(pi) sin( ) pi is a constant in R, its value is already defined. SEXECASCAV|CGIN 6
  • 7.  Results, input data, tables etc. are all stored in R as Objects Objects have a name, content , type and are stored in memory. Ex.  Creates object “x” with the number 10: x <- 10  Show the content of x: x In R, abc is different of ABC SEXECASCAV|CGIN 7
  • 8.  Try: X <- sqrt(2) <- and = are equivalent. Y = sin(pi) Z = sqrt(X+Y) In the above examples, X, Y and Z store results from each operation.In R, There is always many ways ofdoing the same thing. We will try to focus on a single way of doing each task. SEXECASCAV|CGIN 8
  • 9.  What is the value of C at the end of the script? A = 1 B = 2 C = A + B A = 5 B = 5 Why? SEXECASCAV|CGIN 9
  • 10. SEXECASCAV|CGIN 10
  • 11.  Tool that makes it easier to use R Manages work windows Easier access to objects, scripts, history of commands and plots. SEXECASCAV|CGIN 11
  • 12. Editing Scripts &object view Console SEXECASCAV|CGIN 12
  • 13. Object list& historyHelp, plots,files & packages SEXECASCAV|CGIN 13
  • 14.  Object that hold multiple values that store data of a single type Function c( ) (“c” from concatenate) groups values to build a vector: X = c(1,3,6) To access vector elements: X[1] X[3] SEXECASCAV|CGIN 14
  • 15.  Operations may be performed and functions applied over the whole vector. Ex. X = c(1,3,5) Y = c(10,20,30) X+Y [1] 11 23 35 sum(X) [1] 9 How about X + 100 ? [1] 101 103 105 due to the Recycling law SEXECASCAV|CGIN 15
  • 16.  When the size of an object required by an operation is different from the actual size, available data is repeated as needed. As X has 3 elements, X+100 is the same as X + c(100,100,100) SEXECASCAV|CGIN 16
  • 17. > X = 1:10> [1] 1 2 3 4 5 6 7 8 9 10> X = seq(0,1,by=0.1)> [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0> rep(“a”,5)> “a” “a” “a” “a” “a”> names = c("fulano", "beltrano", "cicrano")> names [1] "fulano" "beltrano" "cicrano"> letras = letters[1:5]> letras [1] "a" "b" "c" "d" "e"> letras = LETTERS[1:5]> letras [1] "A" "B" "C" "D" "E" SEXECASCAV|CGIN 17
  • 18.  numeric  integer  is.numeric( )  is.integer( )  as.numeric( )  as.integer( ) character  logical  is.character( )  T == TRUE == 1  as.character( )  F == FALSE == 0 A == B means “is A equal to B?” SEXECASCAV|CGIN 18
  • 19.  A Vector arranged in rows & columns m1 <- matrix(1:12, ncol = 3) [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 SEXECASCAV|CGIN 19
  • 20.  length(m1) [1] 12 dim(m1) [1] 4 3 nrow(m1) [1] 4 ncol(m1) [1] 3 SEXECASCAV|CGIN 20
  • 21.  m1[1, 2] [1] 5 m1[2, 2] [1] 6 m1[ , 2] [1] 5 6 7 8 m1[3, ] m1[1,2]= 99 [1] 3 7 11 changes the value of the cell SEXECASCAV|CGIN 21
  • 22. m1[1:2, 2:3] [,1] [,2][1,] 5 9[2,] 6 10 SEXECASCAV|CGIN 22
  • 23. colnames(m1)NULLrownames(m1)NULLcolnames(m1) = c("C1","C2","C3")m1[,”C1”][1] 1 2 3 4 t(m1) transpose of m1 SEXECASCAV|CGIN 23
  • 24.  “matrix” with many dimensions. Ex. 3 dim.:ar1 <- array(1:24, dim = c(3, 4, 2)), , 1 1ª matrix [,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 2 5 8 11[3,] 3 6 9 12 For a 3 dimention array, you migth visualize the 3rd, , 2 dimentions as a colections of matrices. [,1] [,2] [,3] [,4][1,] 13 16 19 22[2,] 14 17 20 23 2ª matrix[3,] 15 18 21 24 SEXECASCAV|CGIN 24
  • 25.  How to work with this kind of data?Ano Código do Órgão UF Órgão Código da UO unidade orçamentária função subfunção programa ação localizador descrição da ação valor P&D valor ACTC Adm direta e MODERNIZAÇÃO DO SISTEMA DE2010 AC 1 indireta 1 Adm direta e indireta 19 121 2056 1548 PLANEJAMENTO E GESTÃO DA SDCT R$ - R$ 16.655,00 PROGRAMA DE COOPERAÇÃO TÉCNICA E Adm FINANCEIRA COM INSTIT. NAC. INTERN. direta e GOVERNAMENTAIS E NÃO2010 AC 1 indireta 1 Adm direta e indireta 19 121 2056 1549 GOVERNAMENTAIS R$ - R$ 715.000,00 Adm direta e MANUTENÇÃO DO GABINETE DO SECRETÁ2010 AC 1 indireta 1 Adm direta e indireta 19 122 2009 2224 RIO R$ - R$ 27.732,11 Adm direta e2010 AC 1 indireta 1 Adm direta e indireta 19 122 2009 2227 DEPARTAMENTO DE GESTÃO INTERNA R$ - R$ 2.266.169,90 SEXECASCAV|CGIN 25
  • 26. colnames(d) [1] "letra" "num" "valor"  Each column has its own data type d = data.frame(letters[1:4], 1:4, 10.5) letters.1.4. X1.4 X10.5 1 a 1 10.5 We will be using 2 b 2 10.5 data.frames most of 3 c 3 10.5 the time 4 d 4 10.5  We can change column names: colnames(d) = c("letra","num", "valor") colnames(d) [1] "letra" "num" "valor“ d$valor # selects column “valor” from d SEXECASCAV|CGIN 26
  • 27.  list factor latter... 27 SEXECASCAV|CGIN
  • 28.  Several possible sources. We will see:  Keyboard x = scan( )  Excel files  CSV files  SQL Databases SEXECASCAV|CGIN 28
  • 29. require(XLConnect)wb <- loadWorkbook(“AC_PDACTCaula.xls”)plan1 <- readWorksheet(wb, sheet = 1)str(plan1)View(plan1) SEXECASCAV|CGIN 29
  • 30. require(XLConnect) Loads package XLConnect Packages are sets of functions and data that add capabilities to R. If the package is not installed:setInternet2() #only on windowsinstall.packages("XLConnect", dep=T) SEXECASCAV|CGIN 30
  • 31.  Creates an object “wb” that points to the excel file:wb <- loadWorkbook(“AC_PDACTCaula.xls”) SEXECASCAV|CGIN 31
  • 32.  Load the first sheet data into an object called “plan1”plan1 <- readWorksheet(wb, sheet = 1) R functions identify parameters by Or by name, or order both SEXECASCAV|CGIN 32
  • 33.  Show the structure of the new object:str(plan1) str() works with any R Object. It is very useful. Show data on a window:View(plan1) In RStudio, you may click on na object from the objects list to the same effect SEXECASCAV|CGIN 33
  • 34. args(readWorksheet) #shows available parametersfunction (object, #workbook “wb”sheet, #number or name of the sheetstartRow, #startCol, #endRow, #endCol, #header # T or F: use first line to name columns ) SEXECASCAV|CGIN 34
  • 35.  Comma-separated values Very popular format for data interchange ; Other separators are also popular: <tab> <space> Example:uf ano valido somaactc somapdAC 2009 1 34296430.67 3630841.04AC 2010 1 29397712.04 3579715.12AL 2009 1 12650160.51 8903714.41 SEXECASCAV|CGIN 35
  • 36.  Example:uf ano valido somaactc somapdAC 2009 1 34296430,67 3630841,04AC 2010 1 29397712,04 3579715,12AL 2009 1 12650160,51 8903714,41 To read this file:d = read.csv(file="AgregaUF20110930_b.txt", header=T, # uses first line as column names sep="t", # separator is <tab> dec="," # decimals uses comma) SEXECASCAV|CGIN 36
  • 37.  str(d) #structure summary(d) #Statistical summary head(d) #first rows tail(d) #last rows plot(d) #standard plot SEXECASCAV|CGIN 37
  • 38. require(RODBC)canal <- odbcConnect(“base_ODBC",case="tolower“,uid=“user”,pwd=“password”)d <- sqlQuery(canal,”select * from table where year = 2010”,as.is=T) SEXECASCAV|CGIN 38
  • 39.  How to get the sum of values from a data.frame column? sum(data.frame$column) sum(d$somapd) [1] NA SEXECASCAV|CGIN 39
  • 40.  NA Not Available  Missing values. NaN Not a Number  Value not able to be presented as a number. Inf & -Inf  plus and minus infinite Try: c(-1,0,1)/0 SEXECASCAV|CGIN 40
  • 41.  Sum: sum(d$somapd, na.rm=T) [1] 4836882446 Mean:mean(d$somapd, na.rm=T) Median:median(d$somapd, na.rm=T) Standard deviation:sd(d$somapd, na.rm=T) SEXECASCAV|CGIN 41
  • 42.  For these examples: milsa = read.csv("milsaText.txt", sep="t", head=T, dec=".") SEXECASCAV|CGIN 42
  • 43.  Absolute frequenciestable(milsa$civil) Relative frequenciestable(milsa$civil) / length(milsa$civil) orprop.table(milsa$civil) Pie chartpie(table(milsa$civil)) SEXECASCAV|CGIN 43
  • 44.  With attach(milsa) Absolute frequenciestable(civil) Relative frequenciestable(civil) / length(civil) orprop.table(civil) Pie Chart after: detach(milsa)pie(table(civil)) SEXECASCAV|CGIN 44
  • 45.  Bar plot:barplot(table(instrucao)) remember:  I may save any result as an object to use it later.instrucao.tb = table(instrucao)barplot(instrucao.tb)pie(instrucao.tb) SEXECASCAV|CGIN 45
  • 46.  Try:prop.table(filhos) Solution:prop.table(table(filhos)) Other solution:  Filter out elements with NA SEXECASCAV|CGIN 46
  • 47.  mean(filhos, na.rm=T)  median(filhos, na.rm=T)  range(filhos, na.rm=T)  var(filhos, na.rm=T) #variance  sd(filhos, na.rm=T) #standard deviation Quantiles:  filhos.quartis = quantile(filhos, na.rm=T) interquartile range:  filhos.quartis [4] -filhos.quartis [1] SEXECASCAV|CGIN 47
  • 48.  plot(milsa) plot(salario ~ ano) hist(salario) boxplot(salario) stem(salario) SEXECASCAV|CGIN 48
  • 49.  Selecting some rows milsaNovo = milsa[c(1,3,5,6) , ] Selecting some columns milsaNovo = milsa[ , c(1,3,5)] milsaNovo = milsa[ , c(“funcionario”, ”instrucao“, “salario”)] Attention:  New copy milsaNovo=milsa[c(1,3,5,6) ,]  Replaces previous milsa=milsa[c(1,3,5,6) , ] SEXECASCAV|CGIN 49
  • 50.  Who earns above median acimamediana = milsa[ salario > median(salario), ] Who is married and has higher education degree? casadoEsuperior = milsa[ civil==“casado” & instrucao == “Superior”, ] AND: both must be true SEXECASCAV|CGIN 50
  • 51.  Who is married or has higher education degree? casadoOUsuperior = milsa[ civil==“casado” | instrucao == “Superior”, ] OR: at least one must be true SEXECASCAV|CGIN 51
  • 52. NOT milsaLimpo=milsa[!is.na(salario), ] In English:  New Table milsaLimpo  equals =  Old table milsa  Select [  Rows where  Salary is not NA ! is.na(salario)  And all columns , ] SEXECASCAV|CGIN 52
  • 53. How many are married?sum(civil==“casado”)  ortable(civil)["casado"]How may are married and has higher ed. degree?sum(civil==“casado” & instrucao == “Superior” )  ortable(civil,instrucao)["casado","S uperior"] SEXECASCAV|CGIN 53
  • 54.  milsaNovo is equal to milsa, without rows 1,2 & 5 & without columns 1 & 8:milsaNovo =milsa[-c(1,2,5), -c(1,8)] SEXECASCAV|CGIN 54
  • 55. Which rows where this is TRUE sup = which(instrucao=="Superior“) [1] 19 24 31 33 34 36 May use it again later:  mean(milsa[sup,”salario”])  Mean salary for those with higher education advantage: it is not a copy!! SEXECASCAV|CGIN 55
  • 56.  A random sample of 10 rows from milsa: amostra = sample(x=nrow(milsa),size=10) [1] 12 29 1 3 17 14 26 33 20 31 Mean salary for the sample: mean(milsa[amostra,”salario”]) SEXECASCAV|CGIN 56
  • 57.  By number of children: milsa[order(filhos),] Decreasing: milsa[order(filhos, decreasing=T),] By number of children and then age: milsa[order(filhos,ano),] 10 youngest: head(milsa[order(ano),], 10) 10 older: tail(milsa[order(ano),], 10) SEXECASCAV|CGIN 57
  • 58.  Removing an object  rm(milsaNovo) Removing every object  rm(list = ls()) ls() : list of current objects SEXECASCAV|CGIN 58
  • 59.  List objects are collections that may include different types of objects.lis = list(A=1:10, B=“Text”, C = matrix(1:9,ncol=3) They are often used as parameters to functions or as result sets from them. lis[1:2]  A list with the two first objects from lis (A & B) lis[[1]]:  object stored at the first position of the list ( the content of A). The same as lis$A SEXECASCAV|CGIN 59
  • 60.  Saving all objects: save.image(“file.RData”) Saving selected objects: save( x, y, file=“file.RData”) loading: load(“file.RData“) Several “loads”: objects with distinct names are kept in memory SEXECASCAV|CGIN 60
  • 61.  Saving a script “.R” that reproduces the desired output. Advantage:  It may be used to document the work performed;  It may be used again over updated data to update results. Hybrid model:  Save intermediate results that take long time to process. Update them less often. SEXECASCAV|CGIN 61
  • 62.  Add a column to a data.frame: milsa$idade = milsa$ano + milsa$mes/12 SEXECASCAV|CGIN 62
  • 63. X Y6+3+5=14 SEXECASCAV|CGIN 63
  • 64. X Y SEXECASCAV|CGIN 64
  • 65. X Y SEXECASCAV|CGIN 65
  • 66. X Y SEXECASCAV|CGIN 66
  • 67. X Y SEXECASCAV|CGIN 67
  • 68.  Example: & SEXECASCAV|CGIN 68
  • 69.  Only rows found in both data.frames:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc“, all=F)All rows from data.frame X:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc", all.x=T) SEXECASCAV|CGIN 69
  • 70. All rows from data.frame y:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc", all.y=T)All rows from data.frames x & y:merge(x=milsa, y=tabInst,by.x="instrucao", by.y="desc", all=T) SEXECASCAV|CGIN 70
  • 71.  From text to numericd.f$novaColuna = as.numeric(d.f$coluna) From numeric to text:d.f$novaColuna=as.character(d.f$coluna) From text or numeric to integer:d.f$novaColuna = as.integer(d.f$coluna) Integers save memory SEXECASCAV|CGIN 71
  • 72.  Representation for categorical data  Nominal ▪ “married”, “single”  Ordinal Factors save memory ▪ “tall”, “short” Assure proper treatment for these variables by many R functions SEXECASCAV|CGIN 72
  • 73. Nominal:milsa$fatorcivil=factor(milsa$civil, ordered=F)$fatorcivil : Factor w/ 2 levels "casado","solteiro": 2 1 1 2 2 1 2 2 1 2Ordinal:milsa$fatormes = factor(milsa$mes, ordered=T)$fatormes : Ord.factor w/ 12 levels "0"<"1"<"2"<"3"<..: 4 11 6 11 8 1 1 5 11 7 ... It is possible to define a custom order: ?factor SEXECASCAV|CGIN 73
  • 74.  From factor to text:d.f$novaColuna = as.character(d.f$colunaFator) From factor to numeric:d.f$novaColuna = as.numeric( as.character(d.f$colunaFator)) The internal representation of a factor is different from its text description SEXECASCAV|CGIN 74
  • 75.  Using: m1 <- matrix(1:12, ncol = 3) Sum of columns (a value for each column):colSums(m1)[1] 10 26 42  orapply(m1,2,sum)[1] 10 26 42 SEXECASCAV|CGIN 75
  • 76.  Sum of rows (one value for each row):rowSums(m1)[1] 15 18 21 24  orapply(m1,1,sum)[1] 15 18 21 24 May use any function, even your own. SEXECASCAV|CGIN 76
  • 77. aggregate(salario ~ instrucao, data = milsa, mean) instrucao salario1 1oGrau 7.8366672 2oGrau 11.5283333 Superior 16.475000 SEXECASCAV|CGIN 77
  • 78. aggregate( salario ~ instrucao + civil, data = milsa, mean) instrucao civil salario1 1oGrau casado 7.0440002 2oGrau casado 12.8250003 Superior casado 17.7833334 1oGrau solteiro 8.4028575 2oGrau solteiro 8.9350006 Superior solteiro 15.166667 SEXECASCAV|CGIN 78
  • 79. model = lm( formula = salario ~ ano + instrucao, data = milsa)summary(model) Just one line!!! SEXECASCAV|CGIN 79
  • 80. Prof. Dr. Roberto Dantas de Pinho, roberto.pinho@mct.gov.br This presentation is based on courses by Dr. Paulo Justiniano Ribeiro Jr (UFPR) & Dr. Cosme Marcelo Furtado Passos da Silva (FIOCRUZ) SEXECASCAV|CGIN 80