Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Analysis and Programming in R

308 views

Published on

  • Be the first to comment

Data Analysis and Programming in R

  1. 1. Data Analysis and Programming in R Eswar Sai Santosh Bandaru Eswar Sai Santosh Bandaru
  2. 2. R • What is R? • Programming language meant for statistical analysis, data mining • https://en.wikipedia.org/wiki/R_(programming_language) • Why R? • Effective data manipulation, Storage and graphical display • Free of cost, open source • Many packages contributed by experienced programmers/ statisticians • https://cran.r-project.org/web/packages/available_packages_by_name.html • Simple and elegant code, easy to learn • Microsoft is integrating R in SQL server • Problems: • Memory management : data sits on RAM • Speed • Many developments are happening to address these problems. Eswar Sai Santosh Bandaru
  3. 3. Eswar Sai Santosh Bandaru
  4. 4. R studio Interface: Console Console: Run your code here Eswar Sai Santosh Bandaru
  5. 5. R studio Interface: Editor Save and edit your code here Eswar Sai Santosh Bandaru
  6. 6. R studio Interface: Output Output – plots and help Eswar Sai Santosh Bandaru
  7. 7. General Things: • Case sensitive • Shortcuts: • CTRL+ENTER (Important): Send code from editor to console and execute • CTRL+2: Move the console from editor to console • CTRL+1: MOVE the cursor from console to editor • CTRL+UP IN CONSOLE: Retrieve previous commands • # hash is used for commenting the code • CTRL+SHIFT+C: comment/uncomment a block of code Eswar Sai Santosh Bandaru
  8. 8. R as a calculator • + : Addition -- 2+3 output:5 • - : Subtraction -- 4-5 output: -1 • * : Multiplication - 2*3 output:8 • ^ or ** : Exponentiation -- 2^3 or 2**3 • / : Division - 17/3 -- 5.66667 • %% : Modulo Division - 17%3-- 2 • %/% : Integer Division -17%/%3 -- 5 Eswar Sai Santosh Bandaru
  9. 9. Assignments and Expression • “<-” is the assignment operator in R • a<-3, 3 gets assigned to variable a • Expressions • Combination of numbers/variables/operators • E.g., 2+3*a/14 • Order of Evaluation: • ORDER OF EVALUATION: BRACKETS -> EXPONENTIATION-> DIVISION -> MULTILICATION -> ADDITION/SUBTRACTION • E.g., 7*9/13 - 10.1111 • -2^0.5 -- -1.414 • (-2) ^0.5 - NaN • Q1 Eswar Sai Santosh Bandaru
  10. 10. Data Types • Numeric: Real Numbers. E.g., 1.24, -3.12, 1 • Integer: Integer values. Suffix L is added • Character: E.g., ‘a’ , “a”, “Hello World!”, “2” • Logical: Boolean Type. TRUE (1), FALSE(0), T, F • Complex: a+bi . a,b are real numbers • Class(): function is used to check the class • E.g., class(24) -- numeric • E.g., class(24L)-- integer Eswar Sai Santosh Bandaru
  11. 11. Data structures • 4 main types: • Vector • Matrices • Lists • Data frames • We would discuss vectors and data frames in today’s session Eswar Sai Santosh Bandaru
  12. 12. Vectors: • One dimension collection of objects of same kind (same data type) • Vectors in R are similar to arrays in any other programming language • Syntax: (1,2,3,4,5) . 1,2,3,4,5 are called elements • (1,2,3,4,5) : numeric vector • (‘a’,’b’,’c’,’d’): character vector • (T, F, T, T): logical vector • (1L,2L,3L): integer vector • (1,2,3,4,6) ----- valid vector • (1,’a’,3,’t’) ------ invalid vector (but R doesn’t throw an error due to coercion Eswar Sai Santosh Bandaru
  13. 13. Creating • Basic ways: • Using c() • Using “:” • Using seq() • Using rep() • Using vector() Eswar Sai Santosh Bandaru
  14. 14. C() combine function • Syntax: • X<- C(1,2,4,78,90) creates a Numeric vector X with elements 1,2,4,78,90 • Y<- c(‘a’,’b’,’c’,’d’) creates a character vector Y with elements ‘a’, ‘b’, ‘c’,’d’ • Printing: • X # Auto printing • Print(x) # explicit printing Eswar Sai Santosh Bandaru
  15. 15. Using “:” • x <- 20:50 • Creates a numeric vector x with values starting from 20 till 50 with increments of 1 • Ending value > Starting Value - default increment +1 • y <- 50:20 • Creates a numeric vector x with values starting from 50 till 20 with increments of -1 • Ending value < Starting Value .- default increment -1 Eswar Sai Santosh Bandaru
  16. 16. Seq() • X <- seq(2,50) • Creates a numeric vector starting from 2 till 50 with increment of +1 • X <- seq(50,2) • Creates a numeric vector starting from 50 till 2 with increment of -1 • X <- seq(2,50,2) • Creates a numeric vector starting from 2 till 50 with increment of +2 • Increment can also be –ve if starting element > ending element • ( 2, 4,6,8,10…….,50) • X<- seq(‘a’,’b’,2) Throws an error Eswar Sai Santosh Bandaru
  17. 17. Rep() • X <- rep(c(1,2,3),times =2) • Creates vector numeric vector X: 1,2,3,1,2,3 • The vector gets repeated twice • rep(1:3, each =2) • Output: 1,1,2,2,3,3 • Each element in the vector gets repeated twice • rep(1:3,each=2,times =3) • Output: 1,1,2,2,3,3, 1,1,2,2,3,3, 1,1,2,2,3,3, • 2 steps • 1:Each element gets repeated twice • 2: the entire vector itself gets repeated thrice • Different variations of rep-- ?rep Eswar Sai Santosh Bandaru
  18. 18. Combining vectors • X <-c(1,2,3,4,5) • Y<-c(1,6,7,8) • Z<-c(X,Y) • Combines vectors X,Y and assigns to Z, output: 1,2,3,4,5,1,6,7,8 • Q1 – Q8 Eswar Sai Santosh Bandaru
  19. 19. vector() • X<-vector() …empty vector with default data type:logical • X<-vector (…) Eswar Sai Santosh Bandaru
  20. 20. Subsetting vectors X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’) Index: 1 2 3 4 5 6 X[1]: ‘a’ • Unlike python, java…indexing starts from 1 in R Eswar Sai Santosh Bandaru
  21. 21. Subsetting vectors X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’) Index: 1 2 3 4 5 6 X[5]: ‘e’ Eswar Sai Santosh Bandaru
  22. 22. Subsetting vectors X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’) Index: 1 2 3 4 5 6 X[-1]: ‘b’ ‘c’ ‘d’ ‘e’ ‘f’ Expect first element Eswar Sai Santosh Bandaru
  23. 23. Subsetting vectors X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’) Index: 1 2 3 4 5 6 X[1:3]: ‘a’ ‘b’ ‘c’ Not same as x[3:1] Prints first three elements Eswar Sai Santosh Bandaru
  24. 24. Subsetting vectors X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’) Index: 1 2 3 4 5 6 X[-1:-2]: ‘c’ ‘d’ ‘e’ ‘f’ or X[-2:-1]: ‘c’ ‘d’ ‘e’ ‘f’ Eswar Sai Santosh Bandaru
  25. 25. Example • X[1:(length(X)-1)] • Prints every element except for the last element Eswar Sai Santosh Bandaru
  26. 26. Element wise operations • (45,20, 25,3,4) + • (2, 6, 10, 1, 3) || (47, 26, 35, 4, 7) • (45,20, 25,3,4) + • (2, 6, 10, 1, 3) || (47, 26, 35, 4, 7) • (45,20, 25,3,4) + • (2, 6, 10, 1, 3) || (47, 26, 35, 4, 7) Eswar Sai Santosh Bandaru
  27. 27. Example: • x1 <- c(1,2,3), x2 <- c(6,7,8). what is x1+2*x2 • (1,2,3) • 2*(6,7,8) -- (12, 14, 16) ….recycling! • (1,2,3) + (12,14,16) - (13,16,19) Eswar Sai Santosh Bandaru
  28. 28. Recycling • 1:5 + 1 • Internally 1,2,3,4,5 + 1,1,1,1,1 (1 gets recycled 5 times to match the length of longer vector, then element wise operation occurs) • 1:6 + c(1,2) • Internally 1,2,3,4,5,6 + 1,2,1,2,1,2 (c(1,2) gets recycled to meet the length of longer vector) • C(1,2,3,4,5,6,7) + c(1,2,3,4) ( a warning !!) • 1,2,3,4,5,6,7 + 1,2,3,4,1,2,3 Eswar Sai Santosh Bandaru
  29. 29. Q12: Create vector q using element wise operations Eswar Sai Santosh Bandaru
  30. 30. Subsetting a vector with logical vector • Y <- c('a','b','c','d') • Y[c(T,T,F,T)] • ‘a’ ‘b’ ‘d’(selects the element if true else does not select) • Recycling • Y[c(T)] • Vector T gets recycled till it matches the length of Y • Every element gets printed Eswar Sai Santosh Bandaru
  31. 31. Comparison operators • X<- c(1,2,3,4,5,6,7) • X>4 (x greater than 4) • Outputs a logical vector having True for values greater than 4 and false for values less than or equal to false • Output: logical vector : F,F,F,F,T,T,T • X[X>4] • Selects elements from X which are greater than 4 • Output: 5,6,7 Eswar Sai Santosh Bandaru
  32. 32. Conditional operators in R • conditional statements in R • x == y : checks for equality, outputs TRUE if equal else FALSE • x !=y : checks for inequality • x >=y: greater than or equal • x <=y • x<y • x>y • You can combine both of them using & , or operators • Q13-Q16 Eswar Sai Santosh Bandaru
  33. 33. Coercion • x <- c(1,2,'a',3) -- Does not throw an error • Other elements in the vector gets coerced to character • Output: ‘1’,’2’,’a’,’3’ • priority for coercion; character> numeric> logical • Logical converts to 1,0 • explicit coercion: • as.* function s • as.character (1:20) # customerID • X<-c(‘a’,’b’,’c’,’d’) • as.numeric(x)--- R produced NA’s • Output: NA, NA, NA, NA Eswar Sai Santosh Bandaru
  34. 34. Some important functions • Which() : produces the indices of vector the condition is satisfied • X <- c(10,2,4,5,0) • Which(x>2) • Output: 1, 3, 4 • all() : produces a logical vector if a condition is satisfied by all values in a vector • all(x>2): False • any(): produces a logical vector if a condition is satisfied in any values in a vector • Any(x>2) :TRUE Eswar Sai Santosh Bandaru
  35. 35. attributes • Attributes: Give additional information about elements of a vector • E.g., names of elements, dimensions, levels • attributes(x) : shows all the available attributes of x • If there are no attributes, r outputs NULL • We can assign attributes to a created vector • E.g., we can assign names to elements with function name() • names(x) <- student_names • Where student names is character vector containing names of students Eswar Sai Santosh Bandaru
  36. 36. Subsetting using names attribute • X[‘Cory’] -- prints marks of Cory • Internally…using which() , R gets the index whose attribute name is “Cory” • Then subsets based on the index • X[c(‘Cory’,’James’)] - prints marks of Cory and James • Q16 Eswar Sai Santosh Bandaru
  37. 37. Updating a vector: What if Cory’s marks get updated • X[1] <- 35 • Element at index 1 gets updated to 35 • X[x<30 &&x>25] <-40 • All the values which are less than 30 updated to 40 • X[“Cory”] <- 67 Eswar Sai Santosh Bandaru
  38. 38. is.na() and mean imputation • x<- c(1,2,4,NA,5,NA) • is.na(x): produces a logical vector, TRUE if element is NA else FALSE • Output: F F F T F T • Replace NA with the mean values???? Eswar Sai Santosh Bandaru
  39. 39. Factors attribute • Converts a continuous vector in to a categorical data • X<-c(1,1,1,2,2,2,3,3,3) • Sum(x) : 18 • X<-factors(X) • Sum(x) : error • Levels(x): categories in x • Output: “1” “2” “3” • Class(X) • Output: factor Eswar Sai Santosh Bandaru
  40. 40. Table function: frequency table • Counts the number of times an element occurs in vector • X<-c(‘a’,’a’,’a’,’b’,’b’,’c’,’c’) • table(x): • a-3 • b-2 • c-2 • Useful while plotting barplot Eswar Sai Santosh Bandaru
  41. 41. ls() and rm() • ls() : Lists all the objects in the current R session(environment) • rm(“d”) : removes the object d • rm( list = ls()): removes all objects from the environment Eswar Sai Santosh Bandaru
  42. 42. Data frames: • Data frames are simply “tables” (rows and columns) • Each column should be of same data type (hence all the vector operations are valid for each column) • Creation • X<- data.frame(data for column1, data for column 2,…….) • Column gets binded • 2 dimensional Eswar Sai Santosh Bandaru
  43. 43. Subsetting data frames…why? • Very useful for analyzing the data • As it 2 dimensional, it has 2 indices : row * columns • test[3,2] : refers to element in 3rd row 2nd column • test[1:3,1:2]: first three rows, 2 columns • Using column names • test$student_name : refers to column: student_name • Its kind of vector!...so we can perform all vector operations • test["student_name"] : refers to column student_name • test["marks"] Eswar Sai Santosh Bandaru
  44. 44. Students with higher than average marks?? • above_average<- (test$marks>mean(test$marks)) • test$student_names[above_average] • Two steps: • above_average is a logical vector • Test$student_names[above_average] selecting students where the vector is True Eswar Sai Santosh Bandaru
  45. 45. Writing into csv • Write.csv(test,”test.csv”) • Gets saved to the default directory(folder) R is pointing to • To know the default directory: • Use getwd() Eswar Sai Santosh Bandaru
  46. 46. Reading a csv file • setwd(“directory path”) • read.csv(“file name”) • Different function to read different files • dir() : lists all files in the current directory Eswar Sai Santosh Bandaru
  47. 47. Data inspection • str() • head() • tail() Eswar Sai Santosh Bandaru
  48. 48. Dates and Times in R • Dates are stored internally as the number of days since 1970-01-01 while times are stored internally as the number of seconds since 1970-01-01 Eswar Sai Santosh Bandaru
  49. 49. Data Visualization in R: Using R base graphics • 3 types: • base graphics • ggplot2 • lattice • Boxplots • Barplots • Histograms • Scatter plots Eswar Sai Santosh Bandaru

×