Successfully reported this slideshow.
Upcoming SlideShare
×

# R part I

467 views

Published on

Praxis Weekend Analytics

Published in: Education, Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### R part I

1. 1. Let’s start with R > typos = c(2,3,0,3,1,0,0,1) > typos [1] 2 3 0 3 1 0 0 1 > mean(typos) [1] 1.25 > median(typos) [1] 1 > var(typos) [1] 1.642857 • • • • “typos” represent number of typing errors on different pages Note that each command is stored in history You can use UP arrow key to retrieve your previous command You have started using built-in functions
2. 2. Let’s start with R > typos.draft1 = c(2,3,0,3,1,0,0,1) > typos.draft2 = c(0,3,0,3,1,0,0,1) > typos.draft1 [1] 2 3 0 3 1 0 0 1 > typos.draft2 [1] 0 3 0 3 1 0 0 1 • Note the two different object names for two drafts • Period has been used as punctuation in object names • Both the object names represent a vector
3. 3. Let’s start with R > typos.draft1 = c(2,3,0,3,1,0,0,1) > typos.draft2 = typos.draft1 # make a copy > typos.draft2[1] = 0 # assign the first page 0 typing error > typos.draft2 [1] 0 3 0 3 1 0 0 1 • Note how we have created the same typos.draft2 • “#” has been used for comments • ‘()’ are for functions and ‘*+’ are for vectors
4. 4. Now try and check …. > typos.draft2 # print out the value [1] 0 3 0 3 1 0 0 1 > typos.draft2[2] # print 2nd pages' value [1] 3 > typos.draft2[4] # 4th page [1] 3 > typos.draft2[-4] # all but the 4th page [1] 0 3 0 1 0 0 1 > typos.draft2[c(1,2,3)] # print values for 1st, 2nd and 3rd. [1] 0 3 0 • Note the output of the last command. This is called Slicing.
5. 5. Numeric Vector • Simplest data structure in R • To set up a numeric vector named x assign values : > x <- c(23.0,17.0,12.5,11.0,17.0,12.0,14.5,9.0,11.0) > x [1] 23.0 17.0 12.5 11.0 17.0 12.0 14.5 9.0 11.0 Or > assign ("x", c(23.0,17.0,12.5,11.0,17.0,12.0,14.5,9.0,11.0)) > x [1] 23.0 17.0 12.5 11.0 17.0 12.0 14.5 9.0 11.0
6. 6. Numeric Vector or > rm(x) > c(23.0,17.0,12.5,11.0,17.0,12.0,14.5,9.0,11.0) -> x > x [1] 23.0 17.0 12.5 11.0 17.0 12.0 14.5 9.0 11.0 Look at the next assignment > y <- c(x,0,1) > y [1] 23.0 17.0 12.5 11.0 17.0 12.0 14.5 9.0 11.0 0.0 1.0 A vector y has been created with a copy of x with a zero and one at the end.
7. 7. Character Vector A character vector is a set of text values > weekdays <- c("Sun","Mon","Tues","Wed","Thurs","Fri","Sat") > weekdays [1] "Sun" "Mon" "Tues" "Wed" "Thurs" "Fri" "Sat"
8. 8. Positive Index • A positive index can appended in square brackets to the name of a vector • It helps to select subsets of the elements of a vector > x[2] [1] 17 > x[1:9] [1] 23.0 17.0 12.5 11.0 17.0 12.0 14.5 > x[3:7] [1] 12.5 11.0 17.0 12.0 14.5 > x[c(2,5,7)] [1] 17.0 17.0 14.5 9.0 11.0 • How do you find the number of elements in a vector? > X [1] 23.0 17.0 12.5 11.0 17.0 12.0 14.5 >length(x) [1] 9 9.0 11.0
9. 9. Negative Index • A negative index specifies the element(s) to be excluded rather than included > y<-x[-2] #Include all but the second element > y [1] 23.0 12.5 11.0 17.0 12.0 14.5 9.0 11.0 > x [1] 23.0 17.0 12.5 11.0 17.0 12.0 14.5 9.0 11.0 • How do you exclude more than one element? > X [1] 23.0 17.0 12.5 11.0 17.0 12.0 14.5 9.0 11.0 > y<-x[-(2:4)] > y [1] 23.0 17.0 12.0 14.5 9.0 11.0 > y<-x[-(c((2:4),9))] #exclude 2nd to 4th, and 9th elements > y [1] 23.0 17.0 12.0 14.5 9.0
10. 10. Now try and check …. > typos.draft2 # show all the values [1] 0 3 0 3 1 0 0 1 > max(typos.draft2) # what are worst pages? [1] 3 > typos.draft2 == 3 # Where are they? [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE • Note the use of ‘==‘ for comparing • But how do we get the indices (pages) having 3 typos? > which(typos.draft2 == 3) [1] 2 4 • You only get the index of the elements
11. 11. Now try and check …. > n = length(typos.draft2) # how many pages > pages = 1:n # how we get the page numbers > pages # pages is simply 1 to number of pages [1] 1 2 3 4 5 6 7 8 > pages[typos.draft2 == 3] # logical extraction. Very useful [1] 2 4 The idea is to create a new vector 1, 2, 3, …. keeping track of page numbers and then slicing off ones for which typos.draft2===3
12. 12. Now try and check …. > sum(typos.draft2) # How many typos? [1] 8 > sum(typos.draft2>0) # How many pages with typos? [1] 4 > typos.draft1 - typos.draft2 # difference between the two [1] 2 0 0 0 0 0 0 0 Well Done … Great!!
13. 13. Now try and check …. Suppose the daily closing price of your favourite stock for two weeks is 45,43,46,48,51,46,50,47,46,45 How do you keep track of this? > x = c(45,43,46,48,51,46,50,47,46,45) > x [1] 45 43 46 48 51 46 50 47 46 45 > mean(x) # the mean [1] 46.7 > median(x) # the median [1] 46 > max(x) # the maximum or largest value [1] 51 > min(x) # the minimum value [1] 43 Hope you are enjoying many interesting functions ………
14. 14. Now try and check …. Let’s add the next two weeks worth of data to x. This was 48,49,51,50,49,41,40,38,35,40 > x = c(x,48,49,51,50,49) # > length(x) # how long is x [1] 15 > x[16] = 41 # add value to > x[17:20] = c(40,38,35,40) > x [1] 45 43 46 48 51 46 50 47 append values to x now (it was 10) a specified index which is 16 # add to many specified indices 46 45 48 49 51 50 49 41 40 38 35 40 We did three different things to add to a vector. • We used the c (combine) operator to combine the previous value of x with the next week's numbers. • We then assigned directly to the 16th index. • Finally, we assigned to a slice of indices.
15. 15. Now try and check …. Suppose we want a 5-day moving average > day<-5 > mean(x[day:(day+4)]) [1] 48 > day:(day+4) [1] 5 6 7 8 9 How do you get running maximum or minimum till date? > cummax(x) # running [1] 45 45 46 48 51 51 > cummin(x) # running [1] 45 43 43 43 43 43 maximum 51 51 51 51 51 51 51 51 51 51 51 51 51 51 minimum 43 43 43 43 43 43 43 43 43 41 40 38 35 35
16. 16. Self-test Suppose you keep track of your mileage each time you fill up. At your last 8 fill-ups the mileage was 65311 65624 65908 66219 66499 66821 67145 67447 Enter these numbers into R. Use the function ‘diff’ on the data. What does it give? Use the max function to find the maximum number of miles between fill-ups, the mean function to find the average number of miles and the min function to get the minimum number of miles.
17. 17. Self-test Suppose you track your commute times for two weeks (10 days) and you find the following times in minutes 17 16 20 24 22 15 21 15 17 22 Enter this into R. Use the function max to find the longest commute time, the function mean to find the average and the function min to find the minimum. The 24 was a mistake. It should have been 18. How can you fix this? Do so, and then find the new average. How many times was your commute 20 minutes or more? To answer this you can try (if you call your numbers commutes) > sum( commutes >= 20) What do you get? What percent of your commutes are less than 17 minutes? How can you answer this with R?
18. 18. Categorical Data A survey asks people if they smoke or not. The data is Yes, No, No, Yes, Yes We can enter this into R with the c() command, and summarize with the table command as > x=c("Yes","No","No","Yes","Yes") > table(x) x No Yes 2 3 The table command simply adds up the frequency of each unique value of the data.
19. 19. Categorical Data : Factor Categorical data is often used to classify data into various levels or factors. To make a factor is easy with the command factor or as.factor. > x #Print the values in x [1] "Yes" "No" "No" "Yes" "Yes" > factor(x) # print out value in factor(x) [1] Yes No No Yes Yes Levels: No Yes Note that levels have been printed.
20. 20. Categorical Data and Bar Chart A bar chart draws a bar with a height proportional to the count in the table. The height could be given by the frequency, or the proportion. Suppose, a group of 25 people are surveyed as to their beerdrinking preference. The categories were (1) Domestic can, (2) Domestic bottle, (3) Microbrew and (4) import. The raw data is 3411343313212123231111431 > beer = scan() 1: 3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1 26: Read 25 items > barplot(beer) # this isn't correct
21. 21. Categorical Data and Bar Chart There are 25 categories in the Bar Chart. But how many do we need?
22. 22. Categorical Data and Bar Chart > table(beer) beer 1 2 3 4 10 4 8 3 > barplot(table(beer)) # Yes, call with summarized data There are 4 categories now, y-axis shows frequency
23. 23. Categorical Data and Bar Chart > barplot(table(beer)/length(beer)) # divide by n for proportion There are 4 categories now, y-axis shows proportion
24. 24. Categorical Data and Pie Charts > beer.counts = table(beer) # store the table result > pie(beer.counts) # first pie -- kind of dull
25. 25. Categorical Data and Pie Charts names(beer.counts) = c("domesticn can","Domesticn bottle", + "Microbrew","Import") # give names > pie(beer.counts) # prints out names
26. 26. Categorical Data and Pie Charts pie(beer.counts,col=c("purple","green2","cyan","white"))
27. 27. Stem and Leaf chart Suppose you have the box score of a basketball game and and the following points per game for players on both teams 2 3 16 23 14 12 4 13 2 0 0 0 6 28 31 14 4 8 2 5 Create a Stem and Leaf Chart > scores = scan() 1: 2 3 16 23 14 12 4 13 2 0 0 0 6 28 31 14 4 8 2 5 21: Read 20 items > stem(scores) The decimal point is 1 digit(s) to the right of the | 0 | 000222344568 1 | 23446 2 | 38 3 | 1
28. 28. Stem and Leaf chart Suppose you have the box score of a basketball game and and the following points per game for players on both teams 2 3 16 23 14 12 4 13 2 0 0 0 6 28 31 14 4 8 2 5 Create a Stem and Leaf Chart > stem(scores,scale=2) The decimal point is 1 digit(s) to the right of the | 0 0 1 1 2 2 3 | | | | | | | 000222344 568 2344 6 3 8 1
29. 29. Making numeric data categorical Suppose, CEO yearly compensations are sampled and the following are found (in millions). 12 0.4 5 2 50 8 3 1 4 0.25 And we want to break that data into the intervals [0; 1]; (1; 5]; (5; 50] and name the same. > sals = c(12, .4, 5, 2, 50, 8, 3, 1, 4, .25) # enter data > cats = cut(sals,breaks=c(0,1,5,max(sals))) # specify the breaks > cats # view the values [1] (5,50] (0,1] (1,5] (1,5] (5,50] (5,50] (1,5] (0,1] (1,5] Levels: (0,1] (1,5] (5,50] > levels(cats) = c("poor","rich","rolling in it") # change labels > table(cats) cats poor rich rolling in it 3 4 3 (0,1]