5. Inclusion & Exclusion
summary(data[ ,c(-1,-2,-3,-4,-5)])
Excludes specific variables
summary(data[,c("ba","ms")])
Includes specific variables
5
www.Sanaitics.com
9. tapply Function
A<-tapply( data$ba,data$Location,mean, na.rm=TRUE)
A Break up vector into groups by factors and compute functions
B<- tapply( data$ms,list(data$Location,data$Grade),
mean,na.rm=T); B
Mean value for ‘ms’ grouped by categorical variables Location and
Grade of the same data set
9
www.Sanaitics.com
10. Using Aggregation Function
• Single variable, single factor, single function
• Single variable, single factor, multiple functions
• Multiple variables, single factor, multiple functions
• Single variable, multiple factors, multiple functions
• Multiple variables, multiple factors, multiple
functions
10
www.Sanaitics.com
11. Single Variable, Single Factor, Single Function
A<-aggregate(ba ~ Location,data=data, FUN = mean )
A
To calculate mean for variable ‘ba’ by Location variable
Aggregate function by default ignores the missing data values
11
www.Sanaitics.com
12. Single Variable, Single Factor, Multiple Functions
f<-function(x) c( mean=mean(x), median=median(x),
sd=sd(x))
B<-aggregate(ba ~ Location,data=data, FUN = f ); B
To calculate mean, median & S.D for variable ‘ba’ by Location
variable
12
www.Sanaitics.com
13. Multiple Variables, Single Factor, Multiple Functions
f<-function(x) c( mean=round(mean(x,0)),
median=round(median(x,0)), sd=round(sd(x,0)))
C<-aggregate(cbind(ba,ms) ~ Location,data=data, FUN=f )
C
13
www.Sanaitics.com
14. Single Variable, Multiple Factors, Multiple Functions
f<-function(x) c( mean=round(mean(x,0)),
median=round(median(x,0)), sd=round(sd(x,0)))
D<-aggregate(ba ~ Location+Grade+Function,data=data,
FUN = f ); D
14
www.Sanaitics.com
15. Multiple Variables, Multiple Factors, Multiple Functions
f<-function(x) c(mean=round(mean(x),0),
sd=round(sd(x),0))
E<-aggregate (cbind(ba,ms) ~ Location+Grade+Function,
data=data, FUN = f ); E
15
www.Sanaitics.com
17. ddply Function
ddply(data, .(Location, Grade), summarize, avg.ba =
mean(ba,na.rm=TRUE), sd.ms = sd(ms,na.rm=TRUE),
max.ba = max(ba,na.rm=TRUE))
Summarize by combination of variables and factors
17
www.Sanaitics.com
18. Do you know?
ddply can take one tenth of time to process a data than the
aggregate function
Read more about our research on efficient processing in R at
www.Sanaitics.com/research-paper.html
18
www.Sanaitics.com
21. Generating Frequency Tables
table1 <- table(data$Location,data$Grade,data$Function)
ftable(table1)
Table ignores missing values. To include NA as a category in
counts, include the table option exclude=NULL
21
www.Sanaitics.com
22. Generating Frequency Tables
table2 <- xtabs(~Location+Grade+Function,data=data)
ftable(table2)
Allows formula style input
22
www.Sanaitics.com
26. Bar Plot- Median salary for two grades
A<-aggregate(ba ~ Grade,data=data, FUN = median )
barplot(A$ba, names = A$Grade, col="pink",xlab = "GRADE",
ylab = "median_salary ", main = "SALARY DATA OF
EMPLOYEES")
26
GR1 GR2
SALARY DATA OF EMPLOYEES
GRADE
median_salary
050001000015000
www.Sanaitics.com
27. Standard Box-Whiskers Plot
boxplot(data$ba,range=0)
• Range determines how far the plot whiskers extend out from the box
• If range is positive, the whiskers extend to the most extreme data
point which is no more than range times the interquartile range
from the box
• A value of zero causes the whiskers to extend to the data extremes.
So here in this case no outliers will be returned
27
www.Sanaitics.com
28. Modified Box-Whiskers Plot
• Constructed to highlight outliers where Standard Boxplot fails
• Default in R, requires no special parameters
• The "dots" at the end of the boxplot represent outliers
• There are a number of different rules for determining if a point is
an outlier, but the method that R and ggplot use is the "1.5 rule“
If a data point is either:
1. less than Q1 - 1.5*IQR
2. greater than Q3 + 1.5*IQR
then that point is classed as an "outlier". The whisker line goes to
the first data point before the "1.5" cut-off.
Note: IQR = Q3 - Q1
28
www.Sanaitics.com
29. Box Plot – Single Variable
boxplot(data$ba,col="coral1",main="boxplot for variable
ba " , ylab=" basic allowance range " , xlab="ba")
29
www.Sanaitics.com
30. Box Plot – Single Variable, Two Factors
boxplot(data$ba~data$Location+data$Grade,
col=c("orange"),main="boxplot" , ylab="basic allowance
range”, xlab="ba" )
30
www.Sanaitics.com