A statistical examination of ozone variation between days.
(more description coming soon)
Please visit this link for poster: https://www.slideshare.net/KalaivananMurthy/pairwise-comparison-of-daily-ozone-concentration-in-tampastpetersburg-region-a-research-poster
Application of Residue Theorem to evaluate real integrations.pptx
Pairwise Comparison of Daily Ozone Concentration in Tampa-St.Petersburg Region (abstract, R code)
1. Pairwise Comparison of Daily Ozone Concentration in Tampa-St.Petersburg Region
A statistical examination of ozone variation between days.
Author: Kalaivanan Murthy
A B S T R A C T
The variation of ozone by day has become evident in the recent past, and it has been supported by the
observation that human activity – mainly automobile emission – varies by day over the course of a week.
And the variation is supposed to be cyclic, which means the trend repeats itself every week. This project
aims to examine the claim by use of statistical methods.
Hourly ozone data is downloaded from EPA for the seven stations spread across Tampa-St.Petersburg
region for the last three years (2014, 2015 and 2016). The data is governed by five factors: site location,
year, month, day, and hour. Student’s t-distribution based ANOVA (Analysis of Variance) was performed
on the data, and the resulting residuals highly deviate from the normal distribution. This deviation is
confirmed by Anderson-Darling test. Since the residuals are not normally distributed, parametric methods
cannot be used. Hence, non-parametric methods are used here.
To perform the non-parametric comparison, the data is categorized by two factors, which are day and hour.
Kruskal-Wallis test, a non-parametric one-way ANOVA test, is used to test the hypothesis that there exists
variation among days, and it confirms the hypothesis at 5% significance level. The hypothesis is also
confirmed by Wilcoxon rank-sum test, which is a similar one-way non-parametric ANOVA test for
identifying group effects.
To perform pairwise comparison, the data is averaged by pooling hourly observations for every unique day
for three years. Friedman test, a non-parametric two-way ANOVA test, is used to test the hypothesis that
‘there exists significant difference between a pair of days which are being compared’. Friedman test is
chosen because it allows to compare the ‘day’ factor while eliminating the interference of hourly variation
within a day. The results show that there exists a difference between days; and those pairs of days which
differ are explained in the poster.
2. Pairwise Comparison of Daily Ozone Concentration in Tampa-St.Petersburg Region
Programmed by Kalaivanan Murthy
Programming Language: R
SOURCE CODE
#READ DATA AND BUILD DATAFRAME
data.raw=read.csv("C:/Users/Kalaivanan Murthy/Documents/DATA/tampa/to_database.txt",header=T,
stringsAsFactors=F)
s=stringr::str_split_fixed(data.raw$datetime,"[T-]",3)
date.s=strptime(s[,1],"%Y%m%d")
data.xc=data.frame(siteid=factor(data.raw$site),date.s,
day=factor(weekdays(date.s)),
month=factor(format(date.s,"%B")),
year=factor(format(date.s,"%Y")),
hour=factor(s[,2]),value=data.raw$value)
attach(data.xc)
str(data.xc)
#FACTOR DATA: SORT FACTOR LEVELS
siteid=factor(siteid,levels=c("840120571065","840120571035","840120570081","840120573002",
"840121030004","840121030018","840121035002"),
labels=c("T-USMC","T-Davis","T-EGSPrk","T-Sydney",
"SP-SPJrCol","SP-AzaPrk","SP-JCPrk"))
levels(month)=c("January","February","March","April",
"May","June","July","August",
"September","October","November","December")
levels(day)=c("Monday","Tuesday","Wednesday","Thursday",
"Friday","Saturday","Sunday")
#SUMMARY STATISTICS
summary(value)
unique(value)[which.max(tabulate(match(v, unique(value))))] #mode
range(value)
sd(value)
boxplot(value,main="Box-and-whisker",xlab="ozone (ppm)",
horizontal=T, varwidth=1, cex.axis=2.5, cex.lab=2.5)
#ANOVA: INDEPENDENT FACTORS
anova.xc=aov(value~siteid+year+month+day+hour)
summary(anova.xc)
#MEANS BY GROUP (SITE,YEAR,MONTH,DAY,HOUR)
means.func=function(factor,trimx=0.05,roundx=4) {
means.x=tapply(value,factor,mean,trim=trimx)
return(round(means.x,digits=roundx))
}
means.func(year)
means.func(month)
means.func(day)
means.func(hour,roundx=3)
#PLOTS BY GROUP (SITE,YEAR,MONTH,DAY,HOUR)
plot.func=function(factor) {
stripchart(value~factor,method="stack",vertical=TRUE,
pch=1,cex=0.0001,xlab=as.character(substitute(factor)),ylab="ozone(ppm)",
ylim=c(0,0.05),main="Ozone Trend (2014-2016)",col="gray")
title(sub="Pre-Analysis Plot",adj=0,cex=0.1)
points(c(1:length(levels(factor))),tapply(value,factor,mean,trim=0.05),col=2,pch=8)
for (i in 1:length(levels(factor))) {
avg=tapply(value,factor,mean,trim=0.05)[i]
stdev=tapply(value,factor,sd)[i]
arrows(i,avg-stdev,i,avg+stdev,length=0.05,angle=90,code=3,col=16,lwd=0.6)
3. }
#text(c(1:length(levels(factor))),tapply(value,factor,mean,trim=0.05),
# labels=round(tapply(value,factor,mean,trim=0.05),4),pos=1)
abline(h=mean(value,trim=0.05),lty=2,lwd=1.2,col=2)
legend("bottomright",inset=0.02,
c("ozone (ppm)",paste(as.character(substitute(factor)),"mean",sep=" "),"overall mean"),
col=c(152,2,2),text.col="black",lty=c(0,0,2),pch=c(1,8,NA),bg="gray99")
}
plot.func(siteid)
plot.func(year)
plot.func(month)
plot.func(day)
plot.func(hour)
#TEST FOR NORMALITY: ANDERSON-DARLING
normality.func=function(vector.sample) { # vector.sample is an error vector
qqnorm(vector.sample,datax=TRUE,cex.lab=1.5)
qqline(vector.sample,datax=TRUE)
p.norm=nortest::ad.test(vector.sample)$p.value
norm=ifelse(nortest::ad.test(vector.sample)$p.value<=0.05,
"Ha:Normality Violated","Ho:Normality Verified")
paste(norm," ","p-value=",p.norm)
}
normality.func(anova.xc$residuals)
#KRUSKAL-WALLIS TEST: NON-PARAMETRIC 1-WAY TEST
krtest.func=function(kvalue,kgroup) { # kvalue is a numeric vector of values
# kgroup is a group factor
kr=kruskal.test(kvalue~kgroup)
ifelse(kr$p.value<=0.05,
"Ha:Groups significantly differ","Ho:Groups are alike")
paste(kr$p.value)
}
krtest.func(value,day)
krtest.func(value,month)
#FRIEDMAN TEST: NON-PARAMETRIC 2-WAY TEST
agg.x<<-aggregate(value~day+hour,FUN="mean")
kalday.func=function(xday1,xday2) {
df.x=data.frame(agg.x[agg.x$day %in% c(xday1,xday2),])
df.x$day=factor(df.x$day);levels(df.x$day)
fr.x=friedman.test(value~day|hour,data=df.x)
fr.out=ifelse(fr.x$p.value<=0.05,
"Ha:Different","Ho:Same")
return(paste(fr.out," p-value=",round(fr.x$p.value,3)))
}
pair.func=function() {
for (i in levels(day)) {
for (j in levels(day)) {
while (i!=j) {
print(paste(i,"-",j,kalday.func(i,j)))
break
}}}
}
pair.func()