SlideShare a Scribd company logo
1 of 6
Download to read offline
Analysis of Activity Monitoring Data
John Slough II
15 Jan 2015
Report
A histogram with density curve of the average number of steps taken per day is shown below.
setwd("~/Desktop/Git/Coursera-Reproducible-Research/Project1")
options(scipen=999)
activity=read.csv("activity.csv",header=TRUE)
activity$date=as.Date(activity$date)
library(plyr)
byDay=ddply(activity,"date",summarize, sum=sum(steps))
meanDay=round(mean(byDay$sum,na.rm=TRUE))
medianDay=median(byDay$sum,na.rm=TRUE)
library(ggplot2)
ggplot(byDay, aes(x=sum))+
geom_histogram(aes(y=..density..),binwidth=1000,
colour="black", fill="lightblue") +
geom_density(alpha=.2, fill="#FF6666")+
labs(title="Histogram and Density of Average Number of Steps per Day")+
labs(x="average number of steps per day",y="density")
## Warning: Removed 8 rows containing non-finite values (stat_density).
1
0.00000
0.00005
0.00010
0.00015
0 5000 10000 15000 20000
average number of steps per day
density Histogram and Density of Average Number of Steps per Day
The mean and median number of steps per day are 10766 and 10765 respectively.
Average Daily Activity Pattern
The following plot is a time series of the 5-minute interval and the average number of steps taken, averaged
across all days.
byInt=ddply(activity,"interval",summarize, avg=mean(steps,na.rm=TRUE))
ggplot(byInt, aes(x = interval, y = avg, group = 1))+
geom_line(colour="purple")+
labs(title="Time Series of Average Number of Steps per 5-minute Interval")+
labs(x="5 minute interval",y="average number of steps")
2
0
50
100
150
200
0 500 1000 1500 2000
5 minute interval
averagenumberofsteps Time Series of Average Number of Steps per 5−minute Interval
which.max(byInt[,2])
## [1] 104
maxInt=round(byInt[104,])
byIntSort=byInt[order(byInt[,2],decreasing=TRUE),]
The interval which contains the maximum average number of steps taken is interval 835 for a maximum
average of 206 number of steps.
Imputing Missing Values
miss=sum(is.na(activity))
n=nrow(activity)
prop=round(sum(is.na(activity))/nrow(activity)*100,1)
The total number of missing values in the dataset is 2304 which corresponds to 13.1% of the data.
To fill in the missing data I will replace the missing value with the mean of the corresponding 5-minute
interval for that missing value. This is done with the packages ‘plyr’ and ‘Hmisc’.
library(Hmisc)
##
## Attaching package: 'Hmisc'
3
##
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
# create new dataset with imputed values
activity.imputed = ddply(activity, "interval", mutate,
imputed.steps = impute(steps, mean))
act.imp.order=activity.imputed[order(activity.imputed[,2],decreasing=FALSE),]
activity.imp=act.imp.order[,c(4,2,3)]
activity.imp$imputed.steps=as.integer(activity.imp$imputed.steps)
detach("package:Hmisc")
A histogram with density curve of the new dataset with imputed missing values is shown below.
byDay.imp=ddply(activity.imp,"date",summarize, sum=sum(imputed.steps))
ggplot(byDay.imp, aes(x=sum)) +theme_set(theme_bw())+
geom_histogram(aes(y=..density..),binwidth=1000,
colour="black", fill="lightblue") +
geom_density(alpha=.2, fill="#FF6666")+
labs(title="Histogram and Density of Average Number of Steps per Day
(with Imputed Missing Values)")+
labs(x="average number of steps per day",y="density")
4
0.0000
0.0001
0.0002
0.0003
0 5000 10000 15000 20000
average number of steps per day
density Histogram and Density of Average Number of Steps per Day
(with Imputed Missing Values)
mean.imp=round(mean(byDay.imp$sum,na.rm=TRUE))
median.imp=median(byDay.imp$sum,na.rm=TRUE)
The mean and median number of steps per day from the new dataset are 10750 and 10641 respectively.
These values do differ from the estimates from the original dataset by 16 and 124 for the means and medians,
respectively. From the histogram it appears that replacing the missing values with the means of that 5-minute
interval have concentrated the average total number of steps per day around 10000 steps.
Differences in activity patterns between weekdays and weekends
The following plot, created by making a new factor variable for weekend and weekdays was created to show
different activity patterns between these two factors.
# create factor variable for Weekend vs. Weekday
activity.imp$wend = as.factor(ifelse(weekdays(activity.imp$date) %in%
c("Saturday","Sunday"), "Weekend", "Weekday"))
activity.WE=subset(activity.imp,wend=="Weekend")
activity.WD=subset(activity.imp,wend=="Weekday")
byWE.imp=ddply(activity.WE,"interval",summarize, avg=mean(imputed.steps))
byWD.imp=ddply(activity.WD,"interval",summarize, avg=mean(imputed.steps))
library(gridExtra)
ggWE=ggplot(byWE.imp, aes(x = interval, y = avg, group = 1))+ylim(0,250)+
geom_line(colour="blue")+
labs(title="Average Number of Steps per 5-minute IntervalsnWeekends")+
5
labs(x="5 minute interval",y="average number of steps")
ggWE=ggWE+theme(plot.margin=unit(c(0,1,0,1), "cm"))
ggWD=ggplot(byWD.imp, aes(x = interval, y = avg, group = 1))+ylim(0,250)+
geom_line(colour="salmon")+
labs(title="Average Number of Steps per 5-minute IntervalsnWeekdays")+
labs(x="5 minute interval",y="average number of steps")
ggWD=ggWD+theme(plot.margin=unit(c(0,1,0,1), "cm"))
grid.arrange(ggWE, ggWD, nrow=2, ncol=1)
0
50
100
150
200
250
0 500 1000 1500 2000
5 minute interval
averagenumberofsteps
Average Number of Steps per 5−minute Intervals
Weekends
0
50
100
150
200
250
0 500 1000 1500 2000
5 minute interval
averagenumberofsteps
Average Number of Steps per 5−minute Intervals
Weekdays
We can clearly see that there is a higher spike in number of steps for the weekdays however, it appears that
the number of steps is higher throughout the day for weekends.
6

More Related Content

Similar to Course Project for Coursera Reproducible Research

Let’s Talk About Ruby
Let’s Talk About RubyLet’s Talk About Ruby
Let’s Talk About Ruby
Ian Bishop
 

Similar to Course Project for Coursera Reproducible Research (20)

ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
 
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
 
Using a mobile phone as a therapist - Superweek 2018
Using a mobile phone as a therapist - Superweek 2018Using a mobile phone as a therapist - Superweek 2018
Using a mobile phone as a therapist - Superweek 2018
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
4. functions
4. functions4. functions
4. functions
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
 
III_Data Structure_Module_1.pptx
III_Data Structure_Module_1.pptxIII_Data Structure_Module_1.pptx
III_Data Structure_Module_1.pptx
 
proj1v2
proj1v2proj1v2
proj1v2
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
Let’s Talk About Ruby
Let’s Talk About RubyLet’s Talk About Ruby
Let’s Talk About Ruby
 
III_Data Structure_Module_1.ppt
III_Data Structure_Module_1.pptIII_Data Structure_Module_1.ppt
III_Data Structure_Module_1.ppt
 
[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
 
bacon.js
bacon.jsbacon.js
bacon.js
 
Matplotlib demo code
Matplotlib demo codeMatplotlib demo code
Matplotlib demo code
 
Begin with Python
Begin with PythonBegin with Python
Begin with Python
 
Matlab integration
Matlab integrationMatlab integration
Matlab integration
 
BVJS
BVJSBVJS
BVJS
 
Aggregation is not Replication
Aggregation is not ReplicationAggregation is not Replication
Aggregation is not Replication
 
Functional Programming with Groovy
Functional Programming with GroovyFunctional Programming with Groovy
Functional Programming with Groovy
 

Course Project for Coursera Reproducible Research

  • 1. Analysis of Activity Monitoring Data John Slough II 15 Jan 2015 Report A histogram with density curve of the average number of steps taken per day is shown below. setwd("~/Desktop/Git/Coursera-Reproducible-Research/Project1") options(scipen=999) activity=read.csv("activity.csv",header=TRUE) activity$date=as.Date(activity$date) library(plyr) byDay=ddply(activity,"date",summarize, sum=sum(steps)) meanDay=round(mean(byDay$sum,na.rm=TRUE)) medianDay=median(byDay$sum,na.rm=TRUE) library(ggplot2) ggplot(byDay, aes(x=sum))+ geom_histogram(aes(y=..density..),binwidth=1000, colour="black", fill="lightblue") + geom_density(alpha=.2, fill="#FF6666")+ labs(title="Histogram and Density of Average Number of Steps per Day")+ labs(x="average number of steps per day",y="density") ## Warning: Removed 8 rows containing non-finite values (stat_density). 1
  • 2. 0.00000 0.00005 0.00010 0.00015 0 5000 10000 15000 20000 average number of steps per day density Histogram and Density of Average Number of Steps per Day The mean and median number of steps per day are 10766 and 10765 respectively. Average Daily Activity Pattern The following plot is a time series of the 5-minute interval and the average number of steps taken, averaged across all days. byInt=ddply(activity,"interval",summarize, avg=mean(steps,na.rm=TRUE)) ggplot(byInt, aes(x = interval, y = avg, group = 1))+ geom_line(colour="purple")+ labs(title="Time Series of Average Number of Steps per 5-minute Interval")+ labs(x="5 minute interval",y="average number of steps") 2
  • 3. 0 50 100 150 200 0 500 1000 1500 2000 5 minute interval averagenumberofsteps Time Series of Average Number of Steps per 5−minute Interval which.max(byInt[,2]) ## [1] 104 maxInt=round(byInt[104,]) byIntSort=byInt[order(byInt[,2],decreasing=TRUE),] The interval which contains the maximum average number of steps taken is interval 835 for a maximum average of 206 number of steps. Imputing Missing Values miss=sum(is.na(activity)) n=nrow(activity) prop=round(sum(is.na(activity))/nrow(activity)*100,1) The total number of missing values in the dataset is 2304 which corresponds to 13.1% of the data. To fill in the missing data I will replace the missing value with the mean of the corresponding 5-minute interval for that missing value. This is done with the packages ‘plyr’ and ‘Hmisc’. library(Hmisc) ## ## Attaching package: 'Hmisc' 3
  • 4. ## ## The following objects are masked from 'package:plyr': ## ## is.discrete, summarize ## ## The following objects are masked from 'package:base': ## ## format.pval, round.POSIXt, trunc.POSIXt, units # create new dataset with imputed values activity.imputed = ddply(activity, "interval", mutate, imputed.steps = impute(steps, mean)) act.imp.order=activity.imputed[order(activity.imputed[,2],decreasing=FALSE),] activity.imp=act.imp.order[,c(4,2,3)] activity.imp$imputed.steps=as.integer(activity.imp$imputed.steps) detach("package:Hmisc") A histogram with density curve of the new dataset with imputed missing values is shown below. byDay.imp=ddply(activity.imp,"date",summarize, sum=sum(imputed.steps)) ggplot(byDay.imp, aes(x=sum)) +theme_set(theme_bw())+ geom_histogram(aes(y=..density..),binwidth=1000, colour="black", fill="lightblue") + geom_density(alpha=.2, fill="#FF6666")+ labs(title="Histogram and Density of Average Number of Steps per Day (with Imputed Missing Values)")+ labs(x="average number of steps per day",y="density") 4
  • 5. 0.0000 0.0001 0.0002 0.0003 0 5000 10000 15000 20000 average number of steps per day density Histogram and Density of Average Number of Steps per Day (with Imputed Missing Values) mean.imp=round(mean(byDay.imp$sum,na.rm=TRUE)) median.imp=median(byDay.imp$sum,na.rm=TRUE) The mean and median number of steps per day from the new dataset are 10750 and 10641 respectively. These values do differ from the estimates from the original dataset by 16 and 124 for the means and medians, respectively. From the histogram it appears that replacing the missing values with the means of that 5-minute interval have concentrated the average total number of steps per day around 10000 steps. Differences in activity patterns between weekdays and weekends The following plot, created by making a new factor variable for weekend and weekdays was created to show different activity patterns between these two factors. # create factor variable for Weekend vs. Weekday activity.imp$wend = as.factor(ifelse(weekdays(activity.imp$date) %in% c("Saturday","Sunday"), "Weekend", "Weekday")) activity.WE=subset(activity.imp,wend=="Weekend") activity.WD=subset(activity.imp,wend=="Weekday") byWE.imp=ddply(activity.WE,"interval",summarize, avg=mean(imputed.steps)) byWD.imp=ddply(activity.WD,"interval",summarize, avg=mean(imputed.steps)) library(gridExtra) ggWE=ggplot(byWE.imp, aes(x = interval, y = avg, group = 1))+ylim(0,250)+ geom_line(colour="blue")+ labs(title="Average Number of Steps per 5-minute IntervalsnWeekends")+ 5
  • 6. labs(x="5 minute interval",y="average number of steps") ggWE=ggWE+theme(plot.margin=unit(c(0,1,0,1), "cm")) ggWD=ggplot(byWD.imp, aes(x = interval, y = avg, group = 1))+ylim(0,250)+ geom_line(colour="salmon")+ labs(title="Average Number of Steps per 5-minute IntervalsnWeekdays")+ labs(x="5 minute interval",y="average number of steps") ggWD=ggWD+theme(plot.margin=unit(c(0,1,0,1), "cm")) grid.arrange(ggWE, ggWD, nrow=2, ncol=1) 0 50 100 150 200 250 0 500 1000 1500 2000 5 minute interval averagenumberofsteps Average Number of Steps per 5−minute Intervals Weekends 0 50 100 150 200 250 0 500 1000 1500 2000 5 minute interval averagenumberofsteps Average Number of Steps per 5−minute Intervals Weekdays We can clearly see that there is a higher spike in number of steps for the weekdays however, it appears that the number of steps is higher throughout the day for weekends. 6