1. Peer Assessment 1
Grant Oliveira
March 14, 2015
This is my take on Peer Assessment 1 for the sixth course in the Coursera Data Science Specialization,
“Reproducible Research”. It involves a simple data analysis but is meant more to demonstrate a familiarity
with reproducible research workflow using R markdown and the knitr package. The assignment specifies that
the code must be shown for each step, so I’ll begin by setting the global option to echo code.
echo=TRUE
This analysis requires the following packages:
• dplyr
• ggplot2
• reshape2
Next we’ll load the data, which is a dataset containing the readout from wearable tech monitoring the amount
of steps taken in five minute intervals. It has three variables
• steps: Number of steps taken in a 5-minute interval with missing values coded as NA
• date: The date on which the measurement was taken in YYYY-MM-DD format
• interval: Indentifier for the 5-minute interval in which the measurement was taken.
It’s stored in a CSV file with 17,568 total observations. Let’s load that now:
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip", destfile = "./repda
unzip("./repdata-data-activity.zip")
data <- read.csv("./activity.csv", colClasses=c("integer","Date","numeric"))
Section 1:
The first question on the assignment asks us to calculate the total number of steps taken per day and then
plot it into a histogram. I like using dplyr for this kind of stuff, so if you don’t have it installed go ahead and
do that. Thank me later. I like ggplot2 for plotting but that comes down to preference
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
1
2. ## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
groupSteps <- group_by(data, date)
steps <- summarise(groupSteps,
total = sum(steps, na.rm = TRUE))
library(ggplot2)
ggplot(steps, aes(date, total)) + geom_bar(stat = "identity", colour = "black", fill = "black", width =
0
5000
10000
15000
20000
Oct 01 Oct 15 Nov 01 Nov 15 Dec 01
Date
Steps
Total Number of Steps Taken Each Day
The assignment then asks us to calculate and report the mean and median of the total number of steps taken
per day:
summary(steps$total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 6778 10400 9354 12810 21190
Swag.
Section 2:
Section number two asks us to make a time series plot of the 5-minute interval (x-axis) and the average
number of steps taken, averaged across all days (y-axis)
2
3. data2 <- data[complete.cases(data),]
groupSteps2 <- group_by(data2, interval)
steps2 <- summarise(groupSteps2,
avg = mean(steps))
ggplot(steps2, aes(interval, avg)) + geom_line(colour = "black", fill = "black", width = 0.7) + labs(ti
0
50
100
150
200
0 500 1000 1500 2000
Interval
Steps
Average Steps By Interval
Then it asks which interval has the highest average value
steps2[steps2$avg == max(steps2$avg),]
## Source: local data frame [1 x 2]
##
## interval avg
## 1 835 206.1698
Section 3:
Section 3 first asks what the total number of missing values is in the data set
sum(is.na(data))
## [1] 2304
Then it asks us to fill these observations with some kind of data, mean or median will work.
3
4. data3 <- data
mean <- mean(!is.na(data$steps))
data3[is.na(data3)] <- mean
Then it asks us create a histogram of the total number of steps per day then calculate the mean and median
total steps per day. We’ll just crib the function from the first section.
groupSteps3 <- group_by(data3, date)
steps <- summarise(groupSteps,
total = sum(steps))
ggplot(steps, aes(date, total)) + geom_bar(stat = "identity", colour = "black", fill = "black", width =
## Warning: Removed 8 rows containing missing values (position_stack).
0
5000
10000
15000
20000
Oct 01 Oct 15 Nov 01 Nov 15 Dec 01
Date
Steps
Total Number of Steps Taken Each Day
Then we have to calculate and report the mean and median. Easy enough.
mean(data3$steps)
## [1] 32.59391
median(data3$steps)
## [1] 0
4
5. Section 4:
Finally, the assignment asks if there are different activity levels on weekdays vs. weekends. First we have to
make a new factor variable denoting whether each day is a weekend or weekday. I used gsub for each day,
and it’s a little tedious so if you have a more elegant solution I’m open to suggestion!
data4 <- data
data4 <- mutate(data4, weekdays = weekdays(date))
data4[,4] <- gsub("Monday", "Weekday", data4[,4])
data4[,4] <- gsub("Tuesday", "Weekday", data4[,4])
data4[,4] <- gsub("Wednesday", "Weekday", data4[,4])
data4[,4] <- gsub("Thursday", "Weekday", data4[,4])
data4[,4] <- gsub("Friday", "Weekday", data4[,4])
data4[,4] <- gsub("Saturday", "Weekend", data4[,4])
data4[,4] <- gsub("Sunday", "Weekend", data4[,4])
Then we make the graph in much the same way we made the one in section 2. I put both lines on the same
graph because I felt it’s easier to compare that way than doing it in panels like Dr. Peng did.
data4 <- data4[complete.cases(data4),]
weekday <- filter(data4, data4[,4] == "Weekday")
groupWeekday <- group_by(weekday, interval)
newWeekday <- summarise(groupWeekday,
avg = mean(steps))
weekend <- filter(data4, data4[,4] == "Weekend")
groupWeekend <- group_by(weekend, interval)
newWeekend <- summarise(groupWeekend,
avg = mean(steps))
library(reshape2)
total <- cbind(newWeekday, newWeekend[,2])
colnames(total) <- c("Interval", "Weekday Average", "Weekend Average")
total <- melt(total, id.vars = "Interval")
ggplot(total, aes(Interval, value), group = variable) + geom_line(aes(color=variable, width = 0.7)) + la
5
6. 0
50
100
150
200
0 500 1000 1500 2000
Interval
Steps
variable
Weekday Average
Weekend Average
Average Steps By Interval
That should be everything! Thanks for reading and good luck in the rest of the class!
6