SlideShare a Scribd company logo
1 of 6
Download to read offline
Peer Assessment 1
Grant Oliveira
March 14, 2015
This is my take on Peer Assessment 1 for the sixth course in the Coursera Data Science Specialization,
“Reproducible Research”. It involves a simple data analysis but is meant more to demonstrate a familiarity
with reproducible research workflow using R markdown and the knitr package. The assignment specifies that
the code must be shown for each step, so I’ll begin by setting the global option to echo code.
echo=TRUE
This analysis requires the following packages:
• dplyr
• ggplot2
• reshape2
Next we’ll load the data, which is a dataset containing the readout from wearable tech monitoring the amount
of steps taken in five minute intervals. It has three variables
• steps: Number of steps taken in a 5-minute interval with missing values coded as NA
• date: The date on which the measurement was taken in YYYY-MM-DD format
• interval: Indentifier for the 5-minute interval in which the measurement was taken.
It’s stored in a CSV file with 17,568 total observations. Let’s load that now:
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip", destfile = "./repda
unzip("./repdata-data-activity.zip")
data <- read.csv("./activity.csv", colClasses=c("integer","Date","numeric"))
Section 1:
The first question on the assignment asks us to calculate the total number of steps taken per day and then
plot it into a histogram. I like using dplyr for this kind of stuff, so if you don’t have it installed go ahead and
do that. Thank me later. I like ggplot2 for plotting but that comes down to preference
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
1
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
groupSteps <- group_by(data, date)
steps <- summarise(groupSteps,
total = sum(steps, na.rm = TRUE))
library(ggplot2)
ggplot(steps, aes(date, total)) + geom_bar(stat = "identity", colour = "black", fill = "black", width =
0
5000
10000
15000
20000
Oct 01 Oct 15 Nov 01 Nov 15 Dec 01
Date
Steps
Total Number of Steps Taken Each Day
The assignment then asks us to calculate and report the mean and median of the total number of steps taken
per day:
summary(steps$total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 6778 10400 9354 12810 21190
Swag.
Section 2:
Section number two asks us to make a time series plot of the 5-minute interval (x-axis) and the average
number of steps taken, averaged across all days (y-axis)
2
data2 <- data[complete.cases(data),]
groupSteps2 <- group_by(data2, interval)
steps2 <- summarise(groupSteps2,
avg = mean(steps))
ggplot(steps2, aes(interval, avg)) + geom_line(colour = "black", fill = "black", width = 0.7) + labs(ti
0
50
100
150
200
0 500 1000 1500 2000
Interval
Steps
Average Steps By Interval
Then it asks which interval has the highest average value
steps2[steps2$avg == max(steps2$avg),]
## Source: local data frame [1 x 2]
##
## interval avg
## 1 835 206.1698
Section 3:
Section 3 first asks what the total number of missing values is in the data set
sum(is.na(data))
## [1] 2304
Then it asks us to fill these observations with some kind of data, mean or median will work.
3
data3 <- data
mean <- mean(!is.na(data$steps))
data3[is.na(data3)] <- mean
Then it asks us create a histogram of the total number of steps per day then calculate the mean and median
total steps per day. We’ll just crib the function from the first section.
groupSteps3 <- group_by(data3, date)
steps <- summarise(groupSteps,
total = sum(steps))
ggplot(steps, aes(date, total)) + geom_bar(stat = "identity", colour = "black", fill = "black", width =
## Warning: Removed 8 rows containing missing values (position_stack).
0
5000
10000
15000
20000
Oct 01 Oct 15 Nov 01 Nov 15 Dec 01
Date
Steps
Total Number of Steps Taken Each Day
Then we have to calculate and report the mean and median. Easy enough.
mean(data3$steps)
## [1] 32.59391
median(data3$steps)
## [1] 0
4
Section 4:
Finally, the assignment asks if there are different activity levels on weekdays vs. weekends. First we have to
make a new factor variable denoting whether each day is a weekend or weekday. I used gsub for each day,
and it’s a little tedious so if you have a more elegant solution I’m open to suggestion!
data4 <- data
data4 <- mutate(data4, weekdays = weekdays(date))
data4[,4] <- gsub("Monday", "Weekday", data4[,4])
data4[,4] <- gsub("Tuesday", "Weekday", data4[,4])
data4[,4] <- gsub("Wednesday", "Weekday", data4[,4])
data4[,4] <- gsub("Thursday", "Weekday", data4[,4])
data4[,4] <- gsub("Friday", "Weekday", data4[,4])
data4[,4] <- gsub("Saturday", "Weekend", data4[,4])
data4[,4] <- gsub("Sunday", "Weekend", data4[,4])
Then we make the graph in much the same way we made the one in section 2. I put both lines on the same
graph because I felt it’s easier to compare that way than doing it in panels like Dr. Peng did.
data4 <- data4[complete.cases(data4),]
weekday <- filter(data4, data4[,4] == "Weekday")
groupWeekday <- group_by(weekday, interval)
newWeekday <- summarise(groupWeekday,
avg = mean(steps))
weekend <- filter(data4, data4[,4] == "Weekend")
groupWeekend <- group_by(weekend, interval)
newWeekend <- summarise(groupWeekend,
avg = mean(steps))
library(reshape2)
total <- cbind(newWeekday, newWeekend[,2])
colnames(total) <- c("Interval", "Weekday Average", "Weekend Average")
total <- melt(total, id.vars = "Interval")
ggplot(total, aes(Interval, value), group = variable) + geom_line(aes(color=variable, width = 0.7)) + la
5
0
50
100
150
200
0 500 1000 1500 2000
Interval
Steps
variable
Weekday Average
Weekend Average
Average Steps By Interval
That should be everything! Thanks for reading and good luck in the rest of the class!
6

More Related Content

What's hot

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
University of Washington
 
Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston...
Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston...Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston...
Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston...
Neo4j
 
Python 的文件系統
Python 的文件系統Python 的文件系統
Python 的文件系統
Weizhong Yang
 

What's hot (20)

NUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloudNUS-ISS Learning Day 2019-Pandas in the cloud
NUS-ISS Learning Day 2019-Pandas in the cloud
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
 
An Introduction to RxJava
An Introduction to RxJavaAn Introduction to RxJava
An Introduction to RxJava
 
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Standard Libraries in Python Programming
Standard Libraries in Python ProgrammingStandard Libraries in Python Programming
Standard Libraries in Python Programming
 
"k-means-clustering" presentation @ Papers We Love Bucharest
"k-means-clustering" presentation @ Papers We Love Bucharest"k-means-clustering" presentation @ Papers We Love Bucharest
"k-means-clustering" presentation @ Papers We Love Bucharest
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Jsr310
Jsr310Jsr310
Jsr310
 
Network Analysis with networkX : Real-World Example-1
Network Analysis with networkX : Real-World Example-1Network Analysis with networkX : Real-World Example-1
Network Analysis with networkX : Real-World Example-1
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston...
Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston...Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston...
Betting the Company on a Graph Database - Aseem Kishore @ GraphConnect Boston...
 
Geek Time Janvier 2017 : Quiz Java
Geek Time Janvier 2017 : Quiz JavaGeek Time Janvier 2017 : Quiz Java
Geek Time Janvier 2017 : Quiz Java
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
Azure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analyticsAzure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analytics
 
Python 的文件系統
Python 的文件系統Python 的文件系統
Python 的文件系統
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 

Similar to PA1_template

CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
Arumugam90
 
Stevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting AlgorithmsStevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting Algorithms
James Stevens
 
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Patrick Diehl
 

Similar to PA1_template (20)

DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
Operation Analytics and Investigating Metric Spike.pptx
Operation Analytics and Investigating Metric Spike.pptxOperation Analytics and Investigating Metric Spike.pptx
Operation Analytics and Investigating Metric Spike.pptx
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
 
Introduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdfIntroduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdf
 
Introduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptxIntroduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptx
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Using the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slidesUsing the python_data_toolkit_timbers_slides
Using the python_data_toolkit_timbers_slides
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
 
Stevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting AlgorithmsStevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting Algorithms
 
Size measurement and estimation
Size measurement and estimationSize measurement and estimation
Size measurement and estimation
 
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
 
R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithm
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
IC2IT 2013 Presentation
IC2IT 2013 PresentationIC2IT 2013 Presentation
IC2IT 2013 Presentation
 
IC2IT 2013 Presentation
IC2IT 2013 PresentationIC2IT 2013 Presentation
IC2IT 2013 Presentation
 

PA1_template

  • 1. Peer Assessment 1 Grant Oliveira March 14, 2015 This is my take on Peer Assessment 1 for the sixth course in the Coursera Data Science Specialization, “Reproducible Research”. It involves a simple data analysis but is meant more to demonstrate a familiarity with reproducible research workflow using R markdown and the knitr package. The assignment specifies that the code must be shown for each step, so I’ll begin by setting the global option to echo code. echo=TRUE This analysis requires the following packages: • dplyr • ggplot2 • reshape2 Next we’ll load the data, which is a dataset containing the readout from wearable tech monitoring the amount of steps taken in five minute intervals. It has three variables • steps: Number of steps taken in a 5-minute interval with missing values coded as NA • date: The date on which the measurement was taken in YYYY-MM-DD format • interval: Indentifier for the 5-minute interval in which the measurement was taken. It’s stored in a CSV file with 17,568 total observations. Let’s load that now: download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip", destfile = "./repda unzip("./repdata-data-activity.zip") data <- read.csv("./activity.csv", colClasses=c("integer","Date","numeric")) Section 1: The first question on the assignment asks us to calculate the total number of steps taken per day and then plot it into a histogram. I like using dplyr for this kind of stuff, so if you don’t have it installed go ahead and do that. Thank me later. I like ggplot2 for plotting but that comes down to preference library(dplyr) ## ## Attaching package: 'dplyr' ## ## The following object is masked from 'package:stats': ## 1
  • 2. ## filter ## ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union groupSteps <- group_by(data, date) steps <- summarise(groupSteps, total = sum(steps, na.rm = TRUE)) library(ggplot2) ggplot(steps, aes(date, total)) + geom_bar(stat = "identity", colour = "black", fill = "black", width = 0 5000 10000 15000 20000 Oct 01 Oct 15 Nov 01 Nov 15 Dec 01 Date Steps Total Number of Steps Taken Each Day The assignment then asks us to calculate and report the mean and median of the total number of steps taken per day: summary(steps$total) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 6778 10400 9354 12810 21190 Swag. Section 2: Section number two asks us to make a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis) 2
  • 3. data2 <- data[complete.cases(data),] groupSteps2 <- group_by(data2, interval) steps2 <- summarise(groupSteps2, avg = mean(steps)) ggplot(steps2, aes(interval, avg)) + geom_line(colour = "black", fill = "black", width = 0.7) + labs(ti 0 50 100 150 200 0 500 1000 1500 2000 Interval Steps Average Steps By Interval Then it asks which interval has the highest average value steps2[steps2$avg == max(steps2$avg),] ## Source: local data frame [1 x 2] ## ## interval avg ## 1 835 206.1698 Section 3: Section 3 first asks what the total number of missing values is in the data set sum(is.na(data)) ## [1] 2304 Then it asks us to fill these observations with some kind of data, mean or median will work. 3
  • 4. data3 <- data mean <- mean(!is.na(data$steps)) data3[is.na(data3)] <- mean Then it asks us create a histogram of the total number of steps per day then calculate the mean and median total steps per day. We’ll just crib the function from the first section. groupSteps3 <- group_by(data3, date) steps <- summarise(groupSteps, total = sum(steps)) ggplot(steps, aes(date, total)) + geom_bar(stat = "identity", colour = "black", fill = "black", width = ## Warning: Removed 8 rows containing missing values (position_stack). 0 5000 10000 15000 20000 Oct 01 Oct 15 Nov 01 Nov 15 Dec 01 Date Steps Total Number of Steps Taken Each Day Then we have to calculate and report the mean and median. Easy enough. mean(data3$steps) ## [1] 32.59391 median(data3$steps) ## [1] 0 4
  • 5. Section 4: Finally, the assignment asks if there are different activity levels on weekdays vs. weekends. First we have to make a new factor variable denoting whether each day is a weekend or weekday. I used gsub for each day, and it’s a little tedious so if you have a more elegant solution I’m open to suggestion! data4 <- data data4 <- mutate(data4, weekdays = weekdays(date)) data4[,4] <- gsub("Monday", "Weekday", data4[,4]) data4[,4] <- gsub("Tuesday", "Weekday", data4[,4]) data4[,4] <- gsub("Wednesday", "Weekday", data4[,4]) data4[,4] <- gsub("Thursday", "Weekday", data4[,4]) data4[,4] <- gsub("Friday", "Weekday", data4[,4]) data4[,4] <- gsub("Saturday", "Weekend", data4[,4]) data4[,4] <- gsub("Sunday", "Weekend", data4[,4]) Then we make the graph in much the same way we made the one in section 2. I put both lines on the same graph because I felt it’s easier to compare that way than doing it in panels like Dr. Peng did. data4 <- data4[complete.cases(data4),] weekday <- filter(data4, data4[,4] == "Weekday") groupWeekday <- group_by(weekday, interval) newWeekday <- summarise(groupWeekday, avg = mean(steps)) weekend <- filter(data4, data4[,4] == "Weekend") groupWeekend <- group_by(weekend, interval) newWeekend <- summarise(groupWeekend, avg = mean(steps)) library(reshape2) total <- cbind(newWeekday, newWeekend[,2]) colnames(total) <- c("Interval", "Weekday Average", "Weekend Average") total <- melt(total, id.vars = "Interval") ggplot(total, aes(Interval, value), group = variable) + geom_line(aes(color=variable, width = 0.7)) + la 5
  • 6. 0 50 100 150 200 0 500 1000 1500 2000 Interval Steps variable Weekday Average Weekend Average Average Steps By Interval That should be everything! Thanks for reading and good luck in the rest of the class! 6