SlideShare a Scribd company logo
1 of 52
What is said about ...
data scientists spend from 50 to 80 percent of their time wrangling
big data.
source : NY T imes
Remaining 20 percent they use in plotting and fitting models So
focus on 80 percent of time.
Data Analysis using R February 26, 2016 2/ 43
Data Analysis
Data Analysis is a process that apply statistical methods on data to get
the knowledge and insight.
Steps in Data Analysis
Data Analysis using R February 26, 2016 3/ 43
Data Analysis
Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 3/ 43
Data Analysis
Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data transform the data
Data Analysis using R February 26, 2016 3/ 43
Data Analysis
Data Analysis is a process that apply statistical methods on data to get the
knowledge and insight.
Steps in Data Analysis
Store the Data transform the data Visualization
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 3/ 43
Data Analysis
Data Analysis is a process that apply statistical methods on data to get
the knowledge and insight.
Steps in Data Analysis
1. Store The Data
2. Transform The Data
3. Visualization
4. Model Fitting
Data Analysis using R February 26, 2016 3/ 43
Data wrangling packages in R
• Tidyr - To Make The Data Tidy
• Plyr - Split-apply-combine
• Dplyr - A New Version Of Plyr
• Reshape2 - To Reshape The Data
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 4/ 43
Tidy Data
Definition of tidy data is given by Hadley Wickham as follows
Every value belongs to a variable and an observation.
Variables in columns. Observations in rows.
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 5/ 43
Tidy Data
> load("D:/new/table1.rdata")
> table1
country year cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
cases refers to the number of people diagnosed with TB per country per
year
Question: calculate the rate of TB cases per country per year.
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 6/ 43
Solution
> rate<-table1$cases/table1$population
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 7/ 43
Tidy Data
> load("D:/new/table2.rdata")
> table2
country year key value
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 8/ 43
Tidy Data
> load("D:/new/table2.rdata")
> table2
country year key value
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Question: calculate the rate of TB cases per country per year.
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 8/ 43
Tidy Data
> load("D:/new/table3.rdata")
> table3
country year rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 9/ 43
Tidy Data
> load("D:/new/table3.rdata")
> table3
country year rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Question: calculate the rate of TB cases per country per year.
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 9/ 43
Tidy data
> load("D:/new/table4.rdata")
> load("D:/new/table5.rdata")
> table4
country 1999 2000
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
> table5
country 1999 2000
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 10/ 43
1. Tidyr package
tidyr: package is written by Hadley Wickham.
Important functions of this package are:-
spread(): Transform a long table into wide table
gather() :Transform a wide table into long table
separate(): Break a column into multiple column
unite(): Unite multiple column into one column.
Data Analysis using R February 26, 2016 11/ 43
• spread(): returns a copy of your data set that has had the key and
value columns removed.
• In their place, spread() adds a new column for each unique value of
the key column.
• These unique values will form the column names of the new columns.
• spread() distributes the cells of the former value column across the
cells of the new columns and truncates any non-key, non-value
columns in a way that prevents duplication.
spread()
spread()
spread() turns a pair of key:value columns into a set of tidy columns.
> table2
country year key value
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Data Analysis using R February 26, 2016 12/ 43
spread()
spread() turns a pair of key:value columns into a set of tidy columns.
> library(tidyr)
> spread(table2,key,value)
Source: local data frame [6 x 4]
country (fctr) year
(int)
cases (int) population
(int)
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Data Analysis using R February 26, 2016 13/ 43
gather()
gather() is reverse of spread()
> library(tidyr)
> gather(table1,key="KEYS",value = "DataValue",3:4)
Source: local data frame [12 x 4]
country year KEYS DataValue
(fctr) (int) (chr) (int)
1 Afghanistan 1999 cases 745
2 Afghanistan 2000 cases 2666
3 Brazil 1999 cases 37737
4 Brazil 2000 cases 80488
5 China 1999 cases 212258
6 China 2000 cases 213766
7 Afghanistan 1999 population 19987071
8 Afghanistan 2000 population 20595360
9 Brazil 1999 population 172006362
10 Brazil 2000 population 174504898
Data Analysis using R February 26, 2016 14/ 43
separate()
separate divide a column into multiple column.
use of separate() separate(data,columnToBeSeparated,
into=name of new columns,sep=regx of separator)
default separator is first non-alphanumeric character.
Data Analysis using R February 26, 2016 15/ 43
separate()
separate divide a column into multiple column. use of separate()
separate(data,columnToBeSeparated,into=name of new
columns,sep=regx of separator) default separator is first non-
alphanumeric character.
> table3
Source: local data frame [6 x 3]
country (fctr) year
(int)
rate (chr)
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 15/ 43
separate()
>
separate(table3,rate,into=c("cases","population"))
Source: local data frame [6 x 4]
country (fctr) year
(int)
cases
(chr)
population
(chr)
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6
>
China 2000 213766 1280428583
Data Analysis using R February 26, 2016 16/ 43
separate()
> separate(table3,rate,into=c("cases","population"),sep="/")
Source: local data frame [6 x 4]
country (fctr) year (int) cases (chr) population
(chr)
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 17/ 43
separate()
> t<-separate(table3,rate,into=c("cases","population"),sep="/"
> t
Source: local data frame [6 x 4]
country (fctr) year (int) cases (int) population
(int)
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 18/ 43
separate()
You can also pass an integer or vector of integers to sep. separate() will
interpret the integers as positions to split at. Positive values start at 1 at the
far-left of the strings; negative value start at -1 at the far-right of the strings.
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 19/ 43
separate()
You can also pass an integer or vector of integers to sep. separate() will
interpret the integers as positions to split at. Positive values start at 1 at the
far-left of the strings; negative value start at -1 at the far-right of the strings.
> t1<-separate(t,year,into=c("centuary","year"),sep=2)
> t1
Source: local data frame [6 x 5]
country (fctr) centuary
(chr)
year
(chr)
cases population
(int)
1 Afghanistan 19 99 745 19987071
2 Afghanistan 20 00 2666 20595360
3 Brazil 19 99 37737 172006362
4 Brazil 20 00 80488 174504898
5 China 19 99 212258 1272915272
6 China 20 00 213766 1280428583
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 19/ 43
unite()
unite is reverse of separate()
> ut<-unite(t1,"year1",centuary,year,sep = " " )
> Ut
Source: local data frame [6 x 4]
:Here we combine centuary and year into new column year1
country (fctr) year1
(chr)
cases (int) population
(int)
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 20/ 43
unite()
unite is reverse of separate()
> ut1<-unite(t1,"rate",cases,population,sep = " / " )
> ut1
Source: local data frame [6 x 4]
country (fctr) centuary
(chr)
year
(chr)
rate (chr)
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 21/ 43
help
for more information type
> ?spread
> ?gather
> ?separate
> ?unite
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 22/ 43
2. dplyr: a grammar of data manipulation
It is a next iteration of plyr(a tool for manipulating all data structure)
dplyr provides a flexible grammar of data manipulation.
This is a tool for manipulating data frame (table).
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 23/ 43
Grammarof dplyr
select: return a subset of the columns of a data frame
filter: extract a subset of rows from a data frame based on logical conditions
arrange: reorder rows of a data frame
rename: rename variables in a data frame
mutate: add new variables/columns or transform existing variables
summarise: generate summary statistics of different variables in the
data frame,
%>%: the "pipe” operator is used to connect multiple verb actions together into a
pipeline
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 24/ 43
CommonProperties of dplyr functions
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame
specified in the first argument, and you can refer to columns in the data
frame directly without using the $ operator
The return result of a function is a new data frame
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 25/ 43
Data set
How to convert month number into month name?
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 27/ 43
Data set
How to convert month number into month name?
> airq<-airquality
> f o r ( i in 5:9)
{
l<-airq$Month==i
airq$Month[l]=month.abb[i]
}
> str(airq)
'data.frame': 153 obs. of 6 var iables:
$ Ozone : i n t
$ Solar.R: i n t
41 36 12 18 NA 28 23 19 8 NA . . .
190 118 149 313 NA NA 299 99 19 194 . . .
7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 .: num
: i n t
: chr
: i n t
67 72 74 62 56 66 65 59 61 69 ...
"May" "May" "May" "May" ... 1 2 3 4 5 6 7 8 9 10 ...
$ Wind
$ Temp
$ Month
$ Day
>
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R 27/ 43
select()
"Month" "Day"
select():extract columns from a data frame
> library(dplyr)
> names(airq)
[ 1] "Ozone" "Solar.R" "Wind" "Temp"
> subset<-select(airq,Solar.R:Temp)
> head(subset)
Solar.R Wind Temp
1 190 7.4 67
2 118 8.0 72
3 149 12.6 74
4 313 11.5 62
5 NA 14.3 56
6 NA 14.9 66
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R 28/ 43
select()
> subset<-select(airq,starts_with("So"))
> subset[1:5,]
[ 1] 190 118 149 313 NA
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 29/ 43
select()
> subset<-select(airq,ends_with("mp"))
> subset[1:5,]
[ 1] 67 72 74 62 56
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 30/ 43
filter()
filter(): Extract a subset of row of given dataframe.
> rsubset<- filter(airq,Temp>75)
> rsubset[1:5,]
Ozone Solar.R Wind Temp Month Day
1 45 252 14.9 81 May 29
2 115 223 5.7 79 May 30
3 37 279 7.4 76 May 31
4 NA 286 8.6 78 Jun 1
5
>
NA 186 9.2 84 Jun 4
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R 31/ 43
filter()
> rsubset<- filter(airq,Temp>70,Temp<80)
> rsubset[1:5,]
Ozone Solar.R Wind Temp Month Day
1 36 118 8.0 72 May 2
2 12 149 12.6 74 May 3
3 7 NA 6.9 74 May 11
4 11 320 16.6 73 May 22
5
>
115 223 5.7 79 May 30
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 32/ 43
help
> rsubset<- filter(airq,Temp>70,Temp<80,Month %in% c("May","Aug))
> rsubset[1:5,]
Ozone Solar.R Wind Temp Month Day
1 36 118 8.0 72 May 2
2 12 149 12.6 74 May 3
3 7 NA 6.9 74 May 11
4 11 320 16.6 73 May 22
5 115 223 5.7 79 May 30
> tail(rsubset)
Ozone Solar.R Wind Temp Month Day
11 31 244 10.9 78 Aug 19
12 44 190 10.3 78 Aug 20
13 21 259 15.5 77 Aug 21
14 9 36 14.3 72 Aug 22
15 NA 255 12.6 75 Aug 23
HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 33/ 43
arrange()
arrange:reorder raw according to one of the variable
> asub<-arrange(airq,Temp)
> asub[1:10,]
Ozone Solar.R Wind Temp Month Day
1 NA NA 14.3 56 May 5
2 6 78 18.4 57 May 18
3 NA 66 16.6 57 May 25
4 NA NA 8.0 57 May 27
5 18 65 13.2 58 May 15
6 NA 266 14.9 58 May 26
7 19 99 13.8 59 May 8
8 1 8 9.7 59 May 21
9 8 19 20.1 61 May 9
10 4 25 9.7 61 May 23
Data Analysis using R February 26, 2016 34/ 43
arrange()
> asub<-arrange(airq,desc(Temp))
> asub[1:10,]
Ozone Solar.R Wind Temp Month Day
1 76 203 9.7 97 Aug 28
2 84 237 6.3 96 Aug 30
3 118 225 2.3 94 Aug 29
4 85 188 6.3 94 Aug 31
5 NA 259 10.9 93 Jun 11
6 73 183 2.8 93 Sep 3
7 91 189 4.6 93 Sep 4
8 NA 250 9.2 92 Jun 12
9 97 267 6.3 92 Jul 8
10 97 272 5.7 92 Jul 9
Data Analysis using R February 26, 2016 35/ 43
rename()
rename():rename a variable
> resub<-rename(airq,NewTemp=Temp)
> head(resub)
Ozone Solar.R Wind NewTemp Month Day
1 41 190 7.4 67 May 1
2 36 118 8.0 72 May 2
3 12 149 12.6 74 May 3
4 18 313 11.5 62 May 4
5 NA NA 14.3 56 May 5
6 28 NA 14.9 66 May 6
Data Analysis using R February 26, 2016 36/ 43
rename
> resub<-rename(airq,NewTemp=Temp,"New Wind"=Wind)
> head(resub)
Ozone Solar.R
New
Wind NewTemp Month Day
1 41 190 7.4 67 May 1
2 36 118 8.0 72 May 2
3 12 149 12.6 74 May 3
4 18 313 11.5 62 May 4
5 NA NA 14.3 56 May 5
6 28 NA 14.9 66 May 6
Data Analysis using R February 26, 2016 37/ 43
mutate()
mutate():add a new column to data frame
> msub<-mutate(airq,mdTemp=Temp-mean(Temp,na.rm = T))
> head(msub)
Ozone Solar.R Wind Temp Month Day mdTemp
1 41 190 7.4 67 May 1 -10.882353
2 36 118 8.0 72 May 2 -5.882353
3 12 149 12.6 74 May 3 -3.882353
4 18 313 11.5 62 May 4 -15.882353
5 NA NA 14.3 56 May 5 -21.882353
6 28 NA 14.9 66 May 6 -11.882353
Data Analysis using R February 26, 2016 38/ 43
transmute()
transmute():similar to mutate() but drop all non transformed variable
> t=transmute(airq,mdTemp=Tempmean(Temp,na.rm=T),WindSquare=(Wind* Wind))
> head(tmsub)
mdTemp WindSquare
1 -10.882353 54.76
2 -5.882353 64.00
3 -3.882353 158.76
4 -15.882353 132.25
5 -21.882353 204.49
6 -11.882353 222.01
Data Analysis using R February 26, 2016 39/ 43
group by()
group by():group a data frameby one or more variable.
> #scramble the rows
> sairq<-airq[sample(1:153,153),]
> sairq[1:10,]
Ozone Solar.R Wind Temp Month Day
6 28 NA 14.9 66 May 6
85 80 294 8.6 86 Jul 24
99 122 255 4.0 89 Aug 7
80 79 187 5.1 87 Jul 19
68 77 276 5.1 88 Jul 7
89 82 213 7.4 88 Jul 28
12 16 256 9.7 69 May 12
86 108 223 8.0 85 Jul 25
153 20 223 11.5 68 Sep 30
73 10 264 14.3 73 Jul 12
Data Analysis using R February 26, 2016 40/ 43
group by()
> grsub<-group_by(sairq,Month)
> summarize(grsub,tmean=mean(Temp,na.rm = T),
max(Wind),min(Solar.R, na.rm=T))
Source: local data frame [5 x 4]
Month tmean max(Wind) minSolar
(chr) (dbl) (dbl) (int)
1 Aug 83.96774 15.5 24
2 Jul 83.90323 14.9 7
3 Jun 79.10000 20.7 31
4 May 65.54839 20.1 8
5 Sep 76.90000 16.6 14
Data Analysis using R February 26, 2016 41/ 43
%>% ”pipe operator”
pipe operator combines multiple functions in a sequence
> f(x)>f(y)
> #equivalent to
> f ( f ( x ) , y )
> #If we use placeholder
> f(x)>f(y,.)
> #equivalent to
> f ( y , f ( x ) )
> third(second(first(x)))
> first(x)>second()>third()
Data Analysis using R February 26, 2016 42/ 43
%>% ”pipe operator”
> airq%>% group_by(Month)%>%summarise (mtemp=mean(Temp,na.rm=T),
+Smean=mean(Solar.R,na.rm=T))
Source: local data frame [5 x 3]
Month mtemp Smean
(chr) (dbl) (dbl)
1 Aug 83.96774 171.8571
2 Jul 83.90323 216.4839
3 Jun 79.10000 190.1667
4 May 65.54839 181.2963
5 Sep 76.90000 167.4333
Data Analysis using R February 26, 2016 43/ 43

More Related Content

Similar to Data wrangling IN R LANGUAGE

Files to Use httpwww.cse.msu.edu~cse231ProjectsProject05Red.pdf
Files to Use httpwww.cse.msu.edu~cse231ProjectsProject05Red.pdfFiles to Use httpwww.cse.msu.edu~cse231ProjectsProject05Red.pdf
Files to Use httpwww.cse.msu.edu~cse231ProjectsProject05Red.pdf
mdameer02
 
Pricipal Component Analysis Using R
Pricipal Component Analysis Using RPricipal Component Analysis Using R
Pricipal Component Analysis Using R
Karthi Keyan
 
Chapter 8 making sense of sample data
Chapter 8 making sense of sample dataChapter 8 making sense of sample data
Chapter 8 making sense of sample data
bathabilev
 
Plan601 e session 1 demo
Plan601 e session 1 demoPlan601 e session 1 demo
Plan601 e session 1 demo
rkottam
 

Similar to Data wrangling IN R LANGUAGE (17)

Files to Use httpwww.cse.msu.edu~cse231ProjectsProject05Red.pdf
Files to Use httpwww.cse.msu.edu~cse231ProjectsProject05Red.pdfFiles to Use httpwww.cse.msu.edu~cse231ProjectsProject05Red.pdf
Files to Use httpwww.cse.msu.edu~cse231ProjectsProject05Red.pdf
 
Pricipal Component Analysis Using R
Pricipal Component Analysis Using RPricipal Component Analysis Using R
Pricipal Component Analysis Using R
 
Analysis of indian weather data sets using data mining techniques
Analysis of indian weather data sets using data mining techniquesAnalysis of indian weather data sets using data mining techniques
Analysis of indian weather data sets using data mining techniques
 
Electrocardiogram Beat Classification using Discrete Wavelet Transform, Highe...
Electrocardiogram Beat Classification using Discrete Wavelet Transform, Highe...Electrocardiogram Beat Classification using Discrete Wavelet Transform, Highe...
Electrocardiogram Beat Classification using Discrete Wavelet Transform, Highe...
 
Chapter 8 making sense of sample data
Chapter 8 making sense of sample dataChapter 8 making sense of sample data
Chapter 8 making sense of sample data
 
Chapter 8 making sense of sample data
Chapter 8 making sense of sample dataChapter 8 making sense of sample data
Chapter 8 making sense of sample data
 
Chapter 8 making sense of sample data
Chapter 8 making sense of sample dataChapter 8 making sense of sample data
Chapter 8 making sense of sample data
 
Making sense of sample data
Making sense of sample dataMaking sense of sample data
Making sense of sample data
 
Making Sense of Sample Data
Making Sense of Sample DataMaking Sense of Sample Data
Making Sense of Sample Data
 
To find raise to five of any number
To find raise to five of any numberTo find raise to five of any number
To find raise to five of any number
 
Plan601 e session 1 demo
Plan601 e session 1 demoPlan601 e session 1 demo
Plan601 e session 1 demo
 
ANALYSIS OF PRODUCTION PERFORMANCE OF TAMILNADU NEWSPRINT AND PAPERS LTD – C...
ANALYSIS OF PRODUCTION PERFORMANCE OF  TAMILNADU NEWSPRINT AND PAPERS LTD – C...ANALYSIS OF PRODUCTION PERFORMANCE OF  TAMILNADU NEWSPRINT AND PAPERS LTD – C...
ANALYSIS OF PRODUCTION PERFORMANCE OF TAMILNADU NEWSPRINT AND PAPERS LTD – C...
 
Record linkage methods applied to population data deduplication
Record linkage methods applied to population data deduplicationRecord linkage methods applied to population data deduplication
Record linkage methods applied to population data deduplication
 
Statistical package for the extension science research by vinay
Statistical package for the extension science research by vinayStatistical package for the extension science research by vinay
Statistical package for the extension science research by vinay
 
16 descriptive statistics
16 descriptive statistics16 descriptive statistics
16 descriptive statistics
 
Data Analysis & Visualization using MS. Excel
Data Analysis & Visualization using MS. ExcelData Analysis & Visualization using MS. Excel
Data Analysis & Visualization using MS. Excel
 
XBASS v2.1 data entry steps
 XBASS v2.1 data entry steps XBASS v2.1 data entry steps
XBASS v2.1 data entry steps
 

More from LOVELY PROFESSIONAL UNIVERSITY

More from LOVELY PROFESSIONAL UNIVERSITY (19)

Enumerations, structure and class IN SWIFT
Enumerations, structure and class IN SWIFTEnumerations, structure and class IN SWIFT
Enumerations, structure and class IN SWIFT
 
Dictionaries IN SWIFT
Dictionaries IN SWIFTDictionaries IN SWIFT
Dictionaries IN SWIFT
 
Control structures IN SWIFT
Control structures IN SWIFTControl structures IN SWIFT
Control structures IN SWIFT
 
Arrays and its properties IN SWIFT
Arrays and its properties IN SWIFTArrays and its properties IN SWIFT
Arrays and its properties IN SWIFT
 
Array and its functionsI SWIFT
Array and its functionsI SWIFTArray and its functionsI SWIFT
Array and its functionsI SWIFT
 
practice problems on array IN SWIFT
practice problems on array IN SWIFTpractice problems on array IN SWIFT
practice problems on array IN SWIFT
 
practice problems on array IN SWIFT
practice problems on array  IN SWIFTpractice problems on array  IN SWIFT
practice problems on array IN SWIFT
 
practice problems on array IN SWIFT
practice problems on array IN SWIFTpractice problems on array IN SWIFT
practice problems on array IN SWIFT
 
practice problems on functions IN SWIFT
practice problems on functions IN SWIFTpractice problems on functions IN SWIFT
practice problems on functions IN SWIFT
 
10. funtions and closures IN SWIFT PROGRAMMING
10. funtions and closures IN SWIFT PROGRAMMING10. funtions and closures IN SWIFT PROGRAMMING
10. funtions and closures IN SWIFT PROGRAMMING
 
Variables and data types IN SWIFT
 Variables and data types IN SWIFT Variables and data types IN SWIFT
Variables and data types IN SWIFT
 
Soft skills. pptx
Soft skills. pptxSoft skills. pptx
Soft skills. pptx
 
JAVA
JAVAJAVA
JAVA
 
Unit 5
Unit 5Unit 5
Unit 5
 
Unit 4
Unit 4Unit 4
Unit 4
 
Unit 3
Unit 3Unit 3
Unit 3
 
STRINGS IN JAVA
STRINGS IN JAVASTRINGS IN JAVA
STRINGS IN JAVA
 
Unit 1
Unit 1Unit 1
Unit 1
 
COMPLETE CORE JAVA
COMPLETE CORE JAVACOMPLETE CORE JAVA
COMPLETE CORE JAVA
 

Recently uploaded

➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Recently uploaded (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Data wrangling IN R LANGUAGE

  • 1. What is said about ... data scientists spend from 50 to 80 percent of their time wrangling big data. source : NY T imes Remaining 20 percent they use in plotting and fitting models So focus on 80 percent of time. Data Analysis using R February 26, 2016 2/ 43
  • 2. Data Analysis Data Analysis is a process that apply statistical methods on data to get the knowledge and insight. Steps in Data Analysis Data Analysis using R February 26, 2016 3/ 43
  • 3. Data Analysis Data Analysis is a process that apply statistical methods on data to get the knowledge and insight. Steps in Data Analysis Store the Data HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 3/ 43
  • 4. Data Analysis Data Analysis is a process that apply statistical methods on data to get the knowledge and insight. Steps in Data Analysis Store the Data transform the data Data Analysis using R February 26, 2016 3/ 43
  • 5. Data Analysis Data Analysis is a process that apply statistical methods on data to get the knowledge and insight. Steps in Data Analysis Store the Data transform the data Visualization HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 3/ 43
  • 6. Data Analysis Data Analysis is a process that apply statistical methods on data to get the knowledge and insight. Steps in Data Analysis 1. Store The Data 2. Transform The Data 3. Visualization 4. Model Fitting Data Analysis using R February 26, 2016 3/ 43
  • 7. Data wrangling packages in R • Tidyr - To Make The Data Tidy • Plyr - Split-apply-combine • Dplyr - A New Version Of Plyr • Reshape2 - To Reshape The Data HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 4/ 43
  • 8. Tidy Data Definition of tidy data is given by Hadley Wickham as follows Every value belongs to a variable and an observation. Variables in columns. Observations in rows. HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 5/ 43
  • 9. Tidy Data > load("D:/new/table1.rdata") > table1 country year cases population 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583 cases refers to the number of people diagnosed with TB per country per year Question: calculate the rate of TB cases per country per year. HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 6/ 43
  • 10. Solution > rate<-table1$cases/table1$population HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 7/ 43
  • 11. Tidy Data > load("D:/new/table2.rdata") > table2 country year key value 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 1272915272 11 China 2000 cases 213766 12 China 2000 population 1280428583 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 8/ 43
  • 12. Tidy Data > load("D:/new/table2.rdata") > table2 country year key value 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 1272915272 11 China 2000 cases 213766 12 China 2000 population 1280428583 Question: calculate the rate of TB cases per country per year. HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 8/ 43
  • 13. Tidy Data > load("D:/new/table3.rdata") > table3 country year rate 1 Afghanistan 1999 745/19987071 2 Afghanistan 2000 2666/20595360 3 Brazil 1999 37737/172006362 4 Brazil 2000 80488/174504898 5 China 1999 212258/1272915272 6 China 2000 213766/1280428583 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 9/ 43
  • 14. Tidy Data > load("D:/new/table3.rdata") > table3 country year rate 1 Afghanistan 1999 745/19987071 2 Afghanistan 2000 2666/20595360 3 Brazil 1999 37737/172006362 4 Brazil 2000 80488/174504898 5 China 1999 212258/1272915272 6 China 2000 213766/1280428583 Question: calculate the rate of TB cases per country per year. HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 9/ 43
  • 15. Tidy data > load("D:/new/table4.rdata") > load("D:/new/table5.rdata") > table4 country 1999 2000 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766 > table5 country 1999 2000 1 Afghanistan 19987071 20595360 2 Brazil 172006362 174504898 3 China 1272915272 1280428583 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 10/ 43
  • 16. 1. Tidyr package tidyr: package is written by Hadley Wickham. Important functions of this package are:- spread(): Transform a long table into wide table gather() :Transform a wide table into long table separate(): Break a column into multiple column unite(): Unite multiple column into one column. Data Analysis using R February 26, 2016 11/ 43
  • 17. • spread(): returns a copy of your data set that has had the key and value columns removed. • In their place, spread() adds a new column for each unique value of the key column. • These unique values will form the column names of the new columns. • spread() distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication. spread()
  • 18. spread() spread() turns a pair of key:value columns into a set of tidy columns. > table2 country year key value 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 1272915272 11 China 2000 cases 213766 12 China 2000 population 1280428583 Data Analysis using R February 26, 2016 12/ 43
  • 19. spread() spread() turns a pair of key:value columns into a set of tidy columns. > library(tidyr) > spread(table2,key,value) Source: local data frame [6 x 4] country (fctr) year (int) cases (int) population (int) 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583 Data Analysis using R February 26, 2016 13/ 43
  • 20. gather() gather() is reverse of spread() > library(tidyr) > gather(table1,key="KEYS",value = "DataValue",3:4) Source: local data frame [12 x 4] country year KEYS DataValue (fctr) (int) (chr) (int) 1 Afghanistan 1999 cases 745 2 Afghanistan 2000 cases 2666 3 Brazil 1999 cases 37737 4 Brazil 2000 cases 80488 5 China 1999 cases 212258 6 China 2000 cases 213766 7 Afghanistan 1999 population 19987071 8 Afghanistan 2000 population 20595360 9 Brazil 1999 population 172006362 10 Brazil 2000 population 174504898 Data Analysis using R February 26, 2016 14/ 43
  • 21. separate() separate divide a column into multiple column. use of separate() separate(data,columnToBeSeparated, into=name of new columns,sep=regx of separator) default separator is first non-alphanumeric character. Data Analysis using R February 26, 2016 15/ 43
  • 22. separate() separate divide a column into multiple column. use of separate() separate(data,columnToBeSeparated,into=name of new columns,sep=regx of separator) default separator is first non- alphanumeric character. > table3 Source: local data frame [6 x 3] country (fctr) year (int) rate (chr) 1 Afghanistan 1999 745/19987071 2 Afghanistan 2000 2666/20595360 3 Brazil 1999 37737/172006362 4 Brazil 2000 80488/174504898 5 China 1999 212258/1272915272 6 China 2000 213766/1280428583 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 15/ 43
  • 23. separate() > separate(table3,rate,into=c("cases","population")) Source: local data frame [6 x 4] country (fctr) year (int) cases (chr) population (chr) 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 > China 2000 213766 1280428583 Data Analysis using R February 26, 2016 16/ 43
  • 24. separate() > separate(table3,rate,into=c("cases","population"),sep="/") Source: local data frame [6 x 4] country (fctr) year (int) cases (chr) population (chr) 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 17/ 43
  • 25. separate() > t<-separate(table3,rate,into=c("cases","population"),sep="/" > t Source: local data frame [6 x 4] country (fctr) year (int) cases (int) population (int) 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 18/ 43
  • 26. separate() You can also pass an integer or vector of integers to sep. separate() will interpret the integers as positions to split at. Positive values start at 1 at the far-left of the strings; negative value start at -1 at the far-right of the strings. HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 19/ 43
  • 27. separate() You can also pass an integer or vector of integers to sep. separate() will interpret the integers as positions to split at. Positive values start at 1 at the far-left of the strings; negative value start at -1 at the far-right of the strings. > t1<-separate(t,year,into=c("centuary","year"),sep=2) > t1 Source: local data frame [6 x 5] country (fctr) centuary (chr) year (chr) cases population (int) 1 Afghanistan 19 99 745 19987071 2 Afghanistan 20 00 2666 20595360 3 Brazil 19 99 37737 172006362 4 Brazil 20 00 80488 174504898 5 China 19 99 212258 1272915272 6 China 20 00 213766 1280428583 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 19/ 43
  • 28. unite() unite is reverse of separate() > ut<-unite(t1,"year1",centuary,year,sep = " " ) > Ut Source: local data frame [6 x 4] :Here we combine centuary and year into new column year1 country (fctr) year1 (chr) cases (int) population (int) 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 20/ 43
  • 29. unite() unite is reverse of separate() > ut1<-unite(t1,"rate",cases,population,sep = " / " ) > ut1 Source: local data frame [6 x 4] country (fctr) centuary (chr) year (chr) rate (chr) 1 Afghanistan 19 99 745/19987071 2 Afghanistan 20 00 2666/20595360 3 Brazil 19 99 37737/172006362 4 Brazil 20 00 80488/174504898 5 China 19 99 212258/1272915272 6 China 20 00 213766/1280428583 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 21/ 43
  • 30. help for more information type > ?spread > ?gather > ?separate > ?unite HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 22/ 43
  • 31. 2. dplyr: a grammar of data manipulation It is a next iteration of plyr(a tool for manipulating all data structure) dplyr provides a flexible grammar of data manipulation. This is a tool for manipulating data frame (table). HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 23/ 43
  • 32. Grammarof dplyr select: return a subset of the columns of a data frame filter: extract a subset of rows from a data frame based on logical conditions arrange: reorder rows of a data frame rename: rename variables in a data frame mutate: add new variables/columns or transform existing variables summarise: generate summary statistics of different variables in the data frame, %>%: the "pipe” operator is used to connect multiple verb actions together into a pipeline HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 24/ 43
  • 33. CommonProperties of dplyr functions The first argument is a data frame. The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator The return result of a function is a new data frame HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 25/ 43
  • 34.
  • 35. Data set How to convert month number into month name? HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 27/ 43
  • 36. Data set How to convert month number into month name? > airq<-airquality > f o r ( i in 5:9) { l<-airq$Month==i airq$Month[l]=month.abb[i] } > str(airq) 'data.frame': 153 obs. of 6 var iables: $ Ozone : i n t $ Solar.R: i n t 41 36 12 18 NA 28 23 19 8 NA . . . 190 118 149 313 NA NA 299 99 19 194 . . . 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 .: num : i n t : chr : i n t 67 72 74 62 56 66 65 59 61 69 ... "May" "May" "May" "May" ... 1 2 3 4 5 6 7 8 9 10 ... $ Wind $ Temp $ Month $ Day > HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R 27/ 43
  • 37. select() "Month" "Day" select():extract columns from a data frame > library(dplyr) > names(airq) [ 1] "Ozone" "Solar.R" "Wind" "Temp" > subset<-select(airq,Solar.R:Temp) > head(subset) Solar.R Wind Temp 1 190 7.4 67 2 118 8.0 72 3 149 12.6 74 4 313 11.5 62 5 NA 14.3 56 6 NA 14.9 66 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R 28/ 43
  • 38. select() > subset<-select(airq,starts_with("So")) > subset[1:5,] [ 1] 190 118 149 313 NA HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 29/ 43
  • 39. select() > subset<-select(airq,ends_with("mp")) > subset[1:5,] [ 1] 67 72 74 62 56 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 30/ 43
  • 40. filter() filter(): Extract a subset of row of given dataframe. > rsubset<- filter(airq,Temp>75) > rsubset[1:5,] Ozone Solar.R Wind Temp Month Day 1 45 252 14.9 81 May 29 2 115 223 5.7 79 May 30 3 37 279 7.4 76 May 31 4 NA 286 8.6 78 Jun 1 5 > NA 186 9.2 84 Jun 4 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R 31/ 43
  • 41. filter() > rsubset<- filter(airq,Temp>70,Temp<80) > rsubset[1:5,] Ozone Solar.R Wind Temp Month Day 1 36 118 8.0 72 May 2 2 12 149 12.6 74 May 3 3 7 NA 6.9 74 May 11 4 11 320 16.6 73 May 22 5 > 115 223 5.7 79 May 30 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 32/ 43
  • 42. help > rsubset<- filter(airq,Temp>70,Temp<80,Month %in% c("May","Aug)) > rsubset[1:5,] Ozone Solar.R Wind Temp Month Day 1 36 118 8.0 72 May 2 2 12 149 12.6 74 May 3 3 7 NA 6.9 74 May 11 4 11 320 16.6 73 May 22 5 115 223 5.7 79 May 30 > tail(rsubset) Ozone Solar.R Wind Temp Month Day 11 31 244 10.9 78 Aug 19 12 44 190 10.3 78 Aug 20 13 21 259 15.5 77 Aug 21 14 9 36 14.3 72 Aug 22 15 NA 255 12.6 75 Aug 23 HukamSingh Rana Hitesh Kumar Sharma Data Analysis using R February 26, 2016 33/ 43
  • 43. arrange() arrange:reorder raw according to one of the variable > asub<-arrange(airq,Temp) > asub[1:10,] Ozone Solar.R Wind Temp Month Day 1 NA NA 14.3 56 May 5 2 6 78 18.4 57 May 18 3 NA 66 16.6 57 May 25 4 NA NA 8.0 57 May 27 5 18 65 13.2 58 May 15 6 NA 266 14.9 58 May 26 7 19 99 13.8 59 May 8 8 1 8 9.7 59 May 21 9 8 19 20.1 61 May 9 10 4 25 9.7 61 May 23 Data Analysis using R February 26, 2016 34/ 43
  • 44. arrange() > asub<-arrange(airq,desc(Temp)) > asub[1:10,] Ozone Solar.R Wind Temp Month Day 1 76 203 9.7 97 Aug 28 2 84 237 6.3 96 Aug 30 3 118 225 2.3 94 Aug 29 4 85 188 6.3 94 Aug 31 5 NA 259 10.9 93 Jun 11 6 73 183 2.8 93 Sep 3 7 91 189 4.6 93 Sep 4 8 NA 250 9.2 92 Jun 12 9 97 267 6.3 92 Jul 8 10 97 272 5.7 92 Jul 9 Data Analysis using R February 26, 2016 35/ 43
  • 45. rename() rename():rename a variable > resub<-rename(airq,NewTemp=Temp) > head(resub) Ozone Solar.R Wind NewTemp Month Day 1 41 190 7.4 67 May 1 2 36 118 8.0 72 May 2 3 12 149 12.6 74 May 3 4 18 313 11.5 62 May 4 5 NA NA 14.3 56 May 5 6 28 NA 14.9 66 May 6 Data Analysis using R February 26, 2016 36/ 43
  • 46. rename > resub<-rename(airq,NewTemp=Temp,"New Wind"=Wind) > head(resub) Ozone Solar.R New Wind NewTemp Month Day 1 41 190 7.4 67 May 1 2 36 118 8.0 72 May 2 3 12 149 12.6 74 May 3 4 18 313 11.5 62 May 4 5 NA NA 14.3 56 May 5 6 28 NA 14.9 66 May 6 Data Analysis using R February 26, 2016 37/ 43
  • 47. mutate() mutate():add a new column to data frame > msub<-mutate(airq,mdTemp=Temp-mean(Temp,na.rm = T)) > head(msub) Ozone Solar.R Wind Temp Month Day mdTemp 1 41 190 7.4 67 May 1 -10.882353 2 36 118 8.0 72 May 2 -5.882353 3 12 149 12.6 74 May 3 -3.882353 4 18 313 11.5 62 May 4 -15.882353 5 NA NA 14.3 56 May 5 -21.882353 6 28 NA 14.9 66 May 6 -11.882353 Data Analysis using R February 26, 2016 38/ 43
  • 48. transmute() transmute():similar to mutate() but drop all non transformed variable > t=transmute(airq,mdTemp=Tempmean(Temp,na.rm=T),WindSquare=(Wind* Wind)) > head(tmsub) mdTemp WindSquare 1 -10.882353 54.76 2 -5.882353 64.00 3 -3.882353 158.76 4 -15.882353 132.25 5 -21.882353 204.49 6 -11.882353 222.01 Data Analysis using R February 26, 2016 39/ 43
  • 49. group by() group by():group a data frameby one or more variable. > #scramble the rows > sairq<-airq[sample(1:153,153),] > sairq[1:10,] Ozone Solar.R Wind Temp Month Day 6 28 NA 14.9 66 May 6 85 80 294 8.6 86 Jul 24 99 122 255 4.0 89 Aug 7 80 79 187 5.1 87 Jul 19 68 77 276 5.1 88 Jul 7 89 82 213 7.4 88 Jul 28 12 16 256 9.7 69 May 12 86 108 223 8.0 85 Jul 25 153 20 223 11.5 68 Sep 30 73 10 264 14.3 73 Jul 12 Data Analysis using R February 26, 2016 40/ 43
  • 50. group by() > grsub<-group_by(sairq,Month) > summarize(grsub,tmean=mean(Temp,na.rm = T), max(Wind),min(Solar.R, na.rm=T)) Source: local data frame [5 x 4] Month tmean max(Wind) minSolar (chr) (dbl) (dbl) (int) 1 Aug 83.96774 15.5 24 2 Jul 83.90323 14.9 7 3 Jun 79.10000 20.7 31 4 May 65.54839 20.1 8 5 Sep 76.90000 16.6 14 Data Analysis using R February 26, 2016 41/ 43
  • 51. %>% ”pipe operator” pipe operator combines multiple functions in a sequence > f(x)>f(y) > #equivalent to > f ( f ( x ) , y ) > #If we use placeholder > f(x)>f(y,.) > #equivalent to > f ( y , f ( x ) ) > third(second(first(x))) > first(x)>second()>third() Data Analysis using R February 26, 2016 42/ 43
  • 52. %>% ”pipe operator” > airq%>% group_by(Month)%>%summarise (mtemp=mean(Temp,na.rm=T), +Smean=mean(Solar.R,na.rm=T)) Source: local data frame [5 x 3] Month mtemp Smean (chr) (dbl) (dbl) 1 Aug 83.96774 171.8571 2 Jul 83.90323 216.4839 3 Jun 79.10000 190.1667 4 May 65.54839 181.2963 5 Sep 76.90000 167.4333 Data Analysis using R February 26, 2016 43/ 43

Editor's Notes

  1. Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis. With the amount of data and data sources rapidly growing and expanding, it is getting more and more essential for the large amounts of available data to be organized for analysis. Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
  2. http://garrettgman.github.io/tidying/
  3. A key value pair is a simple way to record information. A pair contains two parts: a key that explains what the information describes, and a value that contains the actual information. So for example, Password: 0123456789 would be a key value pair. 0123456789 is the value, and it is associated with the key Password.