SlideShare a Scribd company logo
1 of 10
Introduction to dplyr and base R functions for data manipulation
Kamal Gupta Roy
Last Edited on 3rd Nov 2021
Instructions/Agenda and Learnings
1. Use of functions like ls(), getwd(), setwd(), rm()
2. Install packages (dslabs, dplyr)
3. Load packages(dslabs, dplyr) – library
4. Read murder dataset
5. functions: nrow,ncol,head,tail,summary,class[try for dataframe and variable],str,names,levels,nlevels
6. Position of a dataframe
7. Reading a vector from data frame and doing basic arithmetic functions
8. Order/Arrange - Sorting the data
9. Selecting a column
10. Filtering rows
11. Creating a new variable
12. Summrizing data
13. Summarizing while grouping
14. Chaining Method
15. Exercise
dplyr functionality
• Five basic verbs: filter, select, arrange, mutate, summarise (plus group_by)
Basic Codes
Directory Details
#### workspace
ls()
1
## character(0)
#To know what is the default working directory
getwd()
## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 -
# Setting a Working Directory using setwd()
#setwd(C:/Users/Admin/)
getwd()
## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 -
Install packages
install.packages("dslabs")
install.packages("dplyr")
Load packages
library(dslabs)
library(dplyr)
##
## Attaching package: ’dplyr’
## The following objects are masked from ’package:stats’:
##
## filter, lag
## The following objects are masked from ’package:base’:
##
## intersect, setdiff, setequal, union
Read dataframe
murder <- data.frame(murders)
Basic check on data
nrow(murder)
## [1] 51
2
ncol(murder)
## [1] 5
head(murder)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
murder[1,1]
## [1] "Alabama"
tail(murder)
## state abb region population total
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
summary(murder)
## state abb region population
## Length:51 Length:51 Northeast : 9 Min. : 563626
## Class :character Class :character South :17 1st Qu.: 1696962
## Mode :character Mode :character North Central:12 Median : 4339367
## West :13 Mean : 6075769
## 3rd Qu.: 6636084
## Max. :37253956
## total
## Min. : 2.0
## 1st Qu.: 24.5
## Median : 97.0
## Mean : 184.4
## 3rd Qu.: 268.0
## Max. :1257.0
class(murder)
## [1] "data.frame"
3
class(murder$state)
## [1] "character"
str(murder)
## ’data.frame’: 51 obs. of 5 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
names(murder)
## [1] "state" "abb" "region" "population" "total"
levels(murder$region)
## [1] "Northeast" "South" "North Central" "West"
nlevels(murder$region)
## [1] 4
Read a vector from data frame
mdr <- murder$total
sum(mdr)
## [1] 9403
mean(mdr)
## [1] 184.3725
max(mdr)
## [1] 1257
min(mdr)
## [1] 2
4
dplyr functions
Sorting data
Simple R
rway <- murder[order(murder$total),]
head(rway)
## state abb region population total
## 46 Vermont VT Northeast 625741 2
## 35 North Dakota ND North Central 672591 4
## 30 New Hampshire NH Northeast 1316470 5
## 51 Wyoming WY West 563626 5
## 12 Hawaii HI West 1360301 7
## 42 South Dakota SD North Central 814180 8
dplyr
dpway <- arrange(murder, total)
head(dpway)
## state abb region population total
## 1 Vermont VT Northeast 625741 2
## 2 North Dakota ND North Central 672591 4
## 3 New Hampshire NH Northeast 1316470 5
## 4 Wyoming WY West 563626 5
## 5 Hawaii HI West 1360301 7
## 6 South Dakota SD North Central 814180 8
Selecting a column
Simple R
rway <- murder[,"state"]
head(rway)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
## [6] "Colorado"
class(rway)
## [1] "character"
5
rway <- murder[,c("state","total")]
head(rway)
## state total
## 1 Alabama 135
## 2 Alaska 19
## 3 Arizona 232
## 4 Arkansas 93
## 5 California 1257
## 6 Colorado 65
dplyr
dpway <- select(murder,state)
head(dpway)
## state
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
class(dpway)
## [1] "data.frame"
dpway <- select(murder,state,total)
head(dpway)
## state total
## 1 Alabama 135
## 2 Alaska 19
## 3 Arizona 232
## 4 Arkansas 93
## 5 California 1257
## 6 Colorado 65
Filtering rows
Simple R
rway <- murder[murder$state=='California',]
head(rway)
## state abb region population total
## 5 California CA West 37253956 1257
6
dplyr
dpway <- filter(murder,state=='California')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California' & abb=='CA')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California', abb=='CA')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California' | abb=='WI')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
## 2 Wisconsin WI North Central 5686986 97
dpway <- filter(murder,abb %in% c('CA','WI','NY'))
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
## 2 New York NY Northeast 19378102 517
## 3 Wisconsin WI North Central 5686986 97
Creating a new variable
Simple R
murder$newpop <- murder$population / 1000
head(murder)
## state abb region population total newpop
## 1 Alabama AL South 4779736 135 4779.736
## 2 Alaska AK West 710231 19 710.231
## 3 Arizona AZ West 6392017 232 6392.017
## 4 Arkansas AR South 2915918 93 2915.918
## 5 California CA West 37253956 1257 37253.956
## 6 Colorado CO West 5029196 65 5029.196
7
dplyr
dpway <- mutate(murder,newpop=population/1000)
head(dpway)
## state abb region population total newpop
## 1 Alabama AL South 4779736 135 4779.736
## 2 Alaska AK West 710231 19 710.231
## 3 Arizona AZ West 6392017 232 6392.017
## 4 Arkansas AR South 2915918 93 2915.918
## 5 California CA West 37253956 1257 37253.956
## 6 Colorado CO West 5029196 65 5029.196
summarise: Reduce variables to values
• Primarily useful with data that has been grouped by one or more variables
• group_by creates the groups that will be operated on
• summarise uses the provided aggregation function to summarise each group
dplyr way - summarize
summarise(murder,summurder=sum(total,na.rm=TRUE))
## summurder
## 1 9403
summarise(murder,avgmurder=mean(total,na.rm=TRUE))
## avgmurder
## 1 184.3725
summarise(murder,countrows=n())
## countrows
## 1 51
summarise(murder,summurder=sum(total,na.rm=TRUE),
avgmurder=mean(total,na.rm=TRUE),countrows=n())
## summurder avgmurder countrows
## 1 9403 184.3725 51
dplyr way - group by
8
m1 <- group_by(murder,region)
ab <- summarise(m1,md=sum(total, na.rm=TRUE),
pop = mean(population, na.rm=TRUE),
cn = n())
ab <- data.frame(ab)
ab
## region md pop cn
## 1 Northeast 1469 6146360 9
## 2 South 4195 6804378 17
## 3 North Central 1828 5577250 12
## 4 West 1911 5534273 13
Chaining Method
ab <- murder %>%
group_by(region) %>%
summarise(md = sum(total, na.rm=TRUE),
pop = sum(population, na.rm=TRUE),
cn=n())
ab <- data.frame(ab)
ab
## region md pop cn
## 1 Northeast 1469 55317240 9
## 2 South 4195 115674434 17
## 3 North Central 1828 66927001 12
## 4 West 1911 71945553 13
Exercises
Exercise 1
Do the following for Murder dataset
i. Get the murder dataset (as was done in the class)
ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the
data)
iii. which three states have highest population?
iv. How many states have more than average population?
v. what is the total population of US (actual number and in millions)
vi. what is the total number of murders across US?
vii. what is the average number of murders
viii. what is the total murders in the South region
9
ix. How many states are there in each region
x. what is the murder rate across each region?
xi. Which is the most dangerous state?
Exercise 2
Do the following for mtcars dataset
i. Get the mtcars dataset
ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the
data)
iii. How many different types of gears are there?
iv. which type of transmission is more? automatic or manual
v. what is the average hp by number of cylinders
vi. what is the avg hp by gears
vii. does mpg depend on number of gears?
viii. Does weight of car depends on number of cylinders?
10

More Related Content

What's hot

Travelling salesman problem ( Operation Research)
Travelling salesman problem ( Operation Research)Travelling salesman problem ( Operation Research)
Travelling salesman problem ( Operation Research)Muhammed Abdulla N C
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in REshwar Sai
 
第9回 大規模データを用いたデータフレーム操作実習(3)
第9回 大規模データを用いたデータフレーム操作実習(3)第9回 大規模データを用いたデータフレーム操作実習(3)
第9回 大規模データを用いたデータフレーム操作実習(3)Wataru Shito
 
第8回 大規模データを用いたデータフレーム操作実習(2)
第8回 大規模データを用いたデータフレーム操作実習(2)第8回 大規模データを用いたデータフレーム操作実習(2)
第8回 大規模データを用いたデータフレーム操作実習(2)Wataru Shito
 
STRING LIST TUPLE DICTIONARY FILE.pdf
STRING LIST TUPLE DICTIONARY FILE.pdfSTRING LIST TUPLE DICTIONARY FILE.pdf
STRING LIST TUPLE DICTIONARY FILE.pdfomprakashmeena48
 
A presentation on prim's and kruskal's algorithm
A presentation on prim's and kruskal's algorithmA presentation on prim's and kruskal's algorithm
A presentation on prim's and kruskal's algorithmGaurav Kolekar
 
BackTracking Algorithm: Technique and Examples
BackTracking Algorithm: Technique and ExamplesBackTracking Algorithm: Technique and Examples
BackTracking Algorithm: Technique and ExamplesFahim Ferdous
 

What's hot (17)

OR Unit 5 queuing theory
OR Unit 5 queuing theoryOR Unit 5 queuing theory
OR Unit 5 queuing theory
 
Strassen.ppt
Strassen.pptStrassen.ppt
Strassen.ppt
 
Tsp branch and-bound
Tsp branch and-boundTsp branch and-bound
Tsp branch and-bound
 
Minimum spanning tree
Minimum spanning treeMinimum spanning tree
Minimum spanning tree
 
Data Types of R.pptx
Data Types of R.pptxData Types of R.pptx
Data Types of R.pptx
 
Travelling salesman problem ( Operation Research)
Travelling salesman problem ( Operation Research)Travelling salesman problem ( Operation Research)
Travelling salesman problem ( Operation Research)
 
Data Analysis and Programming in R
Data Analysis and Programming in RData Analysis and Programming in R
Data Analysis and Programming in R
 
第9回 大規模データを用いたデータフレーム操作実習(3)
第9回 大規模データを用いたデータフレーム操作実習(3)第9回 大規模データを用いたデータフレーム操作実習(3)
第9回 大規模データを用いたデータフレーム操作実習(3)
 
第8回 大規模データを用いたデータフレーム操作実習(2)
第8回 大規模データを用いたデータフレーム操作実習(2)第8回 大規模データを用いたデータフレーム操作実習(2)
第8回 大規模データを用いたデータフレーム操作実習(2)
 
STRING LIST TUPLE DICTIONARY FILE.pdf
STRING LIST TUPLE DICTIONARY FILE.pdfSTRING LIST TUPLE DICTIONARY FILE.pdf
STRING LIST TUPLE DICTIONARY FILE.pdf
 
Python programming : List and tuples
Python programming : List and tuplesPython programming : List and tuples
Python programming : List and tuples
 
A presentation on prim's and kruskal's algorithm
A presentation on prim's and kruskal's algorithmA presentation on prim's and kruskal's algorithm
A presentation on prim's and kruskal's algorithm
 
Backtracking
BacktrackingBacktracking
Backtracking
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
 
Merge Sort
Merge SortMerge Sort
Merge Sort
 
What is sparse matrix
What is sparse matrixWhat is sparse matrix
What is sparse matrix
 
BackTracking Algorithm: Technique and Examples
BackTracking Algorithm: Technique and ExamplesBackTracking Algorithm: Technique and Examples
BackTracking Algorithm: Technique and Examples
 

More from Kamal Gupta Roy

More from Kamal Gupta Roy (6)

Decision_tree.pdf
Decision_tree.pdfDecision_tree.pdf
Decision_tree.pdf
 
Text analytics
Text analyticsText analytics
Text analytics
 
Media savvy for data news
Media savvy for data newsMedia savvy for data news
Media savvy for data news
 
Learning R
Learning RLearning R
Learning R
 
Excel reference book by kamal gupta roy
Excel reference book by kamal gupta royExcel reference book by kamal gupta roy
Excel reference book by kamal gupta roy
 
Knn Algorithm
Knn AlgorithmKnn Algorithm
Knn Algorithm
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

Introduction to data manipulation in R

  • 1. Introduction to dplyr and base R functions for data manipulation Kamal Gupta Roy Last Edited on 3rd Nov 2021 Instructions/Agenda and Learnings 1. Use of functions like ls(), getwd(), setwd(), rm() 2. Install packages (dslabs, dplyr) 3. Load packages(dslabs, dplyr) – library 4. Read murder dataset 5. functions: nrow,ncol,head,tail,summary,class[try for dataframe and variable],str,names,levels,nlevels 6. Position of a dataframe 7. Reading a vector from data frame and doing basic arithmetic functions 8. Order/Arrange - Sorting the data 9. Selecting a column 10. Filtering rows 11. Creating a new variable 12. Summrizing data 13. Summarizing while grouping 14. Chaining Method 15. Exercise dplyr functionality • Five basic verbs: filter, select, arrange, mutate, summarise (plus group_by) Basic Codes Directory Details #### workspace ls() 1
  • 2. ## character(0) #To know what is the default working directory getwd() ## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 - # Setting a Working Directory using setwd() #setwd(C:/Users/Admin/) getwd() ## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 - Install packages install.packages("dslabs") install.packages("dplyr") Load packages library(dslabs) library(dplyr) ## ## Attaching package: ’dplyr’ ## The following objects are masked from ’package:stats’: ## ## filter, lag ## The following objects are masked from ’package:base’: ## ## intersect, setdiff, setequal, union Read dataframe murder <- data.frame(murders) Basic check on data nrow(murder) ## [1] 51 2
  • 3. ncol(murder) ## [1] 5 head(murder) ## state abb region population total ## 1 Alabama AL South 4779736 135 ## 2 Alaska AK West 710231 19 ## 3 Arizona AZ West 6392017 232 ## 4 Arkansas AR South 2915918 93 ## 5 California CA West 37253956 1257 ## 6 Colorado CO West 5029196 65 murder[1,1] ## [1] "Alabama" tail(murder) ## state abb region population total ## 46 Vermont VT Northeast 625741 2 ## 47 Virginia VA South 8001024 250 ## 48 Washington WA West 6724540 93 ## 49 West Virginia WV South 1852994 27 ## 50 Wisconsin WI North Central 5686986 97 ## 51 Wyoming WY West 563626 5 summary(murder) ## state abb region population ## Length:51 Length:51 Northeast : 9 Min. : 563626 ## Class :character Class :character South :17 1st Qu.: 1696962 ## Mode :character Mode :character North Central:12 Median : 4339367 ## West :13 Mean : 6075769 ## 3rd Qu.: 6636084 ## Max. :37253956 ## total ## Min. : 2.0 ## 1st Qu.: 24.5 ## Median : 97.0 ## Mean : 184.4 ## 3rd Qu.: 268.0 ## Max. :1257.0 class(murder) ## [1] "data.frame" 3
  • 4. class(murder$state) ## [1] "character" str(murder) ## ’data.frame’: 51 obs. of 5 variables: ## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... ## $ abb : chr "AL" "AK" "AZ" "AR" ... ## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ... ## $ population: num 4779736 710231 6392017 2915918 37253956 ... ## $ total : num 135 19 232 93 1257 ... names(murder) ## [1] "state" "abb" "region" "population" "total" levels(murder$region) ## [1] "Northeast" "South" "North Central" "West" nlevels(murder$region) ## [1] 4 Read a vector from data frame mdr <- murder$total sum(mdr) ## [1] 9403 mean(mdr) ## [1] 184.3725 max(mdr) ## [1] 1257 min(mdr) ## [1] 2 4
  • 5. dplyr functions Sorting data Simple R rway <- murder[order(murder$total),] head(rway) ## state abb region population total ## 46 Vermont VT Northeast 625741 2 ## 35 North Dakota ND North Central 672591 4 ## 30 New Hampshire NH Northeast 1316470 5 ## 51 Wyoming WY West 563626 5 ## 12 Hawaii HI West 1360301 7 ## 42 South Dakota SD North Central 814180 8 dplyr dpway <- arrange(murder, total) head(dpway) ## state abb region population total ## 1 Vermont VT Northeast 625741 2 ## 2 North Dakota ND North Central 672591 4 ## 3 New Hampshire NH Northeast 1316470 5 ## 4 Wyoming WY West 563626 5 ## 5 Hawaii HI West 1360301 7 ## 6 South Dakota SD North Central 814180 8 Selecting a column Simple R rway <- murder[,"state"] head(rway) ## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" ## [6] "Colorado" class(rway) ## [1] "character" 5
  • 6. rway <- murder[,c("state","total")] head(rway) ## state total ## 1 Alabama 135 ## 2 Alaska 19 ## 3 Arizona 232 ## 4 Arkansas 93 ## 5 California 1257 ## 6 Colorado 65 dplyr dpway <- select(murder,state) head(dpway) ## state ## 1 Alabama ## 2 Alaska ## 3 Arizona ## 4 Arkansas ## 5 California ## 6 Colorado class(dpway) ## [1] "data.frame" dpway <- select(murder,state,total) head(dpway) ## state total ## 1 Alabama 135 ## 2 Alaska 19 ## 3 Arizona 232 ## 4 Arkansas 93 ## 5 California 1257 ## 6 Colorado 65 Filtering rows Simple R rway <- murder[murder$state=='California',] head(rway) ## state abb region population total ## 5 California CA West 37253956 1257 6
  • 7. dplyr dpway <- filter(murder,state=='California') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 dpway <- filter(murder,state=='California' & abb=='CA') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 dpway <- filter(murder,state=='California', abb=='CA') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 dpway <- filter(murder,state=='California' | abb=='WI') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 ## 2 Wisconsin WI North Central 5686986 97 dpway <- filter(murder,abb %in% c('CA','WI','NY')) head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 ## 2 New York NY Northeast 19378102 517 ## 3 Wisconsin WI North Central 5686986 97 Creating a new variable Simple R murder$newpop <- murder$population / 1000 head(murder) ## state abb region population total newpop ## 1 Alabama AL South 4779736 135 4779.736 ## 2 Alaska AK West 710231 19 710.231 ## 3 Arizona AZ West 6392017 232 6392.017 ## 4 Arkansas AR South 2915918 93 2915.918 ## 5 California CA West 37253956 1257 37253.956 ## 6 Colorado CO West 5029196 65 5029.196 7
  • 8. dplyr dpway <- mutate(murder,newpop=population/1000) head(dpway) ## state abb region population total newpop ## 1 Alabama AL South 4779736 135 4779.736 ## 2 Alaska AK West 710231 19 710.231 ## 3 Arizona AZ West 6392017 232 6392.017 ## 4 Arkansas AR South 2915918 93 2915.918 ## 5 California CA West 37253956 1257 37253.956 ## 6 Colorado CO West 5029196 65 5029.196 summarise: Reduce variables to values • Primarily useful with data that has been grouped by one or more variables • group_by creates the groups that will be operated on • summarise uses the provided aggregation function to summarise each group dplyr way - summarize summarise(murder,summurder=sum(total,na.rm=TRUE)) ## summurder ## 1 9403 summarise(murder,avgmurder=mean(total,na.rm=TRUE)) ## avgmurder ## 1 184.3725 summarise(murder,countrows=n()) ## countrows ## 1 51 summarise(murder,summurder=sum(total,na.rm=TRUE), avgmurder=mean(total,na.rm=TRUE),countrows=n()) ## summurder avgmurder countrows ## 1 9403 184.3725 51 dplyr way - group by 8
  • 9. m1 <- group_by(murder,region) ab <- summarise(m1,md=sum(total, na.rm=TRUE), pop = mean(population, na.rm=TRUE), cn = n()) ab <- data.frame(ab) ab ## region md pop cn ## 1 Northeast 1469 6146360 9 ## 2 South 4195 6804378 17 ## 3 North Central 1828 5577250 12 ## 4 West 1911 5534273 13 Chaining Method ab <- murder %>% group_by(region) %>% summarise(md = sum(total, na.rm=TRUE), pop = sum(population, na.rm=TRUE), cn=n()) ab <- data.frame(ab) ab ## region md pop cn ## 1 Northeast 1469 55317240 9 ## 2 South 4195 115674434 17 ## 3 North Central 1828 66927001 12 ## 4 West 1911 71945553 13 Exercises Exercise 1 Do the following for Murder dataset i. Get the murder dataset (as was done in the class) ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the data) iii. which three states have highest population? iv. How many states have more than average population? v. what is the total population of US (actual number and in millions) vi. what is the total number of murders across US? vii. what is the average number of murders viii. what is the total murders in the South region 9
  • 10. ix. How many states are there in each region x. what is the murder rate across each region? xi. Which is the most dangerous state? Exercise 2 Do the following for mtcars dataset i. Get the mtcars dataset ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the data) iii. How many different types of gears are there? iv. which type of transmission is more? automatic or manual v. what is the average hp by number of cylinders vi. what is the avg hp by gears vii. does mpg depend on number of gears? viii. Does weight of car depends on number of cylinders? 10