dplyr Package
Introduction
Helps transform and manipulate data
Powerful tool to summarise data sets
Install: install.packages(dplyr)
Activate: library(dplyr)
File: Excel
Variables: 7
Observations: 153
▪ select
▪ filter
▪ arrange
▪ distinct
▪ mutate
▪ transmute
▪ group_by
▪ summarise
▪ pipe operator (%>%)
▪ slice
▪ count
Functions in dplyr
• Keeps only those variables (columns) that you
want to retain/extract.
• Syntax: select(dataset,[column1],[column2],…)
Examples:
 Select columns Month, Dealer, Item, Quantity: select(sales,Month,Dealer,Item,Qty)
 Select columns from Month to Quantity: select(sales,Month:Qty)
 Deselect column Month from the dataset: select(sales,-Month)
 Select columns ending with the letter “r”: select(sales,ends_with("r"))
 Select columns containing the letter “r”: select(sales,contains("r"))
 Select columns starting the series “m”: select(sales,matches("m."))
 Select columns with multiple variables: select(sales,one_of(c("Month","Dealer")))
 Select columns starting with the letter “d”: select(sales,starts_with("d"))
select()
• Keeps only those records (rows) that you want to
retain/extract.
• Syntax: filter(dataset,criteria)
Examples:
 Item is Pen: filter(sales,Item==“Pen”)
 Quantity is more than 50: filter(sales,Qty>50)
 Item is Pencil and Quantity is more than 50: filter(sales,Item=="Pencil"&Qty>50)
 Quantity is between 50 and 80: filter(sales,Qty>50&Qty<80)
 Item is Pencil or Quantity is more than 50: filter(sales,Item=="Pencil"|Qty>50)
filter()
Examples:
 We want to extract the Sales Manager, Item and Quantity but only for Pencil:
i) k=select(sales,SalesManager,Item,Qty)
filter(k,Item=="Pencil")
ii) select(filter(sales,Item=="Pencil"),SalesManager,Item,Qty)
iii) filter(select(sales,SalesManager,Item,Qty),Item=="Pencil")
 We want to extract for the Month of May, Dealer, Item and Quantity:
i) filter(select(sales,Dealer,Item,Qty),sales$Month=="May")
ii) filter(select(sales,Dealer,Item,Qty),Month=="May")
select() and filter()
• Orders or sorts the records (rows) based on the
variable(s).
• By default the arrangement is in ascending order.
• Syntax: arrange(dataset,column1,[column2],…)
Examples:
 Sort the dataset based on Months: arrange(sales,Month)
 Sort the dataset based on Months and Dealer: arrange(sales,Month,Dealer)
 Arrange the data in descending order of Quantity: arrange(sales,desc(Qty))
arrange()
• Helps extract unique values from a variable.
• Syntax: distinct(dataset,by=column1)
Examples:
 Find the names of the Dealers: distinct(sales,Dealer)
 Find the items sold by each Dealer: arrange(distinct(sales,Dealer,Item),by=Dealer)
distinct()
• Adds a new variable (column) to the existing
dataset
• Syntax: mutate(dataset,newcolumn=criteria)
Example:
 Add a new column Target where it is twice of Quantity: mutate(sales,Target=Qty*2)
mutate()
• Creates a new variable (column) but drops the
existing ones
• Syntax: transmute(dataset,newcolumn=criteria)
Example:
 Create a new column Target where it is twice of Quantity: transmute(sales,tgt=2*Qty)
transmute()
• Helps create groups in a dataset based on a
varaible.
• Useful when nested with other functions.
• Syntax: group_by(dataset,column1,[column2]…)
• Ungroup Syntax: ungroup(dataset)
Example:
 Create groups in the data based on Items: group_by(sales,Item)
 Get the maximum units sold for each item: filter(group_by(sales,by=Item),Qty==max(Qty))
group_by()
• Helps generate a single number/statistic for the dataset
• Syntax: summarise(dataset,newvariable=function….)
Examples:
 Total number of units sold across all Items:
summarise(sales,total=sum(Qty))
 Total number of units sold and total amount:
summarise(sales,t_Qty=sum(Qty),t_Amount=sum(Amount))
 Total number of records in the dataset:
summarise(sales,rowscount=n())
 Get the total number of records, quantity sold and amount for each item:
summarise(group_by(sales,Item),rcount=n(),untiyqty=sum(Qty),totalamount=sum(Amount))
 Every statistic for each dealer and their respective items:
summarise(group_by(sales,Dealer,Item),rcount=n(),untiyqty=sum(Qty),totalamount=sum(Amount))
summarise()
 We want to extract the top 6 records for Dealers who have sold the Item Pen only:
filter((sales,Item=="Pen")
select(filter(sales,Item=="Pen"),Item,Dealer,Qty)
arrange(select(filter(sales,Item=="Pen"),Item,Dealer,Qty),by=Dealer)
head(arrange(select(filter(sales,Item=="Pen"),Item,Dealer,Qty),by=Dealer))
 We want the maximum quantity of every item for the month of May with just Dealer, Item and
Quantity variables:
select(sales,Dealer,Item,Qty)
filter(select(sales,Dealer,Item,Qty),sales$Month=="May")
group_by(filter(select(sales,Dealer,Item,Qty),sales$Month=="May"),Item,Dealer)
summarise(group_by(filter(select(sales,Dealer,Item,Qty),sales$Month=="May"),Item,Dealer),max(Qty))
Assignment
• Belongs to magrittr Package.
• Helps structure sequence of operations in a
single code from left to right.
• Helps avoid nesting of funtions.
• Operator: %>%
Examples:
 We want to extract the top 6 records for Dealers who have sold the Item Pen only:
sales%>%filter(Item=="Pen")%>%select(Dealer,Item,Qty)%>%arrange(Dealer)%>%head
 We want the maximum quantity of every item for the month of May with just Dealer, Item and
Quantity variables:
sales%>%select(Dealer,Item,Qty)%>%filter(sales$Month=="May")%>%group_by(Item,Dealer)%>%sum
marise(max(Qty))
pipe operator %>%
• Helps extract records (rows) based on their
position.
• Syntax: slice(dataset,row numbers)
Examples:
 Select first ten rows: slice(sales,1:10)
 Select rows fifteen to twenty: slice(sales,15:20)
slice()
• Helps count the number of times a values has
appeared in a variable.
• Syntax: count(dataset, [column1],[column2],…)
Examples:
 Count the number of times each Dealer has appeared: count(sales,Dealer)
 Count the number of times Pen has appeared: count(sales,Item=="Pen")
count()
Thanks!
Any questions?
You can find me at
▪ cc@wkvedu.com

dplyr Package in R

  • 1.
  • 2.
    Introduction Helps transform andmanipulate data Powerful tool to summarise data sets Install: install.packages(dplyr) Activate: library(dplyr)
  • 3.
  • 4.
    ▪ select ▪ filter ▪arrange ▪ distinct ▪ mutate ▪ transmute ▪ group_by ▪ summarise ▪ pipe operator (%>%) ▪ slice ▪ count Functions in dplyr
  • 5.
    • Keeps onlythose variables (columns) that you want to retain/extract. • Syntax: select(dataset,[column1],[column2],…) Examples:  Select columns Month, Dealer, Item, Quantity: select(sales,Month,Dealer,Item,Qty)  Select columns from Month to Quantity: select(sales,Month:Qty)  Deselect column Month from the dataset: select(sales,-Month)  Select columns ending with the letter “r”: select(sales,ends_with("r"))  Select columns containing the letter “r”: select(sales,contains("r"))  Select columns starting the series “m”: select(sales,matches("m."))  Select columns with multiple variables: select(sales,one_of(c("Month","Dealer")))  Select columns starting with the letter “d”: select(sales,starts_with("d")) select()
  • 6.
    • Keeps onlythose records (rows) that you want to retain/extract. • Syntax: filter(dataset,criteria) Examples:  Item is Pen: filter(sales,Item==“Pen”)  Quantity is more than 50: filter(sales,Qty>50)  Item is Pencil and Quantity is more than 50: filter(sales,Item=="Pencil"&Qty>50)  Quantity is between 50 and 80: filter(sales,Qty>50&Qty<80)  Item is Pencil or Quantity is more than 50: filter(sales,Item=="Pencil"|Qty>50) filter()
  • 7.
    Examples:  We wantto extract the Sales Manager, Item and Quantity but only for Pencil: i) k=select(sales,SalesManager,Item,Qty) filter(k,Item=="Pencil") ii) select(filter(sales,Item=="Pencil"),SalesManager,Item,Qty) iii) filter(select(sales,SalesManager,Item,Qty),Item=="Pencil")  We want to extract for the Month of May, Dealer, Item and Quantity: i) filter(select(sales,Dealer,Item,Qty),sales$Month=="May") ii) filter(select(sales,Dealer,Item,Qty),Month=="May") select() and filter()
  • 8.
    • Orders orsorts the records (rows) based on the variable(s). • By default the arrangement is in ascending order. • Syntax: arrange(dataset,column1,[column2],…) Examples:  Sort the dataset based on Months: arrange(sales,Month)  Sort the dataset based on Months and Dealer: arrange(sales,Month,Dealer)  Arrange the data in descending order of Quantity: arrange(sales,desc(Qty)) arrange()
  • 9.
    • Helps extractunique values from a variable. • Syntax: distinct(dataset,by=column1) Examples:  Find the names of the Dealers: distinct(sales,Dealer)  Find the items sold by each Dealer: arrange(distinct(sales,Dealer,Item),by=Dealer) distinct()
  • 10.
    • Adds anew variable (column) to the existing dataset • Syntax: mutate(dataset,newcolumn=criteria) Example:  Add a new column Target where it is twice of Quantity: mutate(sales,Target=Qty*2) mutate()
  • 11.
    • Creates anew variable (column) but drops the existing ones • Syntax: transmute(dataset,newcolumn=criteria) Example:  Create a new column Target where it is twice of Quantity: transmute(sales,tgt=2*Qty) transmute()
  • 12.
    • Helps creategroups in a dataset based on a varaible. • Useful when nested with other functions. • Syntax: group_by(dataset,column1,[column2]…) • Ungroup Syntax: ungroup(dataset) Example:  Create groups in the data based on Items: group_by(sales,Item)  Get the maximum units sold for each item: filter(group_by(sales,by=Item),Qty==max(Qty)) group_by()
  • 13.
    • Helps generatea single number/statistic for the dataset • Syntax: summarise(dataset,newvariable=function….) Examples:  Total number of units sold across all Items: summarise(sales,total=sum(Qty))  Total number of units sold and total amount: summarise(sales,t_Qty=sum(Qty),t_Amount=sum(Amount))  Total number of records in the dataset: summarise(sales,rowscount=n())  Get the total number of records, quantity sold and amount for each item: summarise(group_by(sales,Item),rcount=n(),untiyqty=sum(Qty),totalamount=sum(Amount))  Every statistic for each dealer and their respective items: summarise(group_by(sales,Dealer,Item),rcount=n(),untiyqty=sum(Qty),totalamount=sum(Amount)) summarise()
  • 14.
     We wantto extract the top 6 records for Dealers who have sold the Item Pen only: filter((sales,Item=="Pen") select(filter(sales,Item=="Pen"),Item,Dealer,Qty) arrange(select(filter(sales,Item=="Pen"),Item,Dealer,Qty),by=Dealer) head(arrange(select(filter(sales,Item=="Pen"),Item,Dealer,Qty),by=Dealer))  We want the maximum quantity of every item for the month of May with just Dealer, Item and Quantity variables: select(sales,Dealer,Item,Qty) filter(select(sales,Dealer,Item,Qty),sales$Month=="May") group_by(filter(select(sales,Dealer,Item,Qty),sales$Month=="May"),Item,Dealer) summarise(group_by(filter(select(sales,Dealer,Item,Qty),sales$Month=="May"),Item,Dealer),max(Qty)) Assignment
  • 15.
    • Belongs tomagrittr Package. • Helps structure sequence of operations in a single code from left to right. • Helps avoid nesting of funtions. • Operator: %>% Examples:  We want to extract the top 6 records for Dealers who have sold the Item Pen only: sales%>%filter(Item=="Pen")%>%select(Dealer,Item,Qty)%>%arrange(Dealer)%>%head  We want the maximum quantity of every item for the month of May with just Dealer, Item and Quantity variables: sales%>%select(Dealer,Item,Qty)%>%filter(sales$Month=="May")%>%group_by(Item,Dealer)%>%sum marise(max(Qty)) pipe operator %>%
  • 16.
    • Helps extractrecords (rows) based on their position. • Syntax: slice(dataset,row numbers) Examples:  Select first ten rows: slice(sales,1:10)  Select rows fifteen to twenty: slice(sales,15:20) slice()
  • 17.
    • Helps countthe number of times a values has appeared in a variable. • Syntax: count(dataset, [column1],[column2],…) Examples:  Count the number of times each Dealer has appeared: count(sales,Dealer)  Count the number of times Pen has appeared: count(sales,Item=="Pen") count()
  • 18.
    Thanks! Any questions? You canfind me at ▪ cc@wkvedu.com