Learn to use dplyr (Feb 2015 Philly R User Meetup)


A presentation on how to use the dplyr package in R at the February 2015 R User Meetup in Philadelphia

Published in: Software
  1. 1. dplyr package Fan Li @ Philly R User Meetup (R<-Gang) Learn to use Demo: Source code:
  2. 2. What is dplyr A package developed by Hadley Wickham to help transform tabular data. ● Unified, intuitive syntax ● Fast implementation in C++ ● Support various data backends (dataframe, RDB, etc.) install.package(“dplyr”) # Version 0.4
  3. 3. Basic Operators (Verbs) select(data, col.1, col.2, …) select existing variables (columns) filter(data, condition.1, condition.2, ...) filter table by conditions arrange(data, col.1, col.2, …) sort table by variables or other logicals mutate(data, newcol = …) create new variables group_by() + summarize() summarize data per group For a graphic explanation, see Garrett Grolemund’s talk.
  4. 4. Other Helper Functions transmute tally top_n summarize_each sample_n/sample_frac distinct rename slice n_distinct first/last/nth Type ?function_name in R to find how to use
  5. 5. %>% Pipe Operator data %>% function() function(data, …) foo() %>% bar() bar(foo()) Very useful to convert nested structure into more logical chain expression. = =
  6. 6. Example feb.snow2 <- weather %>% select(Year, Month, Day, Snow) %>% # Step 1. Select relevant variables filter(Year >= 1885, Month == 2) %>% # Step 2. Filter by year and month group_by(Year) %>% # Step 3. Group by year summarize( Snow.Sum = sum(Snow, na.rm = TRUE)) %>% # Step 4. Summarize monthly snowfall arrange(-Snow.Sum, -Year) # Step 5. Sort table by monthly snowfall/year Demo with Philly weather data (1872-2001)
  7. 7. Performance Benefit C++ implementation Lazy evaluation Avoid accidental, expensive operations Usually must faster than base R. Otherwise it will tell you with a progress bar.
  8. 8. Data Backends Supports the three most popular open source databases (sqlite, mysql and postgresql), and Google’s bigquery.
  9. 9. Resources Reference manual: dplyr.pdf Vignettes: Data frames Databases Hybrid evaluation Introduction to dplyr Adding a new SQL backend Non-standard evaluation Two-table verbs Window functions and grouped mutate/filter Cheatsheet by RStudio
  10. 10. Date Meeting Title Speaker Link 20150122 Advanced Data Manipulation Mike McCann [Slide] 20150121 Berkeley Institute for Data Science Pipelines for Data Analysis Hadley Wickham [Video] 20150114 RStudio Webinar Data Wrangling with R Garrett Grolemund [Slide][Video][Data] 20150113 Upstate Data Analytics Wallace Campbell [Video][Data] 20141202 Sheffield R Users Group how to find help online, data manipulation with plyr and dplyr Mathew Hall [Slide] 20141126 Budapest BI Introduction to the dplyr R package Romain Francois [Slide] 20141111 LA R users group Benchmarking dplyr and data.table (with biggish data) Szilard Pafka [Slide][Data] 20141025 ACM DataScience Camp Data Manipulation Using R Ram Narasimhan [Slide][Video] 20141022 Becoming a data ninja with dplyr Devin Pastoor [Slide] 20141007 Davis R Users' Group dplyr: Data manipulation in R made easy Michael A. Levy [Slide][Video] 20140825 RStudio Webinar Hands-on dplyr tutorial for faster data manipulation in R [Slide][Video] 20140701 USER2014 dplyr: a grammar of data manipulation Hadley Wickham [Video] 20140630 USER2014 Data manipulation with dplyr Hadley Wickham [Slide][Video][Data] 20140214 Stanford HCI Group Expressing yourself in R Hadley Wickham [Slide][Data] See updated list at:
  11. 11. “Hadley Ecosystem” Visualization ggplot, ggmap, ggvis Data Wrangling reshape, plyr, dplyr, tidyr Web rvest, httr, xml2 Other tools stringr, lubridate, heaven (Github Repo) (Advanced R Book) (R Packages Book)