Learn to use dplyr (Feb 2015 Philly R User Meetup)
1. dplyr package
Fan Li @ Philly R User Meetup (R<-Gang)
Learn to use
Demo: http://rpubs.com/lifan/phillyweather
Source code: https://github.com/lifan0127/meetup_dplyr_talk
2. What is dplyr
A package developed by Hadley Wickham to help
transform tabular data.
● Unified, intuitive syntax
● Fast implementation in C++
● Support various data backends (dataframe, RDB, etc.)
install.package(“dplyr”) # Version 0.4
3. Basic Operators (Verbs)
select(data, col.1, col.2, …) select existing variables (columns)
filter(data, condition.1, condition.2, ...) filter table by conditions
arrange(data, col.1, col.2, …) sort table by variables or other logicals
mutate(data, newcol = …) create new variables
group_by() + summarize() summarize data per group
For a graphic explanation, see Garrett Grolemund’s talk.
5. %>% Pipe Operator
data %>% function() function(data, …)
foo() %>% bar() bar(foo())
Very useful to convert nested structure into
more logical chain expression.
=
=
6. Example
feb.snow2 <- weather %>%
select(Year, Month, Day, Snow) %>% # Step 1. Select relevant variables
filter(Year >= 1885, Month == 2) %>% # Step 2. Filter by year and month
group_by(Year) %>% # Step 3. Group by year
summarize(
Snow.Sum = sum(Snow, na.rm = TRUE)) %>% # Step 4. Summarize monthly snowfall
arrange(-Snow.Sum, -Year) # Step 5. Sort table by monthly snowfall/year
Demo with Philly weather data (1872-2001)
7. Performance Benefit
C++ implementation
Lazy evaluation
Avoid accidental, expensive operations
Usually must faster than base R. Otherwise it will tell you
with a progress bar.
8. Data Backends
Supports the three most popular open source
databases (sqlite, mysql and postgresql), and
Google’s bigquery.
http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
10. Date Meeting Title Speaker Link
20150122 Advanced Data Manipulation Mike McCann [Slide]
20150121 Berkeley Institute for Data Science Pipelines for Data Analysis Hadley Wickham [Video]
20150114 RStudio Webinar Data Wrangling with R Garrett Grolemund [Slide][Video][Data]
20150113 Upstate Data Analytics Wallace Campbell [Video][Data]
20141202 Sheffield R Users Group how to find help online, data manipulation with plyr and dplyr Mathew Hall [Slide]
20141126 Budapest BI Introduction to the dplyr R package Romain Francois [Slide]
20141111 LA R users group Benchmarking dplyr and data.table (with biggish data) Szilard Pafka [Slide][Data]
20141025 ACM DataScience Camp Data Manipulation Using R Ram Narasimhan [Slide][Video]
20141022 Becoming a data ninja with dplyr Devin Pastoor [Slide]
20141007 Davis R Users' Group dplyr: Data manipulation in R made easy Michael A. Levy [Slide][Video]
20140825 RStudio Webinar Hands-on dplyr tutorial for faster data manipulation in R [Slide][Video]
20140701 USER2014 dplyr: a grammar of data manipulation Hadley Wickham [Video]
20140630 USER2014 Data manipulation with dplyr Hadley Wickham [Slide][Video][Data]
20140214 Stanford HCI Group Expressing yourself in R Hadley Wickham [Slide][Data]
See updated list at: https://github.com/lifan0127/meetup_dplyr_talk
11. “Hadley Ecosystem”
Visualization
ggplot, ggmap, ggvis
Data Wrangling
reshape, plyr, dplyr, tidyr
Web
rvest, httr, xml2
Other tools
stringr, lubridate, heaven
https://github.com/hadley (Github Repo)
http://adv-r.had.co.nz/ (Advanced R Book)
http://r-pkgs.had.co.nz/ (R Packages Book)