SlideShare a Scribd company logo
dplyr package
Fan Li @ Philly R User Meetup (R<-Gang)
Learn to use
Demo: http://rpubs.com/lifan/phillyweather
Source code: https://github.com/lifan0127/meetup_dplyr_talk
What is dplyr
A package developed by Hadley Wickham to help
transform tabular data.
● Unified, intuitive syntax
● Fast implementation in C++
● Support various data backends (dataframe, RDB, etc.)
install.package(“dplyr”) # Version 0.4
Basic Operators (Verbs)
select(data, col.1, col.2, …) select existing variables (columns)
filter(data, condition.1, condition.2, ...) filter table by conditions
arrange(data, col.1, col.2, …) sort table by variables or other logicals
mutate(data, newcol = …) create new variables
group_by() + summarize() summarize data per group
For a graphic explanation, see Garrett Grolemund’s talk.
Other Helper Functions
transmute
tally
top_n
summarize_each
sample_n/sample_frac
distinct
rename
slice
n_distinct
first/last/nth
Type ?function_name in R to find how to use
%>% Pipe Operator
data %>% function() function(data, …)
foo() %>% bar() bar(foo())
Very useful to convert nested structure into
more logical chain expression.
=
=
Example
feb.snow2 <- weather %>%
select(Year, Month, Day, Snow) %>% # Step 1. Select relevant variables
filter(Year >= 1885, Month == 2) %>% # Step 2. Filter by year and month
group_by(Year) %>% # Step 3. Group by year
summarize(
Snow.Sum = sum(Snow, na.rm = TRUE)) %>% # Step 4. Summarize monthly snowfall
arrange(-Snow.Sum, -Year) # Step 5. Sort table by monthly snowfall/year
Demo with Philly weather data (1872-2001)
Performance Benefit
C++ implementation
Lazy evaluation
Avoid accidental, expensive operations
Usually must faster than base R. Otherwise it will tell you
with a progress bar.
Data Backends
Supports the three most popular open source
databases (sqlite, mysql and postgresql), and
Google’s bigquery.
http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
Resources
Reference manual:
dplyr.pdf
Vignettes:
Data frames
Databases
Hybrid evaluation
Introduction to dplyr
Adding a new SQL backend
Non-standard evaluation
Two-table verbs
Window functions and grouped mutate/filter
Cheatsheet by RStudio
Date Meeting Title Speaker Link
20150122 Advanced Data Manipulation Mike McCann [Slide]
20150121 Berkeley Institute for Data Science Pipelines for Data Analysis Hadley Wickham [Video]
20150114 RStudio Webinar Data Wrangling with R Garrett Grolemund [Slide][Video][Data]
20150113 Upstate Data Analytics Wallace Campbell [Video][Data]
20141202 Sheffield R Users Group how to find help online, data manipulation with plyr and dplyr Mathew Hall [Slide]
20141126 Budapest BI Introduction to the dplyr R package Romain Francois [Slide]
20141111 LA R users group Benchmarking dplyr and data.table (with biggish data) Szilard Pafka [Slide][Data]
20141025 ACM DataScience Camp Data Manipulation Using R Ram Narasimhan [Slide][Video]
20141022 Becoming a data ninja with dplyr Devin Pastoor [Slide]
20141007 Davis R Users' Group dplyr: Data manipulation in R made easy Michael A. Levy [Slide][Video]
20140825 RStudio Webinar Hands-on dplyr tutorial for faster data manipulation in R [Slide][Video]
20140701 USER2014 dplyr: a grammar of data manipulation Hadley Wickham [Video]
20140630 USER2014 Data manipulation with dplyr Hadley Wickham [Slide][Video][Data]
20140214 Stanford HCI Group Expressing yourself in R Hadley Wickham [Slide][Data]
See updated list at: https://github.com/lifan0127/meetup_dplyr_talk
“Hadley Ecosystem”
Visualization
ggplot, ggmap, ggvis
Data Wrangling
reshape, plyr, dplyr, tidyr
Web
rvest, httr, xml2
Other tools
stringr, lubridate, heaven
https://github.com/hadley (Github Repo)
http://adv-r.had.co.nz/ (Advanced R Book)
http://r-pkgs.had.co.nz/ (R Packages Book)

More Related Content

What's hot

SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
The HDF-EOS Tools and Information Center
 
Packages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージPackages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージ
Hiroki K
 
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
PingCAP
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
UMLtoGraphDB: Mapping Conceptual Schemas to Graph Databases
UMLtoGraphDB: Mapping Conceptual Schemas to Graph DatabasesUMLtoGraphDB: Mapping Conceptual Schemas to Graph Databases
UMLtoGraphDB: Mapping Conceptual Schemas to Graph Databases
Gwendal Daniel
 
Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science Platform
Ted Habermann
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
Rim Moussa
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
March 2012 HUG: JuteRC compiler
March 2012 HUG: JuteRC compilerMarch 2012 HUG: JuteRC compiler
March 2012 HUG: JuteRC compiler
Yahoo Developer Network
 
Azure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analyticsAzure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analytics
Lamprini Koutsokera
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
The HDF-EOS Tools and Information Center
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
The HDF-EOS Tools and Information Center
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
Adnan Akhter
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked data
Alessandro Adamou
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
Ankit Beohar
 

What's hot (15)

SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
 
Packages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージPackages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージ
 
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
UMLtoGraphDB: Mapping Conceptual Schemas to Graph Databases
UMLtoGraphDB: Mapping Conceptual Schemas to Graph DatabasesUMLtoGraphDB: Mapping Conceptual Schemas to Graph Databases
UMLtoGraphDB: Mapping Conceptual Schemas to Graph Databases
 
Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science Platform
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
March 2012 HUG: JuteRC compiler
March 2012 HUG: JuteRC compilerMarch 2012 HUG: JuteRC compiler
March 2012 HUG: JuteRC compiler
 
Azure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analyticsAzure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analytics
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked data
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 

Viewers also liked

Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
Dr. Volkan OBAN
 
JavaScript-based Visualization in R
JavaScript-based Visualization in RJavaScript-based Visualization in R
JavaScript-based Visualization in RFan Li
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
Spark Summit
 
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
Tanya Cashorali
 
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
Penn State University
 
20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料
安隆 沖
 
Rlecturenotes
RlecturenotesRlecturenotes
Rlecturenotes
thecar1992
 
R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8
Muhammad Nabi Ahmad
 
Análisis espacial con R (asignatura de Master - UPM)
Análisis espacial con R (asignatura de Master - UPM)Análisis espacial con R (asignatura de Master - UPM)
Análisis espacial con R (asignatura de Master - UPM)
Vladimir Gutierrez, PhD
 
Paquete ggplot - Potencia y facilidad para generar gráficos en R
Paquete ggplot - Potencia y facilidad para generar gráficos en RPaquete ggplot - Potencia y facilidad para generar gráficos en R
Paquete ggplot - Potencia y facilidad para generar gráficos en R
Nestor Montaño
 
WF ED 540, Class Meeting 3 - mutate and summarise, 2016
WF ED 540, Class Meeting 3 - mutate and summarise, 2016WF ED 540, Class Meeting 3 - mutate and summarise, 2016
WF ED 540, Class Meeting 3 - mutate and summarise, 2016
Penn State University
 
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
Penn State University
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
Muhammad Nabi Ahmad
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R Studio
Susan Johnston
 
Dplyr and Plyr
Dplyr and PlyrDplyr and Plyr
Dplyr and Plyr
Paul Richards
 
Data and donuts: Data Visualization using R
Data and donuts: Data Visualization using RData and donuts: Data Visualization using R
Data and donuts: Data Visualization using R
C. Tobin Magle
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
Edwin de Jonge
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
Florian Uhlitz
 
R and Rcmdr Statistical Software
R and Rcmdr Statistical SoftwareR and Rcmdr Statistical Software
R and Rcmdr Statistical Software
arttan2001
 
Tokyor36
Tokyor36Tokyor36

Viewers also liked (20)

Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
JavaScript-based Visualization in R
JavaScript-based Visualization in RJavaScript-based Visualization in R
JavaScript-based Visualization in R
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
 
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
 
20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料
 
Rlecturenotes
RlecturenotesRlecturenotes
Rlecturenotes
 
R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8
 
Análisis espacial con R (asignatura de Master - UPM)
Análisis espacial con R (asignatura de Master - UPM)Análisis espacial con R (asignatura de Master - UPM)
Análisis espacial con R (asignatura de Master - UPM)
 
Paquete ggplot - Potencia y facilidad para generar gráficos en R
Paquete ggplot - Potencia y facilidad para generar gráficos en RPaquete ggplot - Potencia y facilidad para generar gráficos en R
Paquete ggplot - Potencia y facilidad para generar gráficos en R
 
WF ED 540, Class Meeting 3 - mutate and summarise, 2016
WF ED 540, Class Meeting 3 - mutate and summarise, 2016WF ED 540, Class Meeting 3 - mutate and summarise, 2016
WF ED 540, Class Meeting 3 - mutate and summarise, 2016
 
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R Studio
 
Dplyr and Plyr
Dplyr and PlyrDplyr and Plyr
Dplyr and Plyr
 
Data and donuts: Data Visualization using R
Data and donuts: Data Visualization using RData and donuts: Data Visualization using R
Data and donuts: Data Visualization using R
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
R and Rcmdr Statistical Software
R and Rcmdr Statistical SoftwareR and Rcmdr Statistical Software
R and Rcmdr Statistical Software
 
Tokyor36
Tokyor36Tokyor36
Tokyor36
 

Similar to Learn to use dplyr (Feb 2015 Philly R User Meetup)

Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
 
Hive
HiveHive
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
Athens Big Data
 
Examining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail FilesExamining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail Files
Bobby Curtis
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
restassure
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
Skillspeed
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Data Con LA
 
Exploring sql server 2016 bi
Exploring sql server 2016 biExploring sql server 2016 bi
Exploring sql server 2016 bi
Antonios Chatzipavlis
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
Venkata Reddy Konasani
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
Puppet
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
Puppet
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Sumeet Singh
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel
 
Oracle GoldenGate
Oracle GoldenGateOracle GoldenGate
Oracle GoldenGate
Anar Godjaev
 
SQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQLSQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQL
Peter Eisentraut
 
QTP Automation Testing Tutorial 7
QTP Automation Testing Tutorial 7QTP Automation Testing Tutorial 7
QTP Automation Testing Tutorial 7
Akash Tyagi
 

Similar to Learn to use dplyr (Feb 2015 Philly R User Meetup) (20)

Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
Hive
HiveHive
Hive
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
Examining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail FilesExamining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail Files
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
Exploring sql server 2016 bi
Exploring sql server 2016 biExploring sql server 2016 bi
Exploring sql server 2016 bi
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 
Oracle GoldenGate
Oracle GoldenGateOracle GoldenGate
Oracle GoldenGate
 
SQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQLSQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQL
 
QTP Automation Testing Tutorial 7
QTP Automation Testing Tutorial 7QTP Automation Testing Tutorial 7
QTP Automation Testing Tutorial 7
 

Recently uploaded

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 

Recently uploaded (20)

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 

Learn to use dplyr (Feb 2015 Philly R User Meetup)

  • 1. dplyr package Fan Li @ Philly R User Meetup (R<-Gang) Learn to use Demo: http://rpubs.com/lifan/phillyweather Source code: https://github.com/lifan0127/meetup_dplyr_talk
  • 2. What is dplyr A package developed by Hadley Wickham to help transform tabular data. ● Unified, intuitive syntax ● Fast implementation in C++ ● Support various data backends (dataframe, RDB, etc.) install.package(“dplyr”) # Version 0.4
  • 3. Basic Operators (Verbs) select(data, col.1, col.2, …) select existing variables (columns) filter(data, condition.1, condition.2, ...) filter table by conditions arrange(data, col.1, col.2, …) sort table by variables or other logicals mutate(data, newcol = …) create new variables group_by() + summarize() summarize data per group For a graphic explanation, see Garrett Grolemund’s talk.
  • 5. %>% Pipe Operator data %>% function() function(data, …) foo() %>% bar() bar(foo()) Very useful to convert nested structure into more logical chain expression. = =
  • 6. Example feb.snow2 <- weather %>% select(Year, Month, Day, Snow) %>% # Step 1. Select relevant variables filter(Year >= 1885, Month == 2) %>% # Step 2. Filter by year and month group_by(Year) %>% # Step 3. Group by year summarize( Snow.Sum = sum(Snow, na.rm = TRUE)) %>% # Step 4. Summarize monthly snowfall arrange(-Snow.Sum, -Year) # Step 5. Sort table by monthly snowfall/year Demo with Philly weather data (1872-2001)
  • 7. Performance Benefit C++ implementation Lazy evaluation Avoid accidental, expensive operations Usually must faster than base R. Otherwise it will tell you with a progress bar.
  • 8. Data Backends Supports the three most popular open source databases (sqlite, mysql and postgresql), and Google’s bigquery. http://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
  • 9. Resources Reference manual: dplyr.pdf Vignettes: Data frames Databases Hybrid evaluation Introduction to dplyr Adding a new SQL backend Non-standard evaluation Two-table verbs Window functions and grouped mutate/filter Cheatsheet by RStudio
  • 10. Date Meeting Title Speaker Link 20150122 Advanced Data Manipulation Mike McCann [Slide] 20150121 Berkeley Institute for Data Science Pipelines for Data Analysis Hadley Wickham [Video] 20150114 RStudio Webinar Data Wrangling with R Garrett Grolemund [Slide][Video][Data] 20150113 Upstate Data Analytics Wallace Campbell [Video][Data] 20141202 Sheffield R Users Group how to find help online, data manipulation with plyr and dplyr Mathew Hall [Slide] 20141126 Budapest BI Introduction to the dplyr R package Romain Francois [Slide] 20141111 LA R users group Benchmarking dplyr and data.table (with biggish data) Szilard Pafka [Slide][Data] 20141025 ACM DataScience Camp Data Manipulation Using R Ram Narasimhan [Slide][Video] 20141022 Becoming a data ninja with dplyr Devin Pastoor [Slide] 20141007 Davis R Users' Group dplyr: Data manipulation in R made easy Michael A. Levy [Slide][Video] 20140825 RStudio Webinar Hands-on dplyr tutorial for faster data manipulation in R [Slide][Video] 20140701 USER2014 dplyr: a grammar of data manipulation Hadley Wickham [Video] 20140630 USER2014 Data manipulation with dplyr Hadley Wickham [Slide][Video][Data] 20140214 Stanford HCI Group Expressing yourself in R Hadley Wickham [Slide][Data] See updated list at: https://github.com/lifan0127/meetup_dplyr_talk
  • 11. “Hadley Ecosystem” Visualization ggplot, ggmap, ggvis Data Wrangling reshape, plyr, dplyr, tidyr Web rvest, httr, xml2 Other tools stringr, lubridate, heaven https://github.com/hadley (Github Repo) http://adv-r.had.co.nz/ (Advanced R Book) http://r-pkgs.had.co.nz/ (R Packages Book)