SlideShare a Scribd company logo
1 of 24
Download to read offline
Data manipulation in R
APAM E4990
Modeling Social Data
Jake Hofman
Columbia University
February 3, 2017
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 1 / 20
The good, the bad, & the ugly
• R isn’t the best programming language out there
• But it happens to be great for data analysis
• The result is a steep learning curve with a high payoff
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 2 / 20
For instance . . .
• You’ll see a mix of camelCase, this.that, and snake case
conventions
• Dots (.) (mostly) don’t mean anything special
• Likewise, $ gets used in funny ways
• R is loosely typed, which can lead to unexpected coercions
and silent fails
• It also tries to be clever about variable scope, which can
backfire if you’re not careful
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 3 / 20
But it will help you . . .
• Do extremely fast exploratory data analysis
• Easily generate high-quality data visualizations
• Fit and evaluate pretty much any statistical model you can
think of
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
But it will help you . . .
• Do extremely fast exploratory data analysis
• Easily generate high-quality data visualizations
• Fit and evaluate pretty much any statistical model you can
think of
This will change the way you do data analysis, because you’ll ask
questions you wouldn’t have bothered to otherwise
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
Basic types
• int, double: for numbers
• character: for strings
• factor: for categorical variables (∼ struct or ENUM)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
Basic types
• int, double: for numbers
• character: for strings
• factor: for categorical variables (∼ struct or ENUM)
Factors are handy, but take some getting used to
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
Containers
• vector: for multiple values of the same type (∼ array)
• list: for multiple values of different types (∼ dictionary)
• data.frame: for tables of rectangular data of mixed types (∼
matrix)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
Containers
• vector: for multiple values of the same type (∼ array)
• list: for multiple values of different types (∼ dictionary)
• data.frame: for tables of rectangular data of mixed types (∼
matrix)
We’ll mostly work with data frames, which themselves are lists of
vectors
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
The tidyverse
The tidyverse is a collection of packages that work together to
make data analysis easier:
• dplyr for split / apply / combine type counting
• ggplot2 for making plots
• tidyr for reshaping and “tidying” data
• readr for reading and writing files
• ...
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 7 / 20
Tidy data
The core philosophy is that your data should be in a “tidy” table
with:
• One observation per row
• One variable per column
• One measured value per cell
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 8 / 20
Tidy data
• Most of the work goes into getting your data into shape
• After which descriptives statistics, modeling, and visualization
are easy
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 9 / 20
dplyr: a grammar of data manipulation
dplyr implements the split / apply / combine framework
discussed in the last lecture
• Its “grammar” has five main verbs used in the “apply” phase:
• filter: restrict rows based on a condition (N → N )
• arrange: reorder rows by a variable (N → N )
• select: pick out specific columns (K → K )
• mutate: create new or change existing columns (K → K )
• summarize: collapse a column into one value (N → 1)
• The group by function creates indices to take care of the
split and combine phases
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
dplyr: a grammar of data manipulation
dplyr implements the split / apply / combine framework
discussed in the last lecture
• Its “grammar” has five main verbs used in the “apply” phase:
• filter: restrict rows based on a condition (N → N )
• arrange: reorder rows by a variable (N → N )
• select: pick out specific columns (K → K )
• mutate: create new or change existing columns (K → K )
• summarize: collapse a column into one value (N → 1)
• The group by function creates indices to take care of the
split and combine phases
The cost is that you have to think “functionally”, in terms of
“vectorized” operations
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
filter
filter(trips, start_station_name == "Broadway & E 14 St")
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 11 / 20
arrange
arrange(trips, starttime)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 12 / 20
select
select(trips, starttime, stoptime,
start_station_name, end_station_name)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 13 / 20
mutate
mutate(trips, time_in_min = tripduration / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 14 / 20
mutate
summarize(trips, mean_duration = mean(tripduration) / 60,
sd_duration = sd(tripduration) / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 15 / 20
group by
trips_by_gender <- group_by(trips, gender)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 16 / 20
group by
trips_by_gender <- group_by(trips, gender)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 17 / 20
group by + summarize
summarize(trips_by_gender,
count = n(),
mean_duration = mean(tripduration) / 60,
sd_duration = sd(tripduration) / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 18 / 20
%>%: the pipe operator
trips %>%
group_by(gender) %>%
summarize(count = n(),
mean_duration = mean(tripduration) / 60,
sd_duration = sd(tripduration) / 60)
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 19 / 20
r4ds.had.co.nz
Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 20 / 20

More Related Content

Viewers also liked

Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...Andreas Önnerfors
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboralalexander_hv
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBGord Sissons
 
ระบบสารสนเทศ
ระบบสารสนเทศระบบสารสนเทศ
ระบบสารสนเทศPetch Boonyakorn
 
2016 Results & Outlook
2016 Results & Outlook 2016 Results & Outlook
2016 Results & Outlook Total
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLSimon Harris
 
Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waitingrittujacob
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopRussell Jurney
 
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureEricsson
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 

Viewers also liked (20)

Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
”’I den svenska och tyska litteraturens mittpunkt’: Svenska Pommerns roll som...
 
Motivación laboral
Motivación laboralMotivación laboral
Motivación laboral
 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB
 
ระบบสารสนเทศ
ระบบสารสนเทศระบบสารสนเทศ
ระบบสารสนเทศ
 
2016 Results & Outlook
2016 Results & Outlook 2016 Results & Outlook
2016 Results & Outlook
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQL
 
tarea 7 gabriel
tarea 7 gabrieltarea 7 gabriel
tarea 7 gabriel
 
Your moment is Waiting
Your moment is WaitingYour moment is Waiting
Your moment is Waiting
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
JSON-LD Update
JSON-LD UpdateJSON-LD Update
JSON-LD Update
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Enabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPopEnabling Multimodel Graphs with Apache TinkerPop
Enabling Multimodel Graphs with Apache TinkerPop
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
ConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving FutureConsumerLab: The Self-Driving Future
ConsumerLab: The Self-Driving Future
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 

Similar to Modeling Social Data, Lecture 3: Data manipulation in R

Algorithms - Jeff Erickson.pdf
Algorithms - Jeff Erickson.pdfAlgorithms - Jeff Erickson.pdf
Algorithms - Jeff Erickson.pdfHannah Baker
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generationkrisztianbalog
 
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean MetricsYuxin Su
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Karen Tran - ENGE 4994 Paper
Karen Tran - ENGE 4994 PaperKaren Tran - ENGE 4994 Paper
Karen Tran - ENGE 4994 PaperKaren Tran
 
Are You Ready to Write Up Your Quantitative Data?
 Are You Ready to Write Up Your Quantitative Data? Are You Ready to Write Up Your Quantitative Data?
Are You Ready to Write Up Your Quantitative Data?DoctoralNet Limited
 
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...collwe
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanfordSakthivel C R
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to RAnshik Bansal
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesEmir Muñoz
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...広樹 本間
 
What's "For Free" on Craigslist?
What's "For Free" on Craigslist? What's "For Free" on Craigslist?
What's "For Free" on Craigslist? Josh Mayer
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...GUANGYUAN PIAO
 
CMI2012 CCSS for Mathematics
CMI2012 CCSS for MathematicsCMI2012 CCSS for Mathematics
CMI2012 CCSS for MathematicsJanet Hale
 
Chapter-8 Relational Database Design
Chapter-8 Relational Database DesignChapter-8 Relational Database Design
Chapter-8 Relational Database DesignKunal Anand
 

Similar to Modeling Social Data, Lecture 3: Data manipulation in R (20)

Algorithms - Jeff Erickson.pdf
Algorithms - Jeff Erickson.pdfAlgorithms - Jeff Erickson.pdf
Algorithms - Jeff Erickson.pdf
 
Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Token
TokenToken
Token
 
Karen Tran - ENGE 4994 Paper
Karen Tran - ENGE 4994 PaperKaren Tran - ENGE 4994 Paper
Karen Tran - ENGE 4994 Paper
 
Are You Ready to Write Up Your Quantitative Data?
 Are You Ready to Write Up Your Quantitative Data? Are You Ready to Write Up Your Quantitative Data?
Are You Ready to Write Up Your Quantitative Data?
 
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
MultiC2: an Optimization Framework for Learning from Task and Worker Dual Het...
 
Assignment#11
Assignment#11Assignment#11
Assignment#11
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
 
R programming
R programmingR programming
R programming
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
 
What's "For Free" on Craigslist?
What's "For Free" on Craigslist? What's "For Free" on Craigslist?
What's "For Free" on Craigslist?
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
 
Data analysis02 twovariables
Data analysis02 twovariablesData analysis02 twovariables
Data analysis02 twovariables
 
CMI2012 CCSS for Mathematics
CMI2012 CCSS for MathematicsCMI2012 CCSS for Mathematics
CMI2012 CCSS for Mathematics
 
Chapter-8 Relational Database Design
Chapter-8 Relational Database DesignChapter-8 Relational Database Design
Chapter-8 Relational Database Design
 

More from jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIjakehofman
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part Ijakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part Ijakehofman
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIjakehofman
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part Ijakehofman
 

More from jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part II
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part I
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part II
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part I
 

Recently uploaded

social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 

Recently uploaded (20)

Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 

Modeling Social Data, Lecture 3: Data manipulation in R

  • 1. Data manipulation in R APAM E4990 Modeling Social Data Jake Hofman Columbia University February 3, 2017 Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 1 / 20
  • 2. The good, the bad, & the ugly • R isn’t the best programming language out there • But it happens to be great for data analysis • The result is a steep learning curve with a high payoff Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 2 / 20
  • 3. For instance . . . • You’ll see a mix of camelCase, this.that, and snake case conventions • Dots (.) (mostly) don’t mean anything special • Likewise, $ gets used in funny ways • R is loosely typed, which can lead to unexpected coercions and silent fails • It also tries to be clever about variable scope, which can backfire if you’re not careful Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 3 / 20
  • 4. But it will help you . . . • Do extremely fast exploratory data analysis • Easily generate high-quality data visualizations • Fit and evaluate pretty much any statistical model you can think of Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
  • 5. But it will help you . . . • Do extremely fast exploratory data analysis • Easily generate high-quality data visualizations • Fit and evaluate pretty much any statistical model you can think of This will change the way you do data analysis, because you’ll ask questions you wouldn’t have bothered to otherwise Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 4 / 20
  • 6. Basic types • int, double: for numbers • character: for strings • factor: for categorical variables (∼ struct or ENUM) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
  • 7. Basic types • int, double: for numbers • character: for strings • factor: for categorical variables (∼ struct or ENUM) Factors are handy, but take some getting used to Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 5 / 20
  • 8. Containers • vector: for multiple values of the same type (∼ array) • list: for multiple values of different types (∼ dictionary) • data.frame: for tables of rectangular data of mixed types (∼ matrix) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
  • 9. Containers • vector: for multiple values of the same type (∼ array) • list: for multiple values of different types (∼ dictionary) • data.frame: for tables of rectangular data of mixed types (∼ matrix) We’ll mostly work with data frames, which themselves are lists of vectors Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 6 / 20
  • 10. The tidyverse The tidyverse is a collection of packages that work together to make data analysis easier: • dplyr for split / apply / combine type counting • ggplot2 for making plots • tidyr for reshaping and “tidying” data • readr for reading and writing files • ... Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 7 / 20
  • 11. Tidy data The core philosophy is that your data should be in a “tidy” table with: • One observation per row • One variable per column • One measured value per cell Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 8 / 20
  • 12. Tidy data • Most of the work goes into getting your data into shape • After which descriptives statistics, modeling, and visualization are easy Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 9 / 20
  • 13. dplyr: a grammar of data manipulation dplyr implements the split / apply / combine framework discussed in the last lecture • Its “grammar” has five main verbs used in the “apply” phase: • filter: restrict rows based on a condition (N → N ) • arrange: reorder rows by a variable (N → N ) • select: pick out specific columns (K → K ) • mutate: create new or change existing columns (K → K ) • summarize: collapse a column into one value (N → 1) • The group by function creates indices to take care of the split and combine phases Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
  • 14. dplyr: a grammar of data manipulation dplyr implements the split / apply / combine framework discussed in the last lecture • Its “grammar” has five main verbs used in the “apply” phase: • filter: restrict rows based on a condition (N → N ) • arrange: reorder rows by a variable (N → N ) • select: pick out specific columns (K → K ) • mutate: create new or change existing columns (K → K ) • summarize: collapse a column into one value (N → 1) • The group by function creates indices to take care of the split and combine phases The cost is that you have to think “functionally”, in terms of “vectorized” operations Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 10 / 20
  • 15. filter filter(trips, start_station_name == "Broadway & E 14 St") Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 11 / 20
  • 16. arrange arrange(trips, starttime) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 12 / 20
  • 17. select select(trips, starttime, stoptime, start_station_name, end_station_name) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 13 / 20
  • 18. mutate mutate(trips, time_in_min = tripduration / 60) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 14 / 20
  • 19. mutate summarize(trips, mean_duration = mean(tripduration) / 60, sd_duration = sd(tripduration) / 60) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 15 / 20
  • 20. group by trips_by_gender <- group_by(trips, gender) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 16 / 20
  • 21. group by trips_by_gender <- group_by(trips, gender) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 17 / 20
  • 22. group by + summarize summarize(trips_by_gender, count = n(), mean_duration = mean(tripduration) / 60, sd_duration = sd(tripduration) / 60) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 18 / 20
  • 23. %>%: the pipe operator trips %>% group_by(gender) %>% summarize(count = n(), mean_duration = mean(tripduration) / 60, sd_duration = sd(tripduration) / 60) Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 19 / 20
  • 24. r4ds.had.co.nz Jake Hofman (Columbia University) Data manipulation in R February 3, 2017 20 / 20