SlideShare a Scribd company logo
1 of 50
Download to read offline
Analyzing NYC Transit Data:
Taxis, Ubers, and Citi Bikes
Todd Schneider
April 8, 2016
todd@toddwschneider.com
Where to find me
toddwschneider.com
github.com/toddwschneider
@todd_schneider
toddsnyder
Things I’ll talk about
• Taxi, Uber, and Citi Bike data
• Medium data analysis tools and tips
• Where does R fit in?
Taxi and Uber Data
http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
Citi Bike Data
http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/
NYC Taxi and Uber Data
• Taxi & Limousine Commission released
public, trip-level data for over 1.1 billion
taxi rides 2009–2015
• Some public Uber data available as
well, thanks to a FOIL request by
FiveThirtyEight
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
Citi Bike Data
• Citi Bike releases monthly data for every
individual ride
• Data includes timestamps and locations,
plus rider’s subscriber status, gender,
and age
https://www.citibikenyc.com/system-data
Generic Analysis Overview
1. Get raw data
2. Write code to process raw data into
something more useful
3. Analyze data
4. Write about what you found out
Analysis Tools
• PostgreSQL
• PostGIS
• R
• Command line
• JavaScript
https://github.com/toddwschneider/nyc-taxi-data
https://github.com/toddwschneider/nyc-citibike-data
Raw data processing goals
• Load flat files of varying file formats into a unified,
persistent PostgreSQL database that we can use to
answer questions about the data
• Do some one-time calculations to augment the raw
data
• We want to answer neighborhood-based questions,
so we’ll map latitude/longitude coordinates to NYC
census tracts
Processing raw data:
The reality
• Often messy, raw data can require massaging
• Not fun, takes a while, but is essential
• Specifically: we have to plan ahead a bit,
anticipate usage patterns, questions we’re going
to ask, then decide on schema
Raw Data
Specific issues encountered
with raw taxi data
• Some files contain empty lines and unquoted carriage
returns 😐
• Raw data files have different formats even within the
same cab type 😕
• Some files contain extra columns in every row 😠
• Some files contain extra columns in only some rows 😡
How do we load a bunch of
files into a database?
• One at a time!
• Bash script loops through each raw data file, for
each file it executes code to process data and
insert records into a database table
https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh
How do we map latitude and
longitude to census tracts?
• PostGIS!
• Geographic information system (GIS) for
PostgreSQL
• Can do calculations of the form, “is a point inside a
polygon?”
• Every pickup/drop off is a point, NYC’s census
tracts are polygons
NYC Census
Tracts
• 2,166 tracts
• 196 neighborhood
tabulation areas
(NTAs)
Shapefiles
• Shapefile format describes geometries like points,
lines, polygons
• Many shapefiles publicly available, e.g. NYC
provides a shapefile that contains definitions for all
census tracts and NTAs
• PostGIS includes functionality to import shapefiles
Shapefile Example
PostGIS: ST_Within()
• ST_Within(geom A, geom B) function returns
true if and only if A is entirely within B
• A = pickup or drop off point
• B = NYC census tract polygon
Spatial Indexes
• Problem: determining whether a point is inside an
arbitrary polygon is computationally intensive and
slow
• PostGIS spatial indexes to the rescue!
Spatial indexes in a nutshell
bounding box
Bounding
box
Census
tract
Spatial Indexes
• Determining whether a point is inside a rectangle is easy!
• Spatial indexes store rectangular bounding boxes for
polygons, then when determining if a point is inside a
polygon, calculate in 2 steps:
1. Is the point inside the polygon’s bounding box?
2. If so, is the point inside the polygon itself?
• Most of the time the cheap first check will be false, then
we can skip the expensive second step
Putting it all together
• Download NYC census tracts shapefile, import
into database, create spatial index
• Download raw taxi/Uber/Citi Bike data files and
loop through them, one file at a time
• For each file: fix data issues, load into database,
calculate census tracts with ST_Within()
• Wait 3 days and voila!
Analysis, a.k.a.“the fun part”
• Ask fun and interesting
questions
• Try to answer them
• Rinse and repeat
Taxi maps
• Question: what does a map of every taxi pickup
and drop off look like?
• Each trip has a pickup and drop off location, plot a
bunch of dots at those locations
• Made entirely in R using ggplot2
Taxi maps
Taxi maps preprocess
• Problem: R can’t fit 1.1 billion rows
• Solution: preprocess data by rounding
lat/long to 4 decimal places (~10
meters), count number of trips at each
aggregated point
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215
Render maps in R
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R
Data reliability
Every other comment on reddit:
• Map the position of every Citi Bike over
the course of a single day
• Google Maps Directions API for cycling
directions
• Leaflet.js for mapping
• Torque.js by CartoDB for animation
Citi Bike Animation
• Google Maps cycling directions
have strong bias for dedicated
bike lanes on 1st, 2nd, 8th, and
9th avenues
• Not necessarily true!
Citi Bike Assumptions
Modeling the relationship between
the weather and Citi Bike ridership
Modeling the relationship between
the weather and Citi Bike ridership
• Daily ridership data from Citi Bike
• Daily weather data from National Climatic
Data Center: temperature, precipitation,
snow depth
• Devise and calibrate model in R
Modeling the relationship between
the weather and Citi Bike ridership
Model specification
Calibration in R
• Uses nlsLM() function from
minpack.lm package for Levenberg–
Marquardt algorithm to minimize
nonlinear squared error
https://gist.github.com/toddwschneider/bac3350f84b2ff99969d
Model Results
Airport traffic
• Question: how long will my taxi take to get to the
airport?
• LGA, JFK, and EWR are each their own census
tracts
• Get all trips that dropped off in one of those tracts
• Calculate travel times from neighborhoods to airports
Airport traffic
More fun stuff in the full posts
• On the realism of Die Hard 3
• Relationship between age, gender, and cycling
speed
• Neighborhoods with most nightlife
• East Hampton privacy concerns
• What time do investment bankers arrive at work?
“Medium data” analysis tips
What is “medium data”?
No clear answer, but my rough thinking:
• Tiny: fits in spreadsheet
• Small: doesn’t fit in spreadsheet, but fits in RAM
• Medium: too big for RAM, but fits on local hard disk
• Big: too big for local disk, has to be distributed across
many nodes
Use the right tool for the job
My personal toolkit (yours may vary!):
• PostgreSQL for storing and aggregating data. Geospatial
calculations with PostGIS extension
• R for modeling and plotting
• Command line tools for looping through files, loading data, text
processing on input data with sed, awk, etc.
• Ruby for making API calls, scraping websites, running web servers,
and sometimes using local rails apps to organize relational data
• JavaScript for interactivity on the web
R + PostgresSQL
• The R ↔ Postgres link is invaluable! Use R and
Postgres for the things they’re respectively best at
• Postgres: persisting data in tables, rote number
crunching
• R: calibrating models, plotting
• RPostgreSQL package allows querying Postgres
from within R
Tip: pre-aggregate
• Think about how you’re going to access the data, and
consider creating intermediate aggregated tables
which can be used as building blocks for later
analysis
• Example: number of taxi trips grouped by pickup
census tract and date/time truncated to the hour
• Resulting table is only 30 million rows, easier to work
with than full trips table, and can still answer lots of
interesting questions
Pre-aggregating example
CREATE TABLE hourly_pickups AS
SELECT
date_trunc('hour', pickup_datetime) AS pickup_hour,
cab_type_id,
pickup_nyct2010_gid,
COUNT(*)
FROM trips
WHERE pickup_nyct2010_gid IS NOT NULL
GROUP BY pickup_hour,
cab_type_id,
pickup_nyct2010_gid;
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L30-L38
How to get people to read
your work
• It has to be interesting. If you’re not excited,
probably nobody else is either
• Most people are distracted, and they read things in
“fast scroll” mode. Optimize for them
• The questions you ask are more important than
the methods you use to answer them
Specific tips
• Write in short paragraphs with straightforward
language
• Use plenty of section headers
• Good ratio of pictures to text
• Avoid the dreaded “wall of text”
Above all…
• Have fun!
• Keep an inquisitive mind.
Observe stuff happening around
you, ask questions about it, try to
answer those questions
Thanks!
todd@toddwschneider.com

More Related Content

Viewers also liked

Viewers also liked (14)

I Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for TreesI Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for Trees
 
The Political Impact of Social Penumbras
The Political Impact of Social PenumbrasThe Political Impact of Social Penumbras
The Political Impact of Social Penumbras
 
Thinking Small About Big Data
Thinking Small About Big DataThinking Small About Big Data
Thinking Small About Big Data
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
R for Everything
R for EverythingR for Everything
R for Everything
 
Julia + R for Data Science
Julia + R for Data ScienceJulia + R for Data Science
Julia + R for Data Science
 
Using R at NYT Graphics
Using R at NYT GraphicsUsing R at NYT Graphics
Using R at NYT Graphics
 
Building Scalable Prediction Services in R
Building Scalable Prediction Services in RBuilding Scalable Prediction Services in R
Building Scalable Prediction Services in R
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
 
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
 
Iterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament editionIterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament edition
 
Inside the R Consortium
Inside the R ConsortiumInside the R Consortium
Inside the R Consortium
 
Scaling Analysis Responsibly
Scaling Analysis ResponsiblyScaling Analysis Responsibly
Scaling Analysis Responsibly
 

Similar to Analyzing NYC Transit Data

945 mpp1 chicago_taxi data research_v1_cyy_33w1xtp
945 mpp1 chicago_taxi data research_v1_cyy_33w1xtp945 mpp1 chicago_taxi data research_v1_cyy_33w1xtp
945 mpp1 chicago_taxi data research_v1_cyy_33w1xtp
Rajprakash Dwivedi
 
Disruptive open transport data
Disruptive open transport dataDisruptive open transport data
Disruptive open transport data
Jonathan Raper
 
Od ifriday openraildata
Od ifriday openraildataOd ifriday openraildata
Od ifriday openraildata
Jonathan Raper
 

Similar to Analyzing NYC Transit Data (20)

Visualizing Urban Data - Chris Whong
Visualizing Urban Data - Chris WhongVisualizing Urban Data - Chris Whong
Visualizing Urban Data - Chris Whong
 
CitySDK Linked Data API
CitySDK Linked Data APICitySDK Linked Data API
CitySDK Linked Data API
 
capital bikeshare
capital bikesharecapital bikeshare
capital bikeshare
 
945 mpp1 chicago_taxi data research_v1_cyy_33w1xtp
945 mpp1 chicago_taxi data research_v1_cyy_33w1xtp945 mpp1 chicago_taxi data research_v1_cyy_33w1xtp
945 mpp1 chicago_taxi data research_v1_cyy_33w1xtp
 
Disruptive open transport data
Disruptive open transport dataDisruptive open transport data
Disruptive open transport data
 
SF-CHAMP 5 - FROGGER - San Francisco's Newly-updated Travel Model
SF-CHAMP 5 - FROGGER - San Francisco's Newly-updated Travel ModelSF-CHAMP 5 - FROGGER - San Francisco's Newly-updated Travel Model
SF-CHAMP 5 - FROGGER - San Francisco's Newly-updated Travel Model
 
Use of AI in commuting
Use of AI in commutingUse of AI in commuting
Use of AI in commuting
 
Dynamic Fleet Sizing Problem for an E-Scooter Valet Service
Dynamic Fleet Sizing Problem for an E-Scooter Valet ServiceDynamic Fleet Sizing Problem for an E-Scooter Valet Service
Dynamic Fleet Sizing Problem for an E-Scooter Valet Service
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at Uber
 
Openstreetmap
OpenstreetmapOpenstreetmap
Openstreetmap
 
Koober Preduction IO Presentation
Koober Preduction IO PresentationKoober Preduction IO Presentation
Koober Preduction IO Presentation
 
Koober Machine Learning
Koober Machine LearningKoober Machine Learning
Koober Machine Learning
 
The Impact of Always-on Connectivity for Geospatial Applications and Analysis
The Impact of Always-on Connectivity for Geospatial Applications and AnalysisThe Impact of Always-on Connectivity for Geospatial Applications and Analysis
The Impact of Always-on Connectivity for Geospatial Applications and Analysis
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
URBAN TRAFFIC DATA HACK - ROLAND MAJOR
URBAN TRAFFIC DATA HACK - ROLAND MAJORURBAN TRAFFIC DATA HACK - ROLAND MAJOR
URBAN TRAFFIC DATA HACK - ROLAND MAJOR
 
How can Open Data Revolutionise your Rail Travel?
How can Open Data Revolutionise your Rail Travel?How can Open Data Revolutionise your Rail Travel?
How can Open Data Revolutionise your Rail Travel?
 
Od ifriday openraildata
Od ifriday openraildataOd ifriday openraildata
Od ifriday openraildata
 
SoTM US Routing
SoTM US RoutingSoTM US Routing
SoTM US Routing
 
City bench iswc_2015
City bench iswc_2015City bench iswc_2015
City bench iswc_2015
 

More from Work-Bench

Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions Compared
Work-Bench
 

More from Work-Bench (8)

2017 Enterprise Almanac
2017 Enterprise Almanac2017 Enterprise Almanac
2017 Enterprise Almanac
 
AI to Enable Next Generation of People Managers
AI to Enable Next Generation of People ManagersAI to Enable Next Generation of People Managers
AI to Enable Next Generation of People Managers
 
Startup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview ProcessStartup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview Process
 
Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions Compared
 
Building a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDBBuilding a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDB
 
How to Market Your Startup to the Enterprise
How to Market Your Startup to the EnterpriseHow to Market Your Startup to the Enterprise
How to Market Your Startup to the Enterprise
 
Marketing & Design for the Enterprise
Marketing & Design for the EnterpriseMarketing & Design for the Enterprise
Marketing & Design for the Enterprise
 
Playing the Marketing Long Game
Playing the Marketing Long GamePlaying the Marketing Long Game
Playing the Marketing Long Game
 

Recently uploaded

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Analyzing NYC Transit Data

  • 1. Analyzing NYC Transit Data: Taxis, Ubers, and Citi Bikes Todd Schneider April 8, 2016 todd@toddwschneider.com
  • 2. Where to find me toddwschneider.com github.com/toddwschneider @todd_schneider toddsnyder
  • 3. Things I’ll talk about • Taxi, Uber, and Citi Bike data • Medium data analysis tools and tips • Where does R fit in?
  • 4. Taxi and Uber Data http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
  • 6. NYC Taxi and Uber Data • Taxi & Limousine Commission released public, trip-level data for over 1.1 billion taxi rides 2009–2015 • Some public Uber data available as well, thanks to a FOIL request by FiveThirtyEight http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
  • 7. Citi Bike Data • Citi Bike releases monthly data for every individual ride • Data includes timestamps and locations, plus rider’s subscriber status, gender, and age https://www.citibikenyc.com/system-data
  • 8. Generic Analysis Overview 1. Get raw data 2. Write code to process raw data into something more useful 3. Analyze data 4. Write about what you found out
  • 9. Analysis Tools • PostgreSQL • PostGIS • R • Command line • JavaScript https://github.com/toddwschneider/nyc-taxi-data https://github.com/toddwschneider/nyc-citibike-data
  • 10. Raw data processing goals • Load flat files of varying file formats into a unified, persistent PostgreSQL database that we can use to answer questions about the data • Do some one-time calculations to augment the raw data • We want to answer neighborhood-based questions, so we’ll map latitude/longitude coordinates to NYC census tracts
  • 11. Processing raw data: The reality • Often messy, raw data can require massaging • Not fun, takes a while, but is essential • Specifically: we have to plan ahead a bit, anticipate usage patterns, questions we’re going to ask, then decide on schema
  • 13. Specific issues encountered with raw taxi data • Some files contain empty lines and unquoted carriage returns 😐 • Raw data files have different formats even within the same cab type 😕 • Some files contain extra columns in every row 😠 • Some files contain extra columns in only some rows 😡
  • 14. How do we load a bunch of files into a database? • One at a time! • Bash script loops through each raw data file, for each file it executes code to process data and insert records into a database table https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh
  • 15. How do we map latitude and longitude to census tracts? • PostGIS! • Geographic information system (GIS) for PostgreSQL • Can do calculations of the form, “is a point inside a polygon?” • Every pickup/drop off is a point, NYC’s census tracts are polygons
  • 16. NYC Census Tracts • 2,166 tracts • 196 neighborhood tabulation areas (NTAs)
  • 17. Shapefiles • Shapefile format describes geometries like points, lines, polygons • Many shapefiles publicly available, e.g. NYC provides a shapefile that contains definitions for all census tracts and NTAs • PostGIS includes functionality to import shapefiles
  • 19. PostGIS: ST_Within() • ST_Within(geom A, geom B) function returns true if and only if A is entirely within B • A = pickup or drop off point • B = NYC census tract polygon
  • 20. Spatial Indexes • Problem: determining whether a point is inside an arbitrary polygon is computationally intensive and slow • PostGIS spatial indexes to the rescue!
  • 21. Spatial indexes in a nutshell bounding box Bounding box Census tract
  • 22. Spatial Indexes • Determining whether a point is inside a rectangle is easy! • Spatial indexes store rectangular bounding boxes for polygons, then when determining if a point is inside a polygon, calculate in 2 steps: 1. Is the point inside the polygon’s bounding box? 2. If so, is the point inside the polygon itself? • Most of the time the cheap first check will be false, then we can skip the expensive second step
  • 23. Putting it all together • Download NYC census tracts shapefile, import into database, create spatial index • Download raw taxi/Uber/Citi Bike data files and loop through them, one file at a time • For each file: fix data issues, load into database, calculate census tracts with ST_Within() • Wait 3 days and voila!
  • 24. Analysis, a.k.a.“the fun part” • Ask fun and interesting questions • Try to answer them • Rinse and repeat
  • 25. Taxi maps • Question: what does a map of every taxi pickup and drop off look like? • Each trip has a pickup and drop off location, plot a bunch of dots at those locations • Made entirely in R using ggplot2
  • 27. Taxi maps preprocess • Problem: R can’t fit 1.1 billion rows • Solution: preprocess data by rounding lat/long to 4 decimal places (~10 meters), count number of trips at each aggregated point https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215
  • 28. Render maps in R https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R
  • 29. Data reliability Every other comment on reddit:
  • 30. • Map the position of every Citi Bike over the course of a single day • Google Maps Directions API for cycling directions • Leaflet.js for mapping • Torque.js by CartoDB for animation Citi Bike Animation
  • 31. • Google Maps cycling directions have strong bias for dedicated bike lanes on 1st, 2nd, 8th, and 9th avenues • Not necessarily true! Citi Bike Assumptions
  • 32. Modeling the relationship between the weather and Citi Bike ridership
  • 33. Modeling the relationship between the weather and Citi Bike ridership • Daily ridership data from Citi Bike • Daily weather data from National Climatic Data Center: temperature, precipitation, snow depth • Devise and calibrate model in R
  • 34. Modeling the relationship between the weather and Citi Bike ridership
  • 36. Calibration in R • Uses nlsLM() function from minpack.lm package for Levenberg– Marquardt algorithm to minimize nonlinear squared error https://gist.github.com/toddwschneider/bac3350f84b2ff99969d
  • 38. Airport traffic • Question: how long will my taxi take to get to the airport? • LGA, JFK, and EWR are each their own census tracts • Get all trips that dropped off in one of those tracts • Calculate travel times from neighborhoods to airports
  • 40. More fun stuff in the full posts • On the realism of Die Hard 3 • Relationship between age, gender, and cycling speed • Neighborhoods with most nightlife • East Hampton privacy concerns • What time do investment bankers arrive at work?
  • 42. What is “medium data”? No clear answer, but my rough thinking: • Tiny: fits in spreadsheet • Small: doesn’t fit in spreadsheet, but fits in RAM • Medium: too big for RAM, but fits on local hard disk • Big: too big for local disk, has to be distributed across many nodes
  • 43. Use the right tool for the job My personal toolkit (yours may vary!): • PostgreSQL for storing and aggregating data. Geospatial calculations with PostGIS extension • R for modeling and plotting • Command line tools for looping through files, loading data, text processing on input data with sed, awk, etc. • Ruby for making API calls, scraping websites, running web servers, and sometimes using local rails apps to organize relational data • JavaScript for interactivity on the web
  • 44. R + PostgresSQL • The R ↔ Postgres link is invaluable! Use R and Postgres for the things they’re respectively best at • Postgres: persisting data in tables, rote number crunching • R: calibrating models, plotting • RPostgreSQL package allows querying Postgres from within R
  • 45. Tip: pre-aggregate • Think about how you’re going to access the data, and consider creating intermediate aggregated tables which can be used as building blocks for later analysis • Example: number of taxi trips grouped by pickup census tract and date/time truncated to the hour • Resulting table is only 30 million rows, easier to work with than full trips table, and can still answer lots of interesting questions
  • 46. Pre-aggregating example CREATE TABLE hourly_pickups AS SELECT date_trunc('hour', pickup_datetime) AS pickup_hour, cab_type_id, pickup_nyct2010_gid, COUNT(*) FROM trips WHERE pickup_nyct2010_gid IS NOT NULL GROUP BY pickup_hour, cab_type_id, pickup_nyct2010_gid; https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L30-L38
  • 47. How to get people to read your work • It has to be interesting. If you’re not excited, probably nobody else is either • Most people are distracted, and they read things in “fast scroll” mode. Optimize for them • The questions you ask are more important than the methods you use to answer them
  • 48. Specific tips • Write in short paragraphs with straightforward language • Use plenty of section headers • Good ratio of pictures to text • Avoid the dreaded “wall of text”
  • 49. Above all… • Have fun! • Keep an inquisitive mind. Observe stuff happening around you, ask questions about it, try to answer those questions