SlideShare a Scribd company logo
Capital Bikeshare
MAY 2, 2015
GEORGETOWN UNIVERSITY
SCHOOL OF CONTINUING STUDIES
CERTIFICATE IN DATA ANALYTICS
CAPSTONE PROJECT
SELMA ORR
RYAN DONAHUE
ODETTE RIVERA
NORA GOEBELBECKER
KARINA HIDALGO
Problem Statement
Cities want to build bike systems for economic development and sustainability, but they face
serious fiscal constraints.
Problem Statement
As a result, the public has little tolerance for error – even though usage, and therefore
profitability, is highly variable.
Most popular bikeshare station, 2014:
Union Station – 131,700 trips
Least popular bikeshare station, 2014:
34th St & Minnesota Ave SE – 112 trips
For roughly the same fixed costs, there are more than
1,000 times as may riders each year using the Union
Station location as many others.
Currently, there is no standardized, rigorous methodology for accurately predicting which
stations will be most heavily used.
Problem Statement
Goal
Develop a model, based on Washington DC but applicable to other
U.S. cities, that will predict the popularity of bikeshare stations based
on characteristics of the area surrounding each station.
Such a model could be used to increase the popularity of new and
existing bikeshare systems, making them more financially
sustainable.
Hypotheses
• Bike share station popularity is influenced by station location– specifically, the
economic, demographic, and geographic characteristics of surrounding
neighborhoods.
• Certain determinants of bikeshare station popularity hold true across cities,
allowing for the construction of a model that could accurately predict the
popularity of bikeshare stations in other cities.
•Regression models, such as linear regression, are a good fit to predict station
popularity.
Project Background
This project builds primarily upon two previous analyses of Washington, DC:
1. Maximizing Bicycle Sharing: An Empirical Analysis of Capital Bikeshare Usage
• Multivariate regression to identify five factors that influenced bikeshare station popularity:
population 20-39, non-white population, retail proximity, Metro proximity, and distance from
system center.
2. Predicting the Popularity of Bicycle Sharing Stations: An Accessibility-Based Approach Using
Linear Regression and Random Forests
• Linear regression and random forest analysis to understand how job and residential proximity
influenced station popularity. Although the study attempted to extend the model to San Francisco
and Minneapolis, it found that the model was a poor predictor of station popularity in those cities.
Why revisit this topic?
• Our team identified other characteristics that might drive usage.
• The bikeshare system has expanded considerably since those studies, offering a larger and more
varied sample.
Data Sources and Ingestion
Data Variable Type Source Year Geography Format
Bikeshare trips Dependent Capital Bikeshare 2014 N/A CSV
Bikeshare stations Dependent DC Open Data 2014 Point CSV/shapefile
Population- age Independent U.S. Census/ACS 2013 Block Group CSV
Population- race Independent U.S. Census /ACS 2013 Block Group CSV
DC liquor license locations Independent DC Open Data 2014 Point CSV/shapefile
DC Metro stations (train and
bus)
Independent WMATA 2014 Point CSV/shapefile
Parks (DC, National) Independent DC Open Data 2014 Polygon CSV/shapefile
Campuses (college, university) Independent DC Open Data 2014 Polygon CSV/shapefile
Historic landmarks Independent DC Open Data 2014 Polygon CSV/shapefile
Starbucks and McDonald’s
locations
Independent Various online
sources
2014 Point CSV
Data Sources and Ingestion
Data Variable Type Source Year Geography Format
Total trips per bike per year Dependent Capital Bikeshare 2014 N/A CSV
Bikeshare stations Dependent DC Open Data 2014 Point CSV/shapefile
Population- age Independent U.S. Census/ACS 2013 Block Group CSV
Population- race Independent U.S. Census /ACS 2013 Block Group CSV
DC liquor license locations Independent DC Open Data 2014 Point CSV/shapefile
DC Metro stations (train and
bus)
Independent WMATA 2014 Point CSV/shapefile
Parks (DC, National) Independent DC Open Data 2014 Polygon CSV/shapefile
Campuses (college, university) Independent DC Open Data 2014 Polygon CSV/shapefile
Historic landmarks Independent DC Open Data 2014 Polygon CSV/shapefile
Starbucks and McDonald’s
locations
Independent Various online
sources
2014 Point CSV
Data Variable Type Format
Amenities within walking
distance from bikeshare
stations (sum)
Independent CSV
Distance from bikeshare
station to each type of
amenity (closest amenity
within walking distance of
station-- -.5 miles or less)
Independent CSV
Socioeconomic characteristics
of the population sharing the
same census block group as
the bikeshare station
Independent CSV/shapefile
Capital Bikeshare Data
• Data attributes for bike trip: Trip Duration, Start/End Station (address), Start/End Date Time,
Bike Number, and Member Type
• Divided by time period: A year of trip data downloaded from the Capital Bikeshare website was
divided in four files with each file representing a quarter.
•Separate dataset with coordinates of bikeshare station’s locations was obtained from DC Open
Data website
• How to measure popularity? Trips leaving, trips arriving, total trips? Different capacity at each
station, so popularity = total trips (arrive + depart)/bike/year
Census Data
• Socioeconomic and demographic data collected for all Census blocks within DC, in the form
of CSV files downloaded from American Factfinder
• Census blocks are the smallest geographic area for which sample data is collected (typically
600 to 3,000 residents)
• Challenge 1: how to link stations (discrete points) with block groups (boundaries)?
oSolution: ArcGIS was able to identify block groups by lat/long of bikeshare station
• Challenge 2: how to deal with missing data?
oFor missing rent (4 instances): impute by calculating average rent/income ratio across city
oFor missing income (2 instances): impute by averaging two adjacent block groups
oFor missing population (10 instances, national park areas): leave blank
Nearby Amenity Data
• Selection of amenities based largely on past studies and common-sense drivers
of bikeshare usage: metro and bus stations, college campuses, DC and national
parks, entertainment, restaurants, bars (proxied by liquor licenses)
• Two ways to think about importance of amenities as drivers of usage:
o How close the single closest location of each type of amenity is (likely most
important for metro stations)
o How many locations of each type of amenity there are within a half mile
(likely most important for restaurants, bars)
• One challenge: whenever there wasn’t a single location of an amenity within a
half mile (common result for metro stations), ArcGIS identified distance as “0”
Data Wrangling
The primary challenge in the data
wrangling process was to create an
architecture that links each individual
station with its census block group
(and associated socioeconomic
characteristics) as well as with
distances to surrounding physical
amenities.
Census/Bike
Station.csv
Bike
Station
Address
Census
Block
Group
Amenity
Proximity
Spatial Analysis
Spatial Analysis
Spatial Analysis
Union Station 34th and Minnesota Ave SE
Data Wrangling
All four files with quarterly
ridership data were loaded into a
PostgreSQL table to be merged to
the Census/Bike station.csv file to
add the stationID, latitude and
longitude of the stations.
The resulting file contained all the
trips in 2014 with reference
information about the station and
created a common field uniquely
linking the records between the
files.
BikeStation
_Census.csv
DCBikeTrips2014.csv Census_BikeTrip.csv
Data Exploration
Correlated dependent variables with one another to anticipate collinearity. Shown here are
metro station proximity vs. density, and bar proximity vs. single households.
Data Exploration
A few variables that appear to have little correlation to station popularity: metro station
proximity and % of households in Census block that commute by public transit.
Data Exploration
A few variables that seem to have correlation to station popularity: % of population in Census
block that drives to work, and % of population in Census block with a college degree.
Data Exploration: Correlations with y
-0.600
-0.400
-0.200
0.000
0.200
0.400
0.600
Correlation
2014 Capital Bikeshare Member Survey
“Compared with all commuters in the region, they were, on average, considerably younger,
more likely to be male, Caucasian, and slightly less affluent.”
“Two-thirds (64%) of respondents said that at least one of the Capital Bikeshare trips they made
last month either started or ended at a Metrorail station and 21% had used bikeshare six or
more times for this purpose. About a quarter (24%) of respondents used Capital Bikeshare to
access a bus in the past month.”
Data Exploration: Excerpt from cross-
correlation matrix
Study Methodology: Feature Selection or
better said “Feature Wrangling”
Step 1 : Ran a full Ordinary Least Squares Regression with all thirty independent variables using StatsModels in Python.
◦ R-squared: 0.636
◦ Adjusted R-squared: 0.577
◦ Eight variables with significant p-values included: DRIVE, WHITE, DENSITY, AGE, WALK, BUS_N, TRANSIT, MCDON_N
◦ As expected, very large conditions number, 1.11e+06 indicating strong multicollinearity
◦ F-statistic: 10.73 and Prob(F-statistic): 3.42e-25
◦ Designated this as Model 2
Step 2 : Beginning with the variable DRIVE, the variable with the highest linear correlation to y, sequentially added the
other variables to the OLS regression according to correlation
◦ Any variable that triggered a multicollinearity warning was left out
◦ Any variable without a significant p-value was left out
◦ Seven variables with significant p-values included: DRIVE, WHITE, LIQUOR_N, BUS_N, MCDON_N, SINGLE, CAMPUS_N
◦ Designated this as Model 1
◦ Note: Experimented quite a bit with k features module but this adds features sequentially using descending correlations, but does not
take account of multicollinearity
Definitions of relevant features
Census:
DRIVE: share of population in Census block that drives to work (Model 1 and Model 2)
WALK: share of population in Census block that walks to work (Model 2)
TRANSIT: share of population in Census block that takes transit to work (Model 2)
WHITE: white share of population in Census block (Model 1 and Model 2)
DENSITY: population density in Census block (Model 2)
AGE: median age in Census block (Model 2)
SINGLE: share of households in block group that are single (i.e., non-family) (Model 1)
Amenities:
BUS_N: number of bus stations within half mile (Model 1 and Model 2)
MCDON_N: number of McDonald’s within half mile (Model 1 and Model 2)
LIQUOR_N: distance to nearest establishment with a liquor license (Model 1 and Model 2)
CAMPUS_N: number of college campuses within half a mile (Model 1)
OLS Output for Model 1
Model 1 Results: y -Actual versus y-Pred
OLS Output for Model 2
Model 2 Results: y -Actual versus y-Pred
Study Methodology: Machine Learning
Step 1 : Selected the following regression types for Machine Learning on Model 1 and Model 2
◦ OLS
◦ Ridge
◦ RidgeCV
◦ Lasso
◦ LassoCV
◦ Decision Tree
◦ Random Forest
Step 2 : Prepared the data
◦ Because there were only 201 stations (rows of data) opted against the K-fold cross-validation
◦ Used Repeated Random sub-sampling validation with 20% splits for testing and 80% splits for training
◦ Iterated for each regression type for n=15 times and averaged the results for the 15 trials
Study Methodology: Machine Learning
Model 1 R-SquaredAverages
Ex. 1
R-Squared Averages
Ex. 2
R-Squared Averages
Ex. 3
OLS 0.466372594903 0.472123692839 0.399698146173
Ridge 0.469910211714 0.46817885095 0.406793762824
Ridge CV 0.466525097058 0.472095157445 0.399928079436
Decision Tree (depth = 2) 0.383743276391 0.395203148425 0.359422535555
Decision Tree (depth = 5) 0.396411487916 0.399304877207 0.396627898939
Lasso 0.213675679287 0.192417630297 0.20648584883
Lasso CV 0.466325652215 0.472022317547 0.399672003362
Random Forest 0.513764682165 0.521753949359 0.510745984545
Model 1 Random Forest Results:
y-Actual versus y-Predict
Study Methodology: Machine Learning
Model 2 R-Squared Averages
Ex. 1
R-Squared Averages
Ex. 2
R-Squared Averages
Ex. 3
OLS 0.520828019361 0.540831229205 0.513422473234
Ridge 0.497919255131 0.516490091569 0.506607344645
Ridge CV 0.516873054267 0.54031441364 0.515242580429
Decision Tree (depth = 2) 0.32349467429 0.311292315645 0.285600505202
Decision Tree (depth = 5) 0.353623405764 0.435172086349 0.350751443572
Lasso 0.191216516526 0.217208742758 0.196249828561
Lasso CV 0.52129767449 0.541115629648 0.506104546918
Random Forest 0.468915321109 0.533377083505 0.42975292961
Model 2 Results Ridge CV:
y -Actual versus y-Pred
Model 2 Results Lasso CV:
y -Actual versus y-Pred
Data Product
 Groundwork: By analyzing the correlation between the factors such as bikeshare’s stations location, geographic and
demographic information, we obtained results that allow us to create a data product that predicts the likelihood of
success or failure of a new Bikeshare station prior to implementation in the DC area
 Results: Our models succeed in explaining about half of the variance in the Bikeshare Station popularity as measured by
our utilization factor. This should at least help in predicting the potential popularity of a station based on the
combination of demographic and geographic factors we identified as significant.
 Further applications: In addition to identifying promising locations for new Bikeshare stations in DC, the results may also
generalize to other cities. By using data on the demographic and geographic factors we identified as significant, it could
allow a user to predict promising locations for bike stations during an initial roll-out ,thus enhancing the overall success
of a new project without costly experimentation
What worked?
GIS was critical to creating effective architecture: linked stations to amenities by distance, and
to Census blocks and associated data.
Since the data volume that we handled was small, using local machine rather than a powerful
data base or cloud environment helped to achieve faster results.
Spending time exploring the variables before beginning analysis. This, plus domain knowledge,
allowed team to identify and address data issues manually that the software didn't calculate
accurately from the beginning.
Trying many different types of regressions using different variables (including different forms of
independent variable – log, natural log).
What didn’t work? And lessons learned.
Data sample was relatively small – started at roughly 350 stations, but shrank to roughly 200
once MD and VA locations were removed from the analysis.
Data and feature wrangling take a long time (80% of the process); domain expertise makes this
easier.
Would have been harder to detect and address anomalies and missing values in data had the
sample been larger; familiarity with DC allowed us to understand why data missing for national
parks or institutional land uses.
Couldn’t do k-fold cross validation given small sample size.
Decision tree model didn’t work for our analysis.
Conclusion
Model offers a good starting point for assessing likely popularity of station locations, using data that is readily
available for most major U.S. cities. Some subjective decision-making will still be required around major parks
(i.e., National Mall) or institutional land uses (campuses, hospitals).
Bikeshare continues to gain momentum across the country. Future studies should:
• Use a larger sample
• Idea: use DC as model, test against NYC and Chicago. Instead: use DC, NYC, and Chicago to build model with larger
sample, test against fourth city.
• Categorize stations by function within network
• Stations have different functions: residential feeders to metro stations, tourism. With a larger sample, these station
types could be separated and the drivers of popularity independently determined.
Important to note that there are valid reasons other than current popularity that should determine station
placement (i.e., equity, driving changes in travel behavior). This model helps ensure financial viability so that
these outcomes can be pursued.

More Related Content

What's hot

Mountain man brewing company case analysis
Mountain man brewing company case analysisMountain man brewing company case analysis
Mountain man brewing company case analysis
Abhishek Yadav
 
Burke: Learning and Growing through Marketing Research
Burke: Learning and Growing through Marketing ResearchBurke: Learning and Growing through Marketing Research
Burke: Learning and Growing through Marketing Research
Asif Mahmood Abbas
 
Integrated marketing communication
Integrated marketing communicationIntegrated marketing communication
Integrated marketing communication
Udit Jain
 
8 Use Cases for the Intelligent, Ideal Customer Profile
8 Use Cases for the Intelligent, Ideal Customer Profile  8 Use Cases for the Intelligent, Ideal Customer Profile
8 Use Cases for the Intelligent, Ideal Customer Profile
Doug Sechrist
 
Co-Creating Brands & Campaigns via Customer Communities
Co-Creating Brands & Campaigns via Customer CommunitiesCo-Creating Brands & Campaigns via Customer Communities
Co-Creating Brands & Campaigns via Customer CommunitiesTom De Ruyck
 
UBER-Current Strategy, Competition Analysis and Global Expansion
UBER-Current Strategy, Competition Analysis and Global ExpansionUBER-Current Strategy, Competition Analysis and Global Expansion
UBER-Current Strategy, Competition Analysis and Global Expansion
Shaminder Saini
 
Mountain Man Brewing Co. Case Study
Mountain Man Brewing Co. Case StudyMountain Man Brewing Co. Case Study
Mountain Man Brewing Co. Case Study
Avani Jain
 
Presentation uber
Presentation uberPresentation uber
Presentation uber
Souarv Dhar
 
Finance project Mountain man
Finance project  Mountain manFinance project  Mountain man
Finance project Mountain manNoor
 
[Dove] Customer Journey
[Dove] Customer Journey[Dove] Customer Journey
[Dove] Customer Journey
Michela Caltran
 
Ccd vs Barista
Ccd vs BaristaCcd vs Barista
Ccd vs Barista
Anurag Gupta
 
Final Class Presentation: Zipcar Strategy
Final Class Presentation: Zipcar StrategyFinal Class Presentation: Zipcar Strategy
Final Class Presentation: Zipcar Strategy
carolinestokes
 
Dove: evolution of a brand
Dove: evolution of a brand Dove: evolution of a brand
Dove: evolution of a brand
Sameer Mathur
 
Air France Digital Marketing Strategy
Air France Digital Marketing StrategyAir France Digital Marketing Strategy
Air France Digital Marketing Strategy
Gareth Jones
 
Uber Analysis with details 2017
Uber Analysis with details 2017 Uber Analysis with details 2017
Uber Analysis with details 2017
Lexi Jacobs
 
Carpool by Meru Digital Marketing Campaign
Carpool by Meru Digital Marketing CampaignCarpool by Meru Digital Marketing Campaign
Carpool by Meru Digital Marketing Campaign
MindShift Interactive
 
How to acquire your first 1,000 customers (based on what worked for Uber, Air...
How to acquire your first 1,000 customers (based on what worked for Uber, Air...How to acquire your first 1,000 customers (based on what worked for Uber, Air...
How to acquire your first 1,000 customers (based on what worked for Uber, Air...
Thales Teixeira
 
Mountain Man Brewing Company - Case Study
Mountain Man Brewing Company - Case StudyMountain Man Brewing Company - Case Study
Mountain Man Brewing Company - Case Study
Ashwin C
 
Air france case study
Air france case studyAir france case study
Air france case study
Arthur Marot
 
Vohra case
 Vohra case Vohra case
Vohra case
Abhishek Maloo
 

What's hot (20)

Mountain man brewing company case analysis
Mountain man brewing company case analysisMountain man brewing company case analysis
Mountain man brewing company case analysis
 
Burke: Learning and Growing through Marketing Research
Burke: Learning and Growing through Marketing ResearchBurke: Learning and Growing through Marketing Research
Burke: Learning and Growing through Marketing Research
 
Integrated marketing communication
Integrated marketing communicationIntegrated marketing communication
Integrated marketing communication
 
8 Use Cases for the Intelligent, Ideal Customer Profile
8 Use Cases for the Intelligent, Ideal Customer Profile  8 Use Cases for the Intelligent, Ideal Customer Profile
8 Use Cases for the Intelligent, Ideal Customer Profile
 
Co-Creating Brands & Campaigns via Customer Communities
Co-Creating Brands & Campaigns via Customer CommunitiesCo-Creating Brands & Campaigns via Customer Communities
Co-Creating Brands & Campaigns via Customer Communities
 
UBER-Current Strategy, Competition Analysis and Global Expansion
UBER-Current Strategy, Competition Analysis and Global ExpansionUBER-Current Strategy, Competition Analysis and Global Expansion
UBER-Current Strategy, Competition Analysis and Global Expansion
 
Mountain Man Brewing Co. Case Study
Mountain Man Brewing Co. Case StudyMountain Man Brewing Co. Case Study
Mountain Man Brewing Co. Case Study
 
Presentation uber
Presentation uberPresentation uber
Presentation uber
 
Finance project Mountain man
Finance project  Mountain manFinance project  Mountain man
Finance project Mountain man
 
[Dove] Customer Journey
[Dove] Customer Journey[Dove] Customer Journey
[Dove] Customer Journey
 
Ccd vs Barista
Ccd vs BaristaCcd vs Barista
Ccd vs Barista
 
Final Class Presentation: Zipcar Strategy
Final Class Presentation: Zipcar StrategyFinal Class Presentation: Zipcar Strategy
Final Class Presentation: Zipcar Strategy
 
Dove: evolution of a brand
Dove: evolution of a brand Dove: evolution of a brand
Dove: evolution of a brand
 
Air France Digital Marketing Strategy
Air France Digital Marketing StrategyAir France Digital Marketing Strategy
Air France Digital Marketing Strategy
 
Uber Analysis with details 2017
Uber Analysis with details 2017 Uber Analysis with details 2017
Uber Analysis with details 2017
 
Carpool by Meru Digital Marketing Campaign
Carpool by Meru Digital Marketing CampaignCarpool by Meru Digital Marketing Campaign
Carpool by Meru Digital Marketing Campaign
 
How to acquire your first 1,000 customers (based on what worked for Uber, Air...
How to acquire your first 1,000 customers (based on what worked for Uber, Air...How to acquire your first 1,000 customers (based on what worked for Uber, Air...
How to acquire your first 1,000 customers (based on what worked for Uber, Air...
 
Mountain Man Brewing Company - Case Study
Mountain Man Brewing Company - Case StudyMountain Man Brewing Company - Case Study
Mountain Man Brewing Company - Case Study
 
Air france case study
Air france case studyAir france case study
Air france case study
 
Vohra case
 Vohra case Vohra case
Vohra case
 

Similar to Capital Bikeshare Presentation

Where Do I Start? New Tools to Prioritize Investments in Bicycle and Pedestri...
Where Do I Start? New Tools to Prioritize Investments in Bicycle and Pedestri...Where Do I Start? New Tools to Prioritize Investments in Bicycle and Pedestri...
Where Do I Start? New Tools to Prioritize Investments in Bicycle and Pedestri...
Project for Public Spaces & National Center for Biking and Walking
 
ACT 2014 Introduction to Shared Use Mobility-Carsharing and Bikesharing Trend...
ACT 2014 Introduction to Shared Use Mobility-Carsharing and Bikesharing Trend...ACT 2014 Introduction to Shared Use Mobility-Carsharing and Bikesharing Trend...
ACT 2014 Introduction to Shared Use Mobility-Carsharing and Bikesharing Trend...
Association for Commuter Transportation (ACT)
 
Multimodal Impact Fees - Using Advanced Modeling Tools
Multimodal Impact Fees - Using Advanced Modeling ToolsMultimodal Impact Fees - Using Advanced Modeling Tools
Multimodal Impact Fees - Using Advanced Modeling Tools
Jonathan Slason
 
Taking Pedestrian and Bicycle Counting Programs to the Next Level
Taking Pedestrian and Bicycle Counting Programs to the Next Level Taking Pedestrian and Bicycle Counting Programs to the Next Level
Taking Pedestrian and Bicycle Counting Programs to the Next Level
Project for Public Spaces & National Center for Biking and Walking
 
scott_shaffer_board_final
scott_shaffer_board_finalscott_shaffer_board_final
scott_shaffer_board_finalScott Shaffer
 
New Tools for Estimating Walking and Bicycling Demand
New Tools for Estimating Walking and Bicycling DemandNew Tools for Estimating Walking and Bicycling Demand
New Tools for Estimating Walking and Bicycling Demand
Project for Public Spaces & National Center for Biking and Walking
 
Improving the quality and cost effectiveness of multimodal travel behavior da...
Improving the quality and cost effectiveness of multimodal travel behavior da...Improving the quality and cost effectiveness of multimodal travel behavior da...
Improving the quality and cost effectiveness of multimodal travel behavior da...
Sean Barbeau
 
Vancouver_BikeShare_SynergyGroup_CES2013
Vancouver_BikeShare_SynergyGroup_CES2013Vancouver_BikeShare_SynergyGroup_CES2013
Vancouver_BikeShare_SynergyGroup_CES2013
CesToronto
 
Morgan Whitcomb, Equity in Bike Share: Practical Methods for Addressing Equit...
Morgan Whitcomb, Equity in Bike Share: Practical Methods for Addressing Equit...Morgan Whitcomb, Equity in Bike Share: Practical Methods for Addressing Equit...
Morgan Whitcomb, Equity in Bike Share: Practical Methods for Addressing Equit...
Project for Public Spaces & National Center for Biking and Walking
 
Boosting Active Transportation at the Regional Level: Setting and Meeting Per...
Boosting Active Transportation at the Regional Level: Setting and Meeting Per...Boosting Active Transportation at the Regional Level: Setting and Meeting Per...
Boosting Active Transportation at the Regional Level: Setting and Meeting Per...
Project for Public Spaces & National Center for Biking and Walking
 
Theme 3 The costumer experience
Theme 3 The costumer experienceTheme 3 The costumer experience
Theme 3 The costumer experienceBRTCoE
 
2016 Commuter Choice Summit - TDM Technology Session
2016 Commuter Choice Summit - TDM Technology Session2016 Commuter Choice Summit - TDM Technology Session
2016 Commuter Choice Summit - TDM Technology Session
Sean Barbeau
 
BRT Workshop - The Customer Experience
BRT Workshop - The Customer ExperienceBRT Workshop - The Customer Experience
BRT Workshop - The Customer Experience
WRI Ross Center for Sustainable Cities
 
Comparative Analysis of the Multi-modal Transportation Environments in the No...
Comparative Analysis of the Multi-modal Transportation Environments in the No...Comparative Analysis of the Multi-modal Transportation Environments in the No...
Comparative Analysis of the Multi-modal Transportation Environments in the No...
dperl88
 
ATS-16: Making Data Count, Krista Nordback
ATS-16: Making Data Count, Krista NordbackATS-16: Making Data Count, Krista Nordback
ATS-16: Making Data Count, Krista Nordback
BTAOregon
 
Commuting Connections: Carpooling and Cyberspace
Commuting Connections: Carpooling and CyberspaceCommuting Connections: Carpooling and Cyberspace
Commuting Connections: Carpooling and Cyberspace
Smart Commute
 
Connecting Bellingham
Connecting BellinghamConnecting Bellingham
Connecting BellinghamTranspo Group
 
Multimodal Mopbility Planning Using Big Data in Toronto
Multimodal Mopbility Planning Using Big Data in TorontoMultimodal Mopbility Planning Using Big Data in Toronto
Multimodal Mopbility Planning Using Big Data in Toronto
Dewan Masud Karim, P.Eng., PTOE
 

Similar to Capital Bikeshare Presentation (20)

Where Do I Start? New Tools to Prioritize Investments in Bicycle and Pedestri...
Where Do I Start? New Tools to Prioritize Investments in Bicycle and Pedestri...Where Do I Start? New Tools to Prioritize Investments in Bicycle and Pedestri...
Where Do I Start? New Tools to Prioritize Investments in Bicycle and Pedestri...
 
ACT 2014 Introduction to Shared Use Mobility-Carsharing and Bikesharing Trend...
ACT 2014 Introduction to Shared Use Mobility-Carsharing and Bikesharing Trend...ACT 2014 Introduction to Shared Use Mobility-Carsharing and Bikesharing Trend...
ACT 2014 Introduction to Shared Use Mobility-Carsharing and Bikesharing Trend...
 
Multimodal Impact Fees - Using Advanced Modeling Tools
Multimodal Impact Fees - Using Advanced Modeling ToolsMultimodal Impact Fees - Using Advanced Modeling Tools
Multimodal Impact Fees - Using Advanced Modeling Tools
 
Taking Pedestrian and Bicycle Counting Programs to the Next Level
Taking Pedestrian and Bicycle Counting Programs to the Next Level Taking Pedestrian and Bicycle Counting Programs to the Next Level
Taking Pedestrian and Bicycle Counting Programs to the Next Level
 
scott_shaffer_board_final
scott_shaffer_board_finalscott_shaffer_board_final
scott_shaffer_board_final
 
New Tools for Estimating Walking and Bicycling Demand
New Tools for Estimating Walking and Bicycling DemandNew Tools for Estimating Walking and Bicycling Demand
New Tools for Estimating Walking and Bicycling Demand
 
Improving the quality and cost effectiveness of multimodal travel behavior da...
Improving the quality and cost effectiveness of multimodal travel behavior da...Improving the quality and cost effectiveness of multimodal travel behavior da...
Improving the quality and cost effectiveness of multimodal travel behavior da...
 
Vancouver_BikeShare_SynergyGroup_CES2013
Vancouver_BikeShare_SynergyGroup_CES2013Vancouver_BikeShare_SynergyGroup_CES2013
Vancouver_BikeShare_SynergyGroup_CES2013
 
Morgan Whitcomb, Equity in Bike Share: Practical Methods for Addressing Equit...
Morgan Whitcomb, Equity in Bike Share: Practical Methods for Addressing Equit...Morgan Whitcomb, Equity in Bike Share: Practical Methods for Addressing Equit...
Morgan Whitcomb, Equity in Bike Share: Practical Methods for Addressing Equit...
 
Measure for Measure: Boston-based Technical Toolkits for Measuring Walkabilit...
Measure for Measure: Boston-based Technical Toolkits for Measuring Walkabilit...Measure for Measure: Boston-based Technical Toolkits for Measuring Walkabilit...
Measure for Measure: Boston-based Technical Toolkits for Measuring Walkabilit...
 
Boosting Active Transportation at the Regional Level: Setting and Meeting Per...
Boosting Active Transportation at the Regional Level: Setting and Meeting Per...Boosting Active Transportation at the Regional Level: Setting and Meeting Per...
Boosting Active Transportation at the Regional Level: Setting and Meeting Per...
 
Theme 3 The costumer experience
Theme 3 The costumer experienceTheme 3 The costumer experience
Theme 3 The costumer experience
 
2016 Commuter Choice Summit - TDM Technology Session
2016 Commuter Choice Summit - TDM Technology Session2016 Commuter Choice Summit - TDM Technology Session
2016 Commuter Choice Summit - TDM Technology Session
 
BRT Workshop - The Customer Experience
BRT Workshop - The Customer ExperienceBRT Workshop - The Customer Experience
BRT Workshop - The Customer Experience
 
Comparative Analysis of the Multi-modal Transportation Environments in the No...
Comparative Analysis of the Multi-modal Transportation Environments in the No...Comparative Analysis of the Multi-modal Transportation Environments in the No...
Comparative Analysis of the Multi-modal Transportation Environments in the No...
 
ATS-16: Making Data Count, Krista Nordback
ATS-16: Making Data Count, Krista NordbackATS-16: Making Data Count, Krista Nordback
ATS-16: Making Data Count, Krista Nordback
 
Commuting Connections: Carpooling and Cyberspace
Commuting Connections: Carpooling and CyberspaceCommuting Connections: Carpooling and Cyberspace
Commuting Connections: Carpooling and Cyberspace
 
198 Presentation
198 Presentation198 Presentation
198 Presentation
 
Connecting Bellingham
Connecting BellinghamConnecting Bellingham
Connecting Bellingham
 
Multimodal Mopbility Planning Using Big Data in Toronto
Multimodal Mopbility Planning Using Big Data in TorontoMultimodal Mopbility Planning Using Big Data in Toronto
Multimodal Mopbility Planning Using Big Data in Toronto
 

Capital Bikeshare Presentation

  • 1. Capital Bikeshare MAY 2, 2015 GEORGETOWN UNIVERSITY SCHOOL OF CONTINUING STUDIES CERTIFICATE IN DATA ANALYTICS CAPSTONE PROJECT SELMA ORR RYAN DONAHUE ODETTE RIVERA NORA GOEBELBECKER KARINA HIDALGO
  • 2. Problem Statement Cities want to build bike systems for economic development and sustainability, but they face serious fiscal constraints.
  • 3. Problem Statement As a result, the public has little tolerance for error – even though usage, and therefore profitability, is highly variable. Most popular bikeshare station, 2014: Union Station – 131,700 trips Least popular bikeshare station, 2014: 34th St & Minnesota Ave SE – 112 trips For roughly the same fixed costs, there are more than 1,000 times as may riders each year using the Union Station location as many others.
  • 4. Currently, there is no standardized, rigorous methodology for accurately predicting which stations will be most heavily used. Problem Statement
  • 5. Goal Develop a model, based on Washington DC but applicable to other U.S. cities, that will predict the popularity of bikeshare stations based on characteristics of the area surrounding each station. Such a model could be used to increase the popularity of new and existing bikeshare systems, making them more financially sustainable.
  • 6. Hypotheses • Bike share station popularity is influenced by station location– specifically, the economic, demographic, and geographic characteristics of surrounding neighborhoods. • Certain determinants of bikeshare station popularity hold true across cities, allowing for the construction of a model that could accurately predict the popularity of bikeshare stations in other cities. •Regression models, such as linear regression, are a good fit to predict station popularity.
  • 7. Project Background This project builds primarily upon two previous analyses of Washington, DC: 1. Maximizing Bicycle Sharing: An Empirical Analysis of Capital Bikeshare Usage • Multivariate regression to identify five factors that influenced bikeshare station popularity: population 20-39, non-white population, retail proximity, Metro proximity, and distance from system center. 2. Predicting the Popularity of Bicycle Sharing Stations: An Accessibility-Based Approach Using Linear Regression and Random Forests • Linear regression and random forest analysis to understand how job and residential proximity influenced station popularity. Although the study attempted to extend the model to San Francisco and Minneapolis, it found that the model was a poor predictor of station popularity in those cities. Why revisit this topic? • Our team identified other characteristics that might drive usage. • The bikeshare system has expanded considerably since those studies, offering a larger and more varied sample.
  • 8. Data Sources and Ingestion Data Variable Type Source Year Geography Format Bikeshare trips Dependent Capital Bikeshare 2014 N/A CSV Bikeshare stations Dependent DC Open Data 2014 Point CSV/shapefile Population- age Independent U.S. Census/ACS 2013 Block Group CSV Population- race Independent U.S. Census /ACS 2013 Block Group CSV DC liquor license locations Independent DC Open Data 2014 Point CSV/shapefile DC Metro stations (train and bus) Independent WMATA 2014 Point CSV/shapefile Parks (DC, National) Independent DC Open Data 2014 Polygon CSV/shapefile Campuses (college, university) Independent DC Open Data 2014 Polygon CSV/shapefile Historic landmarks Independent DC Open Data 2014 Polygon CSV/shapefile Starbucks and McDonald’s locations Independent Various online sources 2014 Point CSV
  • 9. Data Sources and Ingestion Data Variable Type Source Year Geography Format Total trips per bike per year Dependent Capital Bikeshare 2014 N/A CSV Bikeshare stations Dependent DC Open Data 2014 Point CSV/shapefile Population- age Independent U.S. Census/ACS 2013 Block Group CSV Population- race Independent U.S. Census /ACS 2013 Block Group CSV DC liquor license locations Independent DC Open Data 2014 Point CSV/shapefile DC Metro stations (train and bus) Independent WMATA 2014 Point CSV/shapefile Parks (DC, National) Independent DC Open Data 2014 Polygon CSV/shapefile Campuses (college, university) Independent DC Open Data 2014 Polygon CSV/shapefile Historic landmarks Independent DC Open Data 2014 Polygon CSV/shapefile Starbucks and McDonald’s locations Independent Various online sources 2014 Point CSV Data Variable Type Format Amenities within walking distance from bikeshare stations (sum) Independent CSV Distance from bikeshare station to each type of amenity (closest amenity within walking distance of station-- -.5 miles or less) Independent CSV Socioeconomic characteristics of the population sharing the same census block group as the bikeshare station Independent CSV/shapefile
  • 10. Capital Bikeshare Data • Data attributes for bike trip: Trip Duration, Start/End Station (address), Start/End Date Time, Bike Number, and Member Type • Divided by time period: A year of trip data downloaded from the Capital Bikeshare website was divided in four files with each file representing a quarter. •Separate dataset with coordinates of bikeshare station’s locations was obtained from DC Open Data website • How to measure popularity? Trips leaving, trips arriving, total trips? Different capacity at each station, so popularity = total trips (arrive + depart)/bike/year
  • 11. Census Data • Socioeconomic and demographic data collected for all Census blocks within DC, in the form of CSV files downloaded from American Factfinder • Census blocks are the smallest geographic area for which sample data is collected (typically 600 to 3,000 residents) • Challenge 1: how to link stations (discrete points) with block groups (boundaries)? oSolution: ArcGIS was able to identify block groups by lat/long of bikeshare station • Challenge 2: how to deal with missing data? oFor missing rent (4 instances): impute by calculating average rent/income ratio across city oFor missing income (2 instances): impute by averaging two adjacent block groups oFor missing population (10 instances, national park areas): leave blank
  • 12. Nearby Amenity Data • Selection of amenities based largely on past studies and common-sense drivers of bikeshare usage: metro and bus stations, college campuses, DC and national parks, entertainment, restaurants, bars (proxied by liquor licenses) • Two ways to think about importance of amenities as drivers of usage: o How close the single closest location of each type of amenity is (likely most important for metro stations) o How many locations of each type of amenity there are within a half mile (likely most important for restaurants, bars) • One challenge: whenever there wasn’t a single location of an amenity within a half mile (common result for metro stations), ArcGIS identified distance as “0”
  • 13. Data Wrangling The primary challenge in the data wrangling process was to create an architecture that links each individual station with its census block group (and associated socioeconomic characteristics) as well as with distances to surrounding physical amenities. Census/Bike Station.csv Bike Station Address Census Block Group Amenity Proximity
  • 16. Spatial Analysis Union Station 34th and Minnesota Ave SE
  • 17. Data Wrangling All four files with quarterly ridership data were loaded into a PostgreSQL table to be merged to the Census/Bike station.csv file to add the stationID, latitude and longitude of the stations. The resulting file contained all the trips in 2014 with reference information about the station and created a common field uniquely linking the records between the files. BikeStation _Census.csv DCBikeTrips2014.csv Census_BikeTrip.csv
  • 18. Data Exploration Correlated dependent variables with one another to anticipate collinearity. Shown here are metro station proximity vs. density, and bar proximity vs. single households.
  • 19. Data Exploration A few variables that appear to have little correlation to station popularity: metro station proximity and % of households in Census block that commute by public transit.
  • 20. Data Exploration A few variables that seem to have correlation to station popularity: % of population in Census block that drives to work, and % of population in Census block with a college degree.
  • 21. Data Exploration: Correlations with y -0.600 -0.400 -0.200 0.000 0.200 0.400 0.600 Correlation
  • 22. 2014 Capital Bikeshare Member Survey “Compared with all commuters in the region, they were, on average, considerably younger, more likely to be male, Caucasian, and slightly less affluent.” “Two-thirds (64%) of respondents said that at least one of the Capital Bikeshare trips they made last month either started or ended at a Metrorail station and 21% had used bikeshare six or more times for this purpose. About a quarter (24%) of respondents used Capital Bikeshare to access a bus in the past month.”
  • 23. Data Exploration: Excerpt from cross- correlation matrix
  • 24. Study Methodology: Feature Selection or better said “Feature Wrangling” Step 1 : Ran a full Ordinary Least Squares Regression with all thirty independent variables using StatsModels in Python. ◦ R-squared: 0.636 ◦ Adjusted R-squared: 0.577 ◦ Eight variables with significant p-values included: DRIVE, WHITE, DENSITY, AGE, WALK, BUS_N, TRANSIT, MCDON_N ◦ As expected, very large conditions number, 1.11e+06 indicating strong multicollinearity ◦ F-statistic: 10.73 and Prob(F-statistic): 3.42e-25 ◦ Designated this as Model 2 Step 2 : Beginning with the variable DRIVE, the variable with the highest linear correlation to y, sequentially added the other variables to the OLS regression according to correlation ◦ Any variable that triggered a multicollinearity warning was left out ◦ Any variable without a significant p-value was left out ◦ Seven variables with significant p-values included: DRIVE, WHITE, LIQUOR_N, BUS_N, MCDON_N, SINGLE, CAMPUS_N ◦ Designated this as Model 1 ◦ Note: Experimented quite a bit with k features module but this adds features sequentially using descending correlations, but does not take account of multicollinearity
  • 25. Definitions of relevant features Census: DRIVE: share of population in Census block that drives to work (Model 1 and Model 2) WALK: share of population in Census block that walks to work (Model 2) TRANSIT: share of population in Census block that takes transit to work (Model 2) WHITE: white share of population in Census block (Model 1 and Model 2) DENSITY: population density in Census block (Model 2) AGE: median age in Census block (Model 2) SINGLE: share of households in block group that are single (i.e., non-family) (Model 1) Amenities: BUS_N: number of bus stations within half mile (Model 1 and Model 2) MCDON_N: number of McDonald’s within half mile (Model 1 and Model 2) LIQUOR_N: distance to nearest establishment with a liquor license (Model 1 and Model 2) CAMPUS_N: number of college campuses within half a mile (Model 1)
  • 26. OLS Output for Model 1
  • 27. Model 1 Results: y -Actual versus y-Pred
  • 28. OLS Output for Model 2
  • 29. Model 2 Results: y -Actual versus y-Pred
  • 30. Study Methodology: Machine Learning Step 1 : Selected the following regression types for Machine Learning on Model 1 and Model 2 ◦ OLS ◦ Ridge ◦ RidgeCV ◦ Lasso ◦ LassoCV ◦ Decision Tree ◦ Random Forest Step 2 : Prepared the data ◦ Because there were only 201 stations (rows of data) opted against the K-fold cross-validation ◦ Used Repeated Random sub-sampling validation with 20% splits for testing and 80% splits for training ◦ Iterated for each regression type for n=15 times and averaged the results for the 15 trials
  • 31. Study Methodology: Machine Learning Model 1 R-SquaredAverages Ex. 1 R-Squared Averages Ex. 2 R-Squared Averages Ex. 3 OLS 0.466372594903 0.472123692839 0.399698146173 Ridge 0.469910211714 0.46817885095 0.406793762824 Ridge CV 0.466525097058 0.472095157445 0.399928079436 Decision Tree (depth = 2) 0.383743276391 0.395203148425 0.359422535555 Decision Tree (depth = 5) 0.396411487916 0.399304877207 0.396627898939 Lasso 0.213675679287 0.192417630297 0.20648584883 Lasso CV 0.466325652215 0.472022317547 0.399672003362 Random Forest 0.513764682165 0.521753949359 0.510745984545
  • 32. Model 1 Random Forest Results: y-Actual versus y-Predict
  • 33. Study Methodology: Machine Learning Model 2 R-Squared Averages Ex. 1 R-Squared Averages Ex. 2 R-Squared Averages Ex. 3 OLS 0.520828019361 0.540831229205 0.513422473234 Ridge 0.497919255131 0.516490091569 0.506607344645 Ridge CV 0.516873054267 0.54031441364 0.515242580429 Decision Tree (depth = 2) 0.32349467429 0.311292315645 0.285600505202 Decision Tree (depth = 5) 0.353623405764 0.435172086349 0.350751443572 Lasso 0.191216516526 0.217208742758 0.196249828561 Lasso CV 0.52129767449 0.541115629648 0.506104546918 Random Forest 0.468915321109 0.533377083505 0.42975292961
  • 34. Model 2 Results Ridge CV: y -Actual versus y-Pred
  • 35. Model 2 Results Lasso CV: y -Actual versus y-Pred
  • 36. Data Product  Groundwork: By analyzing the correlation between the factors such as bikeshare’s stations location, geographic and demographic information, we obtained results that allow us to create a data product that predicts the likelihood of success or failure of a new Bikeshare station prior to implementation in the DC area  Results: Our models succeed in explaining about half of the variance in the Bikeshare Station popularity as measured by our utilization factor. This should at least help in predicting the potential popularity of a station based on the combination of demographic and geographic factors we identified as significant.  Further applications: In addition to identifying promising locations for new Bikeshare stations in DC, the results may also generalize to other cities. By using data on the demographic and geographic factors we identified as significant, it could allow a user to predict promising locations for bike stations during an initial roll-out ,thus enhancing the overall success of a new project without costly experimentation
  • 37. What worked? GIS was critical to creating effective architecture: linked stations to amenities by distance, and to Census blocks and associated data. Since the data volume that we handled was small, using local machine rather than a powerful data base or cloud environment helped to achieve faster results. Spending time exploring the variables before beginning analysis. This, plus domain knowledge, allowed team to identify and address data issues manually that the software didn't calculate accurately from the beginning. Trying many different types of regressions using different variables (including different forms of independent variable – log, natural log).
  • 38. What didn’t work? And lessons learned. Data sample was relatively small – started at roughly 350 stations, but shrank to roughly 200 once MD and VA locations were removed from the analysis. Data and feature wrangling take a long time (80% of the process); domain expertise makes this easier. Would have been harder to detect and address anomalies and missing values in data had the sample been larger; familiarity with DC allowed us to understand why data missing for national parks or institutional land uses. Couldn’t do k-fold cross validation given small sample size. Decision tree model didn’t work for our analysis.
  • 39. Conclusion Model offers a good starting point for assessing likely popularity of station locations, using data that is readily available for most major U.S. cities. Some subjective decision-making will still be required around major parks (i.e., National Mall) or institutional land uses (campuses, hospitals). Bikeshare continues to gain momentum across the country. Future studies should: • Use a larger sample • Idea: use DC as model, test against NYC and Chicago. Instead: use DC, NYC, and Chicago to build model with larger sample, test against fourth city. • Categorize stations by function within network • Stations have different functions: residential feeders to metro stations, tourism. With a larger sample, these station types could be separated and the drivers of popularity independently determined. Important to note that there are valid reasons other than current popularity that should determine station placement (i.e., equity, driving changes in travel behavior). This model helps ensure financial viability so that these outcomes can be pursued.

Editor's Notes

  1. The goals of this project are: 1. Develop a general model (using Washington DC as a test dataset) that will predict the popularity of bike share stations based on characteristics in proximate areas, in order to help cities deploy optimal station networks without expensive experimentation.
  2. The goals of this project are: 1. Develop a general model (using Washington DC as a test dataset) that will predict the popularity of bike share stations based on characteristics in proximate areas, in order to help cities deploy optimal station networks without expensive experimentation.
  3. The goals of this project are: 1. Develop a general model (using Washington DC as a test dataset) that will predict the popularity of bike share stations based on characteristics in proximate areas, in order to help cities deploy optimal station networks without expensive experimentation.
  4. The goals of this project are: 1. Develop a general model (using Washington DC as a test dataset) that will predict the popularity of bike share stations based on characteristics in proximate areas, in order to help cities deploy optimal station networks without expensive experimentation.
  5. This project builds primarily upon two previous analyses. This study, based on Washington DC, used multivariate regression to identify five factors that influenced bikeshare station popularity, including population 20-39, non-white population, retail proximity, Metro proximity, and distance from system center. This study, also based on Washington DC, used linear regression and random forest analysis to understand how job and residential proximity influenced bikeshare station popularity. It then sought to extend the model to San Francisco and Minneapolis, but found that the model was a poor predictor of station popularity in those cities.
  6. (There’s a transition effect on this slide. Click once and this refined dataset table will slide in.)
  7. We joined these separate datasets
  8. The primary challenge is to create an architecture that links each individual station with its block group (and associated socioeconomic characteristics) as well as with distances to surrounding physical amenities. The complexity is a result of the fact that while stations locations are recorded as coordinates, block groups are boundaries that cover an array of coordinates, and physical amenities are variously recorded as addresses, intersections, and/or coordinates.