Building a Standard for Open Bikeshare Data
Originally published at Michael Schade’s Mystery Incorporated Blog
March 2nd, 2014
Should the bikeshare industry adopt an open data standard? As
bikesharing spreads to more cities, having a common method for
accessing and analyzing data will become more important. We know
that transit systems work best when agencies concentrate on their
core mission. Transit agencies are not in the information technology
business; all they should do is release their data to let third parties
build apps that let passengers use the systems.
To use open data, programmers need to know: Where is the
data? What are the files called? Which fields are available? What are
the fields called?
Bikesharing systems should adopt the standard of having a “data”
page which can be found by appending “data” immediately after the
main URL. This is what many U.S. government web sites are doing
(like justice.gov/data, dot.gov/data, state.gov/data, etc.) It would be
awesome to have consistent URLs like capitalbikeshare.com/data and
To standardize what the files are called, we have to decide how many
files are used, and what formats to use. Some systems do not
separate the station information data (which is static) from the station
status data (which is dynamic). The Capital Bikeshare XML file and
the Bixi Montreal XML file are examples of combining both static and
dynamic data in a single file (both use the Bixi public bike system).
This might be more convenient in some cases, but for systems that
frequently update their displays, it wastes a lot of bandwidth. This
process could be made more efficient by using two files. JCDecaux,
which manages many bikesharing systems in Europe, separates
the static data from the dynamic real-time data.
Denver‟s B-cycle doesn‟t seem to offer any data at all, though
Denver‟s Open Data Catalog does offer a variety of formats for data
about B-cycle Stations. I doubt this is the true, live, system data,
because the coordinates are given as street addresses and not latitude
and longitude coordinates.
In addition to information needed by apps, we also need historic data
in order to analyze how people use the system. The most common
kind is system metrics, such as the type released by Bay Area
Bikeshare. This typically shows ridership and membership totals, and
is good for showing how the system has grown. It would be updated at
the end of each day.
Planners and analysts rely on two other types of historic data: trip
history information shows every trip made within a certain period,
and station history data shows the status of the stations within a
certain period. The best example of the former is the Capital Bikeshare
trip history data page, which releases a new data set every quarter.
The latter is sometimes recorded by enthusiasts on their own initiative,
such as the CaBi Tracker website. In San Francisco, Eric Fisherkeeps a
daily log of Bay Area Bikeshare stats at trafficways.org/babs (I used
his data in Probing Data from Bay Area Bikeshare).
The trip history and station history files need a naming convention to
reflect the content‟s date range. CaBi‟s largest quarterly file is 72.5MB,
for the 572,919 trips in the 2nd quarter of 2012 (they have now
started zipping the files). A filename format like trips-2012-3-1-to-20125-30.csv would work well.
While the systems are expected to protect their customers‟ privacy by
not including customer IDs, users should be able to download their
own personal trip history files, and those files should use the same
format as the main trip history files.
Finally, there should be a standard way of summarizing general
information about the entire system. Who provides the equipment,
who runs the system, which jurisdictions participate, where the system
is located and what its boundaries are, what the hours of operation
are, what the operating season is, what the URL is and other contact
info. And to really integrate all the various systems, we also could
benefit from having the URL for a standard-size logo images, plus the
systems‟s colors. This System information file should also include data
found in a manifest file, namely, a list of all the associated open-data
The system information should include definitions of available
membership types. This might merit being listed as a separate table.
Each membership type should include the cost and duration. We also
need to know how long rides can be, and what the charges are for
going beyond the time limit. For example, theCaBi pricing rules say
rides are free for the first 30 minutes; going up to 30 minutes longer
costs $2.00 for casual members (those with 1- or 3-day memberships)
and $1.50 for subscribers. In contrast, the Citi Bike pricing rules say
rides are free for the first 45 minutes; going up to 30 minutes longer
costs $4.00 for those with 24-hour & 7-day passes, and $2.50 for
those with annual memberships.
This table summarizes the six types of bikesharing data:
System information: general info
Station information: a mostly-static list of all stations
Station status: the number of available bikes and docks
System metrics: membership and trip totals
Trip history: every trip made during a given period
Station history: a history of the station status list
Here‟s how I would organize the files. I‟ll use ▶ to indicate a primary
key (one that must be unique within the system), and ▷ to indicate a
foreign key (one that references another table‟s primary key, and
which must exist).
The station information data is the information most likely to be
shared by bikeshare systems. At the very least, it includes the latitude
& longitude coordinates for every station, and the name. The file is
fairly static, changing mostly when new stations are added.
Here are the fields I would include, compared with CaBi (DC), Vélib
(Paris), and Denver‟s B-cycle to see what names they use.
longitude (not used)
Most systems don‟t use a region field, but for multi-jurisdictional
systems, it is important to know which jurisdiction manages each
station. For example,Capital Bikeshare operates within DC,
Montgomery County, Arlington, and Alexandria. Bay Area
Bikeshare operates within San Francisco, Redwood City, Palo Alto,
Mountain View, and San Jose. Nice Ride operates within Minneapolis
and St Paul. Other systems could use this field to track which
neighborhood the station is in.
Vélib appends the postal code & city to the address field, but this
would be better as a separate fields. For example, the Bastille Richard
Lenoir station has an address of “2 BOULEVARD RICHARD LENOIR –
75011 PARIS”, but this should be just “2 BOULEVARD RICHARD
LENOIR”, with a zip of “75011″ and a city of “Paris.” And there is no
reason for Vélib to use all-uppercase letters. The data should be in the
proper mixed-case (using French rules for capitalization), and
programs can easily convert to uppercase if they wish.
I would suggest a message field so systems can communicate that a
station will be shutting down early, or moved to a new location. Or
during snow storms, the rebalancing van might not be able to service
Denver has other fields that should be considered for a standard.
“PROPERTY_TYPE” shows whether the station‟s location
is Private or Public. This could be expanded to show exactly who the
property owner or responsible agency is. “POWER_TYPE” has values
of Solar Only, Wired Only, and Solar with Wire Backup.
Cities often provide temporary stations. The station ID should
correspond to a specific location. If a station returns to the same
location for an annual event, it should re-use the old ID.
The station status file should have the smallest amount of data needed
to describe the current state of each station. This is the file that will be
called most often, potentially thousands of times per minute, so every
byte counts. And many people will be querying this data from mobile
devices, another reason to keep the file size as small as possible.
Here‟s how I would design the standard for this file, compared with
CaBi (DC) and Denver‟s B-cycle to see what names they use. Because
I couldn‟t find Denver‟s XML feed, I used CityBike„s Denver JSON feed.
stationid ▷ id, terminalName
The bikes and docks numbers will generally add up to
the capacity value in the station information file, but if there are nonfunctioning bikes or docks, the total could be smaller. The open field
would be true or false. Sometimes stations are temporarily closed,
perhaps because they have become inaccessible. The timevalue shows
the last time the station communicated with the server. This is useful
to determine if the data might no longer be accurate, such as during a
Notice we don‟t duplicate any of the fields in the station
information file, other than our foreign key, the stationid field.
The trip history file also needs to be as compact as possible, not
because people will be downloading it frequently, but because these
files could be used to store millions of records.
The duration of each trip can be computed on-the-fly and doesn‟t need
to be included in the file. The startstation and endstation values link up
to the stationid field in the station information file. The usertype field
describes the type of membership the rider has.
Though few systems release trip history data on a regular basis, there
have been occasions when systems have released data in support of a
visualization contest. The Hubway Data Visualization Challenge took
place in 2013, and included demographic data about the rider of each
trip: residential zip code, year of birth, and sex. The Divvy Data
Challenge (for Chicago) is currently underway; its data includes riders‟
year of birth and sex.
The station history file should be a list of every change in status
(available bikes and docks) for every station, listed in chronological
order. In order to avoid having to repeat the state of the entire system
when only a few stations have new values, the file should start with
every station, and thereafter list a station only when it has changed.
The initial value would be needed in order to compute the state of any
later times recorded in the file.
The dominant data format nowadays is either XML or JSON. CSV is
also a good choice, as long as the data fits in a tabular format,
consisting of simple rows and columns. For CSV files, the order of
fields should be consistent.
The values of the fields are numeric, string, Boolean, and timestamp.
Boolean is easily expressed as “true” or “false,” and Unix time is a
common way of recording date and time.
By publishing and standardizing bikesharing open data, developers and
analysts can make it easier for the public to make use of and discover
bikesharing systems across the globe, such as the Bike Share
Map by Oliver O‟Brian. The vendors, operators, and managing
jurisdictions should work together to create a standard that can be
used by everyone.