Your SlideShare is downloading. ×
Building a Standard for Open Bikeshare Data
Building a Standard for Open Bikeshare Data
Building a Standard for Open Bikeshare Data
Building a Standard for Open Bikeshare Data
Building a Standard for Open Bikeshare Data
Building a Standard for Open Bikeshare Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Building a Standard for Open Bikeshare Data

523

Published on

Should the bikeshare industry adopt an open data standard? As bikesharing spreads to more cities, having a common method for accessing and analyzing data will become more important.

Should the bikeshare industry adopt an open data standard? As bikesharing spreads to more cities, having a common method for accessing and analyzing data will become more important.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
523
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Building a Standard for Open Bikeshare Data Originally published at Michael Schade’s Mystery Incorporated Blog March 2nd, 2014 Should the bikeshare industry adopt an open data standard? As bikesharing spreads to more cities, having a common method for accessing and analyzing data will become more important. We know that transit systems work best when agencies concentrate on their core mission. Transit agencies are not in the information technology business; all they should do is release their data to let third parties build apps that let passengers use the systems. To use open data, programmers need to know: Where is the data? What are the files called? Which fields are available? What are the fields called? Bikesharing systems should adopt the standard of having a “data” page which can be found by appending “data” immediately after the main URL. This is what many U.S. government web sites are doing (like justice.gov/data, dot.gov/data, state.gov/data, etc.) It would be awesome to have consistent URLs like capitalbikeshare.com/data and velib.paris.fr/data. To standardize what the files are called, we have to decide how many files are used, and what formats to use. Some systems do not separate the station information data (which is static) from the station status data (which is dynamic). The Capital Bikeshare XML file and the Bixi Montreal XML file are examples of combining both static and dynamic data in a single file (both use the Bixi public bike system). This might be more convenient in some cases, but for systems that frequently update their displays, it wastes a lot of bandwidth. This process could be made more efficient by using two files. JCDecaux, which manages many bikesharing systems in Europe, separates the static data from the dynamic real-time data. Denver‟s B-cycle doesn‟t seem to offer any data at all, though Denver‟s Open Data Catalog does offer a variety of formats for data about B-cycle Stations. I doubt this is the true, live, system data, because the coordinates are given as street addresses and not latitude and longitude coordinates.
  • 2. In addition to information needed by apps, we also need historic data in order to analyze how people use the system. The most common kind is system metrics, such as the type released by Bay Area Bikeshare. This typically shows ridership and membership totals, and is good for showing how the system has grown. It would be updated at the end of each day. Planners and analysts rely on two other types of historic data: trip history information shows every trip made within a certain period, and station history data shows the status of the stations within a certain period. The best example of the former is the Capital Bikeshare trip history data page, which releases a new data set every quarter. The latter is sometimes recorded by enthusiasts on their own initiative, such as the CaBi Tracker website. In San Francisco, Eric Fisherkeeps a daily log of Bay Area Bikeshare stats at trafficways.org/babs (I used his data in Probing Data from Bay Area Bikeshare). The trip history and station history files need a naming convention to reflect the content‟s date range. CaBi‟s largest quarterly file is 72.5MB, for the 572,919 trips in the 2nd quarter of 2012 (they have now started zipping the files). A filename format like trips-2012-3-1-to-20125-30.csv would work well. While the systems are expected to protect their customers‟ privacy by not including customer IDs, users should be able to download their own personal trip history files, and those files should use the same format as the main trip history files. Finally, there should be a standard way of summarizing general information about the entire system. Who provides the equipment, who runs the system, which jurisdictions participate, where the system is located and what its boundaries are, what the hours of operation are, what the operating season is, what the URL is and other contact info. And to really integrate all the various systems, we also could benefit from having the URL for a standard-size logo images, plus the systems‟s colors. This System information file should also include data found in a manifest file, namely, a list of all the associated open-data files. The system information should include definitions of available membership types. This might merit being listed as a separate table. Each membership type should include the cost and duration. We also need to know how long rides can be, and what the charges are for going beyond the time limit. For example, theCaBi pricing rules say
  • 3. rides are free for the first 30 minutes; going up to 30 minutes longer costs $2.00 for casual members (those with 1- or 3-day memberships) and $1.50 for subscribers. In contrast, the Citi Bike pricing rules say rides are free for the first 45 minutes; going up to 30 minutes longer costs $4.00 for those with 24-hour & 7-day passes, and $2.50 for those with annual memberships. This table summarizes the six types of bikesharing data: System information: general info Station information: a mostly-static list of all stations Station status: the number of available bikes and docks System metrics: membership and trip totals Trip history: every trip made during a given period Station history: a history of the station status list Here‟s how I would organize the files. I‟ll use ▶ to indicate a primary key (one that must be unique within the system), and ▷ to indicate a foreign key (one that references another table‟s primary key, and which must exist). The station information data is the information most likely to be shared by bikeshare systems. At the very least, it includes the latitude & longitude coordinates for every station, and the name. The file is fairly static, changing mostly when new stations are added. Here are the fields I would include, compared with CaBi (DC), Vélib (Paris), and Denver‟s B-cycle to see what names they use. Station information proposal CaBi Vélib B-cycle id, stationid ▶ number GLOBALID terminalName name name name STATION_NAME STATION_ADDRESS, address (not used) address ADDRESS_LINE1, ADDRESS_LINE2 (not region (not used) CITY, STATE used) (not zip (not used) ZIP used) lat lat latitude (not used) lng long longitude (not used) installed installDate (not (not used)
  • 4. removed removalDate public public capacity (not used) message (not used) used) (not used) (not used) (not used) (not used) (not used) (not used) NUM_DOCKS (not used) Most systems don‟t use a region field, but for multi-jurisdictional systems, it is important to know which jurisdiction manages each station. For example,Capital Bikeshare operates within DC, Montgomery County, Arlington, and Alexandria. Bay Area Bikeshare operates within San Francisco, Redwood City, Palo Alto, Mountain View, and San Jose. Nice Ride operates within Minneapolis and St Paul. Other systems could use this field to track which neighborhood the station is in. Vélib appends the postal code & city to the address field, but this would be better as a separate fields. For example, the Bastille Richard Lenoir station has an address of “2 BOULEVARD RICHARD LENOIR – 75011 PARIS”, but this should be just “2 BOULEVARD RICHARD LENOIR”, with a zip of “75011″ and a city of “Paris.” And there is no reason for Vélib to use all-uppercase letters. The data should be in the proper mixed-case (using French rules for capitalization), and programs can easily convert to uppercase if they wish. I would suggest a message field so systems can communicate that a station will be shutting down early, or moved to a new location. Or during snow storms, the rebalancing van might not be able to service a station. Denver has other fields that should be considered for a standard. “PROPERTY_TYPE” shows whether the station‟s location is Private or Public. This could be expanded to show exactly who the property owner or responsible agency is. “POWER_TYPE” has values of Solar Only, Wired Only, and Solar with Wire Backup. Cities often provide temporary stations. The station ID should correspond to a specific location. If a station returns to the same location for an annual event, it should re-use the old ID.
  • 5. The station status file should have the smallest amount of data needed to describe the current state of each station. This is the file that will be called most often, potentially thousands of times per minute, so every byte counts. And many people will be querying this data from mobile devices, another reason to keep the file size as small as possible. Here‟s how I would design the standard for this file, compared with CaBi (DC) and Denver‟s B-cycle to see what names they use. Because I couldn‟t find Denver‟s XML feed, I used CityBike„s Denver JSON feed. Station status proposal CaBi Denver B-cycle stationid ▷ id, terminalName id, idx bikes nbBikes bikes docks nbEmptyDocks free open locked (not used) time lastCommWithServer timestamp The bikes and docks numbers will generally add up to the capacity value in the station information file, but if there are nonfunctioning bikes or docks, the total could be smaller. The open field would be true or false. Sometimes stations are temporarily closed, perhaps because they have become inaccessible. The timevalue shows the last time the station communicated with the server. This is useful to determine if the data might no longer be accurate, such as during a power outage. Notice we don‟t duplicate any of the fields in the station information file, other than our foreign key, the stationid field. The trip history file also needs to be as compact as possible, not because people will be downloading it frequently, but because these files could be used to store millions of records. Trip history startdate startstation ▷ enddate endstation ▷ bikeid usertype The duration of each trip can be computed on-the-fly and doesn‟t need to be included in the file. The startstation and endstation values link up
  • 6. to the stationid field in the station information file. The usertype field describes the type of membership the rider has. Though few systems release trip history data on a regular basis, there have been occasions when systems have released data in support of a visualization contest. The Hubway Data Visualization Challenge took place in 2013, and included demographic data about the rider of each trip: residential zip code, year of birth, and sex. The Divvy Data Challenge (for Chicago) is currently underway; its data includes riders‟ year of birth and sex. The station history file should be a list of every change in status (available bikes and docks) for every station, listed in chronological order. In order to avoid having to repeat the state of the entire system when only a few stations have new values, the file should start with every station, and thereafter list a station only when it has changed. The initial value would be needed in order to compute the state of any later times recorded in the file. Station history stationid ▷ bikes docks open time The dominant data format nowadays is either XML or JSON. CSV is also a good choice, as long as the data fits in a tabular format, consisting of simple rows and columns. For CSV files, the order of fields should be consistent. The values of the fields are numeric, string, Boolean, and timestamp. Boolean is easily expressed as “true” or “false,” and Unix time is a common way of recording date and time. By publishing and standardizing bikesharing open data, developers and analysts can make it easier for the public to make use of and discover bikesharing systems across the globe, such as the Bike Share Map by Oliver O‟Brian. The vendors, operators, and managing jurisdictions should work together to create a standard that can be used by everyone.

×