Is it harder to find a taxi when it is raining?

© 2015 IBM Corporation
Gibt es bei Regen wirklich keine Taxis?
Open Data in Aktion: Jeder kann
analysieren!
data2day conference 2015, Karlsruhe
Wilfried Hoge – IT Architect Big Data – hoge@de.ibm.com @wilfriedhoge
Stephan Reimann – IT Specialist Big Data – stephan.reimann@de.ibm.com @stereimann

Motivation: A personal experience – especially when it is raining it
seems difficult to get a taxi
§  Is that true?
§  Can analytics provide the answer?
§  Is there any correlation between rain and taxi availability?
2

First we needed data ... Open Data was the key
§  "Open data is the idea that some data should be freely available to everyone to use and
republish as they wish, without restrictions from copyright, patents or other mechanisms of
control.“ [Wikipedia, https://en.wikipedia.org/wiki/Open_data]
§  Open Data is available in different fields, e.g. Science, Government
§  Open Government data is available at almost any level:
– EU http://open-data.europa.eu/en/data/
– US https://www.data.gov/
– GovData – Das Datenportal für Deutschland https://www.govdata.de/
– Bavaria https://opendata.bayern.de/
– Munich https://www.opengov-muenchen.de/
– Berlin http://daten.berlin.de/
– New York https://nycopendata.socrata.com/
– ...
§  Open Data is available in several categories: census data, traffic, education, environment,
economy, health, ...
3

There is plenty of Open Data, but sometimes it isn’t that easy to find
the one you are looking for
§  We needed taxi & weather data
§  Since we couldn’t find an appropriate taxi data set for
Munich, we choose New York
§  The taxi data set is available at
http://www.andresmh.com/nyctaxitrips/ and contains 2
areas trip data & trip fares
§  The taxi data set contains all taxi trips in Manhattan for
2013, approx. 4GB/month, overall too big to analyze it
on a Laptop
§  For the weather, we could find plenty of weather data,
but not detailed enough for our analysis, open weather
data was only available on a daily base, but taxi data is
on exact time
§  We decided to buy an appropriate data set with hourly
weather information for NYC at
https://weatherspark.com/ (approx. 10 €)
4

Then we needed tools to analyze the data, we choose to use cloud
services due to their simplicity and agility
1. IDEAS
2. PROTOTYPE
3. FAIL FAST
4. PRODUCTION
•  Through cloud services, ideas can
be realized fast and simple:
•  Prototype ideas
•  Fail fast
•  Bring successful idea into
production
5

Flexible Compute Options to Run Apps / Services
Instant Runtimes Containers Virtual Machines
Platform Deployment Options that Meet Your Workload Requirements
Bluemix
Public
Bluemix
Dedicated
Bluemix
Local*
DevOps Tooling
IBM SoftLayer
Catalog of Services that Extend Apps’ Functionality
Web Data Mobile AnalyticsCognitive IoT Security Yours
Cloud Services Fabric
Delivery Storage Network Security
Operational Excellence, Visibility, Hybrid Portability
Data Integration Operations
Your Own Hosted Apps / Services with Support of many Languages and Runtimes Integration and API Mgmt
Your Datacenter
We have used IBM Bluemix for our “investigation”
6

To provide the data for analytics, we used the Softlayer Object store
due to automatic compression and attractive price
Automatic partitioning
7
Automatic compression
~4 ct / GB per month

We decided to use dashDB, an in-memory analytical cloud database
to analyze the data since it exceeded the laptop capacity
§  Why?
– Easy to use
– No infrastructure required
– No tuning required, focus on analytics
8
www.ibm.com/software/data/dashdb/

dashDB made it simple to create the table structures and load the
data
Create the tables1
Load the data2
Start analyzing3
9

Now we can use SQL to obtain first insights
SQL can be also used for sampling and data preparation ...
10

With the integrated RStudio, we can now start with advanced
analytics, e.g. to find correlations
The data can be easily accessed from R via SQL
Start the integrated RStudio
11

Some observations made with R
12
Day of week seems to
heavily influence the
number of trips
New York has very few
days with heavy rain,
maybe not the best place
for our investation
Season and holidays seem
to influence the number of
passengers per month

So no strong correlations so far, let’s try a T-Test
http://matheguru.com/stochastik/t-test.html#rechner
13
The T-Test indicates that
the difference for
number of taxi trips
doesn’t show significant
correlation with rain

Shiny Apps provide a simple way to create nice and interactive
visualizations in R
14
Select an area and access
information on individual trips
Get information about
trip destinations visually
From where are people
going to the airport?

Shiny Apps are easy to create
Create a marker for each list element
Shiny app for selecting and passing data to Google maps
15

So is it more difficult to get a taxi when it is raining?
§ There are shorter taxi trips when it is raining, but average trip fare is
higher
è More traffic? Less people using the bike or walk? Traffic jams?
§ T-Test indicates the difference isn’t significant
§ We have analyzed on day level, may be an analysis on an hourly base
would show different results
§ So it seems to be a personal impression, but not a correlation, ... But
maybe New York just hasn’t enough rain ;-)
§ Find your own answers -> https://github.com/WilHoge/NYC-Taxi-Demo
16

Is it harder to find a taxi when it is raining?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Is it harder to find a taxi when it is raining?

Similar to Is it harder to find a taxi when it is raining? (20)

More from Wilfried Hoge

More from Wilfried Hoge (10)

Recently uploaded

Recently uploaded (20)

Is it harder to find a taxi when it is raining?